Scaling Data Infrastructure Towards Heterogeneous Computing

The end of Moore’s law, coupled with the fast-growing requirements for more compute power in a data-driven world, is shifting the industry into a hardware-centric future to realize cost-efficient infrastructure. As the optimal hardware selection to run the myriad of software present in large data pipelines is becoming a moving target, solutions not fundamentally designed to leverage heterogeneous hardware environments will only accelerate customers’ data infrastructure towards a dead end.

XONAI is building a universal compute fabric that seamlessly adapts to data pipelines at scale and moves them towards the ideal deployment.

Our initial focus is on accelerating Big Data processing with optimized execution in general-purpose hardware. Fueled by our unified execution engine, our solution transparently accelerates any integrated compute platform, and we are enabling it first for Apache Spark™, the world’s most popular engine powering large-scale Big Data and Machine Learning workloads across the industry. Spark-based operations in web giants have shown to process data from nearly 1 billion active users, generating petabytes of data per day on clusters with thousands of machines.

With so many improvements yet to come, preliminary standard benchmarks show that our engine can deliver more than 2x faster and instant Spark execution with zero code changes.

The world is data-driven

Data infrastructure was traditionally built in on-premises data centers, entailing numerous challenges in scaling and maintaining server hardware. The growing internet demanded a more flexible model where IT infrastructure could be rented on-demand, and the opportunity prompted the rapid evolution of cloud computing technologies. As the world became more connected, data-driven decisions became essential for organizations to remain competitive.

Today’s data-driven business processes live at the intersection of Big Data Analytics and AI, converting insights into actionable decisions that optimize business performance. Underneath the dashboards, information is continuously processed by algorithms that require ever-increasing computational resources and budget as organizations scale. Despite Big Tech’s success in providing highly scalable IT infrastructure, challenges in maintaining large data infrastructure often push organizations into adopting managed solutions for orchestration of large data and machine learning pipelines.

Yet, despite claims of providing the best-in-class solutions, the real cost-effectiveness of existing data platforms are still a far cry from providing optimal end-to-end acceleration to large data pipelines. The Big Data market is driven by ever-growing volumes of data and connected devices, and as application-specific hardware is increasingly more critical for demanding workloads, it will be challenging for organizations to stay agile when teams are pressured by budget constraints and facing difficulties in deploying optimal data pipelines when these depend on multiple domain expertise. The sand is shifting beneath the data-driven world, and existing solutions won’t effortlessly steer organizations through the dunes.

We need a new approach to seamlessly scale data infrastructure towards heterogeneous computing.

The Case For A New Approach

The SaaS industry capitalizes on externalized cloud APIs to integrate their solutions on top of cloud-native infrastructure. Although existing products succeed in increasing productivity with managed orchestration of large data pipelines to a degree, customers have little choice but to pile up on their infrastructure either closed proprietary software or enterprise stacks on top of open-source solutions. In any case, both models are deceivingly the same: move the consumer deeper into the stack of more locked-in and pricier editions of their products, which ultimately only integrate with a narrow line of vendors' choice of software.

We believe great solutions should seamlessly power the best software of choice for each use case, and liberate the end-user from the labyrinth of pricey adornments and head-scratching bills.

Data-driven businesses thrive when innovation enables them to remain competitive, and engineering teams are most productive when they have unimpaired access to novel technology. Yet, while cloud computing paradigms are shifting towards hybrid models as hyperscalers increase investment in accelerator hardware, we have vendors of data platforms engaging in misleading benchmark wars. Existing solutions are hardly designed to maximize machine utilization when executing in single hardware architectures, let alone heterogeneous hardware environments. The inefficiencies that accumulate in the arteries of ecosystems composed of mixed software libraries only drive up costs. Solutions in the hardware-centric future must provide automatic hardware selection down to the granularity of individual algorithms.

More now than ever before, society is fortunate enough to have all needed technology bits to enable unprecedented approaches to accelerated computing. Consumers of data technology should benefit from intangible technology that “just works” with established open designs, that is easily testable and approved without long weeks to migrate data into proprietary platforms and a few rounds of incremental adjustments to leverage any real benefit.

XONAI sees a new future where data-driven businesses feel the heat emerging from synergies of novel compiler technologies intersecting new domains, and stay ahead of competition when powered by next-generation data infrastructure.

Next-Generation Data Infrastructure

Scaling heterogeneous compute infrastructure towards a plurality of software requires a new type of platform connecting customers’ software pipelines with both established and emerging accelerator hardware. Bridging such a wide gap would not be feasible without leveraging many of the innovations in compiler technologies that have taken place in the past two decades.

LLVM was hugely successful as a modular reusable compiler toolchain: it allowed the semiconductor industry to plug their compute devices into a common infrastructure and enabled new creative uses of compiler technology well beyond the scope of building backends for hardware targets. However, as requirements for domain-specific optimization evolved (such as in machine learning models), so did the need for building new high-level representations that were better suited for optimization in particular domains before targeting lower-level representations closer to hardware. MLIR, built by many of the world-class engineers who greatly contributed to the development of LLVM and derived solutions, is a novel approach to building reusable and extensible compiler infrastructure, addressing problems such as unified program representations down to accelerator-specific compilation and execution.

XONAI is leveraging novel developments in compiler and data analytics technologies to deliver the next-generation solution that powers petabyte-scale data and machine learning pipelines in heterogeneous hardware environments.

We are prioritizing the integration of Apache Spark with our platform, the most popular data processing engine powering Big Data Analytics worldwide. Spark is supported by a growing community of 1000+ contributors and 250+ organizations and connects with a large ecosystem of software libraries, ranging from data science and machine learning to SQL analytics and storage systems. Its ubiquity in Big Data Analytics incentivized major Hadoop vendors and public cloud providers to provide first-class support for managing Spark workloads, assisting small and large businesses in scaling their data products.

Our technology is not limited to accelerating Spark as it can scale for any data analytics platform. We plan in the near future to power other popular data processing engines and software libraries commonly used in Big Data and Machine Learning workloads.

How It Works

Our solution connects software with multiple accelerator hardware by integrating it with a domain-specific compiler we are building to enable unified execution in heterogeneous hardware environments from a single source of truth. In our integration with Spark, we offload SQL execution to our compiler through a frontend that allows expressing computations commonly present in data analytics in a high-level fashion, delegating the optimization process to a common MLIR backend that ultimately drives compilation towards multiple accelerator hardware. While our current focus is to accelerate data pipelines through automatic vectorization in general-purpose CPUs, our compiler is designed to target more specific accelerator hardware such as GPUs and configurable hardware such as FPGAs.

We integrated our platform with Elastic MapReduce (EMR) on EC2 for all available hardware architectures. When activated, our platform works as a drop-in replacement for Spark on EMR or any custom deployments of open-source Spark, accelerating workloads without changing execution plans by default, so that already tuned deployments can connect with our platform seamlessly and instantly improve on execution performance. Future enhancements will introduce optimizations that may change the execution semantics, but these are only optionally enabled. Future updates will make our platform available in all major cloud providers.

Preliminary Benchmarking

Our platform is compatible with Apache Spark up to version 3.1.2 and it was benchmarked against open-source Apache Spark 3.1.2 and Spark with EMR runtime 6.5.0. Our benchmark configuration is loosely based on Amazon’s latest EMR 3TB performance tests but adapted for TPC-H suite and executed on general purpose M6g instances on EMR on EC2. We found this configuration to be more suitable for benchmarking as these have sufficient memory to prevent queries from spilling data to disk, which can easily skew the results towards one implementation or the other. Adaptive Query Execution (AQE) was enabled in all runs and any flags for tuning the JVM where removed. Using the same Spark configuration, all jobs queried 3TB of unpartitioned TPC-H test data (non-decimal) in Parquet format stored on S3.

Benchmark results show that our platform outperforms all implementations of Spark both in total runtime and runtime geometric mean as illustrated in figures 1 and 2, with EMR only performing better in Q7.

In our analysis, the EMR performance optimizations in TPC-H queries are largely attributed to Amazon's own optimized join reorder and backported shuffled hash join operator from Apache Spark 3.2.0. Aside from Q7, all other queries with large runtime perform significantly faster on our engine. When compared to open-source Spark, our engine largely outperforms all queries with 1.8x faster runtime.

Additional Results

Graviton 3 processors were recently made publicly available in C7g instances, enabling better performance and energy efficiency than its predecessor. As the new processors are not yet available on the EMR platform, only benchmarks against open-source Apache Spark were performed. We chose an alternative configuration as compute-optimized C7g machines are more memory constrained than general-purpose M6g ones, with a standalone cluster composed of 8 nodes querying 1TB of unpartitioned TPC-H test data. As illustrated in figures 3 and 4, our platform performed 2.1x faster in total runtime, a larger relative difference when compared to Graviton 2 benchmarks above as expected from the new Graviton processors.

We performed additional benchmarking in machines with Intel Xeon processors (Skylake and Cascade Lake microarchitectures). All configurations used for benchmarking are summarized in Table 1.

Both open-source and Spark with EMR runtime on Graviton 2 machines were roughly 13% faster in total runtime when compared to runtimes on equivalent instance types with Intel processors. Our platform showed a larger gap (around 21%), which we attribute to automatic vectorization capabilities of our engine. Judging from these results, we find the Graviton instances a more cost-effective choice for running Spark, given the better performance at a lower cost over equivalent machines from other instance families.

Scripts used for TPC-H data generation and running benchmarks can be found in our GitHub repository.

Coming Next

As no modifications to Spark execution plans were performed, the presented acceleration is a result of offloading Spark operators along with data decoding processes to our compiler. Generated code from operators is optimized into a more optimal version amenable to automatic vectorization and made more specific based on input data characteristics. As we continue to port remaining operators, performance will further improve since some queries still contain additional overhead from conversions between internal data formats when falling back to default Spark.

We expect significant performance improvements in our next update as we introduce new enhancements that optimize Spark execution plans and finalize our integration towards full SQL support. In addition, we are developing a caching layer to leverage fast local NVMe SSDs storage, which we expect to dramatically improve I/O time in queries at no cost to convenience and facilitate interactive analytics at scale.

Who Are We?

We are a team of passionate engineers not content with the status-quo and striving to build the next-generation technology to power data infrastructure worldwide.

Stay tuned for new exciting blog updates in the weeks to come! Our team will be delving into aspects of our technology in greater detail, as well as announcing exciting new performance and product updates.

We are hiring! Visit our careers page for more information.

written by

Leandro Vaz
Co-Founder

Co-Founder @ XONAI

Co-Founder @ XONAI

Co-Founder @ XONAI