Introducing XONAI for Google Cloud

Today we are introducing XONAI for Google Cloud Dataproc service. Our platform can be easily activated when submitting Apache Spark jobs, delivering up to 3.1x faster data processing and reduced hardware resource usage as measured by TPC-H derived benchmarks. The main benefits are reduced cloud costs for batch processing jobs and more cost-effective resource sharing in containerized deployments.

Benchmark Results

We evaluated the performance of our platform on Google Dataproc with a TPC-H derived benchmark at 3TB scale factor in a similar setup to previous runs, but this time with several new improvements that target performance and resource utilization: on a newly launched cluster without any caching or configuration changes, we measured 2.1x faster benchmark runtime and 72% less memory used on average. Additional setup information can be found in the “Benchmark Details” section below.

Figure 1: TPC-H 3TB - runtime total (lower is better)

Figure 2: TPC-H 3TB - runtime total on 22 queries (lower is better)

More pronounced acceleration in Google Dataproc when compared to AWS EMR suggests that Dataproc’s distribution of Apache Spark has no modifications in the Catalyst optimizer. AWS has published multiple benchmarks of their optimized EMR runtime, as well as explanations of various optimization features. By contrast, we have not found any official articles that evaluate Dataproc's Spark runtime and the release notes do not mention custom query plan improvements.

Our engine has several new improvements besides delivering faster performance, in particular a unified off-heap memory allocation system for task execution that enables precise control of allocation limits, automatically spilling memory to the disk when the executor-wide allocation threshold is reached. As most resource intensive allocation requests are off-heap, JVM heap memory is used mostly for routine data structures allocated during execution, which significantly reduces on-heap memory requirements for Spark executor instances.

Figure 3: Benchmark cluster memory comparison with and without XONAI

When deployed with XONAI, we observed in our benchmarks that Apache Spark delivers consistent performance even with 75% less memory per executor instance. The executor memory is split between off-heap and on-heap by configuring the properties spark.executor.memory and spark.memory.offHeap.size respectively. Because off-heap memory is more precisely managed by our engine and reclaimed more frequently, both average and peak memory utilization is greatly reduced as seen in Figure 3. For this benchmark, we reserved 15G of memory per executor for the run with Dataproc’s Spark, while with XONAI activated only 5G were reserved and the remaining 10G set as off-heap memory.

XONAI for Dataproc

Dataproc is a Google Cloud Platform service for running managed Hadoop and Apache Spark jobs. It provides enterprise-grade support, management and security for running data analytics at scale from a large distribution of open source software for batch processing, streaming and machine learning applications.

Our platform can be activated by specifying two extra properties when submitting a Spark job to a Dataproc cluster: a GCS URI to our JAR and the plugin entry point (activated via the new Spark 3 plugin layer) as illustrated in the images below:

And with the plugin activation property added:

Apart from these two properties, the executor memory configured for the original Dataproc’s Spark job needs to be split as mentioned before and the property spark.memory.offHeap.size set or adjusted.

The off-heap size property is read by our engine, which in turn automatically configures a number of internal plugin properties (which can still be manually overridden) such as the size of preallocated buffers for operations such as coalescing of batches and sort-merge join initial buffer size. While the property has no effect on OSS Spark internals (unless spark.memory.offheap.enabled is set), its value is still added to the total memory calculated for YARN executor containers.

Benchmark Details

The cluster configurations of our EMR benchmarks serve as blueprint for the Dataproc setup: Since we used EMR runtime 6.5.0 which supports Spark 3.1.2, the Dataproc clusters are provisioned with image version 2.0.39-ubuntu18. This is the last Ubuntu-based image with Spark component version 3.1.2, according to the release  notes. EMR's Spark distribution was built with Parquet-MR 1.10.1 which is also the default version in open source Spark. Because version 1.11.1 is used in the Dataproc image, we add initialization actions to the XONAI benchmark clusters that substitute six Parquet JARs with their 1.10.1 counterparts. On Amazon EC2, certain instance families like M5d have built-in NVME SSD disks. Equivalent machine types are not available on Google Cloud but local SSD devices can be attached to every Dataproc node in 375 GB increments. We configure each worker node with four SSDs, it is not possible to choose a smaller number (other than zero) for the n2-standard-32 machine type. The boot disk size of every cluster machine is set to 30 GB which is the minimum size that the Ubuntu image requires.

Our Dataproc benchmark results are not directly comparable to the ones we reported for EMR: Apart from obvious differences in hardware, the cluster configurations of the two platforms are not equivalent. For example, Dataproc sets the value for spark.sql.autoBroadcastJoinThreshold to a small fraction of the executor memory while EMR keeps the default Spark value of 10 MB.

The following command line was used to create a cluster for this benchmark:

gcloud dataproc clusters create benchmark-cluster --enable-component-gateway --bucket xonai-staging --region us-east1 --master-machine-type n2-standard-8 --master-boot-disk-size 30 --num-workers 5 --worker-machine-type n2-standard-32 --worker-boot-disk-size 30 --num-worker-local-ssds 4 --worker-local-ssd-interface NVME --project astral-shape-364114 --initialization-actions 'gs://example/,gs://example/'

Where the bootstrap action script is only added to run with XONAI to perform initialization actions mentioned above.

After logging onto the cluster's master machine, the following command was used to submit a job:

spark-submit \
--class com.xonai.RunTpcQueries \
--deploy-mode cluster \
--jars gs://example/xonai-spark-sql-1.0-SNAPSHOT.jar \
--conf spark.plugins=com.xonai.spark.SQLPlugin \
--conf spark.memory.offHeap.size=10000M \
--conf spark.driver.cores=4 \
--conf spark.driver.memory=4G \
--conf spark.executor.cores=5 \
--conf spark.executor.memory=5000M \
--conf spark.executor.instances=29 \
--conf spark.sql.shuffle.partitions=1305 \
--conf spark.shuffle.service.enabled=false \
--conf spark.dynamicAllocation.enabled=false \
--conf \
--conf spark.executor.heartbeatInterval=300s \
--conf spark.sql.broadcastTimeout=3000s \
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.adaptive.localShuffleReader.enabled=true \
--conf spark.sql.adaptive.coalescePartitions.enabled=true \
--conf spark.sql.adaptive.skewJoin.enabled=true \
gs://example/cloudbench-assembly-2.0.jar \
gs://example/3000g_part gs://xonai/results tpch 3000 3

The first 3 lines following --deploy-mode are the additional lines required to run Spark with XONAI, while the property spark.executor.memory is changed to 15 G when XONAI is not activated.

All scripts used for TPC-H data generation and running benchmarks can be found in our GitHub repository.


Our platform for Google Cloud will be available for closed beta testing in December and supports the Dataproc version 2 images supporting Spark 3.1.2 and above in any of the available OS distributions (Ubuntu 18, Debian 10 or Rocky Linux 8).

Request access via our contact form if you are interested in testing our solution in your cloud.

written by

Leandro Vaz

Co-Founder @ XONAI

Co-Founder @ XONAI

Co-Founder @ XONAI