The disruption of traditional extract, transform, and load (ETL) workloads has been one of the most notable impacts of big data platforms such as Hadoop, Spark, and NoSQL databases. With architectures like data lakes, organizations can shift to a load first approach, or extract, load, transform (ELT), accelerate ingest, and define schemas later.

The key new ingredients of ETL in the big data era are the radical increase in unstructured data and the introduction of fast data into the equation.

  • Faster response times for ad hoc queries and large scale joins executed via Spark SQL
  • Rapid massive data ingest from Hadoop HDFS, Amazon S3, and Apache Kafka
  • Accelerated data/document parsing of JSON, CSV, Parquet and Avro data
  • Accelerated ETL/ELT, data cleansing, and data enrichment processes in Spark

Bigstream Benchmark Report

Read the Bigstream Benchmark Report to see specific Apache Spark acceleration results.

Read Now