Spark Streaming

Overview about batch processing vs stream processing in spark

Posted by Shubham Chaudhary on July 25, 2017

Extract Transform Load (ETL)

ETL process is to fetch data from different types of systems, structure it and save it into the destination database.

ETL Pipeline

Batch

In the case of a batch job, the query will be run on the data saved at source-path and the transformed data will be saved at the destination dest-path.

Batch Job

Streaming

In the case of a streaming job, the query will run on the data continuously from source-path and transformed data will be appended in the destination dest-path again and again as data comes in.

Batch Job converted to streaming

Merging static data (DB) with streaming data

There might be use cases where you want to merge static data (e.g. MySQL) with the streaming data. You can do this as follows:

Joining Streaming Data

Executing the Job

Batch Execution

Batch Plan Batch Plan Execution

Stream Execution

With the planner’s logical plan, incremental execution plan is generated on top of it Incremental

Resources