

#SPARK DRIVER DRIVER#
Spark is dependent on the Cluster Manager to launch the Executors and also the Driver (in Cluster mode). Well, then let’s talk about the Cluster Manager. Thinking how these Driver and Executor Processes are launched after submitting a job (spark-submit)?

Wondering what they are? Check my blog by clicking here. Helps to create the Lineage, Logical Plan and Physical Plan.the number of tasks to be performed is decided by the Driver.
#SPARK DRIVER CODE#
It looks at the user code and determines are the possible Tasks, i.e. Conversion of the user code into Task (transformation and action).The main() method of our program runs in the Driver process.It executes the user code and creates a SparkSession or SparkContext and the SparkSession is responsible to create DataFrame, DataSet, RDD, execute SQL, perform Transformation & Action, etc. This is the process where the main() method of our Scala, Java, Python program runs. Cluster manager can be any one of the following –ĭriver is a Java process. The Spark Application is launched with the help of the Cluster Manager. This working combination of Driver and Workers is known as Spark Application. The Driver has all the information about the Executors at all the time. Executors register themselves with Driver. The central coordinator is called Spark Driver and it communicates with all the Workers.Įach Worker node consists of one or more Executor(s) who are responsible for running the Task. Spark ArchitectureĪs we can see that Spark follows Master-Slave architecture where we have one central coordinator and multiple distributed worker nodes. Now, let’s look into the architecture of Apache Spark.

The official definition of Apache Spark says that “ Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an in-memory computation processing engine where the data is kept in random access memory (RAM) instead of some slow disk drives and is processed in parallel.īefore processing further, I would like to state that the prerequisite to understand this blog would be my blog on “ Understanding how Spark runs on YARN with HDFS” where I have explained in detail “How Spark runs on Cluster Manager i.e. So let’s get started.įirst, let’s see what Apache Spark is. This blog pertains to Apache SPARK, where we will understand how Spark’s Driver and Executors communicate with each other to process a given job.
