pipevast.blogg.se - Spark driver

#SPARK DRIVER DRIVER#
#SPARK DRIVER CODE#

Let’s say a user submits a job using “spark-submit”.

Spark-submit –master –executor-memory 2g –executor-cores 4 WordCount-assembly-1.0.jar it decides the number of Executors to be launched, how much CPU and memory should be allocated for each Executor, etc. Spark provides a script named “ spark-submit” which helps us to connect with a different kind of Cluster Manager and it controls the number of resources the application is going to get i.e. Spark can be run with any of the Cluster Manager. We can use any of the Cluster Manager (as mentioned above) with Spark i.e.

#SPARK DRIVER DRIVER#

Spark is dependent on the Cluster Manager to launch the Executors and also the Driver (in Cluster mode). Well, then let’s talk about the Cluster Manager. Thinking how these Driver and Executor Processes are launched after submitting a job (spark-submit)?

It can cache (persist) the data in the Worker node.

To run an individual Task and return the result to the Driver.

They are dynamically launched and removed by the Driver as per required. Executors are launched at the start of a Spark Application in coordination with the Cluster Manager.

Keeps track of the data (in the form of metadata) which was cached (persisted) in Executor’s (worker’s) memory.Įxecutor resides in the Worker node.

It looks at the current set of Executors and schedules our tasks.

Coordinates with all the Executors for the execution of Tasks.

Once the Physical Plan is generated, the Driver schedules the execution of the tasks by coordinating with the Cluster Manager.

Wondering what they are? Check my blog by clicking here. Helps to create the Lineage, Logical Plan and Physical Plan.the number of tasks to be performed is decided by the Driver.

#SPARK DRIVER CODE#

It looks at the user code and determines are the possible Tasks, i.e. Conversion of the user code into Task (transformation and action).The main() method of our program runs in the Driver process.It executes the user code and creates a SparkSession or SparkContext and the SparkSession is responsible to create DataFrame, DataSet, RDD, execute SQL, perform Transformation & Action, etc. This is the process where the main() method of our Scala, Java, Python program runs. Cluster manager can be any one of the following –ĭriver is a Java process. The Spark Application is launched with the help of the Cluster Manager. This working combination of Driver and Workers is known as Spark Application. The Driver has all the information about the Executors at all the time. Executors register themselves with Driver. The central coordinator is called Spark Driver and it communicates with all the Workers.Įach Worker node consists of one or more Executor(s) who are responsible for running the Task. Spark ArchitectureĪs we can see that Spark follows Master-Slave architecture where we have one central coordinator and multiple distributed worker nodes. Now, let’s look into the architecture of Apache Spark.

The official definition of Apache Spark says that “ Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an in-memory computation processing engine where the data is kept in random access memory (RAM) instead of some slow disk drives and is processed in parallel.īefore processing further, I would like to state that the prerequisite to understand this blog would be my blog on “ Understanding how Spark runs on YARN with HDFS” where I have explained in detail “How Spark runs on Cluster Manager i.e. So let’s get started.įirst, let’s see what Apache Spark is. This blog pertains to Apache SPARK, where we will understand how Spark’s Driver and Executors communicate with each other to process a given job.