vendredi 1 mai 2015

What is Apache Spark doing before a job start

I have an Apache Spark batch job running continuously on AWS EMR. It pulls from AWS S3, runs a couple of jobs with that data, and then stores the data in an RDS instance.

However, there seems to be a long period of inactivity between jobs.

This is the CPU use: CPU Usage

And this is the network: Cluster Network

Notice the gap between each column, it is almost the same size as the activity column!

At first I thought these two columns were shifted (when it was pulling from S3, it wasn't using a lot of CPU and vice-versa) but then I noticed that these two graphs actually follow each other. This makes sense since the RDDs are lazy and will thus pull as the job is running.

Which leads to my question, what is Spark doing during that time? All of the Ganglia graphs seem zeroed during that time. It is as if the cluster decided to take a break before each job.

Thanks.




Aucun commentaire:

Enregistrer un commentaire