jeudi 29 janvier 2015

Amazon EMR runs slow than EC2 based Hadoop Cluster

I am running a streaming Hadoop job written in python over EMR as well as a self made Hadoop Cluster on top of Amazon EC2 instances. My input is divided into around 50,000 files and mapper reads each file as an input. Now when I ran the job over EC2 cluster it took 36 minutes for 1000 files and EMR took 1 hr 30 Minutes. Even job with larger inputs seems to fail on EMR but they run fine on EC2.


The EMR job ran with 4 m3.xlarge instances and EC2 with 4 m1.large instances. For instance info look here


I have gone through this link as well as this link to find the comparison between EMR and EC2.


Now, I am looking for: 1) the reason of EMR being slow as compared to EC2. 2) Is EMR keeping the input provided in s3 only or is it copying the same to HDFS and then running the job. 3) Is EC2 a better choice. If yes what type of jobs are more suitable over EC2.


I would really appreciate reference to some links, some similar experience or any form of relevant document.


P.S: I am a beginner to AWS and some of my questions may sound silly. But as someone has said, no question is silly :).





Aucun commentaire:

Enregistrer un commentaire