mercredi 4 mars 2015

Hadoop/map-reduce: Total time spent by all maps in occupied slots vs. Total time spent by all map tasks

Background: I'm analyzing the performance of AWS Hadoop jobs on various cluster configurations and some of the Hadoop counters are confusing.


Question: what's the difference between "Total time spent by all maps in occupied slots" and "Total time spent by all map tasks"? (same question for reduce). For brevity, let's call these counters mapO, mapT, redO and redT. Here's what I've seen in three different configurations (each with various number of core/slave nodes):


1) For AWS/EMR jobs (Hadoop 2.4.0-amzn-3), the ratio of mapO / mapT is always 6.0 and the ratio of redO / redT is always 12.0.


2) For manually installed Hadoop (Hadoop 2.4.0.2.1.5.0-695) using instance storage, the ratio of mapO / mapT is always 1.0 but the ratio of redO / redT is sometimes 1.0 and sometimes 2.0.


3) For manually installed Hadoop using EBS storage, the ratio of mapO / mapT is always 1.0 and the ratio of redO / redT is always 2.0.


I'm assuming other configurations would have different ratios but what do these counters/timers actually measure?


I bought Tom White's excellent "Hadoop" book (3rd Edition) but there is no mention of the mapO or redO counters in particular or "occupied slots" in general.


I've also run lots of Google searches and viewed dozens of pages on hadoop.apache.com. I also have (and run) hadoop on my MacBook and searched for the code for these counters and couldn't find it (I'm sure it's there but??).


As noted in a related (and unanswered) question, it is surprising and weird that even a basic description of these basic counters is not readily available.





Aucun commentaire:

Enregistrer un commentaire