amazon web services: boto-emr job error: python broken pipeline error and java.lang.OutOfMemoryError

mercredi 15 avril 2015

boto-emr job error: python broken pipeline error and java.lang.OutOfMemoryError

I've prepared a streaming boto jobflow on AWS/EMR that runs perfectly well using the familiar test pipe:


 sed -n '0~10000p'  Big.csv | ./map.py | sort -t$'\t' -k1 | ./reduce.py

The boto emr job run also works well as I increase the size of the input data, until some threshold where jobs fail with a python broken pipe error:


 Traceback (most recent call last):
   File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201504151813_0001/attempt_201504151813_0001_r_000002_0/work/./reduce.py", line 18, in <module>
json.dump( { "cid":cur_key , "promo_hx":kc } , sys.stdout ) 
   File "/usr/lib/python2.6/json/__init__.py", line 181, in dump
fp.write(chunk)
 IOError: [Errno 32] Broken pipe

and the following java error:


  org.apache.hadoop.streaming.PipeMapRed (Thread-38): java.lang.OutOfMemoryError: Java heap space

Mapping tasks all complete for any size; the error occurs at the reducer stage. My reducer is the usual streaming reducer (I am using AMI 3.2.3 with the jason package built into Python 2.6.9):


 for line in sys.stdin:
      line                = line.strip()
      key  , value        = line.split('\t')
      ...
      print json.dumps( { "cid":cur_key , "promo_hx":kc } , sort_keys=True , separators=(',',': ') )

Any idea what is going on? Thanks.

amazon web services

mercredi 15 avril 2015

boto-emr job error: python broken pipeline error and java.lang.OutOfMemoryError

Aucun commentaire:

Enregistrer un commentaire