jeudi 16 avril 2015

boto: checking for correct hadoop parameters

EMR memory errors & job fails occur on large runs, not on smaller test runs. To same time and expense, I check job logs to confirm the expected parameters were handed over from boto to hadoop in



http://s3job_flow_bucket/j_123ABC/jobs/job_..._conf.xml


Question 1: Is this the correct place to look for the purpose above?


What I observed is that some of my bootstrap action parameters are not implemented by hadoop. Also: adding a wrong/bad parameter doesn't generate any warning in the logs (that I can find).


Question 2: If I use the following bootstrap action parameters, how can I confirm that the hadoop call actually sees the request?


At this point I have to wait over an hour for the memory error to occur. There has to be a more efficient way to debug such memory issues.



from boto.emr.bootstrap_action import BootstrapAction

params = [ '-m' , 'mapred.child.java.opts=-Xmx2g' ,
'-m' , 'mapred.cluster.reduce.memory.mb=2000' ,
'-m' , 'mapred.job.reduce.memory.mb=2000' ]

config_bootstrapper = BootstrapAction( name="Bootstrap name" ,
path ='s3://elasticmapreduce/bootstrap-actions/configure-hadoop',
bootstrap_action_args = params)

jobid = conn.run_jobflow(name='The Debug Jobflow',
#api_params=api_params ,
#ec2_keyname="thekey",
bootstrap_actions=[config_bootstrapper],
ami_version="latest",
log_uri='s3://thebucket/jobflowlogs',
master_instance_type='m1.medium',
slave_instance_type='m1.medium',
num_instances=4,
steps=[step],
enable_debugging=True,
keep_alive=False)




Aucun commentaire:

Enregistrer un commentaire