Similar questions about getting logs on YARN have been asked before, but I cannot seem to make this work the way I want on Amazon EMR in particular.
I have Spark jobs running on EMR, triggered via the Java AWS client. On earlier versions of Spark I used to run them directly on EMR, but with the later versions I switched to using YARN as the resource manager which should give some performance and cost benefits, as well as fit well with the EMR documentation.
The problem is of course the logs from my jobs disappear into YARN rather than being collected in the EMR console. In a perfect world, I want to pull these into the stdout or stderr files in the EMR web console step execution logs (currently stdout is empty and stderr contains just YARN noise).
The Amazon forums and the AWS EMR documentation on this page http://ift.tt/1CLt1jk would seem to suggest that the correct way is to just supplying a log URI like I'm already doing and everything will work (it doesn't).
I have tried the solution here: YARN log aggregation on AWS EMR - UnsupportedFileSystemException
I added This actually works nicely to push logs to a remote S3 bucket.
ScriptBootstrapActionConfig scriptBootstrapAction = new ScriptBootstrapActionConfig()
.withPath("s3://elasticmapreduce/bootstrap-actions/configure-hadoop")
.withArgs("-y", "yarn.log-aggregation-enable=true",
"-y", "yarn.log-aggregation.retain-seconds=-1",
"-y", "yarn.log-aggregation.retain-check-interval-seconds=3600",
"-y", "yarn.nodemanager.remote-app-log-dir=s3n://mybucket/logs");
The problem is this creates a series of 'hadoop/application-xxxxxxxx/ip-xxxxxx'
files, and cannot place them in an area accessible by the AWS EMR web console logs, since that's a subdirectory based off the EMR job id. Since bootstrap actions have to be supplied at job creation time, I don't know the job id yet to pass that in the path.
Have I missed anything to get EMR logs working?
Aucun commentaire:
Enregistrer un commentaire