I am trying to run datapipeline which call/invoke EMR cluster and runs PIG scripts to do file processing. I tested pipeline / pig program works very well for smaller data sets but for larger datasets (40 GB) it is failing. And not even reading the files. I think it has something to do with node type. (Memory / space).
I tried using M3.2xlarge then we tried using r3.2xlarge (10 nodes), r3.4xlarge (5 nodeS), and i2.4xlarge (4 nodes) but none worked out.
Below is error for reference
HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.4.0-amzn-4 0.12.0 hadoop 2015-10-09 15:08:03 2015-10-09 15:13:13 GROUP_BY,FILTER,UNION
Failed!
Failed Jobs: JobId Alias Feature Message Outputs job_1444402875772_0001 final_output,job_count_grouped,job_count_lag,job_counts,job_counts_no_differential,last_counts,last_counts_gen,ordered_job_count GROUP_BY Message: Job failed! path,
Input(s): Failed to read data from "path" Failed to read data from "s3://path"
Output(s): Failed to produce result in "s3://path"
any inputs will be helpful.
Aucun commentaire:
Enregistrer un commentaire