i'm currently trying to run a mapreduce job where the inputs are scattered in different folders underneath catch-all bucket in S3.
My original approach was to create a cluster for each of the input files and write separate outputs for each of them. However, that would require spinning up more than 200+ clusters and I don't think thats the most efficient way.
I was wondering if I could instead of specifying a file as input into EMR, specify a folder whose subfolders contain all of the input files.
Thanks!
Aucun commentaire:
Enregistrer un commentaire