mercredi 25 février 2015

S3DistCp Grouping by Folder

I'm trying to use S3DistCp to get around the small files problem in Hadoop. It's working, but the output is a bit annoying to work with. The file path's I'm dealing with are like :


s3://test-bucket/test/0000eb6e-4460-4b99-b93a-469d20543bf3/201402.csv


and there can be multiple files within that folder. I want to group by the folder name, so I use the following group by argument in s3distcp:


--groupBy '.(........-.........-....-............).'


and it does group the files, but it results it still results in multiple output folders, with one file in each folder. Is there any way to output the grouped files into one folder, instead of multiple?


Thanks!





Aucun commentaire:

Enregistrer un commentaire