mardi 25 août 2015

Can you use s3distcp with gzipped input?

I'm trying to use s3distcp to compy a lot of small gzipped files. There s3distcp has an outputCodec argument that can be used to zip the output, but doesn't have a corresponding inputCodec. I'm trying to use --jobconf with the hadoop streaming call but it doesn't seem to be doing anything (the output is still gzipped). The command I'm using is

hadoop jar lib/emr-s3distcp-1.0.jar -Dstream.recordreader.compression=gzip \
           --src s3://inputfolder --dest hdfs:///data

Any ideas what might be going on? I'm running AWS EMR AMI-3.9.




Aucun commentaire:

Enregistrer un commentaire