jeudi 1 octobre 2015

loading files with underscore prefix from s3 into apache pig on amazon emr

I receive log files into a bucket, say named 'my-bucket'. They all have an underscore ('_') on their filename, ie:

s3://my-bucket/_file1.gz  
s3://my-bucket/_file2.gz  
s3://my-bucket/_file3.gz  
etc.  

When I do:

a = load 's3://my-bucket/*';
dump a;

it does not find any files.

I did some research, and it appears that pig treats files with . and _ prefixes as hidden, and does not load them. I wrote a script with boto to remove the prefix, and then I can load the files, but it takes a long time to run since it copies the files.

Is there a way to tell Pig not to ignore '_' prefixed files?

Thanks for any help!




Aucun commentaire:

Enregistrer un commentaire