I receive log files into a bucket, say named 'my-bucket'. They all have an underscore ('_') on their filename, ie:
s3://my-bucket/_file1.gz
s3://my-bucket/_file2.gz
s3://my-bucket/_file3.gz
etc.
When I do:
a = load 's3://my-bucket/*';
dump a;
it does not find any files.
I did some research, and it appears that pig treats files with . and _ prefixes as hidden, and does not load them. I wrote a script with boto to remove the prefix, and then I can load the files, but it takes a long time to run since it copies the files.
Is there a way to tell Pig not to ignore '_' prefixed files?
Thanks for any help!
Aucun commentaire:
Enregistrer un commentaire