The below code works as expected:
a = load 'data_a' using PigStorage('\t') as (a1, a2, a3);
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3;
a_b = join a by a1, b by b1; --inner join
When I inspect the fields, they are populated correctly.
However, once I add a projection into the mix, it doesn't work.
a = load 'data_a' using PigStorage('\t') as (a1, a2, a3);
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3;
a_b = join a by a1, b by b1; --inner join
ab = foreach a_b generate a1 as a1, a2 as a2, b2 as b2;
In ab, all cells in the fields from b are NULL.
The same thing happens if I do this:
a = load 'data_a' using PigStorage('\t') as (a1, a2, a3);
a2 = foreach a generate a1, a2;
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3;
b2 = foreach b generate b1, b2;
ab = join a2 by a1, b2 by b1;
I use the following workaround, but hate being bogged down by the store/load:
a = load 'data_a' using PigStorage('\t') as (a1, a2, a3);
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3;
a_b = join a by a1, b by b1; --inner join
store a_b into 'hdfs:///a_b_temp' using PigStorage('\t','-schema');
a_b2 = load 'hdfs:///a_b_temp' using PigStorage('\t');
ab = foreach a_b2 generate a1 as a1, a2 as a2, b2 as b2;
And the fields in ab do not become NULL.
I am new to Pig - are there any known bugs/issues that could be causing this? I have observed it happening several times with different data sets.
I am using pig 0.12 on Amazon AWS EMR.
Thanks for any help!
Aucun commentaire:
Enregistrer un commentaire