we are using Hive 0.13.1 in AWS EMR, and everyting was fine until we had to work with UTF-16 encoded files.
The symptoms are:
-
There are some string entries with "special symbols" at the start, most probably these are the first entries in a file, with BOM characters included.
-
For every line entry, there is an all-NULLs entry immediately after. Most probably that is due to ITEMS TERMINATED BY '\n' table creation directive working in a funny way with UTF-16 encoded files.
-
Whenever i try to find specific entries, specifying, say, "provider_name = 'provider1'" - nothing comes out, even though i can see such entries exist in the table, while doing selects like "select * from mytable limit 5". Seems like the STRINGs of the table and the STRINGs of hive cli don't match. They don't even match when i specifically copy the value from an output of another query - nothing is found.
Aside from that, everything works fine: count, select distinct, select .. limit etc.
I tried to apply the fix suggested here, replacing "GBK" with "UTF-16", with no luck. Perhaps the problem is - this fix was applied to Hive 0.14.0, and AWS EMR only supports 1.13.1 atm.
(However, this solution seems to be rather an old one - check here. Still, it doesn't help a bit.)
Could someone suggest anything?
Aucun commentaire:
Enregistrer un commentaire