lundi 29 juin 2015

Hive support for UTF-16 files

we are using Hive 0.13.1 in AWS EMR, and everyting was fine until we had to work with UTF-16 encoded files.

The symptoms are:

  1. There are some string entries with "special symbols" at the start, most probably these are the first entries in a file, with BOM characters included.

  2. For every line entry, there is an all-NULLs entry immediately after. Most probably that is due to ITEMS TERMINATED BY '\n' table creation directive working in a funny way with UTF-16 encoded files.

  3. Whenever i try to find specific entries, specifying, say, "provider_name = 'provider1'" - nothing comes out, even though i can see such entries exist in the table, while doing selects like "select * from mytable limit 5". Seems like the STRINGs of the table and the STRINGs of hive cli don't match. They don't even match when i specifically copy the value from an output of another query - nothing is found.

Aside from that, everything works fine: count, select distinct, select .. limit etc.

I tried to apply the fix suggested here, replacing "GBK" with "UTF-16", with no luck. Perhaps the problem is - this fix was applied to Hive 0.14.0, and AWS EMR only supports 1.13.1 atm.

(However, this solution seems to be rather an old one - check here. Still, it doesn't help a bit.)

Could someone suggest anything?




Aucun commentaire:

Enregistrer un commentaire