dimanche 31 mai 2015

No module named simplejson in python UDF on EMR

I'm running an Amazon Elastic MapReduce (EMR) job using Pig. I'm having trouble importing the json or simplejson modules into my Python user defined function (UDF).

Here is my code:

#!/usr/bin/env python
import simplejson as json
@outputSchema('m:map[]')
def flattenJSON(text):
    j = json.loads(text)
    ...

When I try to register the function in Pig I get an error saying "No module named simplejson"

grunt> register 's3://chopperui-emr/code/flattenDict.py' using jython as flatten;
2015-05-31 16:53:43,041 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1121: Python Error. Traceback (most recent call last):
File "/tmp/pig6071834754384533869tmp/flattenDict.py", line 32, in <module>
import simplejson as json
ImportError: No module named simplejson

However, my Amazon AMI includes Python 2.6, which includes json as a standard package (using import json doesn't work either). Also, if I try to install simplejson using pip it says it's already installed (on both master and core nodes).

[hadoop@ip-172-31-46-71 ~]$ pip install simplejson
Requirement already satisfied (use --upgrade to upgrade): simplejson in /usr/local/lib64/python2.6/site-packages

Also, it works fine if I run python interactively from the command line on the master node

[hadoop@ip-172-31-46-71 ~]$ python
Python 2.6.9 (unknown, Apr  1 2015, 18:16:00) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> 

There must be something different about how EMR or Pig is setting up the Python environment, but what?




Aucun commentaire:

Enregistrer un commentaire