Wednesday, November 19, 2014

Things I Didn't Know: Python Modules and Hadoop Streaming

I have a python script that I'm converting to MapReduce. This script relies on another python module that I wrote. I stage both scripts onto the worker nodes with the -files option like so:

hadoop jar $STREAMING -files myscript.py,mymodule.py -mapper "python myscript.py" ...

When this job completes, I get an error in the logs that myscript.py can't locate the mymodule module. Huh. I immediately suspect that the files aren't staging correctly, so I wrap the mapper in a shell script that runs an ls before executing the python code.

I run the job again with the wrapper, and I see from the ls output that, indeed, both scripts are right there in the working directory. Huh. So I look up the python module loading process. In python, the head of the module search path is the directory that holds the script being run, typically the current working directory. From the ls output I know that mymodule.py is there, so something else must be afoot.

I modify the python code to print sys.path and modify the wrapper script to also print the working directory, and then I rerun the job. This time, what I see is that the python search path does not contain the working directory. It contains some other YARN-created directory instead. Double-huh.

For lack of any better ideas, I modify the wrapper to print ls -lR instead of just ls and rerun the job. In the output I can now see that the scripts aren't actually located in the job's working directory. The working directory contains symlinks to the scripts, AND EVERY SCRIPT IS IN ITS OWN DIRECTORY! When my python script is run, the python interpreter works out the real location of the script and uses that location to rightly determine that the mymodule is not there. Doh.

Fortunately the solution is easy. In the wrapper script I explicitly set PYTHONPATH to the job's working directory, and everything then works as expected.  What a pain, though.

No comments:

Post a Comment