Wednesday, November 19, 2014

Things I Didn't Know: Python Modules and Hadoop Streaming

I have a python script that I'm converting to MapReduce. This script relies on another python module that I wrote. I stage both scripts onto the worker nodes with the -files option like so:

hadoop jar $STREAMING -files myscript.py,mymodule.py -mapper "python myscript.py" ...

When this job completes, I get an error in the logs that myscript.py can't locate the mymodule module. Huh. I immediately suspect that the files aren't staging correctly, so I wrap the mapper in a shell script that runs an ls before executing the python code.

I run the job again with the wrapper, and I see from the ls output that, indeed, both scripts are right there in the working directory. Huh. So I look up the python module loading process. In python, the head of the module search path is the directory that holds the script being run, typically the current working directory. From the ls output I know that mymodule.py is there, so something else must be afoot.

I modify the python code to print sys.path and modify the wrapper script to also print the working directory, and then I rerun the job. This time, what I see is that the python search path does not contain the working directory. It contains some other YARN-created directory instead. Double-huh.

For lack of any better ideas, I modify the wrapper to print ls -lR instead of just ls and rerun the job. In the output I can now see that the scripts aren't actually located in the job's working directory. The working directory contains symlinks to the scripts, AND EVERY SCRIPT IS IN ITS OWN DIRECTORY! When my python script is run, the python interpreter works out the real location of the script and uses that location to rightly determine that the mymodule is not there. Doh.

Fortunately the solution is easy. In the wrapper script I explicitly set PYTHONPATH to the job's working directory, and everything then works as expected.  What a pain, though.

Tuesday, November 18, 2014

Things I Didn't Know: The Sqoop-Hive Data Shell Game

I've been at Cloudera for nearly four years now (!), and by this point I've developed a pretty good depth in most of the tools in CDH. Every once in a while, though, I run into something that violates the principle of least astonishment, even given that I expect Hadoop to be pretty astonishing. This morning in the shower I decided that I should be documenting these things when I run into them. This post is the first one.

Yesterday I was trying to do a very simple operation.  I had a MySQL data store with a table that I wanted to bring into Hive.  I did the obvious thing: Sqoop import with --hive-import. I also wanted the data for the Hive table to be in a specific directory, so I did the obvious thing: --target-dir. One other minor detail that turns out to be important: my target directory was of the form: /user/daniel/data/mydata, and the /user/cert/data directory did not exist before the import.

After the Sqoop job completed, I could see the table in Hive, but when I looked at my /user/daniel/data directory, it was empty. Even more unusual, it existed, which means it had been created, but it was empty. Where was the /user/daniel/data/mydata directory?

I assumed it was an issue with the import, so I reran it and watched carefully. The import ran smoothly, and I could see that it put the data where I wanted it to go. And then Sqoop created the table over the data in Hive. There's no way (that I know of) to tell Sqoop to create an external table. (There is a JIRA, though.) That means when Sqoop created the Hive table, it moved the /user/daniel/data/mydata directory into /user/hive/warehouse. I did an ls on /user/hive/warehouse/mydata, and there all my data was.

If I had run show extended on the generated Hive table in the first place, it would have shown me where the data was, but it wouldn't have explained how it got there. And now I know.