Wednesday, November 19, 2014

Things I Didn't Know: Python Modules and Hadoop Streaming

I have a python script that I'm converting to MapReduce. This script relies on another python module that I wrote. I stage both scripts onto the worker nodes with the -files option like so:

hadoop jar $STREAMING -files myscript.py,mymodule.py -mapper "python myscript.py" ...

When this job completes, I get an error in the logs that myscript.py can't locate the mymodule module. Huh. I immediately suspect that the files aren't staging correctly, so I wrap the mapper in a shell script that runs an ls before executing the python code.

I run the job again with the wrapper, and I see from the ls output that, indeed, both scripts are right there in the working directory. Huh. So I look up the python module loading process. In python, the head of the module search path is the directory that holds the script being run, typically the current working directory. From the ls output I know that mymodule.py is there, so something else must be afoot.

I modify the python code to print sys.path and modify the wrapper script to also print the working directory, and then I rerun the job. This time, what I see is that the python search path does not contain the working directory. It contains some other YARN-created directory instead. Double-huh.

For lack of any better ideas, I modify the wrapper to print ls -lR instead of just ls and rerun the job. In the output I can now see that the scripts aren't actually located in the job's working directory. The working directory contains symlinks to the scripts, AND EVERY SCRIPT IS IN ITS OWN DIRECTORY! When my python script is run, the python interpreter works out the real location of the script and uses that location to rightly determine that the mymodule is not there. Doh.

Fortunately the solution is easy. In the wrapper script I explicitly set PYTHONPATH to the job's working directory, and everything then works as expected.  What a pain, though.

Tuesday, November 18, 2014

Things I Didn't Know: The Sqoop-Hive Data Shell Game

I've been at Cloudera for nearly four years now (!), and by this point I've developed a pretty good depth in most of the tools in CDH. Every once in a while, though, I run into something that violates the principle of least astonishment, even given that I expect Hadoop to be pretty astonishing. This morning in the shower I decided that I should be documenting these things when I run into them. This post is the first one.

Yesterday I was trying to do a very simple operation.  I had a MySQL data store with a table that I wanted to bring into Hive.  I did the obvious thing: Sqoop import with --hive-import. I also wanted the data for the Hive table to be in a specific directory, so I did the obvious thing: --target-dir. One other minor detail that turns out to be important: my target directory was of the form: /user/daniel/data/mydata, and the /user/cert/data directory did not exist before the import.

After the Sqoop job completed, I could see the table in Hive, but when I looked at my /user/daniel/data directory, it was empty. Even more unusual, it existed, which means it had been created, but it was empty. Where was the /user/daniel/data/mydata directory?

I assumed it was an issue with the import, so I reran it and watched carefully. The import ran smoothly, and I could see that it put the data where I wanted it to go. And then Sqoop created the table over the data in Hive. There's no way (that I know of) to tell Sqoop to create an external table. (There is a JIRA, though.) That means when Sqoop created the Hive table, it moved the /user/daniel/data/mydata directory into /user/hive/warehouse. I did an ls on /user/hive/warehouse/mydata, and there all my data was.

If I had run show extended on the generated Hive table in the first place, it would have shown me where the data was, but it wouldn't have explained how it got there. And now I know.

Monday, September 8, 2014

Puff the Weepy Dragon

So, my daughter asked me tonight if Puff the Magic Dragon is a sad song, and to explain why. She's six. Oh, boy. As far back as I can remember, that song has made me cry. When I was a kid, it was just sad. Now that I'm an adult, it's gut-wrenchingly, heart-rendingly sad. So, there I am, lying in the bed with her, trying my best not to bawl uncontrollably while explaining that Puff is a metaphor (and explaining what a metaphor is). Tears are streaming down my face onto the pillow, and I can barely squeak out answers to her questions.

I felt like such a sap that I had to go Google to see if there's a clinical name for my Puff the Magic Dragon disorder. I was somewhat relieved that Google auto-completed "Puff the Magic Dragon makes me cry" from just "Puff Dragon mak". There I was looking a page full of links to articles from people professing the same illness. I am not alone.

In that list of links, I found two things I wasn't expecting to find. The first is this post from the original author of the poem that became the song. Lenny explains the original inspiration for the poem/song (Ogden Nash's Custard the Cowardly Dragon, which I also love) and convincingly debunks the pot myth. The second was this article, which professes to have the cure for my disease. Apparently Peter Yarrow had a daughter who has also afflicted, and so he published a Puff book that adds a happy ending. I'm incredulous that anything can save me from that song, but the book looks beautiful, and it's worth a try. I'll let you know how it goes.