Wednesday, November 19, 2014

Things I Didn't Know: Python Modules and Hadoop Streaming

I have a python script that I'm converting to MapReduce. This script relies on another python module that I wrote. I stage both scripts onto the worker nodes with the -files option like so:

hadoop jar $STREAMING -files myscript.py,mymodule.py -mapper "python myscript.py" ...

When this job completes, I get an error in the logs that myscript.py can't locate the mymodule module. Huh. I immediately suspect that the files aren't staging correctly, so I wrap the mapper in a shell script that runs an ls before executing the python code.

I run the job again with the wrapper, and I see from the ls output that, indeed, both scripts are right there in the working directory. Huh. So I look up the python module loading process. In python, the head of the module search path is the directory that holds the script being run, typically the current working directory. From the ls output I know that mymodule.py is there, so something else must be afoot.

I modify the python code to print sys.path and modify the wrapper script to also print the working directory, and then I rerun the job. This time, what I see is that the python search path does not contain the working directory. It contains some other YARN-created directory instead. Double-huh.

For lack of any better ideas, I modify the wrapper to print ls -lR instead of just ls and rerun the job. In the output I can now see that the scripts aren't actually located in the job's working directory. The working directory contains symlinks to the scripts, AND EVERY SCRIPT IS IN ITS OWN DIRECTORY! When my python script is run, the python interpreter works out the real location of the script and uses that location to rightly determine that the mymodule is not there. Doh.

Fortunately the solution is easy. In the wrapper script I explicitly set PYTHONPATH to the job's working directory, and everything then works as expected.  What a pain, though.

Tuesday, November 18, 2014

Things I Didn't Know: The Sqoop-Hive Data Shell Game

I've been at Cloudera for nearly four years now (!), and by this point I've developed a pretty good depth in most of the tools in CDH. Every once in a while, though, I run into something that violates the principle of least astonishment, even given that I expect Hadoop to be pretty astonishing. This morning in the shower I decided that I should be documenting these things when I run into them. This post is the first one.

Yesterday I was trying to do a very simple operation.  I had a MySQL data store with a table that I wanted to bring into Hive.  I did the obvious thing: Sqoop import with --hive-import. I also wanted the data for the Hive table to be in a specific directory, so I did the obvious thing: --target-dir. One other minor detail that turns out to be important: my target directory was of the form: /user/daniel/data/mydata, and the /user/cert/data directory did not exist before the import.

After the Sqoop job completed, I could see the table in Hive, but when I looked at my /user/daniel/data directory, it was empty. Even more unusual, it existed, which means it had been created, but it was empty. Where was the /user/daniel/data/mydata directory?

I assumed it was an issue with the import, so I reran it and watched carefully. The import ran smoothly, and I could see that it put the data where I wanted it to go. And then Sqoop created the table over the data in Hive. There's no way (that I know of) to tell Sqoop to create an external table. (There is a JIRA, though.) That means when Sqoop created the Hive table, it moved the /user/daniel/data/mydata directory into /user/hive/warehouse. I did an ls on /user/hive/warehouse/mydata, and there all my data was.

If I had run show extended on the generated Hive table in the first place, it would have shown me where the data was, but it wouldn't have explained how it got there. And now I know.

Monday, September 8, 2014

Puff the Weepy Dragon

So, my daughter asked me tonight if Puff the Magic Dragon is a sad song, and to explain why. She's six. Oh, boy. As far back as I can remember, that song has made me cry. When I was a kid, it was just sad. Now that I'm an adult, it's gut-wrenchingly, heart-rendingly sad. So, there I am, lying in the bed with her, trying my best not to bawl uncontrollably while explaining that Puff is a metaphor (and explaining what a metaphor is). Tears are streaming down my face onto the pillow, and I can barely squeak out answers to her questions.

I felt like such a sap that I had to go Google to see if there's a clinical name for my Puff the Magic Dragon disorder. I was somewhat relieved that Google auto-completed "Puff the Magic Dragon makes me cry" from just "Puff Dragon mak". There I was looking a page full of links to articles from people professing the same illness. I am not alone.

In that list of links, I found two things I wasn't expecting to find. The first is this post from the original author of the poem that became the song. Lenny explains the original inspiration for the poem/song (Ogden Nash's Custard the Cowardly Dragon, which I also love) and convincingly debunks the pot myth. The second was this article, which professes to have the cure for my disease. Apparently Peter Yarrow had a daughter who has also afflicted, and so he published a Puff book that adds a happy ending. I'm incredulous that anything can save me from that song, but the book looks beautiful, and it's worth a try. I'll let you know how it goes.

Monday, October 17, 2011

ToddlerSort

It occurred to me while watching my youngest attempt to stack a set of nesting blocks, that he might be on to something. After some careful algorithm design, I give you ToddlerSort, the bleeding (drooling?) edge of sorting algorithms:


public static List<T extends Comparable<? super T>>
        toddlerSort(List<T> list, double irritability)
        throws Tantrum {
    List<T> sorted = new ArrayList<T>(list);
    int i = 0;

    while (i < sorted.size() - 1) {
        if (sorted.get(i).compareTo(sorted.get(i+1)) > 0) {
            if (Math.random() < irritability) {
                throw new Tantrum();
            } else {
                Collections.shuffle(sorted);
                i = 0;
            }
        } else {
            i++;
        }
    }

    return sorted;
} 

This algorithm runs in O(∞) time, but in practice it is usually aborted before completion.

Tuesday, July 26, 2011

Now I Know I'm Dreaming

I'm sure, just like everyone else, I go through all the usual cycles of sleep, including lots of dreaming. But since a few years, come morning I have no recollection of my dreams what so ever. Occasionally I will come away with a faint glimmer of a memory, but nothing I can recount.
Last week I read an article about sunlight that espoused the virtues of vitamin D. My wife had gone on a vitamin D kick a while back, and we still had some lying around. The article made a strong enough case that I figured I'd give it a go. For the last four evenings, I've been taking 3000 I.U. of vitamin D.
And for the last four mornings, I've woken up with detailed memories of my dreams. In fact, I can still recall vivid images from the dreams I had last night, and it's past 10:00 PM. I'm not sure that my newly found ability to recall my dreams translates into anything actually useful, but it's intriguing.
The properly scientific thing to do would be to stop taking the D and see if my dreams go dark again, but honestly, I'd rather just keep remembering my dreams. I'm sure at some point I'll forget to take the vitamins — I can be all scientific then.

Sunday, July 10, 2011

Less Sweety Pie

My wife an I have recently been on a quest to remove unnecessary naughties (fat, calories, and sugar, mostly) from the foods me make. I have a natural tendency towards the French style of cooking. For example, making scrambled eggs for two requires at least a quarter cup of cream. I've been working on cutting back on the fat and calories, but that's still a work in progress. Instead, let's talk about sugar.

My wife and I have a tendency to adopt recipes for baked goods from various sources, foodtv.com and American's Test Kitchen mostly, and we generally have some kind of dessert every night after dinner. Most American baked good recipes are on the sugar-heavy side, and our regulars are no exception. ATK's recipes are generally a little less eggregious, but we still thought there was room for improvement. And there was, with one dish at least.

This crisp recipe is one of our favorites. I probably bake four a month. The best part about it is that it's versatile. It works equally well with peaches, apples, plums, whatever. Since we make this recipe so often, it seemed like a natural place to start. In our first pass, we dropped the 1/4 cup of sugar that goes into the fruit and reduced the 1/2 cups of white and brown sugar to 1/3 cups. It actually tasted better. The flavor of the fruit shown through more clearly, and the tartness was actually more palette pleasing than the sweetness it had replaced.

My wife wasn't satisfied, though. Next the butter went down from a stick to 6 tablespoons. The impact was noticeable, but not necessarily in a bad way. I liked it better because it made the remaining sugar stand out more. My wife didn't like it as much because it lost some of it's crispiness. Worth a try.

Tonight we decided to see how far we could push it. Using a low-calorie apple crisp recipe my wife found on the Internet as inspiration, we tried something radical in the form of a peach crisp. Not only did we drop the 1/4 of white sugar from the fruit, we dropped the white sugar from the recipe entirely. We also dropped the brown sugar down to 1/4 cup. To make up for the missing sugar, we upped the oats by 1/2 cup. Even more radical, we dropped the butter down to 4 tablespoons. To make that work, we had to melt the butter before mixing it into the topping.

The result was actually quite good. It's not the sort of dish I would expect to get rave reviews if served at a party, but for an every-night dessert, I think it's perfect. It really lets the fruit take center stage, and the guilt factor is next to zero. With fruit that's less sweet than very ripe peaches, a little more sugar might be necessary, but it's still low cal, low carb, low sugar, delicious and easy. Doesn't get much better than that.

I'm Back

When I left Oracle, I sadly had to leave my Grid Blog behind. It's hard to believe that I had about 8 years of my life logged in there. When my new gig didn't give me a blog on the company site, I decided that with home and work pressures as they are, I would free myself from the burden of blogging for a little while. I can't take it anymore. I keep coming up with things I wish I had a blog on which to rant about. So, here it is. I'm back.

Since this is not a corporate blog, I have the freedom to say whatever I want, and I don't even have to include a legal disclaimer of my opinions. This could be dangerous; it will probably be fun; it will definitely be interesting. Talk to you soon.