Tuesday, November 18, 2014

Things I Didn't Know: The Sqoop-Hive Data Shell Game

I've been at Cloudera for nearly four years now (!), and by this point I've developed a pretty good depth in most of the tools in CDH. Every once in a while, though, I run into something that violates the principle of least astonishment, even given that I expect Hadoop to be pretty astonishing. This morning in the shower I decided that I should be documenting these things when I run into them. This post is the first one.

Yesterday I was trying to do a very simple operation.  I had a MySQL data store with a table that I wanted to bring into Hive.  I did the obvious thing: Sqoop import with --hive-import. I also wanted the data for the Hive table to be in a specific directory, so I did the obvious thing: --target-dir. One other minor detail that turns out to be important: my target directory was of the form: /user/daniel/data/mydata, and the /user/cert/data directory did not exist before the import.

After the Sqoop job completed, I could see the table in Hive, but when I looked at my /user/daniel/data directory, it was empty. Even more unusual, it existed, which means it had been created, but it was empty. Where was the /user/daniel/data/mydata directory?

I assumed it was an issue with the import, so I reran it and watched carefully. The import ran smoothly, and I could see that it put the data where I wanted it to go. And then Sqoop created the table over the data in Hive. There's no way (that I know of) to tell Sqoop to create an external table. (There is a JIRA, though.) That means when Sqoop created the Hive table, it moved the /user/daniel/data/mydata directory into /user/hive/warehouse. I did an ls on /user/hive/warehouse/mydata, and there all my data was.

If I had run show extended on the generated Hive table in the first place, it would have shown me where the data was, but it wouldn't have explained how it got there. And now I know.

No comments:

Post a Comment