Hadoop and Python Streaming

May 21, 2010

I've been starting to write some hadoop and python streaming jobs and there isn't all that much documentation regarding it out there. Things like, how do I pass environment variables, how do I pass along modules that my scripts might need, etc...

here's a couple of quick tips... to pass environment variables to your tasknodes use this command line param when launching a hadoop job:
  1.  
  2. /Users/Hadoop/hadoop/bin/hadoop jar /Users/Hadoop/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
  3. -mapper /Users/hadoop/code/traffic/mapper.py \
  4. -reducer /Users/hadoop/code/traffic/reducer.py \
  5. -input insights-input-small/* \
  6. -output insights-output-traffic \
  7. -cmdenv PYTHONPATH=$PYTHONPATH:/Users/jim/Code \
  8. -cmdenv MYAPP__PATH=/Users/jim/Code \
  9. -cmdenv MYAPP_ENVIRONMENT=development


if you want to distribute your modules to the tasknodes instead of having them installed on the target task nodes then you can zip up your module file, rename it to mymodule.mod and use this command line param

-file /Users/jim/Code/mymodule.mod

then in your script you can unzip it and import it as usual

  1.  
  2. import zipimport
  3. importer = zipimport.zipimporter('mymodule.mod')
  4. insights = importer.load_module('mymodule')


hope that helps someone :)



Comments

RSS feed for comments on this post.

  1. christian louboutin in uk says:
    June 16, 2010 @ 07:31 — Reply

    Comment pending moderation

Leave a Comment

Line and paragraph breaks automatic, HTML allowed: <a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <code> <em> <i> <strike> <strong>

Comments disabled due to spammers being losers that lead sad lives.