How to run Python Spark code on Amazon Aws? -


i have written python code in spark , want run on amazon's elastic map reduce.

my code works great on local machine, confused on how run on amazon's aws?

more specifically, how should transfer python code on master node? need copy python code s3 bucket , execute there? or, should ssh master , scp python code spark folder in master?

for now, tried running code locally on terminal , connecting cluster address ( did reading output of --help flag of spark, might missing few steps here)

./bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.1 \ --master spark://hadoop@ec2-public-dns-of-my-cluster.compute-1.amazonaws.com \ mypythoncode.py 

i tried , without permissions file i.e.

-i permissionsfile.pem 

however, fails , stack trace shows on lines of

exception in thread "main" java.lang.illegalargumentexception: aws access key id , secret access key must specified username or password (respectively) of s3n url, or setting fs.s3n.awsaccesskeyid or fs.s3n.awssecretaccesskey properties (respectively).     @ org.apache.hadoop.fs.s3.s3credentials.initialize(s3credentials.java:66)     @ org.apache.hadoop.fs.s3native.jets3tnativefilesystemstore.initialize(jets3tnativefilesystemstore.java:49)     @ sun.reflect.nativemethodaccessorimpl.invoke0(native method)     ......     ...... 

is approach correct , need resolve access issues going or heading in wrong direction?

what right way of doing it?

i searched lot on youtube couldn't find tutorials on running spark on amazon's emr.

if helps, dataset working on part of amazon's public dataset.

  1. go emr, create new cluster... [recommendation: start 1 node only, testing purposes].
  2. click checkbox install spark, can uncheck other boxes if don't need additional programs.
  3. configure cluster further choosing vpc , security key (ssh key, a.k.a pem key)
  4. wait boot up. once cluster says "waiting", you're free proceed.
  5. [spark submission via gui] in gui, can add step , select spark job, , upload spark file s3, , choose path newly uploaded s3 file. once runs either succeed or fail. if fails, wait moment, , click "view logs" on over of step line in list of steps. keep tweaking script until you've got working.

    [submission via command line] ssh driver node following ssh instructions @ top of page. once inside, use command-line text editor create new file, , paste contents of script in. spark-submit yournewfile.py. if fails, you'll see error output straight console. tweak script, , re-run. until you've got working expected.

note: running jobs local machine remote machine troublesome because may causing local instance of spark responsible expensive computations , data transfer on network. thats why want submit aws emr jobs within emr.


Comments

Popular posts from this blog

java - SSE Emitter : Manage timeouts and complete() -

jquery - uncaught exception: DataTables Editor - remote hosting of code not allowed -

java - How to resolve error - package com.squareup.okhttp3 doesn't exist? -