How to Connect Python to Spark Session and Keep RDDs Alive -
how small python script hook existing instance of spark , operations on existing rdds?
i'm in stages of working spark on windows 10, trying scripts on "local" instance. i'm working latest stable build of spark (spark 2.0.1 hadoop 2.7). i've installed , set environment variables hadoop 2.7.3. i'm experimenting both pyspark shell , visual studio 2015 community python.
i'm trying build large engine, on i'll run individual scripts load, massage, format, , access data. i'm sure there's normal way that; isn't point of spark?
anyway, here's experience have far. expected. when build small spark script in python , run using visual studio, script runs, job, , exits. in process of exiting, exits spark context using.
so had following thought: if started persistent spark context in pyspark , set sparkconf , sparkcontext in each python script connect spark context? so, looking online defaults pyspark, tried following:
conf = sparkconf().setmaster("local[*]").setappname("pysparkshell") sc = sparkcontext(conf = conf)
i started pyspark. in separate script in visual studio, used code sparkcontext. loaded text file rdd named rddfromfilename . couldn't access rdd in pyspark shell once script had run.
how start persistent spark context, create rdd in in 1 python script, , access rdd subsequent python scripts? particularly in windows?
there no solution in spark. may consider:
to keep persistent rdds:
- apache ignite
to keep persistent shared context:
- spark-jobserver
- livy - https://github.com/cloudera/livy
- mist - https://github.com/hydrospheredata/mist
to share context notebooks:
- apache zeppelin
i think out of these zeppelin officially supports windows.
Comments
Post a Comment