apache spark - Load XML string from Column in PySpark -
i have json file in 1 of columns xml string.
i tried extracting field , writing file in first step , reading file in next step. each row has xml header tag. resulting file not valid xml file.
how can use pyspark xml parser ('com.databricks.spark.xml') read string , parse out values?
the following doesn't work:
tr = spark.read.json( "my-file-path") trans_xml = sqlcontext.read.format('com.databricks.spark.xml').options(rowtag='book').load(tr.select("trans_xml"))
thanks, ram.
try hive xpath udfs (languagemanual xpathudf):
>>> pyspark.sql.functions import expr >>> df.select(expr("xpath({0}, '{1}')".format(column_name, xpath_expression)))
or python udf:
>>> pyspark.sql.types import * >>> pyspark.sql.functions import udf >>> import xml.etree.elementtree et >>> schema = ... # define schema >>> def parse(s): ... root = et.fromstring(s) result = ... # select values ... return result >>> df.select(udf(parse, schema)(xml_column))
Comments
Post a Comment