apache spark - Load XML string from Column in PySpark -


i have json file in 1 of columns xml string.

i tried extracting field , writing file in first step , reading file in next step. each row has xml header tag. resulting file not valid xml file.

how can use pyspark xml parser ('com.databricks.spark.xml') read string , parse out values?

the following doesn't work:

tr = spark.read.json( "my-file-path") trans_xml = sqlcontext.read.format('com.databricks.spark.xml').options(rowtag='book').load(tr.select("trans_xml")) 

thanks, ram.

try hive xpath udfs (languagemanual xpathudf):

>>> pyspark.sql.functions import expr >>> df.select(expr("xpath({0}, '{1}')".format(column_name, xpath_expression))) 

or python udf:

>>> pyspark.sql.types import * >>> pyspark.sql.functions import udf >>> import xml.etree.elementtree et >>> schema = ... # define schema >>> def parse(s): ...     root = et.fromstring(s)         result = ... # select values ...     return result >>> df.select(udf(parse, schema)(xml_column)) 

Comments

Popular posts from this blog

java - SSE Emitter : Manage timeouts and complete() -

jquery - uncaught exception: DataTables Editor - remote hosting of code not allowed -

java - How to resolve error - package com.squareup.okhttp3 doesn't exist? -