Spark Pivot String in PySpark -

this question has answer here:

pivot string column on pyspark dataframe 1 answer

i have problem restructuring data using spark. original data looks this:

df = sqlcontext.createdataframe([     ("id_1", "var_1", "butter"),     ("id_1", "var_2", "toast"),     ("id_1", "var_3", "ham"),     ("id_2", "var_1", "jam"),     ("id_2", "var_2", "toast"),     ("id_2", "var_3", "egg"), ], ["id", "var", "val"])  >>> df.show() +----+-----+------+ |  id|  var|   val| +----+-----+------+ |id_1|var_1|butter| |id_1|var_2| toast| |id_1|var_3|   ham| |id_2|var_1|   jam| |id_2|var_2| toast| |id_2|var_3|   egg| +----+-----+------+

this structure try achieve:

+----+------+-----+-----+ |  id| var_1|var_2|var_3| +----+------+-----+-----+ |id_1|butter|toast|  ham| |id_2|   jam|toast|  egg| +----+------+-----+-----+

my idea use:

df.groupby("id").pivot("var").show()

but following error:

traceback (most recent call last): file "<stdin>", line 1, in <module> attributeerror: 'groupeddata' object has no attribute 'show'

any suggestions! thanks!

you need add aggregation after pivot(). if sure there 1 "val" each ("id", "var") pair, can use first():

from pyspark.sql import functions f  result = df.groupby("id").pivot("var").agg(f.first("val")) result.show()  +----+------+-----+-----+ |  id| var_1|var_2|var_3| +----+------+-----+-----+ |id_1|butter|toast|  ham| |id_2|   jam|toast|  egg| +----+------+-----+-----+

Search This Blog

WIKI

Spark Pivot String in PySpark -

Comments

Post a Comment

Popular posts from this blog

java - SSE Emitter : Manage timeouts and complete() -

jquery - uncaught exception: DataTables Editor - remote hosting of code not allowed -

java - How to resolve error - package com.squareup.okhttp3 doesn't exist? -