Spark Pivot String in PySpark -
this question has answer here:
i have problem restructuring data using spark. original data looks this:
df = sqlcontext.createdataframe([ ("id_1", "var_1", "butter"), ("id_1", "var_2", "toast"), ("id_1", "var_3", "ham"), ("id_2", "var_1", "jam"), ("id_2", "var_2", "toast"), ("id_2", "var_3", "egg"), ], ["id", "var", "val"]) >>> df.show() +----+-----+------+ | id| var| val| +----+-----+------+ |id_1|var_1|butter| |id_1|var_2| toast| |id_1|var_3| ham| |id_2|var_1| jam| |id_2|var_2| toast| |id_2|var_3| egg| +----+-----+------+
this structure try achieve:
+----+------+-----+-----+ | id| var_1|var_2|var_3| +----+------+-----+-----+ |id_1|butter|toast| ham| |id_2| jam|toast| egg| +----+------+-----+-----+
my idea use:
df.groupby("id").pivot("var").show()
but following error:
traceback (most recent call last): file "<stdin>", line 1, in <module> attributeerror: 'groupeddata' object has no attribute 'show'
any suggestions! thanks!
you need add aggregation after pivot(). if sure there 1 "val" each ("id", "var") pair, can use first():
from pyspark.sql import functions f result = df.groupby("id").pivot("var").agg(f.first("val")) result.show() +----+------+-----+-----+ | id| var_1|var_2|var_3| +----+------+-----+-----+ |id_1|butter|toast| ham| |id_2| jam|toast| egg| +----+------+-----+-----+
Comments
Post a Comment