join - IndexedRDD used in streaming context? -
i want use fast join operation in spark streaming context, such join b, fixed dataset reading file, b small streaming rdd read socket. i've tried common way provided spark, 5,000,000 rdd joining 10 streaming rdd costs 4 seconds. later i've tried using indexedrdd, can't make it. have following questions:
is 4 seconds slow? can use performance tuning method such broadcast join improve? if slow, why? heard rdd's join operation linear search, true?
can indexedrdd's join operation faster common way?
how use indexedrdd in streaming context? i've tried way:
streaming_rdd.transform{ rdd => indexed_data.innerjoin(indexedrdd(rdd)){(id, a, b) => (a, b)}
it pass compile when running got error:
java.lang.classcastexception: scala.collection.immutable.$colon$colon cannot cast [lscala.tuple2;
i don't know if proper way use indexedrdd, , don't know caused error either. can 1 me?
Comments
Post a Comment