Using dplyr in R: How to summarise data on same column with different criteria -
i have data set
user_id business_id date stars review_length pos_words neg_words net_sentiment xqd0dzhaiyrqvh3wrg7hzg vcnawilm4dr7d2nwwj7nca 17/05/07 5 94 4 1 3 h1kh6qzv7le4zqtrnxozow vcnawilm4dr7d2nwwj7nca 22/03/10 2 114 3 7 -4 zvjccrpm2yozrxkffwgqla vcnawilm4dr7d2nwwj7nca 14/02/12 4 55 6 0 6 kblw4wja_fwowmmhihrvoa vcnawilm4dr7d2nwwj7nca 2/03/12 4 97 0 3 -3 zvjccrpm2yozrxkffwgqla vcnawilm4dr7d2nwwj7nca 15/05/12 4 53 1 2 -1 yelp<- read.csv("yelp_ratings.csv") colnames(yelp) [1] "user_id" "business_id" "date" "stars" "review_length" [6] "pos_words" "neg_words" "net_sentiment"
i need use dplyr determine businesses have best , worst ratings --- determine value in net_sentiment--- , determine users gave best , worst ratings (using value in net_sentiment well) particular business id .
heres have right now,
yelp %>% group_by(business_id,user_id) %>% summarise(net_sentiment = max(net_sentiment)) %>% arrange(desc(net_sentiment)) %>% head(n=20)
this gives print out of, data set
business_id user_id net_sentiment 1 -5rn56jh78mv2oqulv_g8g xnb8pfe99enj8bemscbpcq 80 2 gvyju3xrco1r4ank7szjca xnb8pfe99enj8bemscbpcq 78 3 orilsaav4srz_twfy1twpw xnb8pfe99enj8bemscbpcq 77 4 gvyju3xrco1r4ank7szjca uloplvlghkzrfo3phwbpaq 74 5 4ughpy-opjn08cabttavng xnb8pfe99enj8bemscbpcq 72
which shows business highest net_sentiment score , user gave net_sentiment score.
what intend achieve like
for business best rating:
business_id user_id_best_rating pos_net_sentiment user_id_worst_rating neg_net_sentiment -5rn56jh78mv2oqulv_g8g xnb8pfe99enj8bemscbpcq 80 user123 -50
for business worst rating:
business_id user_id_best_rating pos_net_sentiment user_id_worst_rating neg_net_sentiment business123 user345 10 user789 -150
again clarify, using dplyr, should listing of best businesses first determine net_sentiment score , users gave best , worst rating business , same should applied worst businesses.
here single pipe can first table; after that, resorting second table easily. if pull off head each time single line of desired output.
the logic group business , mutate best , worst results own columns, can use result key column of userid_best_rating. if have getting many results key, add business id along secondary key (essentially utilizing composite key of score-busiid each userid).
the pipe adds in id's highest positive , negative reviews , trims off extras before sorts highest rating top.
# simplified transportable data demonstrating similar pattern of overlap busiid <- c('a','b','c','b','e') userid <- c(1,1,1,2,1) netsenti <- c(80,78,77,74,72) ylp <- data.frame(busiid,userid,netsenti) smryylp <- ylp %>% group_by(busiid) %>% mutate(pos_netsenti = max(netsenti), neg_netsenti = min(netsenti)) %>% left_join(select(ylp, neg_netsenti = netsenti, user_id_worst_rating = userid)) %>% left_join(select(ylp, pos_netsenti = netsenti, user_id_best_rating = userid)) %>% select(busiid, user_id_best_rating, pos_netsenti, user_id_worst_rating, neg_netsenti) %>% ungroup %>% distinct %>% arrange(desc(pos_netsenti)) smryylp ## tibble: 4 × 5 # busiid user_id_best_rating pos_netsenti user_id_worst_rating neg_netsenti # <fctr> <dbl> <dbl> <dbl> <dbl> # 1 1 80 1 80 # 2 b 1 78 2 74 # 3 c 1 77 1 77 # 4 e 1 72 1 72
hope helps 🙂
Comments
Post a Comment