Using dplyr in R: How to summarise data on same column with different criteria -

i have data set

user_id                 business_id             date      stars review_length pos_words neg_words   net_sentiment xqd0dzhaiyrqvh3wrg7hzg  vcnawilm4dr7d2nwwj7nca  17/05/07    5   94              4       1              3 h1kh6qzv7le4zqtrnxozow  vcnawilm4dr7d2nwwj7nca  22/03/10    2   114             3       7             -4 zvjccrpm2yozrxkffwgqla  vcnawilm4dr7d2nwwj7nca  14/02/12    4   55              6       0              6 kblw4wja_fwowmmhihrvoa  vcnawilm4dr7d2nwwj7nca  2/03/12     4   97              0       3              -3 zvjccrpm2yozrxkffwgqla  vcnawilm4dr7d2nwwj7nca  15/05/12    4   53              1       2             -1   yelp<- read.csv("yelp_ratings.csv") colnames(yelp)  [1] "user_id"       "business_id"   "date"          "stars"         "review_length"  [6] "pos_words"    "neg_words"     "net_sentiment"

i need use dplyr determine businesses have best , worst ratings --- determine value in net_sentiment--- , determine users gave best , worst ratings (using value in net_sentiment well) particular business id .

heres have right now,

yelp %>%   group_by(business_id,user_id) %>%   summarise(net_sentiment = max(net_sentiment)) %>%   arrange(desc(net_sentiment)) %>%   head(n=20)

this gives print out of, data set

              business_id                user_id net_sentiment 1  -5rn56jh78mv2oqulv_g8g xnb8pfe99enj8bemscbpcq            80 2  gvyju3xrco1r4ank7szjca xnb8pfe99enj8bemscbpcq            78 3  orilsaav4srz_twfy1twpw xnb8pfe99enj8bemscbpcq            77 4  gvyju3xrco1r4ank7szjca uloplvlghkzrfo3phwbpaq            74 5  4ughpy-opjn08cabttavng xnb8pfe99enj8bemscbpcq            72

which shows business highest net_sentiment score , user gave net_sentiment score.

what intend achieve like

for business best rating:

            business_id    user_id_best_rating pos_net_sentiment user_id_worst_rating neg_net_sentiment  -5rn56jh78mv2oqulv_g8g xnb8pfe99enj8bemscbpcq                80              user123               -50

for business worst rating:

business_id user_id_best_rating pos_net_sentiment user_id_worst_rating    neg_net_sentiment business123             user345                10              user789                 -150

again clarify, using dplyr, should listing of best businesses first determine net_sentiment score , users gave best , worst rating business , same should applied worst businesses.

here single pipe can first table; after that, resorting second table easily. if pull off head each time single line of desired output.

the logic group business , mutate best , worst results own columns, can use result key column of userid_best_rating. if have getting many results key, add business id along secondary key (essentially utilizing composite key of score-busiid each userid).

the pipe adds in id's highest positive , negative reviews , trims off extras before sorts highest rating top.

# simplified transportable data demonstrating similar pattern of overlap busiid <- c('a','b','c','b','e') userid <- c(1,1,1,2,1) netsenti <- c(80,78,77,74,72) ylp <- data.frame(busiid,userid,netsenti)  smryylp <-      ylp %>%      group_by(busiid) %>%      mutate(pos_netsenti = max(netsenti), neg_netsenti = min(netsenti)) %>%      left_join(select(ylp, neg_netsenti = netsenti, user_id_worst_rating = userid)) %>%      left_join(select(ylp, pos_netsenti = netsenti, user_id_best_rating = userid)) %>%      select(busiid, user_id_best_rating, pos_netsenti, user_id_worst_rating, neg_netsenti) %>%      ungroup %>% distinct %>%      arrange(desc(pos_netsenti))  smryylp ## tibble: 4 × 5 #   busiid user_id_best_rating pos_netsenti user_id_worst_rating neg_netsenti #   <fctr>               <dbl>        <dbl>                <dbl>        <dbl> # 1                        1           80                    1           80 # 2      b                   1           78                    2           74 # 3      c                   1           77                    1           77 # 4      e                   1           72                    1           72

hope helps 🙂

Search This Blog

WIKI

Using dplyr in R: How to summarise data on same column with different criteria -

Comments

Post a Comment

Popular posts from this blog

jquery - uncaught exception: DataTables Editor - remote hosting of code not allowed -

java - SSE Emitter : Manage timeouts and complete() -

java - How to resolve error - package com.squareup.okhttp3 doesn't exist? -