Using dplyr in R: How to summarise data on same column with different criteria -
i have data set
user_id                 business_id             date      stars review_length pos_words neg_words   net_sentiment xqd0dzhaiyrqvh3wrg7hzg  vcnawilm4dr7d2nwwj7nca  17/05/07    5   94              4       1              3 h1kh6qzv7le4zqtrnxozow  vcnawilm4dr7d2nwwj7nca  22/03/10    2   114             3       7             -4 zvjccrpm2yozrxkffwgqla  vcnawilm4dr7d2nwwj7nca  14/02/12    4   55              6       0              6 kblw4wja_fwowmmhihrvoa  vcnawilm4dr7d2nwwj7nca  2/03/12     4   97              0       3              -3 zvjccrpm2yozrxkffwgqla  vcnawilm4dr7d2nwwj7nca  15/05/12    4   53              1       2             -1   yelp<- read.csv("yelp_ratings.csv") colnames(yelp)  [1] "user_id"       "business_id"   "date"          "stars"         "review_length"  [6] "pos_words"    "neg_words"     "net_sentiment" i need use dplyr determine businesses have best , worst ratings --- determine value in net_sentiment--- , determine users gave best , worst ratings (using value in net_sentiment well) particular business id .
heres have right now,
yelp %>%   group_by(business_id,user_id) %>%   summarise(net_sentiment = max(net_sentiment)) %>%   arrange(desc(net_sentiment)) %>%   head(n=20) this gives print out of, data set
              business_id                user_id net_sentiment 1  -5rn56jh78mv2oqulv_g8g xnb8pfe99enj8bemscbpcq            80 2  gvyju3xrco1r4ank7szjca xnb8pfe99enj8bemscbpcq            78 3  orilsaav4srz_twfy1twpw xnb8pfe99enj8bemscbpcq            77 4  gvyju3xrco1r4ank7szjca uloplvlghkzrfo3phwbpaq            74 5  4ughpy-opjn08cabttavng xnb8pfe99enj8bemscbpcq            72 which shows business highest net_sentiment score , user gave net_sentiment score.
what intend achieve like
for business best rating:
            business_id    user_id_best_rating pos_net_sentiment user_id_worst_rating neg_net_sentiment  -5rn56jh78mv2oqulv_g8g xnb8pfe99enj8bemscbpcq                80              user123               -50 for business worst rating:
business_id user_id_best_rating pos_net_sentiment user_id_worst_rating    neg_net_sentiment business123             user345                10              user789                 -150 again clarify, using dplyr, should listing of best businesses first determine net_sentiment score , users gave best , worst rating business , same should applied worst businesses.
here single pipe can first table; after that, resorting second table easily. if pull off head each time single line of desired output.
the logic group business , mutate best , worst results own columns, can use result key column of userid_best_rating. if have getting many results key, add business id along secondary key (essentially utilizing composite key of score-busiid each userid).
the pipe adds in id's highest positive , negative reviews , trims off extras before sorts highest rating top.
# simplified transportable data demonstrating similar pattern of overlap busiid <- c('a','b','c','b','e') userid <- c(1,1,1,2,1) netsenti <- c(80,78,77,74,72) ylp <- data.frame(busiid,userid,netsenti)  smryylp <-      ylp %>%      group_by(busiid) %>%      mutate(pos_netsenti = max(netsenti), neg_netsenti = min(netsenti)) %>%      left_join(select(ylp, neg_netsenti = netsenti, user_id_worst_rating = userid)) %>%      left_join(select(ylp, pos_netsenti = netsenti, user_id_best_rating = userid)) %>%      select(busiid, user_id_best_rating, pos_netsenti, user_id_worst_rating, neg_netsenti) %>%      ungroup %>% distinct %>%      arrange(desc(pos_netsenti))  smryylp ## tibble: 4 × 5 #   busiid user_id_best_rating pos_netsenti user_id_worst_rating neg_netsenti #   <fctr>               <dbl>        <dbl>                <dbl>        <dbl> # 1                        1           80                    1           80 # 2      b                   1           78                    2           74 # 3      c                   1           77                    1           77 # 4      e                   1           72                    1           72 hope helps 🙂
Comments
Post a Comment