r - Select grouped rows with at least one matching criterion -


i want select groupings contain @ least 1 of elements interested in. able creating intermediate array, looking simpler , faster. because actual data set has on 1m rows (and 20 columns) not sure whether have sufficient memory create intermediate array. more importantly, below method on original file takes lot of time.

here's code , data:

a) data

dput(data_file) structure(list(group_id = c(123, 123, 123, 123, 234, 345, 444,  444), product_name = c("abcd", "efgh", "xyz1", "z123", "abcd",  "efgh", "abcd", "abcd"), qty = c(2, 3, 4, 5, 6, 7, 8, 9)), .names = c("group_id",  "product_name", "qty"), row.names = c(na, 8l), class = "data.frame") 

b) code: want select group_id has @ least 1 product_name = abcd

#find out transactions     data_t <- data_file %>%        group_by(group_id) %>%       dplyr::filter(product_name == "abcd") %>%       select(group_id) %>%       distinct()      #now filter them     filtered_t <- data_file %>%        group_by(group_id) %>%       dplyr::filter(group_id %in% data_t$group_id) 

c) expected output

  group_id product_name   qty      <dbl>        <chr> <dbl>       123         abcd     2       123         efgh     3       123         xyz1     4       123         z123     5       234         abcd     6       444         abcd     8       444         abcd     9 

i'm struggling on 3 hours now. looked @ auto-suggested thread so: select rows @ least 2 conditions conditions question different.

i this:

data_file %>% group_by(group_id) %>%     filter(any(product_name %in% "abcd")) # source: local data frame [7 x 3] # groups: group_id [3] #  #   group_id product_name   qty #      <dbl>        <chr> <dbl> # 1      123         abcd     2 # 2      123         efgh     3 # 3      123         xyz1     4 # 4      123         z123     5 # 5      234         abcd     6 # 6      444         abcd     8 # 7      444         abcd     9 

explanation: any() return true if there rows (within group) match condition. length-1 result recycled full length of group , entire group kept. sum(product_name %in% "abcd") > 0 condition, reads nicely. use sum instead if wanted more complicated condition, 3 or more matching product names.

i prefer%in%to == things because has better behavior na , easy expand if wanted check of multiple products group.


if speed , efficiency issue, data.table faster. this, relies on keyed join filtering , uses no non-data.table operations, should fast:

library(data.table) df = as.data.table(df) setkey(df) groups = unique(subset(df, product_name %in% "abcd", group_id)) df[groups, nomatch = 0] #    group_id product_name qty # 1:      123         abcd   2 # 2:      123         efgh   3 # 3:      123         xyz1   4 # 4:      123         z123   5 # 5:      234         abcd   6 # 6:      444         abcd   8 # 7:      444         abcd   9 

Comments

Popular posts from this blog

java - SSE Emitter : Manage timeouts and complete() -

jquery - uncaught exception: DataTables Editor - remote hosting of code not allowed -

java - How to resolve error - package com.squareup.okhttp3 doesn't exist? -