r - Select grouped rows with at least one matching criterion -
i want select groupings contain @ least 1 of elements interested in. able creating intermediate array, looking simpler , faster. because actual data set has on 1m rows (and 20 columns) not sure whether have sufficient memory create intermediate array. more importantly, below method on original file takes lot of time.
here's code , data:
a) data
dput(data_file) structure(list(group_id = c(123, 123, 123, 123, 234, 345, 444, 444), product_name = c("abcd", "efgh", "xyz1", "z123", "abcd", "efgh", "abcd", "abcd"), qty = c(2, 3, 4, 5, 6, 7, 8, 9)), .names = c("group_id", "product_name", "qty"), row.names = c(na, 8l), class = "data.frame")
b) code: want select group_id
has @ least 1 product_name = abcd
#find out transactions data_t <- data_file %>% group_by(group_id) %>% dplyr::filter(product_name == "abcd") %>% select(group_id) %>% distinct() #now filter them filtered_t <- data_file %>% group_by(group_id) %>% dplyr::filter(group_id %in% data_t$group_id)
c) expected output
group_id product_name qty <dbl> <chr> <dbl> 123 abcd 2 123 efgh 3 123 xyz1 4 123 z123 5 234 abcd 6 444 abcd 8 444 abcd 9
i'm struggling on 3 hours now. looked @ auto-suggested thread so: select rows @ least 2 conditions conditions question different.
i this:
data_file %>% group_by(group_id) %>% filter(any(product_name %in% "abcd")) # source: local data frame [7 x 3] # groups: group_id [3] # # group_id product_name qty # <dbl> <chr> <dbl> # 1 123 abcd 2 # 2 123 efgh 3 # 3 123 xyz1 4 # 4 123 z123 5 # 5 234 abcd 6 # 6 444 abcd 8 # 7 444 abcd 9
explanation: any()
return true
if there rows (within group) match condition. length-1 result recycled full length of group , entire group kept. sum(product_name %in% "abcd") > 0
condition, reads nicely. use sum
instead if wanted more complicated condition, 3 or more matching product names.
i prefer%in%
to ==
things because has better behavior na
, easy expand if wanted check of multiple products group.
if speed , efficiency issue, data.table
faster. this, relies on keyed join filtering , uses no non-data.table operations, should fast:
library(data.table) df = as.data.table(df) setkey(df) groups = unique(subset(df, product_name %in% "abcd", group_id)) df[groups, nomatch = 0] # group_id product_name qty # 1: 123 abcd 2 # 2: 123 efgh 3 # 3: 123 xyz1 4 # 4: 123 z123 5 # 5: 234 abcd 6 # 6: 444 abcd 8 # 7: 444 abcd 9
Comments
Post a Comment