r - Select grouped rows with at least one matching criterion -
i want select groupings contain @ least 1 of elements interested in. able creating intermediate array, looking simpler , faster. because actual data set has on 1m rows (and 20 columns) not sure whether have sufficient memory create intermediate array. more importantly, below method on original file takes lot of time.
here's code , data:
a) data
dput(data_file) structure(list(group_id = c(123, 123, 123, 123, 234, 345, 444, 444), product_name = c("abcd", "efgh", "xyz1", "z123", "abcd", "efgh", "abcd", "abcd"), qty = c(2, 3, 4, 5, 6, 7, 8, 9)), .names = c("group_id", "product_name", "qty"), row.names = c(na, 8l), class = "data.frame") b) code: want select group_id has @ least 1 product_name = abcd
#find out transactions data_t <- data_file %>% group_by(group_id) %>% dplyr::filter(product_name == "abcd") %>% select(group_id) %>% distinct() #now filter them filtered_t <- data_file %>% group_by(group_id) %>% dplyr::filter(group_id %in% data_t$group_id) c) expected output
group_id product_name qty <dbl> <chr> <dbl> 123 abcd 2 123 efgh 3 123 xyz1 4 123 z123 5 234 abcd 6 444 abcd 8 444 abcd 9 i'm struggling on 3 hours now. looked @ auto-suggested thread so: select rows @ least 2 conditions conditions question different.
i this:
data_file %>% group_by(group_id) %>% filter(any(product_name %in% "abcd")) # source: local data frame [7 x 3] # groups: group_id [3] # # group_id product_name qty # <dbl> <chr> <dbl> # 1 123 abcd 2 # 2 123 efgh 3 # 3 123 xyz1 4 # 4 123 z123 5 # 5 234 abcd 6 # 6 444 abcd 8 # 7 444 abcd 9 explanation: any() return true if there rows (within group) match condition. length-1 result recycled full length of group , entire group kept. sum(product_name %in% "abcd") > 0 condition, reads nicely. use sum instead if wanted more complicated condition, 3 or more matching product names.
i prefer%in%to == things because has better behavior na , easy expand if wanted check of multiple products group.
if speed , efficiency issue, data.table faster. this, relies on keyed join filtering , uses no non-data.table operations, should fast:
library(data.table) df = as.data.table(df) setkey(df) groups = unique(subset(df, product_name %in% "abcd", group_id)) df[groups, nomatch = 0] # group_id product_name qty # 1: 123 abcd 2 # 2: 123 efgh 3 # 3: 123 xyz1 4 # 4: 123 z123 5 # 5: 234 abcd 6 # 6: 444 abcd 8 # 7: 444 abcd 9
Comments
Post a Comment