Removing Duplicates From a Dataframe in R -
my situation trying clean data set of student results processing , i'm having issues removing duplicates wanting @ "first attempts" students have taken course multiple times. example of data using 1 of duplicates is:
id period desc 632 1507 1101 90714 research contemporary biological issue 633 1507 1101 6317 explain process of speciation 634 1507 1101 8931 describe gene expression 14448 1507 1201 8931 describe gene expression 14449 1507 1201 6317 explain process of speciation 14450 1507 1201 90714 research contemporary biological issue 25884 1507 1301 6317 explain process of speciation 25885 1507 1301 8931 describe gene expression 25886 1507 1301 90714 research contemporary biological issue
the first 2 digits of reg_period
year sat paper. can seen, want keeping id
1507 , reg_period
1101. far, example of code values want trimming is:
unique.rows <- unique(df[c("id", "period")]) dups <- (unique.rows[duplicated(unique.rows$id),])
however, there couple of problems running in to. works because data ordered id
, reg_period
, isn't guaranteed in future. plus don't know how take list of duplicate entries , select rows not in because %in%
doesn't seem work , loop rbind
runs out of memory.
what's best way handle this?
i use dplyr
. calling data df
:
result = df %>% group_by(id) %>% filter(period == min(period))
if prefer base
, pull id
/period
combinations keep separate data frame , inner join original data:
id_pd = df[order(df$id, df$pd), c("id", "period")] id_pd = id_pd[!duplicated(df$id), ] result = merge(df, id_pd)
Comments
Post a Comment