python 3.x - Conditional column selection in pandas -
i want select columns dataframe according particular condition. know can done loop, df large efficiency crucial. condition column selection having either non-nan entries or sequence of nans followed sequence of non-nan entries.
here example. consider following dataframe:
pd.dataframe([[1, np.nan, 2, np.nan], [2, np.nan, 5, np.nan], [4, 8, np.nan, 1], [3, 2, np.nan, 2], [3, 2, 5, np.nan]]) 0 1 2 3 0 1 nan 2.0 nan 1 2 nan 5.0 nan 2 4 8.0 nan 1.0 3 3 2.0 nan 2.0 4 3 2.0 5.0 nan
from it, select columns 0 , 1. advice on how efficiently without looping?
logic
- count nulls in each column. if nulls in beginning, number of nulls in column should equal the position of first valid index.
- get first valid index
- slice index null count , compare against first valid indices. if equal, thats column
cnull = df.isnull().sum() fvald = df.apply(pd.series.first_valid_index) cols = df.index[cnull] == fvald df.loc[:, cols]
edited speed improvements
old answer
def pir1(df): cnull = df.isnull().sum() fvald = df.apply(pd.series.first_valid_index) cols = df.index[cnull] == fvald return df.loc[:, cols]
much faster answer using same logic
def pir2(df): nulls = np.isnan(df.values) null_count = nulls.sum(0) first_valid = nulls.argmin(0) null_on_top = null_count == first_valid filtered_data = df.values[:, null_on_top] filtered_columns = df.columns.values[null_on_top] return pd.dataframe(filtered_data, df.index, filtered_columns)
Comments
Post a Comment