python 3.x - Conditional column selection in pandas -


i want select columns dataframe according particular condition. know can done loop, df large efficiency crucial. condition column selection having either non-nan entries or sequence of nans followed sequence of non-nan entries.

here example. consider following dataframe:

pd.dataframe([[1, np.nan, 2, np.nan], [2, np.nan, 5, np.nan], [4, 8, np.nan, 1], [3, 2, np.nan, 2], [3, 2, 5, np.nan]])     0    1    2    3 0  1  nan  2.0  nan 1  2  nan  5.0  nan 2  4  8.0  nan  1.0 3  3  2.0  nan  2.0 4  3  2.0  5.0  nan 

from it, select columns 0 , 1. advice on how efficiently without looping?

logic

  • count nulls in each column. if nulls in beginning, number of nulls in column should equal the position of first valid index.
  • get first valid index
  • slice index null count , compare against first valid indices. if equal, thats column

cnull = df.isnull().sum() fvald = df.apply(pd.series.first_valid_index) cols = df.index[cnull] == fvald df.loc[:, cols] 

enter image description here


edited speed improvements

old answer

def pir1(df):     cnull = df.isnull().sum()     fvald = df.apply(pd.series.first_valid_index)     cols = df.index[cnull] == fvald     return df.loc[:, cols] 

much faster answer using same logic

def pir2(df):     nulls = np.isnan(df.values)     null_count = nulls.sum(0)     first_valid = nulls.argmin(0)     null_on_top = null_count == first_valid     filtered_data = df.values[:, null_on_top]     filtered_columns = df.columns.values[null_on_top]     return pd.dataframe(filtered_data, df.index, filtered_columns) 

enter image description here


Comments

Popular posts from this blog

java - SSE Emitter : Manage timeouts and complete() -

jquery - uncaught exception: DataTables Editor - remote hosting of code not allowed -

java - How to resolve error - package com.squareup.okhttp3 doesn't exist? -