python - Sklearn: improve algorithm and improve quality -
for customers have sex unknown (that not in training set customers_gender_train.csv
, in transactions.csv
) need predict probability male(value 1), using file transactions. part of train looks like
customer_id,gender 75562265 10928546,1 69348468,1 84816985,1 61009479 74045822 27979606,1 54129921 23160845 44160317,1 45646491 36008593 48111232,1 37245184 82609845 60046355,1
the part of transactions looks like
customer_id,tr_datetime,mcc_code,tr_type,amount,term_id 39026145,0 10:23:26,4814,1030,-2245.92, 39026145,1 10:19:29,6011,7010,56147.89, 39026145,1 10:20:56,4829,2330,-56147.89, 39026145,1 10:39:54,5499,1010,-1392.47, 39026145,2 15:33:42,5499,1010,-920.83, 39026145,2 15:53:49,5541,1010,-14643.37, 39026145,3 15:29:08,5499,1010,-1010.66, 39026145,4 12:11:57,5200,1010,-2829.85, 39026145,5 15:19:19,5499,1010,-628.86, 39026145,6 07:08:31,4814,1030,-5614.79, 39026145,7 14:23:17,5541,1010,-14643.37, 39026145,7 14:40:02,5499,1010,-3458.71, 39026145,8 06:49:35,5732,1010,-21897.68,
i use code predict gender
import numpy np import pandas pd sklearn.ensemble import gradientboostingclassifier transactions = pd.read_csv('transactions.csv') customers_gender = pd.read_csv('customers_gender_train.csv') x = transactions.groupby('customer_id') \ .apply(lambda x: x[['mcc_code']].unstack().value_counts()) \ .unstack() \ .fillna(0) customers_gender = customers_gender.set_index('customer_id') y_train = customers_gender.loc[x.index].gender y_train = y_train.reset_index() del y_train['customer_id'] y_train = y_train.dropna(0) x_train = x.reset_index() x_train = x_train.loc[y_train.index].set_index('customer_id') clf = gradientboostingclassifier(random_state=13) clf.fit(x_train, y_train.values[:, 0]) x_test = x.drop(customers_gender.index) result = pd.dataframe(x_test.index, columns=['customer_id']) result['gender'] = clf.predict_proba(x_test)[:, 1] result.to_csv('baseline_a.csv', index=false)
it returns me quality 0.84093
, how can improve this?
Comments
Post a Comment