causalmlを試す。uplift classification

2019-08-16 | 2 min read | machine-learning python causalml uplift

causalmlはuberによる機械学習を用いた因果推論用のpythonパッケージ。例が載ってるノートブックを使って、一連の流れを試してみた

方法など、詳しい説明はこちらを参考に

環境

python: 3.6.8

コード

以下、コードをおいておく。また、一番下にjupyter notebookで動くファイルへのgithubリンクを貼っておく

# install surprise
!pip install causalml
 
# import library
import numpy as np
import pandas as pd
 
from causalml.dataset import make_uplift_classification
from causalml.inference.tree import UpliftRandomForestClassifier
from causalml.metrics import plot_gain
 
from sklearn.model_selection import train_test_split
 
# generate datasets
# 各特徴量とgroup_name、targetが生成される
df, x_names = make_uplift_classification()

# Split data to training and testing samples for model validation (next section)
df_train, df_test = train_test_split(df, test_size=0.2, random_state=111)
 
uplift_model = UpliftRandomForestClassifier(control_name='control')
 
uplift_model.fit(df_train[x_names].values,
                 treatment=df_train['treatment_group_key'].values,
                 y=df_train['conversion'].values)
 
y_pred = uplift_model.predict(df_test[x_names].values)
 
# Put the predictions to a DataFrame for a neater presentation
# 各群がどこのクラス群にいるか
result = pd.DataFrame(y_pred,
                      columns=uplift_model.classes_)
 
# 各群へのスコア
result

# If all deltas are negative, assing to control; otherwise assign to the treatment
# with the highest delta
# マイナスの人はcontrolになる、その他は、一番大きい所のtreatmentになる
best_treatment = np.where((result < 0).all(axis=1),
                           'control',
                           result.idxmax(axis=1))
 
# Create indicator variables for whether a unit happened to have the
# recommended treatment or was in the control group
# 実際に該当するグループだったか、もしくはcontrolだったか
actual_is_best = np.where(df_test['treatment_group_key'] == best_treatment, 1, 0)
actual_is_control = np.where(df_test['treatment_group_key'] == 'control', 1, 0)
 
# synthetic population作成
synthetic = (actual_is_best == 1) | (actual_is_control == 1)
synth = result[synthetic]
 
auuc_metrics = (synth.assign(is_treated = 1 - actual_is_control[synthetic],
                             conversion = df_test.loc[synthetic, 'conversion'].values,
                             uplift_tree = synth.max(axis=1))
                     .drop(columns=list(uplift_model.classes_)))
 
# conversionは 1/0
# 左がcontrolの割合低い、つまり実験群の方が、効果が大きい
plot_gain(auuc_metrics, outcome_col='conversion', treatment_col='is_treated')

notebookファイルはこちら

https://github.com/uni-3/jupyter-notebooks/blob/master/uplift_trees_with_synthetic_data.ipynb

参考

https://www.slideshare.net/PierreGutierrez2/introduction-to-uplift-modelling