causalmlはuberによる機械学習を用いた因果推論用のpythonパッケージ。例が載ってるノートブックを使って、一連の流れを試してみた
方法など、詳しい説明はこちらを参考に
- https://ohke.hateblo.jp/entry/2019/01/05/230000
- http://kamonohashiperry.com/archives/2197
環境
- python: 3.6.8
コード
以下、コードをおいておく。また、一番下にjupyter notebookで動くファイルへのgithubリンクを貼っておく
# install surprise
!pip install causalml
# import library
import numpy as np
import pandas as pd
from causalml.dataset import make_uplift_classification
from causalml.inference.tree import UpliftRandomForestClassifier
from causalml.metrics import plot_gain
from sklearn.model_selection import train_test_split
# generate datasets
# 各特徴量とgroup_name、targetが生成される
df, x_names = make_uplift_classification()
# Split data to training and testing samples for model validation (next section)
df_train, df_test = train_test_split(df, test_size=0.2, random_state=111)
uplift_model = UpliftRandomForestClassifier(control_name='control')
uplift_model.fit(df_train[x_names].values,
treatment=df_train['treatment_group_key'].values,
y=df_train['conversion'].values)
y_pred = uplift_model.predict(df_test[x_names].values)
# Put the predictions to a DataFrame for a neater presentation
# 各群がどこのクラス群にいるか
result = pd.DataFrame(y_pred,
columns=uplift_model.classes_)
# 各群へのスコア
result
# If all deltas are negative, assing to control; otherwise assign to the treatment
# with the highest delta
# マイナスの人はcontrolになる、その他は、一番大きい所のtreatmentになる
best_treatment = np.where((result < 0).all(axis=1),
'control',
result.idxmax(axis=1))
# Create indicator variables for whether a unit happened to have the
# recommended treatment or was in the control group
# 実際に該当するグループだったか、もしくはcontrolだったか
actual_is_best = np.where(df_test['treatment_group_key'] == best_treatment, 1, 0)
actual_is_control = np.where(df_test['treatment_group_key'] == 'control', 1, 0)
# synthetic population作成
synthetic = (actual_is_best == 1) | (actual_is_control == 1)
synth = result[synthetic]
auuc_metrics = (synth.assign(is_treated = 1 - actual_is_control[synthetic],
conversion = df_test.loc[synthetic, 'conversion'].values,
uplift_tree = synth.max(axis=1))
.drop(columns=list(uplift_model.classes_)))
# conversionは 1/0
# 左がcontrolの割合低い、つまり実験群の方が、効果が大きい
plot_gain(auuc_metrics, outcome_col='conversion', treatment_col='is_treated')
notebookファイルはこちら
https://github.com/uni-3/jupyter-notebooks/blob/master/uplift_trees_with_synthetic_data.ipynb