PyCaretとは

PyCaret(Classification And REgression Training) は， scikit-learnや他の機械学習パッケージのラッパである．色々な機械学習を自動的に行なってくれるので，背景にある理論を理解していれば，容易に機械学習を行うことができる．公式ページは https://pycaret.org/ である．

特徴としては，以下があげられる．

短いコードで機械学習ができる
自動化 (AutoML)
オープンソース

ここでは，以下の5つを解説する．

回帰 (regression)
分類 (classification)
クラスタリング (clustering)
異常検知(anomaly detection)
アソシエーション・ルール・マイニング (association rule mining)

他にも，自然言語処理(natural language processing: NLP)や時系列 (time series) などが含まれている．

Google Colab での実行の準備

以下の手順で準備をする．なお，競合パッケージがある場合には，ランタイムをリセットする必要がある．

!pip install pycaret #すべてのパッケージをインストールするなら !pip install pycaret[full]
from pycaret.utils import enable_colab
enable_colab()

回帰

広告による売り上げの予測

広告のデータ http://logopt.com/data/Advertising.csv を用いる．

テレビ(TV)，ラジオ(Radio)，新聞(Newspaper)への広告から売り上げ(Sales)を予測する．

import pandas as pd 
df = pd.read_csv(
    "http://logopt.com/data/Advertising.csv", index_col=0
)  # 0行目をインデックスにする．
df.tail()

独立変数（特徴ベクトル）$X$ は TV, Radio, Newspaperの列，従属変数（ターゲット） $y$ は Salesの列である．

PyCaretの基本手順

手順１： setup（データフレーム）で準備をする．引数 target でターゲットの列を指定．引数 session_id で乱数の種を指定する．自動的にデータの型を判定して，入力待ちになる．大丈夫ならリターンを押す．すると，自動的に前処理が行われて，結果が表示される．必要なら，前処理の方法を引数で指定し直す．
手順２： compare_modelsでモデルの比較を行う（もしくはcreate_modelでモデルを生成する）．引数 fold で交差検証用のデータの分割数を指定する．返値は最良の評価値のモデルインスタンスである．

（注意： 遅い計算機で実行する際には，計算時間がかかるモデルを除いておくと良い．引数excludeで除きたいモデルのリストを入れる）

手順３： predict_modelで予測を行う．

from pycaret.regression import *  # 回帰関連の関数のインポート

reg = setup(df, target="Sales", session_id=123)

best_model = compare_models(fold=5)

回帰モデル

No.	略称	回帰モデル	概要
1	et	Extra Trees Regressor	ランダムに分割してアンサンブルする決定木ベースの手法
2	gbr	Gradient Boosting Regressor	勾配ブースティング法
3	xgboost	Extreme Gradient Boosting	xgブースト（勾配ブースティング法に正則化を追加）
4	rf	Random Forest Regressor	ランダム森（ブートストラップによるランダムサンプリングと決定木のアンサンブル）
5	catboost	CatBoost Regressor	カテゴリー変数の扱いに工夫を入れた勾配ブースティング法
6	ada	AdaBoost Regressor	適応型の勾配ブースティング法
7	dt	Decision Tree Regressor	決定木
8	lightgbm	Light Gradient Boosting Machine	勾配ブースティング法の軽量版
9	knn	K Neighbors Regressor	$k$-近傍法
10	lasso	Lasso Regression	Lasso回帰（正則化を入れた線形回帰）
11	en	Elastic Net	Elastic Net（正則化を入れた線形回帰）
12	lar	Least Angle Regression	予測値と教師データの偏差と相関が大きい特徴量を1つずつ追加していく方法
13	lr	Linear Regression	線形回帰
14	ridge	Ridge Regression	リッジ回帰（正則化を入れた線形回帰）
15	br	Bayesian Ridge	ベイズリッジ回帰
16	huber	Huber Regressor	Huber回帰
17	omp	Orthogonal Matching Pursuit	貪欲に特徴量を1つずつ追加していく方法
18	llar	Lasso Least Angle Regression	LassoにLeast Angle Regressionを適用して特徴量選択
19	dummy	Dummy Regressor	ベースになる簡単なモデル
20	par	Passive Aggressive Regressor	オンライン型の学習

回帰モデルの評価尺度

以下の評価尺度が表示される．定義については深層学習の章を参照されたい．

MAE 平均絶対誤差 (mean absolute error)
MSE 平均2乗誤差 (mean squared error)
RMSE 平均2乗誤差の平方根 (root mean squared error)
R2 決定係数(coefficient of determination) $R^2$
RMSLE 平均2乗対数誤差の平方根 (root mean squared logarithmic error)
MAPE 平均絶対パーセント誤差 (mean absolute percentage error)
TT(sec) 計算時間

best_model

ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                    max_depth=None, max_features='auto', max_leaf_nodes=None,
                    max_samples=None, min_impurity_decrease=0.0,
                    min_impurity_split=None, min_samples_leaf=1,
                    min_samples_split=2, min_weight_fraction_leaf=0.0,
                    n_estimators=100, n_jobs=-1, oob_score=False,
                    random_state=123, verbose=0, warm_start=False)

best_model_results = pull()  # 結果をデータフレームとして得る．
best_model_results.to_csv("best_model.csv") #結果をcsvファイルに保存

予測

predict_model(best_model)

可視化（回帰）

可視化の基本手順は以下の通り．

手順１： plot_model(モデルインスタンス)で描画する．引数plotで描画の種類を指定する．既定値は残差プロット．
手順２：interpret_model（モデルインスタンス）で，結果の解釈を可視化する．

plot_modelの引数plotの種類は以下の通り．

"residuals": 残差プロット（既定値）
"error" : 誤差プロット
"cooks": Cookの距離プロット（外れ値をみる）
"feature": 特徴重要度プロット
"learning": 学習曲線
"vc": 検証曲線
"manifold": 次元削減を行い特徴を2次元に射影した図
"parameter": モデルのパラメータを表で表示
"tree": 決定木の図示（木ベースの場合のみ）

plot_model(best_model);

plot_model(best_model, plot="error");

plot_model(best_model, plot="cooks");

plot_model(best_model, plot="feature");

plot_model(best_model, plot="learning");

plot_model(best_model, plot="vc");

plot_model(best_model, plot="manifold");

plot_model(best_model, plot="parameter");

小さな（深さ3の）決定木 (dt) のモデルを作って可視化する．

dt = create_model("dt", max_depth=3)
plot_model(dt, plot="tree");

モデルの解釈

interpret_model関数は，SHAP値 (Shapley value)を計算する方法によってモデルの解釈を行う．

interpret_modelの引数plotの種類は，以下の通り．

"summary": 各特徴のSHAP値がターゲットに与える影響を表した図（既定値）
"correlation": 特徴とSHAP値の相関図
"reason": 個々のデータに対するSHAP値

interpret_model(best_model);

interpret_model(best_model, plot="correlation");

interpret_model(best_model, plot="reason", observation=10)

問題 (SAT,GPA)

http://logopt.com/data/SATGPA.csv データを用いて，2種類のSATの成績からGPAを予測せよ．

データをそのまま使うと"MathSAT"列と"VerbalSAT"列をカテゴリー変数としてしまうので，浮動小数点数に変換しておく．

gpa = pd.read_csv(
    "http://logopt.com/data/SATGPA.csv",
    index_col=0,
    dtype={"MathSAT": float, "VerbalSAT": float},
)
gpa.head()

問題（住宅価格）

http://logopt.com/data/Boston.csv のBostonの住宅データを用いて回帰分析を行え．

medvが住宅の価格で，他のデータ（犯罪率や人口など）から予測する．

問題（車の燃費）

http://logopt.com/data/Auto.csv の車の燃費のデータを用いて回帰分析を行え．

データの詳細については，

https://vincentarelbundock.github.io/Rdatasets/doc/ISLR/Auto.html

を参照せよ．

最初の列が燃費（mpg: Mile Per Gallon)であり，これを他の列の情報を用いて予測する．最後の列は車名なので無視して良い．

問題（コンクリートの強度）

以下のコンクリートの強度の例題に対して，strength列の強度を他の列の情報から，線形回帰を用いて推定せよ．

concrete = pd.read_csv("http://logopt.com/data/concrete.csv")
concrete.head()

問題（シェアバイク）

以下のシェアバイクのデータに対して，riders列が利用者数を線形回帰を用いて推定せよ．ただし，date列とcasual列は除いてから回帰を行え．

また，なぜcasual列を含めて推定をしないのか考察せよ．

bikeshare = pd.read_csv("http://logopt.com/data/bikeshare.csv")
bikeshare.head()

ダイアモンドの価格の予測（カテゴリー変数）

http://logopt.com/data/Diamond.csv からダイアモンドの価格データを読み込み，回帰による予測を行う．

列は ["carat","colour","clarity","certification","price"] であり，他の情報から価格(price)の予測を行え．

カラット(carat)以外の列は情報が文字列として保管されている．

これはカテゴリー変数とよばれる．

PyCaretでは，前処理関数setupで，自動的に変換してくれるので，手順は前とまったく同じである．

diamond = pd.read_csv("http://logopt.com/data/Diamond.csv", index_col=0)
diamond.head()

reg = setup(diamond, target="price", session_id=123)

best_model = compare_models(fold=5)

plot_model(best_model, plot="feature")

問題（車の価格）

http://logopt.com/data/carprice.csv から車の価格データを読み込み，回帰による予測を行え．

データの詳細は https://vincentarelbundock.github.io/Rdatasets/doc/DAAG/carprice.html にある．

車種(Type)，100マイル走る際のガロン数（gpm100），都市部での1ガロンあたりの走行距離（MPGcity），高速道路での１ガロン当たりの走行距離（MPGhighway）から，価格(Price)を予測せよ．

問題（チップ）

以下のtipsデータに対して回帰を用いてもらえるチップの額(tip)を予測せよ．

import seaborn as sns

tips = sns.load_dataset("tips")
tips.head()

2値分類

メールがスパム（spam；迷惑メイル）か否かを判定する例題を用いる．

https://archive.ics.uci.edu/ml/datasets/spambase

様々な数値情報から，is_spam列が $1$ （スパムでない）か， $0$ （スパム）かを判定する．

spam = pd.read_csv("http://logopt.com/data/spam.csv")

is_spam列が従属変数（ターゲット）$y$ になり，それ以外の列が独立変数（特徴ベクトル）$X$ になる．

from pycaret.classification import *

clf = setup(data=spam, target="is_spam", session_id=123)

best_model = compare_models()

分類モデル

No.	略称	分類モデル	概要
1	et	Extra Trees Classifier	ランダムに分割してアンサンブルする決定木ベースの手法
2	gbc	Gradient Boosting Classifier	勾配ブースティング法
3	xgboost	Extreme Gradient Boosting	xgブースト（勾配ブースティング法に正則化を追加）
4	rf	Random Forest Classifier	ランダム森（ブートストラップによるランダムサンプリングと決定木のアンサンブル）
5	catboost	CatBoost Classifier	カテゴリー変数の扱いに工夫を入れた勾配ブースティング法
6	ada	AdaBoost Classifier	適応型の勾配ブースティング法
7	dt	Decision Tree Classifier	決定木
8	lightgbm	Light Gradient Boosting Machine	勾配ブースティング法の軽量版
9	knn	K Neighbors Classifier	$k$-近傍法
10	lda	Linear Discriminant Analysis	線形判別分析（すべてのクラスで同じ正規分布を仮定）
11	qda	Quadratic Discriminant Analysis	2次判別分析（各クラスで異なる正規分布を仮定）
12	lr	Logistic Regression	ロジスティック回帰
13	ridge	Ridge Classifier	リッジ分類
14	nb	Naive Bayes	単純ベイズ
15	svm	SVM - Linear Kernel	サポートベクトルマシン（線形カーネル）
16	dummy	Dummy Classifier	ベースになる簡単なモデル

分類モデルの評価尺度

以下の評価尺度が表示される．解説のない評価尺度の定義については，機械学習ならびに深層学習の章を参照されたい．

Accuracy：正解率
AUC： area under the curve
Recall：再現率 (recall)
Prec. ：適合率 (precision)
F1 ： f1 score
Kappa

Cohenの提案した $\kappa$ (kappa) は，予測も正解もランダムに発生すると仮定したときの確率で補正した指標であり，以下のように定義される．

偶然TPになる確率（定義は2値分類の場合） $$p_{tp} = \frac{\mathrm{TP}+\mathrm{FN}}{\mathrm{TP}+\mathrm{FN}+\mathrm{FP}+\mathrm{TN}} \cdot \frac{\mathrm{TP}+\mathrm{FP}}{\mathrm{TP}+\mathrm{FN}+\mathrm{FP}+\mathrm{TN}}$$

偶然TNになる確率 $$p_{tn} = \frac{\mathrm{FN}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FN}+\mathrm{FP}+\mathrm{TN}} \cdot \frac{\mathrm{FP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FN}+\mathrm{FP}+\mathrm{TN}}$$

偶然正解する確率 $$p_e = p_{tp} + p_{tn}$$

正解率 $$ p_0 = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FN}+\mathrm{FP}+\mathrm{TN}} $$

上の記号を用いると，$\kappa$は以下のようになる．

$$\kappa = \frac{p_0-p_e}{1-p_e}=\frac{2 \times (TP \times TN - FN \times FP)}{(TP + FP) \times (FP + TN) + (TP + FN) \times (FN + TN)}$$

MCC (Matthews correlation coefficient)

MCCは，非均一データでも大丈夫で，かつ対称性をもつ（positiveとnegativeを入れ替えても同じ）という特徴をもつ指標であり，以下のように定義される．

$$\text{MCC} = \frac{ \mathrm{TP} \times \mathrm{TN} - \mathrm{FP} \times \mathrm{FN} } {\sqrt{ (\mathrm{TP} + \mathrm{FP}) ( \mathrm{TP} + \mathrm{FN} ) ( \mathrm{TN} + \mathrm{FP} ) ( \mathrm{TN} + \mathrm{FN} ) } }$$

TT(Sec) 計算時間

可視化（分類）

可視化はplot_model(モデルインスタンス)で描画する．引数plotで描画の種類を指定できる． interpret_modelは回帰と同じである．

plot_model関数の引数plotの種類を以下に示す（回帰と同じものは省略）．

"auc": ROC曲線の下の面積（既定値）
"threshold": 識別閾値（discrimination threshold）を変えたときの評価尺度の変化
"pr" : 適合率（precision）と再現率（recall）を表示
"confusion_matrix": 混同行列
"error": クラスごとの誤差を表示
"class_report": クラスごとの評価尺度のヒートマップ
"boundary": 決定の境界の図示
"calibration": キャリブレーション（検量，較正）曲線の図示
"dimension" : Dimension Learning
"gain" : データの一部（パーセンテージ）でどれだけクラスを予測できたか(これをgainと呼ぶ）を表した図
"lift" : 上のgainとベースライン（予測モデルを使わない場合）の比をプロットしたもの

plot_model(best_model, plot="auc");

plot_model(best_model, plot="threshold");

plot_model(best_model, plot="pr");

plot_model(best_model, plot="confusion_matrix");

plot_model(best_model, plot="error");

plot_model(best_model, plot="class_report");

plot_model(best_model, plot="boundary");

plot_model(best_model, plot="calibration");

plot_model(best_model, plot="dimension");

plot_model(best_model, plot="lift");

plot_model(best_model, plot="gain");

多クラス分類

2値分類だけでなく，3種類以上の分類もできる．以下ではアヤメのデータを用いて分類を行い，3種類のアヤメを分類する．

import plotly.express as px

iris = px.data.iris()
iris.head()

clf = setup(data=iris, target="species", ignore_features=["species_id"], session_id=123)

best_model = compare_models()

以下の可視化を行う．

混同行列 (ConfusionMatrix)
特徴重要度

3値以上の場合には，閾値を変化させる可視化はできないことに注意

plot_model(best_model, plot="confusion_matrix");

plot_model(best_model, plot="feature");

チューニングと実験ログの保存

ここでは，データから毒キノコか否かを判定する使う．

target列がターゲット（従属変数）であり，edibleが食用，poisonousが毒である．

他の列のデータもすべて数値ではない．

PyCaretでは，自動的に前処理をしてくれるので，手順はまったく同じである．

mashroom = pd.read_csv("http://logopt.com/data/mashroom.csv")
mashroom.head()

前処理

setupで前処理を行うとき，引数log_experimentをTrueにしておくと，実験結果を保存してくれる．また，引数 experiment_name で実験名を指定し，引数 log_plots をTrueにすることによって，図を保存する．

from pycaret.classification import *

clf = setup(
    data=mashroom,
    target="target",
    session_id=123,
    log_experiment=True,
    experiment_name="mashroom_1",
    log_plots=True,
)

ベスト５の手法を保存

best_5 = compare_models(n_select=5)

チューニング

ベスト5の方法のパラメータをチューニングする．

tuned = [tune_model(i) for i in best_5]

Bagging

Bagging (Bootstrap aggregating) はアンサンブル法の一種である．繰り返しを許してサンプリング(bootstrap) し，結果の多数決 (aggregating)をとることによって，元の手法の分散を減らし，過剰適合を減らす．

bagged = [ensemble_model(i) for i in tuned]

Blending

異なる手法の多数決をとる方法がblendingである．

blender = blend_models(estimator_list=tuned)

実験結果を表示

以下を実行して，ブラウザで http://127.0.0.1:5000 を開くと結果を対話的に確認できる．

!mlflow ui

問題（クレジットカード）

以下のクレジットカードのデフォルトの判定データに対して分類を行え．

default列にデフォルトか否かの情報があり，他の列の情報を用いて分類せよ．

ただし，データ数が多いので，引数に exclude=["catboost"] を入れて CatBoostを除いてモデルを比較せよ．

credit = pd.read_csv("http://logopt.com/data/credit.csv")
credit.tail()

問題（部屋）

以下の部屋が使われているか否かを判定するデータに対して分類を行え．

occupancy列が部屋が使われているか否かを表す情報であり，これをdatetime列以外の情報から分類せよ．

occupancy = pd.read_csv("http://logopt.com/data/occupancy.csv")
occupancy.tail()

問題（タイタニック）

titanicデータに対して分類を行い，死亡確率の推定を行え．

ただし，モデルの比較の際には， fold=5, exclude=["ridge"] を引数で与えよ．

titanic = pd.read_csv("http://logopt.com/data/titanic.csv")
titanic.head()

問題（胸部癌）

http://logopt.com/data/cancer.csv にある胸部癌か否かを判定するデータセットを用いて分類を行え．

最初の列diagnosisが癌か否かを表すものであり，"M"が悪性（malignant），"B"が良性（benign）を表す．

cancer = pd.read_csv("http://logopt.com/data/cancer.csv", index_col=0)
cancer.head()

クラスタリング

UCI機械学習レポジトリのワインに関するデータセットを用いてクラスタリングを解説する．使用するのは $k$-平均法である．

元データは http://logopt.com/data/wine.data にある．

列名は https://archive.ics.uci.edu/ml/datasets/Wine で解説されている．

L = [
    "Alcohol",
    "Malic",
    "Ash",
    "Alcalinity",
    "Magnesium",
    "Phenols",
    "Flavanoids",
    "Nonflavanoid",
    "Proanthocyanins",
    "Color",
    "Hue",
    "OD280",
    "OD315",
    "Proline",
]
wine = pd.read_csv("http://logopt.com/data/wine.data", names=L)
wine.head()

from pycaret.clustering import *

cluster = setup(wine, session_id=123)

クラスタリングもcreate_model関数で生成する．引数modelでモデルの種類を設定する．

引数modelの種類は以下の通り（すべてscikit-learnを利用している）．

"kmeans": $k$-平均法（各点の重心までの距離の和を最小化）
"ap": Affinity Propagation
"meanshift": Mean shift Clustering
"sc": スペクトラル・クラスタリング（低次元に射影してから$k$-平均法）
"hclust": 階層的クラスタリング法
"dbscan": DBSCAN (Density-Based Spatial Clustering)
"optics": DBSCANの一般化
"birch" : Birch Clustering
"kmodes": $k$-Modes Clustering

クラスターの数は，引数num_clusters（既定値は4）で与える．

kmeans = create_model("kmeans", num_clusters=4)

得られたクラスタリングを元のデータフレームに書き込む．

kmean_results = assign_model(kmeans)
kmean_results.head()

クラスタリングの評価尺度

Silhouette: シルエット値；クラスター内の平均距離 $a$ と最も近い別のクラスターとの平均距離 $b$ に対して $(b - a) / \max(a, b)$と定義される．他のクラスターと離れているとき1に近くなる．
Calinski-Harabasz: クラスター間の分散とクラスター内の分散の比を合計したもの；大きいほどクラスターが分離している．
Davies-Bouldin：クラスター内と最も類似しているクラスター間の類似度の比の平均；小さいほどクラスターが分離している．
Homogeneity: 均質性尺度；正解が必要； $0$ から $1$ の値をとり， $1$ に近いほど良い．
Rand Index： 2つのクラスターに対して，正解と同じ割当になっている割合を表す尺度；$-1$から$1$の値をとり，$1$のとき正解と同じ．
Completeness：正解のクラスターに含まれるデータが，同じクラスターに含まれる割合；$0$から$1$の値をとり，$1$に近いほど良い．

ap = create_model("ap")

meanshift = create_model("meanshift")

sc = create_model("sc")

hclust = create_model("hclust")

dbscan = create_model("dbscan")  # Bug?

optics = create_model("optics")

birch = create_model("birch")

kmodes = create_model("kmodes")

可視化（クラスタリング）

plot_model（モデルインスタンス）で可視化する．

plot_model関数の引数plotの種類は，以下の通り．

"cluster": クラスターを主成分分析 (PCA) によって2次元に表示した図 (Plotly)
"tsne": クラスターを$t$-SNE($t$-Distributed Stochastic Neighbor Embedding)によって３次元に表示した図(Plotly)
"elbow": 分割数を表すパラメータ $k$ の適正化（エルボー法）の図
"silhouette" : シルエット係数 (クラスター内の平均距離 $a$ と最も近い別のクラスターとの平均距離 $b$ に対して $(b - a) / max(a, b)$ と定義される；他のクラスターと離れているとき1に近くなる）の図示
"distance": クラスター間の距離の図示
"distribution" : クラスターに含まれるデータ数の分布図(Plotly)

plot_model(kmeans, plot="cluster", save=True);

plot_model(kmeans, plot="tsne", save=True);

plot_model(kmeans, plot="elbow");

plot_model(kmeans, plot="silhouette");

plot_model(kmeans, plot="distance");

plot_model(kmeans, plot="distribution", save=True);

問題（アヤメ）

irisのデータセットの各データを $k$-平均法を用いて3つのクラスターに分けて可視化せよ．また，他の手法を1つ選んでクラスタリングと可視化をし， $k$-平均法と比較せよ．

iris = px.data.iris()
iris.head()

異常検知

異常検知 (anomaly detection)は，教師なし学習を用いて，稀なイベントやアイテムを検知するための手法である．

ここでは，ネズミのデータを用いて異常検知を行う．

df = get_data("mice")
print(df.shape)

(1080, 82)

前処理

まずは，anomalyサブモジュールをすべて読み込み，setupで準備をする．ここでは，ignore_features引数に無視したい列名のリスト ["MouseID"] を入れ，さらにnormalize引数をTrueに変更しデータの正規化を行う．

from pycaret.anomaly import *
anomaly = setup(df, normalize = True, ignore_features=["MouseID"],session_id = 123)

モデルの生成

次に，モデルを生成する．モデル（アルゴリズム）の種類はmodel引数で指定する．ここでは，iforestを用いる．

モデルの種類は，以下の通り．

"abod": Angle-base Outlier Detection．角度を用いることによって次元の呪いを回避した手法
"cluster": Clustering-Based Local Outlier．クラスタリングを用いた手法
"cof": Connectivity-Based Local Outlier．連結性を用いた手法
"iforest": Isolation Forest．決定木を用いた手法
"histogram": Histogram-based Outlier Detection．度数分布表（ヒストグラム）に基づく手法
"knn": K-Nearest Neighbors Detector．$k$-近傍法
"lof": Local Outlier Factor．局所的な外れ値尺度 (local outlier factor．$k$-近傍への平均距離の逆数)を用いた手法
"svm": One-class SVM detector．サポートベクトルマシン
"pca": Principal Component Analysis．主成分分析
"mcd": Minimum Covariance Determinant．共分散行列の最小行列式に基づく手法
"sod": Subspace Outlier Detection．高次元データに対処するために部分空間を用いた手法
"sos": Stochastic Outlier Selection．統計的外れ値検知法

iforest = create_model("iforest")

結果の書き込み

異常か否かのフラグ（外れ値のとき1）と異常度のスコアの列を追加したデータフレームを生成するには，assign_modelを使う．

results = assign_model(iforest)
results.head().iloc[:,-10:] #最後の10列を表示

可視化

可視化には plot_model を使うが，引数 plot で可視化手法を選択できる．以下の2つが準備されている．

"tsne"（既定値）: $t$-SNE($t$-Distributed Stochastic Neighbor Embedding)によって３次元に表示した図(Plotly)
"umap": 一般的な非線形関数に対応した次元削減 (Uniform Manifold Approximation and Projection)によって２次元に表示した図 (Plotly)

plot_model(iforest, save=True);

plot_model(iforest, plot = "umap", save=True);

問題（スパムの判定）

メールがスパム（spam；迷惑メイル）か否かを判定する以下の例題に対して，異常検知を iforest を用いて行い，umap で描画せよ．ただし，データの is_spam列が $1$ （スパム）か， $0$ （スパムでない）かの情報を含んでいるので，それを除いたデータを前処理で準備する．

spam = pd.read_csv("http://logopt.com/data/spam.csv")
spam.head()

アソシエーション・ルール・マイニング

アソシエーション・ルール・マイニングとは，データセット内の「興味ある」関係を発見するための手法である．

ここでは例として，インボイス番号（注文番号）とそれに含まれる商品（アイテム）の関係のデータセットを用いる．

InvoiceNoの列が注文番号であり，Descriptionの列に商品名が入っている．同じ注文番号に含まれるアイテムの情報をもとに，どのアイテムとどのアイテムが同時に注文されるかを分析（マイニング）する．

df = get_data("france")

前処理

setupでデータの準備をする．以下の３つの引数を指定する必要がある．

transaction_id: 注文を表す列名を入れる．この例では"InvoiceNo"．
item_id: アイテムを表す列を入れる．この例では "Description"．
ignore_items：無視するアイテム名のリストを入れる．ここでは，各注文に必ず含まれる "POSTAGE"（送料）を除くものとする．

from pycaret.arules import *
rule = setup(data = df, transaction_id = "InvoiceNo", item_id = "Description", ignore_items=["POSTAGE"])

モデルの生成

create_modelで，アソシエーション・ルールを含んだデータフレームが生成される．

ルールは，「条件(antecedents） $\rightarrow$ 帰結（consequents)」の形式で表示され，評価尺度は以下の通りである．

事象 x の出現割合を support(x)と記す．

ルール「A -> C」に対して：

antecedent support: support(A)
consequent support: support(C)
support: 支持度 support(A $+$ C)
confidence: 信頼度 support(A $+$ C) $/$ support(A)；既定値ではこの順に並んでいる．
lift: リフト値 confidence(A $\rightarrow$ C) / support(C)
leverage: support(A $\rightarrow$ C) $-$ support(A) $\times$ support(C)
conviction: [1 $-$ support(C)] $/$ [1 $-$ confidence(A $\rightarrow$ C)]

model = create_model() 
model.head()

可視化

plot_modelで，2次元（既定値）と3次元の図が描画できる．

plot_model(model);  #save引数はない．

plot_model(model, plot="3d")

問題（アイルランドのオンラインショップ）

上の例題のデータは，オンラインショップのデータ https://archive.ics.uci.edu/ml/datasets/online+retail から，Country列が "France" のものを抽出したデータである．以下のデータは，同じデータから Country列が "EIRE"（アイルランド）を抽出したものである．これを読み込み，上と同様のアソシエーション・ルール・マイニングを行え．ただし，今回はアイテム "POSTAGE" は除かなくて良い．

df = pd.read_csv("http://logopt.com/data/ireland.csv", index_col=0)
df.head()

	TV	Radio	Newspaper	Sales
196	38.2	3.7	13.8	7.6
197	94.2	4.9	8.1	9.7
198	177.0	9.3	6.4	12.8
199	283.6	42.0	66.2	25.5
200	232.1	8.6	8.7	13.4

	Description	Value
0	session_id	123
1	Target	Sales
2	Original Data	(200, 4)
3	Missing Values	False
4	Numeric Features	3
5	Categorical Features	0
6	Ordinal Features	False
7	High Cardinality Features	False
8	High Cardinality Method	None
9	Transformed Train Set	(139, 3)
10	Transformed Test Set	(61, 3)
11	Shuffle Train-Test	True
12	Stratify Train-Test	False
13	Fold Generator	KFold
14	Fold Number	10
15	CPU Jobs	-1
16	Use GPU	False
17	Log Experiment	False
18	Experiment Name	reg-default-name
19	USI	ca9a
20	Imputation Type	simple
21	Iterative Imputation Iteration	None
22	Numeric Imputer	mean
23	Iterative Imputation Numeric Model	None
24	Categorical Imputer	constant
25	Iterative Imputation Categorical Model	None
26	Unknown Categoricals Handling	least_frequent
27	Normalize	False
28	Normalize Method	None
29	Transformation	False
30	Transformation Method	None
31	PCA	False
32	PCA Method	None
33	PCA Components	None
34	Ignore Low Variance	False
35	Combine Rare Levels	False
36	Rare Level Threshold	None
37	Numeric Binning	False
38	Remove Outliers	False
39	Outliers Threshold	None
40	Remove Multicollinearity	False
41	Multicollinearity Threshold	None
42	Clustering	False
43	Clustering Iteration	None
44	Polynomial Features	False
45	Polynomial Degree	None
46	Trignometry Features	False
47	Polynomial Threshold	None
48	Group Features	False
49	Feature Selection	False
50	Features Selection Threshold	None
51	Feature Interaction	False
52	Feature Ratio	False
53	Interaction Threshold	None
54	Transform Target	False
55	Transform Target Method	box-cox

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
et	Extra Trees Regressor	0.4636	0.4248	0.6334	0.9837	0.0705	0.0523	0.0440
gbr	Gradient Boosting Regressor	0.6601	0.7945	0.8653	0.9704	0.0910	0.0709	0.0120
xgboost	Extreme Gradient Boosting	0.6984	0.8608	0.9045	0.9675	0.0893	0.0698	0.0300
rf	Random Forest Regressor	0.7123	0.8845	0.9210	0.9667	0.0945	0.0750	0.0520
catboost	CatBoost Regressor	0.5846	0.9041	0.8906	0.9655	0.1063	0.0776	1.3520
ada	AdaBoost Regressor	0.9116	1.3416	1.1366	0.9485	0.1020	0.0903	0.0140
dt	Decision Tree Regressor	0.9474	1.4913	1.2042	0.9418	0.1032	0.0836	0.0040
lightgbm	Light Gradient Boosting Machine	1.1357	2.4086	1.5262	0.9037	0.1636	0.1345	0.2320
knn	K Neighbors Regressor	1.3162	3.1928	1.7806	0.8757	0.1345	0.1183	0.0060
lasso	Lasso Regression	1.3792	3.3050	1.8048	0.8697	0.1819	0.1597	0.2380
en	Elastic Net	1.3775	3.3151	1.8066	0.8692	0.1830	0.1604	0.2340
lar	Least Angle Regression	1.3750	3.3262	1.8082	0.8686	0.1845	0.1613	0.2320
lr	Linear Regression	1.3750	3.3262	1.8082	0.8686	0.1845	0.1613	0.3560
ridge	Ridge Regression	1.3750	3.3262	1.8082	0.8686	0.1845	0.1613	0.2360
br	Bayesian Ridge	1.3794	3.3342	1.8112	0.8683	0.1836	0.1610	0.0100
huber	Huber Regressor	1.3576	3.5250	1.8620	0.8617	0.1901	0.1666	0.0080
omp	Orthogonal Matching Pursuit	2.6863	11.3136	3.3487	0.5605	0.2315	0.2279	0.0040
llar	Lasso Least Angle Regression	4.4043	28.2716	5.2705	-0.0619	0.3890	0.4302	0.0040
par	Passive Aggressive Regressor	5.4260	78.1239	6.5130	-1.8399	0.3612	0.3984	0.2240

	TV	Radio	Newspaper	Sales	Label
0	199.800003	3.100000	34.599998	11.4	11.080
1	80.199997	0.000000	9.200000	8.8	9.137
2	74.699997	49.400002	45.700001	14.7	13.937
3	44.700001	25.799999	20.600000	10.1	9.836
4	147.300003	23.900000	19.100000	14.6	14.633
...	...	...	...	...	...
56	66.099998	5.800000	24.200001	8.6	9.231
57	276.899994	48.900002	41.799999	27.0	25.154
58	120.500000	28.500000	14.200000	14.2	14.470
59	239.300003	15.500000	27.299999	15.7	15.422
60	239.800003	4.100000	36.900002	12.3	12.204

	MathSAT	VerbalSAT	GPA
1	580.0	420.0	2.90
2	670.0	530.0	2.83
3	680.0	540.0	2.90
4	630.0	640.0	3.30
5	620.0	630.0	3.61

	Parameters
bootstrap	False
ccp_alpha	0.0
criterion	mse
max_depth	None
max_features	auto
max_leaf_nodes	None
max_samples	None
min_impurity_decrease	0.0
min_impurity_split	None
min_samples_leaf	1
min_samples_split	2
min_weight_fraction_leaf	0.0
n_estimators	100
n_jobs	-1
oob_score	False
random_state	123
verbose	0
warm_start	False

	cement	slag	water	splast	coarse	fine	age	strength
0	540.0	0.0	162.0	2.5	1040.0	676.0	28	79.986111
1	540.0	0.0	162.0	2.5	1055.0	676.0	28	61.887366
2	332.5	142.5	228.0	0.0	932.0	594.0	270	40.269535
3	332.5	142.5	228.0	0.0	932.0	594.0	365	41.052780
4	198.6	132.4	192.0	0.0	978.4	825.5	360	44.296075

	date	season	month	hour	weekday	weather	temp	feelslike	humidity	casual	registered	riders
0	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16
1	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40
2	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32
3	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13
4	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1

	carat	colour	clarity	certification	price
1	0.30	D	VS2	GIA	1302
2	0.30	E	VS1	GIA	1510
3	0.30	G	VVS1	GIA	1510
4	0.30	G	VS1	GIA	1260
5	0.31	D	VS1	GIA	1641

	Description	Value
0	session_id	123
1	Target	price
2	Original Data	(308, 5)
3	Missing Values	False
4	Numeric Features	1
5	Categorical Features	3
6	Ordinal Features	False
7	High Cardinality Features	False
8	High Cardinality Method	None
9	Transformed Train Set	(215, 15)
10	Transformed Test Set	(93, 15)
11	Shuffle Train-Test	True
12	Stratify Train-Test	False
13	Fold Generator	KFold
14	Fold Number	10
15	CPU Jobs	-1
16	Use GPU	False
17	Log Experiment	False
18	Experiment Name	reg-default-name
19	USI	9e2e
20	Imputation Type	simple
21	Iterative Imputation Iteration	None
22	Numeric Imputer	mean
23	Iterative Imputation Numeric Model	None
24	Categorical Imputer	constant
25	Iterative Imputation Categorical Model	None
26	Unknown Categoricals Handling	least_frequent
27	Normalize	False
28	Normalize Method	None
29	Transformation	False
30	Transformation Method	None
31	PCA	False
32	PCA Method	None
33	PCA Components	None
34	Ignore Low Variance	False
35	Combine Rare Levels	False
36	Rare Level Threshold	None
37	Numeric Binning	False
38	Remove Outliers	False
39	Outliers Threshold	None
40	Remove Multicollinearity	False
41	Multicollinearity Threshold	None
42	Clustering	False
43	Clustering Iteration	None
44	Polynomial Features	False
45	Polynomial Degree	None
46	Trignometry Features	False
47	Polynomial Threshold	None
48	Group Features	False
49	Feature Selection	False
50	Features Selection Threshold	None
51	Feature Interaction	False
52	Feature Ratio	False
53	Interaction Threshold	None
54	Transform Target	False
55	Transform Target Method	box-cox

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
catboost	CatBoost Regressor	237.4153	315117.7498	480.4058	0.9727	0.1097	0.0654	1.3020
et	Extra Trees Regressor	260.3273	337969.9621	510.4212	0.9707	0.0916	0.0608	0.0520
xgboost	Extreme Gradient Boosting	260.5440	363651.4914	530.6390	0.9683	0.0811	0.0551	0.0480
lightgbm	Light Gradient Boosting Machine	326.7242	363339.0916	555.4182	0.9679	0.1385	0.0925	0.0120
gbr	Gradient Boosting Regressor	282.9466	388598.8596	568.6150	0.9657	0.0892	0.0601	0.0100
rf	Random Forest Regressor	330.3894	417221.8575	594.7542	0.9631	0.1007	0.0711	0.0520
dt	Decision Tree Regressor	372.9116	482097.8233	640.5102	0.9574	0.1172	0.0832	0.0040
llar	Lasso Least Angle Regression	491.8227	495822.4883	677.2237	0.9542	0.3945	0.1944	0.0040
huber	Huber Regressor	479.5457	495364.6533	676.0879	0.9542	0.4670	0.1853	0.0060
lasso	Lasso Regression	496.8779	498086.1812	681.3048	0.9537	0.3819	0.2013	0.0040
br	Bayesian Ridge	498.8744	499117.9277	682.5451	0.9536	0.3795	0.2023	0.2360
lar	Least Angle Regression	498.3815	499580.3571	683.0292	0.9535	0.3833	0.2037	0.0040
lr	Linear Regression	498.3806	499579.1188	683.0280	0.9535	0.3833	0.2037	0.0040
par	Passive Aggressive Regressor	490.8640	575096.9221	723.4482	0.9480	0.4104	0.1534	0.0060
ridge	Ridge Regression	544.9786	601979.7688	747.5779	0.9452	0.4118	0.1738	0.0040
ada	AdaBoost Regressor	563.9603	874722.7655	903.3413	0.9193	0.1790	0.1418	0.0160
omp	Orthogonal Matching Pursuit	707.9308	1067712.4332	1002.2154	0.9024	0.7797	0.2185	0.0040
knn	K Neighbors Regressor	1361.7665	3603052.3500	1884.8057	0.6522	0.4608	0.4425	0.0060
en	Elastic Net	2230.8382	7406768.7000	2715.8800	0.2979	0.7104	0.8655	0.0040

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

	Description	Value
0	session_id	123
1	Target	is_spam
2	Target Type	Binary
3	Label Encoded	None
4	Original Data	(4600, 58)
5	Missing Values	False
6	Numeric Features	57
7	Categorical Features	0
8	Ordinal Features	False
9	High Cardinality Features	False
10	High Cardinality Method	None
11	Transformed Train Set	(3219, 56)
12	Transformed Test Set	(1381, 56)
13	Shuffle Train-Test	True
14	Stratify Train-Test	False
15	Fold Generator	StratifiedKFold
16	Fold Number	10
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	False
20	Experiment Name	clf-default-name
21	USI	3444
22	Imputation Type	simple
23	Iterative Imputation Iteration	None
24	Numeric Imputer	mean
25	Iterative Imputation Numeric Model	None
26	Categorical Imputer	constant
27	Iterative Imputation Categorical Model	None
28	Unknown Categoricals Handling	least_frequent
29	Normalize	False
30	Normalize Method	None
31	Transformation	False
32	Transformation Method	None
33	PCA	False
34	PCA Method	None
35	PCA Components	None
36	Ignore Low Variance	False
37	Combine Rare Levels	False
38	Rare Level Threshold	None
39	Numeric Binning	False
40	Remove Outliers	False
41	Outliers Threshold	None
42	Remove Multicollinearity	False
43	Multicollinearity Threshold	None
44	Remove Perfect Collinearity	True
45	Clustering	False
46	Clustering Iteration	None
47	Polynomial Features	False
48	Polynomial Degree	None
49	Trignometry Features	False
50	Polynomial Threshold	None
51	Group Features	False
52	Feature Selection	False
53	Feature Selection Method	classic
54	Features Selection Threshold	None
55	Feature Interaction	False
56	Feature Ratio	False
57	Interaction Threshold	None
58	Fix Imbalance	False
59	Fix Imbalance Method	SMOTE

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
lightgbm	Light Gradient Boosting Machine	0.9553	0.9864	0.9369	0.9478	0.9422	0.9057	0.9059	0.0350
et	Extra Trees Classifier	0.9528	0.9851	0.9337	0.9446	0.9390	0.9005	0.9007	0.0600
rf	Random Forest Classifier	0.9494	0.9846	0.9186	0.9499	0.9339	0.8929	0.8933	0.0720
gbc	Gradient Boosting Classifier	0.9431	0.9824	0.9098	0.9423	0.9256	0.8796	0.8801	0.0990
ada	Ada Boost Classifier	0.9329	0.9763	0.9089	0.9182	0.9133	0.8586	0.8589	0.0350
lr	Logistic Regression	0.9267	0.9705	0.8882	0.9208	0.9041	0.8448	0.8453	0.3110
dt	Decision Tree Classifier	0.9074	0.9059	0.8970	0.8702	0.8829	0.8064	0.8072	0.0110
lda	Linear Discriminant Analysis	0.8875	0.9516	0.7860	0.9129	0.8444	0.7572	0.7627	0.0080
ridge	Ridge Classifier	0.8866	0.0000	0.7820	0.9143	0.8426	0.7549	0.7608	0.0050
nb	Naive Bayes	0.8139	0.9457	0.9521	0.6893	0.7994	0.6342	0.6636	0.1390
qda	Quadratic Discriminant Analysis	0.7944	0.8677	0.9617	0.6629	0.7846	0.6004	0.6391	0.0060
knn	K Neighbors Classifier	0.7937	0.8608	0.7285	0.7395	0.7332	0.5652	0.5659	0.1590
svm	SVM - Linear Kernel	0.7011	0.0000	0.7289	0.5736	0.6026	0.3873	0.4308	0.0060
dummy	Dummy Classifier	0.6111	0.5000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0040

	sepal_length	sepal_width	petal_length	petal_width	species	species_id
0	5.1	3.5	1.4	0.2	setosa	1
1	4.9	3.0	1.4	0.2	setosa	1
2	4.7	3.2	1.3	0.2	setosa	1
3	4.6	3.1	1.5	0.2	setosa	1
4	5.0	3.6	1.4	0.2	setosa	1

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
lda	Linear Discriminant Analysis	0.9809	0.9969	0.9833	0.9857	0.9809	0.9715	0.9739	0.0030
knn	K Neighbors Classifier	0.9800	0.9830	0.9806	0.9800	0.9800	0.9697	0.9697	0.1470
qda	Quadratic Discriminant Analysis	0.9709	0.9969	0.9750	0.9782	0.9709	0.9566	0.9602	0.0030
lr	Logistic Regression	0.9609	0.9921	0.9611	0.9622	0.9596	0.9403	0.9422	0.2370
nb	Naive Bayes	0.9609	0.9938	0.9611	0.9652	0.9605	0.9407	0.9432	0.1370
dt	Decision Tree Classifier	0.9509	0.9616	0.9500	0.9598	0.9479	0.9249	0.9309	0.1430
gbc	Gradient Boosting Classifier	0.9509	0.9782	0.9500	0.9598	0.9479	0.9249	0.9309	0.0270
et	Extra Trees Classifier	0.9509	0.9890	0.9528	0.9532	0.9509	0.9258	0.9269	0.0310
rf	Random Forest Classifier	0.9409	0.9875	0.9417	0.9422	0.9396	0.9100	0.9119	0.0350
ada	Ada Boost Classifier	0.9409	0.9895	0.9417	0.9467	0.9391	0.9100	0.9146	0.0140
lightgbm	Light Gradient Boosting Machine	0.9409	0.9883	0.9417	0.9422	0.9396	0.9100	0.9119	0.0100
ridge	Ridge Classifier	0.8182	0.0000	0.8222	0.8304	0.8080	0.7251	0.7423	0.0030
svm	SVM - Linear Kernel	0.7227	0.0000	0.7444	0.6062	0.6420	0.5863	0.6548	0.0040
dummy	Dummy Classifier	0.3855	0.5000	0.3333	0.1489	0.2147	0.0000	0.0000	0.0020

	target	shape	surface	color
0	edible	convex	smooth	yellow
1	edible	bell	smooth	white
2	poisonous	convex	scaly	white
3	edible	convex	smooth	gray
4	edible	convex	scaly	yellow

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.7452	0.8301	0.8796	0.6827	0.7687	0.4948	0.5147
1	0.6854	0.7859	0.8321	0.6316	0.7181	0.3770	0.3956
2	0.6503	0.7581	0.8175	0.6005	0.6924	0.3085	0.3285
3	0.6907	0.7821	0.8182	0.6410	0.7188	0.3862	0.4005
4	0.6643	0.7804	0.7891	0.6200	0.6944	0.3338	0.3458
5	0.6924	0.7859	0.8582	0.6344	0.7295	0.3911	0.4155
6	0.6496	0.7450	0.8139	0.6011	0.6915	0.3068	0.3259
7	0.7113	0.8247	0.8613	0.6519	0.7421	0.4281	0.4498
8	0.7007	0.7951	0.8759	0.6383	0.7385	0.4082	0.4366
9	0.6778	0.7651	0.8139	0.6282	0.7091	0.3613	0.3766
Mean	0.6868	0.7852	0.8360	0.6330	0.7203	0.3796	0.3990
SD	0.0275	0.0254	0.0291	0.0228	0.0238	0.0542	0.0559

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
Fold
0	0.7381	0.8485	0.7263	0.7289	0.7276	0.4755	0.4755
1	0.7012	0.7914	0.7299	0.6757	0.7018	0.4034	0.4046
2	0.6626	0.7784	0.7117	0.6331	0.6701	0.3272	0.3295
3	0.7170	0.8088	0.7491	0.6913	0.7190	0.4350	0.4364
4	0.7047	0.8199	0.7491	0.6754	0.7103	0.4109	0.4132
5	0.7047	0.8184	0.7818	0.6656	0.7191	0.4121	0.4181
6	0.6708	0.7757	0.7153	0.6426	0.6770	0.3433	0.3453
7	0.7324	0.8461	0.7482	0.7118	0.7295	0.4651	0.4656
8	0.6937	0.8017	0.7409	0.6634	0.7000	0.3890	0.3915
9	0.6831	0.7798	0.7409	0.6506	0.6928	0.3684	0.3717
Mean	0.7008	0.8069	0.7393	0.6739	0.7047	0.4030	0.4051
Std	0.0233	0.0252	0.0192	0.0285	0.0193	0.0458	0.0452

	limit	sex	edu	married	age	apr_delay	may_delay	jun_delay	jul_delay	...	jul_bill	aug_bill	sep_bill	apr_pay	may_pay	jun_pay	jul_pay	aug_pay	sep_pay	default
29995	220000	1	3	1	39	0	0	0	0	...	88004	31237	15980	8500	20000	5003	3047	5000	1000	0
29996	150000	1	3	2	43	-1	-1	-1	-1	...	8979	5190	0	1837	3526	8998	129	0	0	0
29997	30000	1	2	2	37	4	3	2	-1	...	20878	20582	19357	0	0	22000	4200	2000	3100	1
29998	80000	1	3	1	41	1	-1	0	0	...	52774	11855	48944	85900	3409	1178	1926	52964	1804	1
29999	50000	1	2	1	46	0	0	0	0	...	36535	32428	15313	2078	1800	1430	1000	1000	1000	1

	datetime	temperature	relative humidity	light	CO2	humidity	occupancy
20555	2015-02-18 09:15:00	20.815	27.7175	429.75	1505.25	0.004213	1
20556	2015-02-18 09:16:00	20.865	27.7450	423.50	1514.50	0.004230	1
20557	2015-02-18 09:16:59	20.890	27.7450	423.50	1521.50	0.004237	1
20558	2015-02-18 09:17:59	20.890	28.0225	418.75	1632.00	0.004279	1
20559	2015-02-18 09:19:00	21.000	28.1000	409.00	1864.00	0.004321	1

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	symmetry_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
id
842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

	Alcohol	Malic	Ash	Alcalinity	Magnesium	Phenols	Flavanoids	Nonflavanoid	Proanthocyanins	Color	Hue	OD280	OD315	Proline
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

	Description	Value
0	session_id	123
1	Original Data	(178, 14)
2	Missing Values	False
3	Numeric Features	13
4	Categorical Features	1
5	Ordinal Features	False
6	High Cardinality Features	False
7	High Cardinality Method	None
8	Transformed Data	(178, 16)
9	CPU Jobs	-1
10	Use GPU	False
11	Log Experiment	False
12	Experiment Name	cluster-default-name
13	USI	118a
14	Imputation Type	simple
15	Iterative Imputation Iteration	None
16	Numeric Imputer	mean
17	Iterative Imputation Numeric Model	None
18	Categorical Imputer	mode
19	Iterative Imputation Categorical Model	None
20	Unknown Categoricals Handling	least_frequent
21	Normalize	False
22	Normalize Method	None
23	Transformation	False
24	Transformation Method	None
25	PCA	False
26	PCA Method	None
27	PCA Components	None
28	Ignore Low Variance	False
29	Combine Rare Levels	False
30	Rare Level Threshold	None
31	Numeric Binning	False
32	Remove Outliers	False
33	Outliers Threshold	None
34	Remove Multicollinearity	False
35	Multicollinearity Threshold	None
36	Clustering	False
37	Clustering Iteration	None
38	Polynomial Features	False
39	Polynomial Degree	None
40	Trignometry Features	False
41	Polynomial Threshold	None
42	Group Features	False
43	Feature Selection	False
44	Features Selection Threshold	None
45	Feature Interaction	False
46	Feature Ratio	False
47	Interaction Threshold	None

	MouseID	DYRK1A_N	ITSN1_N	BDNF_N	NR1_N	NR2A_N	pAKT_N	pBRAF_N	pCAMKII_N	pCREB_N	...	pCFOS_N	SYP_N	H3AcK18_N	EGR1_N	H3MeK4_N	CaNA_N	Genotype	Treatment	Behavior	class
0	309_1	0.503644	0.747193	0.430175	2.816329	5.990152	0.218830	0.177565	2.373744	0.232224	...	0.108336	0.427099	0.114783	0.131790	0.128186	1.675652	Control	Memantine	C/S	c-CS-m
1	309_2	0.514617	0.689064	0.411770	2.789514	5.685038	0.211636	0.172817	2.292150	0.226972	...	0.104315	0.441581	0.111974	0.135103	0.131119	1.743610	Control	Memantine	C/S	c-CS-m
2	309_3	0.509183	0.730247	0.418309	2.687201	5.622059	0.209011	0.175722	2.283337	0.230247	...	0.106219	0.435777	0.111883	0.133362	0.127431	1.926427	Control	Memantine	C/S	c-CS-m
3	309_4	0.442107	0.617076	0.358626	2.466947	4.979503	0.222886	0.176463	2.152301	0.207004	...	0.111262	0.391691	0.130405	0.147444	0.146901	1.700563	Control	Memantine	C/S	c-CS-m
4	309_5	0.434940	0.617430	0.358802	2.365785	4.718679	0.213106	0.173627	2.134014	0.192158	...	0.110694	0.434154	0.118481	0.140314	0.148380	1.839730	Control	Memantine	C/S	c-CS-m

	Description	Value
0	session_id	123
1	Original Data	(1080, 82)
2	Missing Values	True
3	Numeric Features	77
4	Categorical Features	4
5	Ordinal Features	False
6	High Cardinality Features	False
7	High Cardinality Method	None
8	Transformed Data	(1080, 91)
9	CPU Jobs	-1
10	Use GPU	False
11	Log Experiment	False
12	Experiment Name	anomaly-default-name
13	USI	8cc5
14	Imputation Type	simple
15	Iterative Imputation Iteration	None
16	Numeric Imputer	mean
17	Iterative Imputation Numeric Model	None
18	Categorical Imputer	mode
19	Iterative Imputation Categorical Model	None
20	Unknown Categoricals Handling	least_frequent
21	Normalize	True
22	Normalize Method	zscore
23	Transformation	False
24	Transformation Method	None
25	PCA	False
26	PCA Method	None
27	PCA Components	None
28	Ignore Low Variance	False
29	Combine Rare Levels	False
30	Rare Level Threshold	None
31	Numeric Binning	False
32	Remove Outliers	False
33	Outliers Threshold	None
34	Remove Multicollinearity	False
35	Multicollinearity Threshold	None
36	Clustering	False
37	Clustering Iteration	None
38	Polynomial Features	False
39	Polynomial Degree	None
40	Trignometry Features	False
41	Polynomial Threshold	None
42	Group Features	False
43	Feature Selection	False
44	Features Selection Threshold	None
45	Feature Interaction	False
46	Feature Ratio	False
47	Interaction Threshold	None

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536370	22728	ALARM CLOCK BAKELIKE PINK	24	12/1/2010 8:45	3.75	12583.0	France
1	536370	22727	ALARM CLOCK BAKELIKE RED	24	12/1/2010 8:45	3.75	12583.0	France
2	536370	22726	ALARM CLOCK BAKELIKE GREEN	12	12/1/2010 8:45	3.75	12583.0	France
3	536370	21724	PANDA AND BUNNIES STICKER SHEET	12	12/1/2010 8:45	0.85	12583.0	France
4	536370	21883	STARS GIFT TAPE	24	12/1/2010 8:45	0.65	12583.0	France

Description	Value
session_id	8919
# Transactions	461
# Items	1565
Ignore Items	['POSTAGE']

PyCaretを用いた自動機械学習

PyCaretとは

Google Colab での実行の準備

回帰

広告による売り上げの予測

PyCaretの基本手順

回帰モデル

回帰モデルの評価尺度

予測

可視化（回帰）

モデルの解釈

問題 (SAT,GPA)

問題（住宅価格）

問題（車の燃費）

問題（コンクリートの強度）

問題（シェアバイク）

ダイアモンドの価格の予測（カテゴリー変数）

問題（車の価格）

問題（チップ）

2値分類

分類モデル

分類モデルの評価尺度

可視化（分類）

多クラス分類

チューニングと実験ログの保存

前処理

ベスト５の手法を保存

チューニング

Bagging

Blending

実験結果を表示

問題（クレジットカード）

問題（部屋）

問題（タイタニック）

問題（胸部癌）

クラスタリング

クラスタリングの評価尺度

可視化（クラスタリング）

問題（アヤメ）

異常検知

前処理

モデルの生成

結果の書き込み

可視化

問題（スパムの判定）

アソシエーション・ルール・マイニング

前処理

モデルの生成

可視化

問題（アイルランドのオンラインショップ）

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE...	(SET/6 RED SPOTTY PAPER CUPS)	0.0868	0.1171	0.0846	0.9750	8.3236	0.0744	35.3145
1	(SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE...	(SET/6 RED SPOTTY PAPER PLATES)	0.0868	0.1085	0.0846	0.9750	8.9895	0.0752	35.6616
2	(SET/6 RED SPOTTY PAPER PLATES)	(SET/6 RED SPOTTY PAPER CUPS)	0.1085	0.1171	0.1041	0.9600	8.1956	0.0914	22.0716
3	(CHILDRENS CUTLERY SPACEBOY )	(CHILDRENS CUTLERY DOLLY GIRL )	0.0586	0.0629	0.0542	0.9259	14.7190	0.0505	12.6508
4	(SET/6 RED SPOTTY PAPER CUPS)	(SET/6 RED SPOTTY PAPER PLATES)	0.1171	0.1085	0.1041	0.8889	8.1956	0.0914	8.0239

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
1404	536540	22968	ROSE COTTAGE KEEPSAKE BOX	4	2010-12-01 14:05:00	9.95	14911.0	EIRE
1405	536540	85071A	BLUE CHARLIE+LOLA PERSONAL DOORSIGN	6	2010-12-01 14:05:00	2.95	14911.0	EIRE
1406	536540	85071C	CHARLIE+LOLA"EXTREMELY BUSY" SIGN	6	2010-12-01 14:05:00	2.55	14911.0	EIRE
1407	536540	22355	CHARLOTTE BAG SUKI DESIGN	50	2010-12-01 14:05:00	0.85	14911.0	EIRE
1408	536540	21579	LOLITA DESIGN COTTON TOTE BAG	6	2010-12-01 14:05:00	2.25	14911.0	EIRE