ここでは，統計パッケージstatsmodelsの以下の項目について説明する．

線形回帰
一般化線形モデル
指数平滑法
SARIMAX

線形回帰

http://logopt.com/data/Diamond.csv からダイアモンドの価格データを読み込み，線形回帰を適用する．

列は ["carat","colour","clarity","certification","price"] であり，他の情報から価格(price)の予測を行う．

データをpandasのデータフレームとして読み込む．
statsmodels.formula.apiを smf (stats model formula)の別名でインポートする．
smfの一般化線形モデルglmを用いてモデルインスタンスを生成する．このとき，列名を用いた 式(formula) を文字列で記述し引数 formula で，データは引数 data でデータフレームとして入力する．
モデルインスタンスの fitメソッド で最適パラメータの探索（機械学習で言うところの訓練）を行う．
モデルインスタンスの summary もしくは summary2 メソッドで結果を見る．
モデルインスタンスのpredictメソッドで予測を行う．

モデル式のformulaの書き方:

formulaの文字列	一般化線形モデルの式（誤差項は省略）
y ~ x	$y = b + a x$
y ~ x -1	$y = a x$
y ~ x1 + x2	$y = b + a_1 x_1 + a_2 x_2$
y ~ x1:x2	$y = b + a_1 x_1 x_2$
y ~ x1*x2	$y = b + a_1 x_1 + a_2 x_2 + a_3 x_1 x_2$
y ~ pow(x,2)	$y = b + a_1 x^2$

import pandas as pd
import statsmodels.formula.api as smf  # stats model formula
%matplotlib inline

diamond = pd.read_csv("http://logopt.com/data/Diamond.csv", index_col=0)
diamond.head()

model = smf.glm("price ~ carat + colour + clarity", diamond)
fit = model.fit()
print(fit.summary2())

                    Results: Generalized linear model
=========================================================================
Model:                GLM                AIC:              4928.6013     
Link Function:        identity           BIC:              149489259.2345
Dependent Variable:   price              Log-Likelihood:   -2453.3       
Date:                 2022-12-31 08:31   LL-Null:          -5836.8       
No. Observations:     308                Deviance:         1.4949e+08    
Df Model:             10                 Pearson chi2:     1.49e+08      
Df Residuals:         297                Scale:            5.0334e+05    
Method:               IRLS                                               
-------------------------------------------------------------------------
                  Coef.    Std.Err.    z     P>|z|    [0.025     0.975]  
-------------------------------------------------------------------------
Intercept         316.8348 216.5957   1.4628 0.1435  -107.6849   741.3546
colour[T.E]     -1447.4678 207.2673  -6.9836 0.0000 -1853.7043 -1041.2314
colour[T.F]     -1843.8689 194.6387  -9.4733 0.0000 -2225.3537 -1462.3841
colour[T.G]     -2178.7960 199.6232 -10.9145 0.0000 -2570.0503 -1787.5418
colour[T.H]     -2763.1576 201.3079 -13.7260 0.0000 -3157.7139 -2368.6014
colour[T.I]     -3315.8908 212.4141 -15.6105 0.0000 -3732.2148 -2899.5667
clarity[T.VS1]  -1548.8104 143.7628 -10.7734 0.0000 -1830.5804 -1267.0404
clarity[T.VS2]  -1860.7202 158.8784 -11.7116 0.0000 -2172.1162 -1549.3243
clarity[T.VVS1]  -733.9089 153.8383  -4.7707 0.0000 -1035.4264  -432.3913
clarity[T.VVS2] -1235.3823 143.1063  -8.6326 0.0000 -1515.8654  -954.8991
carat           12683.7515 164.2469  77.2237 0.0000 12361.8334 13005.6696
=========================================================================

サマリーの見方

No. Observations : 観測数 (=308)
Df Model : 自由度(Degree of Freedom) 変数の数なので $12$
AIC : 赤池情報量基準(Akaike Information Criterion) ($=4931.3248 = -2 \times 対数尤度 + 2 \times (自由度+1) = -2 \times (-2452.7) +2 \times (12+1)$ )　(小さいほどモデルの適合度が良い）
Log-Likelihood: 尤度の対数（最大尤度のものを求めている）最尤推定
Coef. : 係数（一番上のInterceptは$y$-切片）
Std. Err. : 標準誤差
z : $z$値．絶対値が大きいほど係数が信頼できる
P : $p$-値（偶然 $|z|$ を超える確率）．小さいほど係数が信頼できる（以下の表参照）
[0.025, 0.975] : 係数の信頼区間

$z$ (標準偏差)	P値 (確率）	信頼度
< -1.65 または > +1.65	< 0.10	90%
< -1.96 または > +1.96	< 0.05	95%
< -2.58 または > +2.58	< 0.01	99%

diamond["predict"] = fit.predict()  # 予測を行い，結果を'predict'列に追加
diamond.plot.scatter(x="predict", y="price");
# 描画

問題: 車の価格の予測

http://logopt.com/data/carprice.csv から車の価格データを読み込み，線形回帰による予測を行え．

車種(Type)，100マイル走る際のガロン数（gpm100），都市部での1ガロンあたりの走行距離（MPGcity），高速道路での１ガロンあたりの走行距離（MPGhighway）から，価格(Price)を予測せよ．

問題: 広告の効果の予測

広告のデータ http://logopt.com/data/Advertising.csv を読み込み，線形回帰による予測を行え．

テレビ(TV)，ラジオ(Radio)，新聞(Newspaper)への広告から売上(Sales)を予測せよ．

問題: 住宅価格の予測

http://logopt.com/data/Boston.csv のBostonの住宅データを用いて回帰分析を行え．

データの詳細については， https://archive.ics.uci.edu/ml/datasets/Housing を参照せよ．

medvが住宅の価格で，他のデータ（犯罪率や人口など）から予測する．

必要なら以下の文字列を切り貼りして用いよ．

"medv ~ crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat"

問題: GPAの予測

http://logopt.com/data/SATGPA.csv データを用いて，2種類のSATの成績からGPAを予測せよ．

一般化線形モデル

基本となる線形回帰だと，独立変数 $x^{(i)}$ を用いて従属変数 $y^{(i)}$ を推定する．上付き添え字の$(i)$ はトレーニングデータのインデックスを表す．評価関数は最小自乗誤差であり，それを最小にするような重みベクトル $w$ を求める．

通常の線形回帰（最小自乗モデル）は，一般化線形モデル的に見直すと以下のように解釈できる．

従属変数 $y^{(i)}$ は平均 $\mu^{(i)}$，標準偏差 $\sigma$ の正規分布 $N(\mu^{(i)},\sigma^2)$ にしたがう．
線形予測子 $z^{(i)}$ を独立変数 $x^{(i)}$ を用いて $z^{(i)} = w x^{(i)} $ と定義する．ここで $w$ は最適化するパラメータ（重み）である．
リンク関数 $g$ を用いて $\mu^{(i)}$ と $z^{(i)}$ を繋ぐが，線形モデルでは $g(\mu) =\mu$ である．

ロジスティック回帰

titanicデータを用いる．

従属変数（予測するもの）はsurvivedの列で，生き残ったか($=1$)，否か($=0$)を表す．

このように $0$ か $1$ かを予測するのに線形回帰は不適当なので，ロジスティック回帰を用いる．

一般化線形モデル(glm)を使えば，ほぼ同じように予測できる（性別sexと客室クラスpclassだけを用いる）．

statsmodels.apiを sm(stats model)の別名でインポートした後で，引数の family に sm.families.Binomial() を指定すれば良い．

一般化線形モデルでの仮定は以下のようになる．

従属変数 $y^$ は平均 $\mu^{(i)}$ （表が出る確率）のコイン投げの分布(2項分布:binomial distribution)にしたがう．
線形予測子 $z^{(i)}$ を独立変数 $x^{(i)}$ を用いて $z^{(i)} = w x^{(i)} $ と定義する．(この部分は全部共通）．
リンク関数 $g$ を用いて $\mu^{(i)}$ と $z^{(i)}$ を繋ぐが，$\mu$ は確率なので $[0,1]$ の範囲しかとらない，一方， $z$ は線形予測子なので $[-\infty,+\infty]$ の定義域をもつ．これを繋ぐために以下のリンク関数 $g$ を用いる．

$$z = g(\mu) = \log \left( \frac{\mu}{1-\mu} \right) $$

これをロジット関数とよぶ．歴史的な都合で $g$ は $\mu$ から $z$ への写像となっているが，逆写像として考えた方がわかりやすい．すなわち，線形予測子 $z$ から分布の平均 $\mu$ を逆写像 $g^{-1}$ で写すのである．この関数は

$$\mu = \frac{ \exp (z) }{ 1+\exp (z)} $$

となり，いわゆるロジスティック関数である．

titanic号のデータセットに対して，ロジスティック回帰を適用してみる．

import statsmodels.api as sm    

titanic = pd.read_csv("http://logopt.com/data/titanic.csv", index_col=0)
model = smf.glm(
    formula="Survived ~ Sex + Pclass + Fare + SibSp + Parch",
    data=titanic,
    family=sm.families.Binomial(),
)
res = model.fit()
print(res.summary2())

               Results: Generalized linear model
===============================================================
Model:              GLM              AIC:            828.7501  
Link Function:      Logit            BIC:            -5194.4747
Dependent Variable: Survived         Log-Likelihood: -408.38   
Date:               2022-12-31 08:45 LL-Null:        -593.33   
No. Observations:   891              Deviance:       816.75    
Df Model:           5                Pearson chi2:   910.      
Df Residuals:       885              Scale:          1.0000    
Method:             IRLS                                       
---------------------------------------------------------------
                Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
---------------------------------------------------------------
Intercept       3.1473   0.3752   8.3895 0.0000  2.4121  3.8826
Sex[T.male]    -2.7594   0.1959 -14.0837 0.0000 -3.1434 -2.3754
Pclass         -0.8360   0.1268  -6.5905 0.0000 -1.0846 -0.5874
Fare            0.0034   0.0024   1.4508 0.1468 -0.0012  0.0080
SibSp          -0.2564   0.1008  -2.5435 0.0110 -0.4539 -0.0588
Parch          -0.0888   0.1132  -0.7842 0.4329 -0.3106  0.1331
===============================================================

titanic["predict"] = res.predict()  # 予測をpredict列に保管
titanic.plot.scatter(x="predict", y="Survived");
# 散布図に描画

問題: 癌の判定

http://logopt.com/data/cancer.csv にある胸部癌か否かを判定するデータセットを用いて分類を行え．

最初の列diagnosisが癌か否かを表すものであり， Mが悪性（malignant），Bが良性（benign）である．

必要なら以下の文字列を切り貼りして用いよ．

formula = """diagnosis~radius_mean+texture_mean+texture_mean+perimeter_mean+area_mean+smoothness_mean+compactness_mean+
concavity_mean+symmetry_mean+radius_worst+texture_worst+perimeter_worst+area_worst+smoothness_worst+
compactness_worst+concavity_worst+symmetry_worst+fractal_dimension_worst"""

cancer = pd.read_csv("http://logopt.com/data/cancer.csv", index_col=0)
cancer.head()

問題: 病院の予測

http://logopt.com/data/hospital.csv にある病院のデータを用いてロジスティック回帰を行え．

従属変数 died は死亡したか否かを表し，これを年齢(age)，施術(procedure)，性別(gender)，救急か否か(type)，入院日数(los: length of stay)から予測する．

必要なら以下の文字列を使用しても良い．

formula="died~procedure+age+gender+los+type"

hospital = pd.read_csv("http://logopt.com/data/hospital.csv", index_col=0)
hospital.head()

Poisson回帰

Poisson回帰は救急車の出動回数などの負の値をとらないカウントデータもしくはその発生率を予測する際に用いられる．

この場合には，従属変数が $0$ 以上の値になるので，一般化線形モデルでの仮定は以下のようになる．

従属変数 $y^$ は平均 $\mu^{(i)}$ のPoisson分布にしたがう．
線形予測子 $z^{(i)}$ を独立変数 $x^{(i)}$ を用いて $z^{(i)} = w x^{(i)} $ と定義する．(この部分は全部共通）．
リンク関数 $g$ を用いて $\mu^{(i)}$ と $z^{(i)}$ を繋ぐが，$\mu$ は $0$ 以上で $z$ は $[-\infty,+\infty]$ の定義域をもつ．これを繋ぐために以下のリンク関数 $g$ を用いる．

$$z = g(\mu) = \log (\mu)$$

$g$ の逆写像は指数関数 $$ \mu = \exp (z) $$ である．

一般化線形モデル(glm)を使えば，ほぼ同じように予測できる．

引数の family に sm.families.Poisson() を指定すれば良い．

http://logopt.com/data/hospital-stay.csv にある病院の入院日数のデータセットを用いる．従属変数である los （入院日数：length of stay)を，性別(gender)，救急か否か(type1)，75歳以上か(age75)から予測する．

入院日数は負の値をとらない，いわゆるカウントデータであるので，Poisson回帰を適用する．

hospital = pd.read_csv("http://logopt.com/data/hospital-stay.csv", index_col=0)
hospital.head()

model = smf.glm(
    formula="los ~ gender + type1 + age75", data=hospital, family=sm.families.Poisson()
)
res = model.fit()
print(res.summary2())

               Results: Generalized linear model
================================================================
Model:              GLM              AIC:            9178.5450  
Link Function:      Log              BIC:            -10080.9611
Dependent Variable: los              Log-Likelihood: -4585.3    
Date:               2022-03-18 19:36 LL-Null:        -4975.9    
No. Observations:   1798             Deviance:       3364.0     
Df Model:           3                Pearson chi2:   4.16e+03   
Df Residuals:       1794             Scale:          1.0000     
Method:             IRLS                                        
-----------------------------------------------------------------
              Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
-----------------------------------------------------------------
Intercept     1.1822    0.0276  42.8441  0.0000   1.1282   1.2363
gender       -0.1475    0.0218  -6.7523  0.0000  -0.1903  -0.1047
type1         0.6280    0.0258  24.3094  0.0000   0.5774   0.6787
age75         0.1298    0.0232   5.6016  0.0000   0.0844   0.1752
================================================================

hospital["predict"] = res.predict()
hospital.plot.scatter(x="predict", y="los")

問題: 魚の数の予測

http://logopt.com/data/fishing.csv にある魚の数を予測するためのデータセットにPoisson回帰を適用せよ．従属変数は魚の数を表す totabundであり，それを密度(density)，平均深度(meandepth)，年度(year)から予測せよ．

必要なら以下の文字列を用いよ．

formula="totabund ~ density + meandepth + year"

fish = pd.read_csv("http://logopt.com/data/fishing.csv", index_col=0)
fish.head()

時系列データの予測

時系列データ $y_0, y_1,\ldots, y_{t}$ が与えられたとき， $t+1$ 期（もしくは，それより先の期）のデータを予測したい．

以下の方法を紹介する．

移動平均法
指数平滑法
SARIMAX

移動平均法

$M$ 期前までのデータの平均を予測とする単純な方法である．

$t+1$ 期の予測値を $\hat{y}_{t+1}$ としたとき，移動平均法による予測値は以下のように定義される．

$$ \hat{y}_{t+1} = \sum_{i=t+1-M}^t y_i /M $$

移動平均は，pandasのSeriesのrollingメソッドに実装されている．パラメータ $M$ は，引数windowで与える．

rollingの返値は，移動平均オブジェクトであり，その平均meanをとることによって，移動平均が計算される．値が定義されない期に対してはNaNが入れられる．

以下では，簡単な需要系列を表すデータに対して， $M=2,4$ の移動平均を計算し，プロットしている．

y = pd.Series([28, 27, 30, 34, 32, 33, 32, 36, 33, 36])
ma2 = y.rolling(window=2).mean()
ma4 = y.rolling(window=4).mean()

y.plot()    # 青線
ma2.plot()  # オレンジ線
ma4.plot(); # 緑線

ma2

0     NaN
1    27.5
2    28.5
3    32.0
4    33.0
5    32.5
6    32.5
7    34.0
8    34.5
9    34.5
dtype: float64

ma4

0      NaN
1      NaN
2      NaN
3    29.75
4    30.75
5    32.25
6    32.75
7    33.25
8    33.50
9    34.25
dtype: float64

単純指数平滑法

単純指数平滑法は，過去の予測値と新たな実現値をパラメータ $\alpha$ で調整して予測を行う．

$y_t$: 期 $t$ の実現値
$s_t$: 予測値
$\alpha$: 平滑化定数 (smoothing_level)

$$ \begin{array}{ll} &s_0 = y_0\\ &s_{t} = \alpha y_{t} + (1-\alpha)s_{t-1} \\ &\hat{y}_{t+1} = s_t \end{array} $$

需要データを単純指数平滑法で予測する．

SimpleExpSmoothingクラスのfitメソッドでパラメータを最適化する． fitメソッドの引数 smoothing_level で平滑化定数を与えることができる． optimized引数を True （既定値）にすると，最適な平滑化定数が計算される．
クラス生成の際には，時系列データ（pandasのSeries）の他に，初期値 $\hat{y}_0 (=s_{-1})$ を計算する方法を引数 initialization_method で与える．引数には以下のものがある．
- estimated: 初期値$\hat{y}_0$を推定する
- heuristic: 経験的方法で初期値$\hat{y}_0$を定める
- known: この場合には，さらに引数 initial_level で初期値を与える必要がある
predictメソッドで予測を行う．
最適なパラメータは，予測インスタンスのparams_formatted属性で確認できる．

移動平均法との関係（最小自乗誤差を同じにする）

$$ \alpha = \frac{2}{M+1} $$

from statsmodels.tsa.api import SimpleExpSmoothing, Holt, ExponentialSmoothing

y = pd.Series([28, 27, 30, 34, 32, 33, 32, 36, 33, 36])
fit1 = SimpleExpSmoothing(y, initialization_method="estimated").fit()
es = fit1.predict(0, 9)
y.plot()
es.plot(); # オレンジ線

fit1.params_formatted

2/0.643296-1

2.108988708153012

Holt法

Holt法は，傾向変動を考慮した指数平滑法であり，2つの平滑化パラメータ $\alpha, \beta$ を使う．

パラメータ

$y_t$: 期 $t$ の実現値
$s_t$: 基本水準予測値
$b_t$: 傾向変動予測値
$\alpha$: 平滑化定数 (smoothing_level)
$\beta$: 傾向変動に対する平滑化定数 (smoothing_slope)

$$ \begin{array}{ll} &s_0 = y_0\\ &s_{t} = \alpha y_{t} + (1-\alpha)(s_{t-1} + b_{t-1})\\ &b_{t} = \beta (s_t - s_{t-1}) + (1-\beta)b_{t-1}\\ &\hat{y}_{t+1} = s_t + b_t \end{array} $$

傾向変動があるデータにHolt法を適用する．

Holtクラスのfitメソッドでパラメータを最適化する．
predictメソッドで予測を行う．
最適なパラメータは，予測インスタンスのparams_formatted属性で確認できる．

y = pd.Series([28, 27, 30, 34, 32, 33, 32, 36, 33, 36])
fit2 = Holt(y, initialization_method="estimated").fit()
holt = fit2.predict(0, 9)
y.plot()
holt.plot();

fit2.params_formatted

Holt-Winters法

Holt-Winters法は，季節変動と傾向変動を考慮した指数平滑法であり，3つの平滑化パラメータ $\alpha, \beta, \gamma$ を用いる．

パラメータ

$y_t$: 期 $t$ の実現値
$s_t$: 基本水準予測値
$b_t$: 傾向変動予測値
$c_t$: 季節変動予測値
$\hat{y}_{t+m}$: 期 $t$ における $m$期後の予測値
$L$ ：季節変動の周期 (seasonal_periods)
$\alpha$: 平滑化定数 (smoothing_level)
$\beta$: 傾向変動に対する平滑化定数 (smoothing_slope)
$\gamma$: 季節変動に対する平滑化定数 (smoothing_seasonal)

加法的季節変動：

$$ \begin{array}{lll} &s_0 & =& y_0\\ &s_{t} & =& \alpha (y_{t}-c_{t-L}) + (1-\alpha)(s_{t-1} + b_{t-1})\\ &b_{t} &=& \beta (s_t - s_{t-1}) + (1-\beta)b_{t-1}\\ &c_{t} &=& \gamma (y_{t}-s_{t})+(1-\gamma)c_{t-L}\\ &\hat{y}_{t+m} &=& s_t + m b_t+c_{t-L+1+(m-1)\mod L} \end{array} $$

乗法的季節変動：

$$ \begin{array}{lll} s_0 &=& y_0\\ s_{t} &=& \alpha (y_{t}/c_{t-L}) + (1-\alpha)(s_{t-1} + b_{t-1})\\ b_{t} &=& \beta (s_t - s_{t-1}) + (1-\beta)b_{t-1}\\ c_{t} &=& \gamma (y_{t}/s_{t})+(1-\gamma)c_{t-L}\\ \hat{y}_{t+m} &=& (s_t + m b_t) c_{t-L+1+(m-1)\mod L} \end{array} $$

１ヶ月ごとの需要量データで，$L=12$ヶ月の周期をもつ．

ExponentialSmoothingクラスのfitメソッドでパラメータを最適化する．周期（seasonal_periods）は12とし，傾向も季節変動も加法的とする（乗法的の場合には'mul'を入れる）．
predictメソッドで予測を行う．
最適なパラメータは，予測インスタンスのparams_formatted属性で確認できる．

y = pd.Series(
    [
        19,
        48,
        48,
        49,
        71,
        91,
        99,
        115,
        96,
        46,
        26,
        24,
        29,
        55,
        44,
        48,
        64,
        98,
        104,
        125,
        104,
        45,
        22,
        20,
        13,
        46,
        42,
        45,
        79,
        93,
        92,
        127,
        84,
        27,
        5,
        7
    ]
)
fit3 = ExponentialSmoothing(
    y,
    seasonal_periods=12,
    trend="add",
    seasonal="add",
    initialization_method="estimated",
).fit()
hw = fit3.predict(0, len(y))
y.plot()
hw.plot();

fit3.params_formatted

航空機の乗客数の例題

航空機の乗客数の時系列データを用いて、様々な手法の比較を行う。

最後の2年間（24ヶ月）をテストデータとする。Holt-Winters法に対しては、パラメータは自動調整とし， Box-Cox変換をしてデータを正規化しておく（引数のuse_boxcoxをTrueにする）．

passengers = pd.read_csv("http://logopt.com/data/AirPassengers.csv")
passengers["Month"] = pd.to_datetime(passengers.Month)
passengers.set_index("Month", inplace=True)  # 月をインデックスとする
passengers.index.freq = "MS"  # 月のはじめを頻度にする
passengers.head()

fit1 = SimpleExpSmoothing(passengers[:-24], initialization_method="estimated").fit(
    smoothing_level=0.2, optimized=False
)
fit2 = Holt(passengers[:-24], initialization_method="estimated").fit(
    smoothing_level=0.8, smoothing_trend=0.2, optimized=False
)
fit3 = ExponentialSmoothing(
    passengers[:-24],
    seasonal_periods=12,
    trend="add",
    seasonal="add",
    initialization_method="estimated",
).fit()

自動調整されたパラメータを見てみる。

fit3.params

{'smoothing_level': 0.23678678235712566,
 'smoothing_trend': 1.8055668382578714e-09,
 'smoothing_seasonal': 0.7632132086415252,
 'damping_trend': nan,
 'initial_level': 119.1876680939318,
 'initial_trend': 2.2768394955266533,
 'initial_seasons': array([ -9.42424719,  -3.87104168,   8.69002781,   3.6678027 ,
         -4.94701962,   9.26652064,  21.53444859,  19.2001798 ,
          5.07181472, -13.80997707, -28.51179942, -12.37245268]),
 'use_boxcox': False,
 'lamda': None,
 'remove_bias': False}

最後の2年分の予測を行う。

start, end = len(passengers) - 24, len(passengers)
predict1 = fit1.predict(start, end)
predict2 = fit2.predict(start, end)
predict3 = fit3.predict(start, end)

passengers[-24:].plot()  # 実際のデータ 青線
predict1.plot()
# 単純指数平滑 オレンジ線
predict2.plot()
# Holt法  緑線
predict3.plot();
# Holt-Winters法 赤線

SARIMAX

SARIMAXは，自己回帰和分移動平均モデル ARIMA (autoregressive integrated moving average)モデルの拡張である．

$$ y_{t} - y_{t-d} =c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \cdots + \phi_{p}y_{t-p} + \varepsilon_{t} + \theta_{1}\varepsilon_{t-1} + \cdots + \theta_{q}\varepsilon_{t-q} $$

パラメータ

自己回帰 AR (autoregressive): $p$；過去 $p$ 期の値の多項式で予測
差分 I (integrated): $d$； $d$ 期前との差分をとることによって傾向変動を予測
移動平均 MA (moving average): $q$；過去 $q$ 期の誤差の多項式で予測

ARIMAに季節変動を追加したものが SARIMA (seasonal ARIMA)である．

パラメータ

季節自己回帰 SAR: $P$
季節差分 SI : $D$
季節移動平均 SMA: $Q$
周期: $L$

SARIMAX は，さらに外生変数 (eXogenous variables) を追加したものである．

航空機の乗客数の例題で予測する．

import statsmodels.api as sm

mod = sm.tsa.statespace.SARIMAX(
    passengers[:-24],
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 12),
    enforce_stationarity=False,
    enforce_invertibility=False,
)
fit4 = mod.fit()
predict4 = fit4.predict(start, end)

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            5     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.92259D+00    |proj g|=  1.74251D-02

At iterate    5    f=  2.90832D+00    |proj g|=  7.22871D-03

At iterate   10    f=  2.90735D+00    |proj g|=  5.62784D-03

At iterate   15    f=  2.90717D+00    |proj g|=  1.02033D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    5     19     25      1     0     0   4.927D-07   2.907D+00
  F =   2.9071672307458569     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL

 This problem is unconstrained.

予測結果をデータフレームに追加する．

passengers["Exp.Smooth."] = predict1
passengers["Holt"] = predict2
passengers["Holt-Winter"] = predict3
passengers["SARIMAX"] = predict4

passengers.tail()

Plotlyで予測結果とオリジナルの乗客数を可視化する．

import plotly.express as px
import plotly

stacked_df = pd.DataFrame(passengers[-24:].stack(), columns=["num"]).reset_index()
fig = px.line(stacked_df, x="Month", y="num", color="level_1")
fig.show();

問題: ピザの需要予測

ピザの需要量データ http://logopt.com/data/pizza.csv を読み込み、指数平滑法， Holt法， Holt-Winters法とSARIMAXで予測を行え。

データは、Date列に日付が，Pizzas列にピザの需要量が入っている．

ただし、テストデータは最後の24日とせよ。

また、Plotly Expressによる可視化を行い、実データと比較せよ。

	carat	colour	clarity	certification	price
1	0.30	D	VS2	GIA	1302
2	0.30	E	VS1	GIA	1510
3	0.30	G	VVS1	GIA	1510
4	0.30	G	VS1	GIA	1260
5	0.31	D	VS1	GIA	1641

	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	symmetry_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
id
842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

	site	totabund	density	meandepth	year	period	sweptarea
1	1	76	0.002070	804	1978	1977-1989	36710.000000
2	2	161	0.003520	808	2001	2000-2002	45741.253906
3	3	39	0.000981	809	2001	2000-2002	39775.000000
4	4	410	0.008039	848	1979	1977-1989	51000.000000
5	5	177	0.005933	853	2002	2000-2002	29831.251953

	name	param	optimized
smoothing_level	alpha	4.976157e-01	True
smoothing_trend	beta	6.622793e-09	True
smoothing_seasonal	gamma	5.862000e-09	True
initial_level	l.0	6.168322e+01	True
initial_trend	b.0	-4.035850e-01	True
initial_seasons.0	s.0	-4.156733e+01	True
initial_seasons.1	s.1	-1.183038e+01	True
initial_seasons.2	s.2	-1.642662e+01	True
initial_seasons.3	s.3	-1.335631e+01	True
initial_seasons.4	s.4	1.104726e+01	True
initial_seasons.5	s.5	3.411763e+01	True
initial_seasons.6	s.6	3.885439e+01	True
initial_seasons.7	s.7	6.325796e+01	True
initial_seasons.8	s.8	3.599502e+01	True
initial_seasons.9	s.9	-1.893489e+01	True
initial_seasons.10	s.10	-4.019823e+01	True
initial_seasons.11	s.11	-4.046100e+01	True

	#Passengers
Month
1949-01-01	112
1949-02-01	118
1949-03-01	132
1949-04-01	129
1949-05-01	121

	name	param	optimized
smoothing_level	alpha	0.643196	True
initial_level	l.0	28.180514	True

	name	param	optimized
smoothing_level	alpha	1.492725e-08	True
smoothing_trend	beta	2.759277e-10	True
initial_level	l.0	2.746664e+01	True
initial_trend	b.0	8.424334e-01	True

	#Passengers	Exp.Smooth.	Holt	Holt-Winter	SARIMAX
Month
1960-08-01	606	374.897322	92.268388	553.832185	506.802983
1960-09-01	508	374.897322	80.363471	455.196351	405.637677
1960-10-01	461	374.897322	68.458554	409.264176	360.736984
1960-11-01	390	374.897322	56.553638	361.770943	311.758930
1960-12-01	432	374.897322	44.648721	400.905087	338.665679

	procedure	age	los	type
1	1	73	51	0
2	0	67	30	1
3	1	69	43	0
4	1	65	32	0
5	1	79	42	1

statsmodelsを用いた統計分析

線形回帰

サマリーの見方

問題: 車の価格の予測

問題: 広告の効果の予測

問題: 住宅価格の予測

問題: GPAの予測

一般化線形モデル

ロジスティック回帰

問題: 癌の判定

問題: 病院の予測

Poisson回帰

問題: 魚の数の予測

時系列データの予測

移動平均法

単純指数平滑法

Holt法

Holt-Winters法

航空機の乗客数の例題

SARIMAX

問題: ピザの需要予測