ここでは,機械学習に役に立つ線形代数や確率・統計の基礎について解説する.Web上に散見されている記事やマニュアルに載っている例題を切り貼りしたような本は多いが,その背景にある理論まで書いたものは少ない.一方で,専門家向けの論文を切り貼りしたような本は,実務家には難解で役に立たない.ここでは,簡単な例題からはじめて,その背景にある理論を理解し,さらには実践的な例題を解くといったスタイルで説明を行うことにする.

線形代数入門

果物を買い物に行った裕一郎くんの証言をもとに, りんごとなしの価格を当てよ.

  • 初日は,りんごとなしを1つずつ買いました.値段は100円でした.
  • 2日目は,りんごを1つ,なしを2つ買いました.値段は180円でした.
import numpy as np

A = np.array([[1, 1], [1, 2]])
b = np.array([100, 180])
x = np.linalg.solve(A, b)
x
array([20., 80.])

問題

次の2つのケースに対して,上と同じような方法でりんごとなしの値段を当てよ.

(ケース1)

  • 初日は,りんごとなしを1つずつ買いました.値段は100円でした.
  • 2日目は,りんごを2つ,なしを2つ買いました.値段は200円でした.

(ケース2)

  • 初日は,りんごとなしを1つずつ買いました.値段は100円でした.
  • 2日目は,りんごを2つ,なしを2つ買いました.値段は240円でした.

回帰分析

あなたはコーヒーショップを開店した.オープン当初の5日間の売り上げは, 15,20,34,42,58 と右肩上がりである.1日あたりどのくらい増えているかを 線形関数で予測せよ.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# NumPyを用いてデータを作成してプロットしてみる.
n = 5
x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([15.0, 20.0, 34.0, 42.0, 58.0])

plt.plot(x, y, ".")

seaborn

seabornを使うともっと綺麗に書ける.

import seaborn as sns

sns.jointplot(x=x, y=y, kind="reg")
<seaborn.axisgrid.JointGrid at 0x7f9ec1fbc0d0>

手計算

線形回帰の予測式  $y=wx$ で $w$ を誤差最小になるように予測するには,

$$w=(x^Tx)^{−1} x^Ty$$

とすれば良い.

print(np.dot(1 / np.dot(x, x) * x, y))  # NumPyの内積は np.dot()で計算できる.
print(1 / (x @ x) * x @ y)  # 同じプログラム;Python 3.5からはベクトル(行列)の積は@ でできる.x@xはスカラであることに注意.
11.18181818181818
11.18181818181818
w = (x @ y) / (x @ x)
SSE = ((y - w * x) ** 2).sum()
SST = ((y - np.mean(y)) ** 2).sum()
R2 = 1.0 - SSE / SST
print(R2, SSE, SST)
0.9731101118133204 32.18181818181819 1196.7999999999997

statsmodelsの最小2乗法(OLS)を利用

x1 の係数(Coef.)が11.1818で,上の手計算と同じであることを確認.

決定係数 $R^2$ (R-squared) なども同時に得られる.

import statsmodels.api as sm

model1 = sm.OLS(y, x)
r1 = model1.fit()
print(r1.summary2())
                       Results: Ordinary least squares
==============================================================================
Model:                  OLS              Adj. R-squared (uncentered): 0.994   
Dependent Variable:     y                AIC:                         25.4992 
Date:                   2022-06-08 07:42 BIC:                         25.1086 
No. Observations:       5                Log-Likelihood:              -11.750 
Df Model:               1                F-statistic:                 854.7   
Df Residuals:           4                Prob (F-statistic):          8.15e-06
R-squared (uncentered): 0.995            Scale:                       8.0455  
-----------------------------------------------------------------------------------
            Coef.       Std.Err.         t         P>|t|        [0.025       0.975]
-----------------------------------------------------------------------------------
x1         11.1818        0.3825      29.2360      0.0000      10.1199      12.2437
------------------------------------------------------------------------------
Omnibus:                   nan              Durbin-Watson:               2.470
Prob(Omnibus):             nan              Jarque-Bera (JB):            0.493
Skew:                      0.087            Prob(JB):                    0.782
Kurtosis:                  1.472            Condition No.:               1    
==============================================================================

/Users/mikiokubo/Library/Caches/pypoetry/virtualenvs/analytics-v-sH3Dza-py3.8/lib/python3.8/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 5 samples were given.
  warn("omni_normtest is not valid with less than 8 observations; %i "

statsmodelsの一般化線形モデル(GLS)を利用

x1 の係数(Corf.)が11.1818で,上で行った計算と同じであることを確認する.

model2 = sm.GLM(y, x)
r2 = model2.fit()
print(r2.summary2())
             Results: Generalized linear model
============================================================
Model:              GLM              AIC:            25.4992
Link Function:      identity         BIC:            25.7441
Dependent Variable: y                Log-Likelihood: -11.750
Date:               2022-06-08 07:42 LL-Null:        -84.185
No. Observations:   5                Deviance:       32.182 
Df Model:           0                Pearson chi2:   32.2   
Df Residuals:       4                Scale:          8.0455 
Method:             IRLS                                    
--------------------------------------------------------------
      Coef.    Std.Err.      z      P>|z|     [0.025    0.975]
--------------------------------------------------------------
x1   11.1818     0.3825   29.2360   0.0000   10.4322   11.9314
============================================================

モデル式を用いた一般化線形モデル

結果は上と同じだが,入力方法にモデル式を用いることができる. 'y ~ x - 1' は $y$-切片を $0$ に固定した(原点を通る) 直線を求めることを意味する.

import statsmodels.formula.api as smf

model3 = smf.glm("y ~ x - 1", {"y": y, "x": x})
r3 = model3.fit()
print(r3.summary2())
             Results: Generalized linear model
============================================================
Model:              GLM              AIC:            25.4992
Link Function:      identity         BIC:            25.7441
Dependent Variable: y                Log-Likelihood: -11.750
Date:               2022-06-08 07:42 LL-Null:        -84.185
No. Observations:   5                Deviance:       32.182 
Df Model:           0                Pearson chi2:   32.2   
Df Residuals:       4                Scale:          8.0455 
Method:             IRLS                                    
--------------------------------------------------------------
      Coef.    Std.Err.      z      P>|z|     [0.025    0.975]
--------------------------------------------------------------
x    11.1818     0.3825   29.2360   0.0000   10.4322   11.9314
============================================================

for i, r in enumerate([r1, r2, r3]):
    plt.subplot(1, 3, i + 1)
    plt.plot(x, y, ".")
    if i < 2:
        plt.plot(x, r.predict(x))
    else:
        plt.plot(x, r.predict({"x": x}))
    plt.title(["OLS", "GLM", "glm"][i])
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, x.max() + 1))
    plt.ylim((0, y.max() + 5))

中心化

ベクトル $( y^{(1)},y^{(2)}, \ldots, y^{(n)} )$ の平均 $\bar{y}$ は, $$ \bar{y} = \left(\sum_i y^{(i)} \right)/n $$ と計算できる.

すべての要素が $1$ の長さ $n$ のベクトルとの内積を $1/n$ にすることによって計算できるので,ベクトル表記すると以下のように書ける. $$ \bar{y} = \frac{1}{n} (1,\ldots,1) \begin{pmatrix} y^{(1)}\\ \vdots\\ y^{(n)} \end{pmatrix} $$

もちろんNumPyのmean関数を使っても結果は同じである.

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([15, 20, 34, 42, 58])

ybar = (1 / 5) * np.ones(5) @ y
print(ybar)
print(np.mean(y))
33.800000000000004
33.8

平均が $0$ になるようにスケーリングすることを中心化(centering)とよぶ.

ベクトル $( y^{(1)},y^{(2)}, \ldots, y^{(n)} )$ の中心化 $\tilde{y}$ は, $$ \tilde{y} = y - \bar{y} \begin{pmatrix} 1 \\ \vdots\\ 1 \end{pmatrix} $$ と計算できる.

ytilde = y - ybar * np.ones(5)
print(ytilde)
[-18.8 -13.8   0.2   8.2  24.2]

$\bar{y}$ の定義を代入すると, $$ \tilde{y} = y - \bar{y} = \left( \begin{array}{ccc} 1&\cdots &0\\ &\ddots& \\ 0&\cdots &1 \end{array} \right) y - \frac{1}{n} \left( \begin{array}{ccc} 1&\cdots &1\\ &\ddots& \\ 1 &\cdots &1 \end{array} \right) y $$ となる.$n$行$n$列の単位行列を $I_n$,$n$行$n$列のすべての要素が $1$ の行列を $1_n$ とすると,中心化は $$ \left(I_n -\frac{1}{n} 1_n \right) y $$ と書くことができる.行列 $ \left(I_n -\frac{1}{n} 1_n \right)$ を $H_n$ と書くことにすると,中心化は $$ \tilde{y} = H_n y $$ という線形変換とみなすことができる.

H = np.identity(5) - (1 / 5) * np.ones((5, 5))  # H makes every column centered
print("H y = \n", H @ y)
H y = 
 [-18.8 -13.8   0.2   8.2  24.2]

行列 $H_n$ は正規化を行うための行列であるが,べき等性とよばれる重要な性質をもつ.

$$H_n H_n = H_n$$

これは2回正規化しても,1回した場合と同じ結果であることを表す.

print("H=\n", H)

print("H H=\n", H @ H)
H=
 [[ 0.8 -0.2 -0.2 -0.2 -0.2]
 [-0.2  0.8 -0.2 -0.2 -0.2]
 [-0.2 -0.2  0.8 -0.2 -0.2]
 [-0.2 -0.2 -0.2  0.8 -0.2]
 [-0.2 -0.2 -0.2 -0.2  0.8]]
H H=
 [[ 0.8 -0.2 -0.2 -0.2 -0.2]
 [-0.2  0.8 -0.2 -0.2 -0.2]
 [-0.2 -0.2  0.8 -0.2 -0.2]
 [-0.2 -0.2 -0.2  0.8 -0.2]
 [-0.2 -0.2 -0.2 -0.2  0.8]]

行列 $X$ に対しても同様に,(列ごとの)平均をとるためには, $$ \bar{X} = \frac{1}{n} (1,\ldots,1) X $$ とすればよく,(列ごとに)中心化するには, $$ \tilde{X} = H_x X $$ とすればよい.

X = np.array([[1, 2, 3, 4, 5], [1, 0, 0, 0, 1]]).T
print("X=\n", X)
print("columnwise average=", (1 / 5) * np.ones(5) @ X)
print("H X =\n", H @ X)
X=
 [[1 1]
 [2 0]
 [3 0]
 [4 0]
 [5 1]]
columnwise average= [3.  0.4]
H X =
 [[-2.00000000e+00  6.00000000e-01]
 [-1.00000000e+00 -4.00000000e-01]
 [-5.55111512e-17 -4.00000000e-01]
 [ 1.00000000e+00 -4.00000000e-01]
 [ 2.00000000e+00  6.00000000e-01]]

分散

ベクトル $( y^{(1)},y^{(2)}, \ldots, y^{(n)} )$ の分散 $Var(y)$ は, $$ Var(y) = \frac{1}{n} \sum_i \left( y^{(i)} - \bar{y} \right)^2 $$ と定義される.これは $\frac{1}{n} \tilde{y}^T \tilde{y}$ であるので,$\tilde{y}= H_n y$ を代入すると, $$ Var(y) = \frac{1}{n} (H_n y)^T (H_n y) = \frac{1}{n} y^T H_n y $$ と書くことができる.ここで $H_n H_n = H_n$ という性質(べき等性)を利用している.

分散はNumPyのvar関数を用いても計算できる.

print("y^T H y =", y.T @ H @ y / 5)
print("variance=", np.var(y))
y^T H y = 239.36000000000004
variance= 239.35999999999996

共分散

ベクトル $( x^{(1)},x^{(2)}, \ldots, x^{(n)} )$ と$( y^{(1)},y^{(2)}, \ldots, y^{(n)} )$ の共分散 $Cov(x,y)$ は, $$ Cov(y) = \frac{1}{n} \sum_i \left( x^{(i)} - \bar{x} \right) \left( y^{(i)} - \bar{y} \right) $$ と定義される.これは $\frac{1}{n} \tilde{x}^T \tilde{y}$ であるので,$\tilde{x}= H_n x$ と $\tilde{y}= H_n y$ を代入すると, $$ Cov(x,y) = \frac{1}{n} (H_n x)^T (H_n y) = \frac{1}{n} x^T H_n y $$ と書くことができる.ここで $H_n H_n = H_n$ という性質(べき等性)を利用している.

ベクトル $y$ にではなく,行列 $X$ に対して $$ \frac{1}{n} X^T H_n X $$ を計算すると分散・共分散行列を得ることができる.

print("X^T H X =\n", X.T @ H @ X / 5)
print(
    "covariance= \n", np.cov(X, rowvar=False, bias=True)
)  # rowbar=False:列(column)が変数, bias=True(Falseのとき不偏分散:n-1で割る)
X^T H X =
 [[2.0000000e+00 0.0000000e+00]
 [8.8817842e-17 2.4000000e-01]]
covariance= 
 [[2.   0.  ]
 [0.   0.24]]

原点を通過する1変数の回帰分析

平均が $0$ になるように中心化した後で,原点を通過する誤差最小の直線を求める. 直線の傾きを $w$ とすると,誤差は以下の関数になる. $$ f(w) = || \tilde{y} - w \tilde{x} ||^2 = \tilde{y}^T \tilde{y} - 2 w \tilde{y}^T \tilde{x} + w^2 \tilde{x}^T \tilde{x} $$ これを最小にする $w$ を求めればよい.上式を微分すると $$ f'(w)= 2 w \tilde{x}^T \tilde{x} - 2 \tilde{y}^T \tilde{x} $$ となるので,これを $0$ と置いて $w$ について解くと $$ w = \frac{\tilde{y}^T \tilde{x}}{ \tilde{x}^T \tilde{x} } $$ を得る.(2次微分の条件から関数$f(w)$が凸であることは各自確認されたい.)

$$ \tilde{y}^T \tilde{x} = x^T H_n y =n Cov(x,y) $$

と $$ \tilde{x}^T \tilde{x} = x^T H_n x = n Var(x) $$ を用いると, $$ w =\frac{w \tilde{y}^T \tilde{x}}{ \tilde{x}^T \tilde{x} } = \frac{Cov(x,y)}{Var(x)} $$ を得る.

確率変数 $x,y$ の相関係数 $\rho_{xy}$ の定義は,$x,y$ の標準偏差を $\sigma_x,\sigma_y$ としたとき, $$ \rho_{xy} =\frac{Cov(x,y)}{ \sigma_x \sigma_y } $$ であったことを思い起こすと, $$ w = \frac{ \rho_{xy} \sigma_x \sigma_y }{\sigma_x^2 } = \rho_{xy} \frac{ \sigma_y }{ \sigma_x } $$ となる.

相関係数はNumPyのcorrcoef関数で得ることができる.

同じ結果が出るかどうか,確認してみよう.

xbar = (1.0 / n) * np.ones(n) @ x
xtilde = x - xbar * np.ones(n)

ybar = (1.0 / n) * np.ones(n) @ y
ytilde = y - ybar * np.ones(n)

print("正規化されたベクトルx,yに対するy^T x/x^T x=", (ytilde @ xtilde) / (xtilde @ xtilde))
print(
    "正規化されたベクトルに対する 相関×yの標準偏差/xの標準偏差=",
    np.corrcoef(xtilde, ytilde)[0, 1] * np.std(ytilde) / np.std(xtilde),
)
print("もとのベクトルに対する 相関×yの標準偏差/xの標準偏差=", np.corrcoef(x, y)[0, 1] * np.std(y) / np.std(x))
正規化されたベクトルx,yに対するy^T x/x^T x= 10.8
正規化されたベクトルに対する 相関×yの標準偏差/xの標準偏差= 10.799999999999999
もとのベクトルに対する 相関×yの標準偏差/xの標準偏差= 10.799999999999997

中心化した後で線形回帰を行って得た傾きは,上で求めた $w$ と一致するはずである.

確認してみよう.

model1 = sm.OLS(ytilde, xtilde)
r1 = model1.fit()
print(r1.summary2())
                       Results: Ordinary least squares
==============================================================================
Model:                  OLS              Adj. R-squared (uncentered): 0.968   
Dependent Variable:     y                AIC:                         25.2144 
Date:                   2022-06-08 07:42 BIC:                         24.8238 
No. Observations:       5                Log-Likelihood:              -11.607 
Df Model:               1                F-statistic:                 153.5   
Df Residuals:           4                Prob (F-statistic):          0.000244
R-squared (uncentered): 0.975            Scale:                       7.6000  
-----------------------------------------------------------------------------------
             Coef.       Std.Err.         t         P>|t|       [0.025       0.975]
-----------------------------------------------------------------------------------
x1          10.8000        0.8718      12.3884      0.0002      8.3796      13.2204
------------------------------------------------------------------------------
Omnibus:                  nan               Durbin-Watson:               2.591
Prob(Omnibus):            nan               Jarque-Bera (JB):            0.631
Skew:                     -0.067            Prob(JB):                    0.730
Kurtosis:                 1.265             Condition No.:               1    
==============================================================================

/Users/mikiokubo/Library/Caches/pypoetry/virtualenvs/analytics-v-sH3Dza-py3.8/lib/python3.8/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 5 samples were given.
  warn("omni_normtest is not valid with less than 8 observations; %i "
model = smf.glm("y ~ x -1", {"y": ytilde, "x": xtilde})
r = model.fit()
print(r.summary2())
             Results: Generalized linear model
============================================================
Model:              GLM              AIC:            25.2144
Link Function:      identity         BIC:            23.9622
Dependent Variable: y                Log-Likelihood: -11.607
Date:               2022-06-08 07:42 LL-Null:        -88.402
No. Observations:   5                Deviance:       30.400 
Df Model:           0                Pearson chi2:   30.4   
Df Residuals:       4                Scale:          7.6000 
Method:             IRLS                                    
--------------------------------------------------------------
       Coef.    Std.Err.      z      P>|z|    [0.025    0.975]
--------------------------------------------------------------
x     10.8000     0.8718   12.3884   0.0000   9.0913   12.5087
============================================================

y-切片の考慮

初日には友人が来てくれたことを思い出した.友人,言い換えれば固定客(y-切片)を線形回帰で推定せよ.

statsmodelsの公式に含まれる一般化線形モデル(glm)を使えば簡単だ.

model3 = smf.glm("y ~ x ", {"y": y, "x": x})
r3 = model3.fit()
print(r3.summary2())
             Results: Generalized linear model
============================================================
Model:              GLM              AIC:            27.2144
Link Function:      identity         BIC:            25.5717
Dependent Variable: y                Log-Likelihood: -11.607
Date:               2022-06-08 07:42 LL-Null:        -69.437
No. Observations:   5                Deviance:       30.400 
Df Model:           1                Pearson chi2:   30.4   
Df Residuals:       3                Scale:          10.133 
Method:             IRLS                                    
------------------------------------------------------------
              Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
------------------------------------------------------------
Intercept     1.4000   3.3387  0.4193 0.6750 -5.1437  7.9437
x            10.8000   1.0066 10.7287 0.0000  8.8270 12.7730
============================================================

$1.4$人の固定客がいることが推定されているが,$p$-値をみると大分大きい値であるので,固定客が $0$ であるという帰無仮説は棄却できない. つまり,友人は固定客になっていないということだ.

回帰分析の結果を相関と標準偏差から計算した結果と較べてみる.結果は同じになる.

w1 = np.corrcoef(xtilde, ytilde)[0, 1] * np.std(ytilde) / np.std(xtilde)
w0 = ybar - xbar * w1
print(w0, w1)
1.4000000000000057 10.799999999999999

公式による計算

線形回帰の予測式 $y=wX$ で $w$ を誤差最小になるように予測するために,行列 $X$ の最初の列に1ベクトルを追加しておく.

最小2乗解は,以下のようになる. $$ w= (X^T X)^{-1} X^T y $$

X = np.array([[1, 1, 1, 1, 1], [1, 2, 3, 4, 5]]).T
y = np.array([15, 20, 34, 42, 58])
XINV = np.linalg.inv(X.T @ X)
print((XINV @ X.T) @ y.T)
[ 1.4 10.8]

scikit-learn の利用

同じことが機械学習モジュール scikit-learnでもできるが,多少注意が必要だ.

from sklearn.linear_model import LinearRegression

model = LinearRegression()

# 入力は縦ベクトル(行列)でなければならない.
# x=np.array([[1,2,3,4,5]]).T
x = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([[15, 20, 34, 42, 58]]).T
print(x)
model.fit(x, y)
print(model.coef_)
print(model.intercept_)
[[1]
 [2]
 [3]
 [4]
 [5]]
[[10.8]]
[1.4]

広告の効果

初日と最終日(5日目)には,ビラをまいたことを思い出した.ビラの効果を線形回帰で推定せよ.

手計算

多重線形回帰の予測式 $y=wX$ で $w$ を誤差最小になるように予測するには,

$$ w=(X^T X)^{−1} X^T y $$

とすれば良い.

X = np.array([[1, 2, 3, 4, 5], [1, 0, 0, 0, 1]]).T  # .Tで行列の転置を求めることができる.
y = np.array([15, 20, 34, 42, 58])
XINV = np.linalg.inv(X.T @ X)
(XINV @ X.T) @ y.T
array([10.7027027 ,  4.39189189])
model3 = smf.glm("y ~ x - 1", {"y": y, "x": x})
r3 = model3.fit()
print(r3.summary2())
             Results: Generalized linear model
============================================================
Model:              GLM              AIC:            25.4992
Link Function:      identity         BIC:            25.7441
Dependent Variable: y                Log-Likelihood: -11.750
Date:               2022-06-08 07:42 LL-Null:        -84.185
No. Observations:   5                Deviance:       32.182 
Df Model:           0                Pearson chi2:   32.2   
Df Residuals:       4                Scale:          8.0455 
Method:             IRLS                                    
--------------------------------------------------------------
      Coef.    Std.Err.      z      P>|z|     [0.025    0.975]
--------------------------------------------------------------
x    11.1818     0.3825   29.2360   0.0000   10.4322   11.9314
============================================================

結果

赤池情報量基準(AIC)が前より小さくなっているので,広告の効果が(1日あたり4-5人程度だが)若干はあることが分かる.

残差の計算

$$ w =(X^T X)^{-1} X^T y $$

残差 $$ \epsilon = y -X w = (I - X(X^T X)^{-1} X^T) y = H_x y $$

$$ H_x = I - X(X^T X)^{-1} X^T $$

$H_x$ は対称,かつべき等( $H_x H_x = H_x$ )

print(r3.resid_pearson)
print(y - r3.predict())
0    3.818182
1   -2.363636
2    0.454545
3   -2.727273
4    2.090909
dtype: float64
[ 3.81818182 -2.36363636  0.45454545 -2.72727273  2.09090909]
XINV = np.linalg.inv(X.T @ X)
Hx = np.identity(5) - X @ XINV @ X.T
Hx @ y.T
array([-0.09459459, -1.40540541,  1.89189189, -0.81081081,  0.09459459])

同じことをscikit-learnで行う.

from sklearn.linear_model import LinearRegression

model = LinearRegression(fit_intercept=False)  # y-切片を0にする

X = np.array([[1, 2, 3, 4, 5], [1, 0, 0, 0, 1]]).T  # .Tで行列の転置を求めることができる.
y = np.array([[15, 20, 34, 42, 58]]).T

model.fit(X, y)
print("係数=", model.coef_)
print("y-切片=", model.intercept_)

print("残差", model.predict(X))
係数= [[10.7027027   4.39189189]]
y-切片= 0.0
残差 [[15.09459459]
 [21.40540541]
 [32.10810811]
 [42.81081081]
 [57.90540541]]

チップの例

import seaborn as sns

tips = sns.load_dataset("tips")
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
model = smf.glm("tip ~ total_bill + sex +smoker + day + size", tips)
# model = smf.glm('tip ~ total_bill + size ', tips)
# model = smf.glm('tip ~ total_bill', tips)
r = model.fit()
print(r.summary2())
               Results: Generalized linear model
===============================================================
Model:              GLM              AIC:            710.9797  
Link Function:      identity         BIC:            -1050.7808
Dependent Variable: tip              Log-Likelihood: -347.49   
Date:               2022-06-08 07:43 LL-Null:        -452.21   
No. Observations:   244              Deviance:       246.55    
Df Model:           7                Pearson chi2:   247.      
Df Residuals:       236              Scale:          1.0447    
Method:             IRLS                                       
---------------------------------------------------------------
                  Coef.  Std.Err.    z    P>|z|   [0.025 0.975]
---------------------------------------------------------------
Intercept         0.5917   0.2552  2.3181 0.0204  0.0914 1.0919
sex[T.Female]     0.0324   0.1413  0.2296 0.8184 -0.2445 0.3094
smoker[T.No]      0.0849   0.1459  0.5815 0.5609 -0.2012 0.3709
day[T.Fri]        0.1198   0.2786  0.4300 0.6672 -0.4262 0.6657
day[T.Sat]       -0.0261   0.1752 -0.1490 0.8815 -0.3695 0.3173
day[T.Sun]        0.0701   0.1815  0.3862 0.6993 -0.2856 0.4258
total_bill        0.0943   0.0095  9.9218 0.0000  0.0757 0.1129
size              0.1770   0.0891  1.9865 0.0470  0.0024 0.3516
===============================================================

sns.jointplot(x="tip", y="total_bill", data=tips, kind="reg")
<seaborn.axisgrid.JointGrid at 0x7f9e984b0f10>
tips["predict"] = r.predict()
tips["residual"] = tips["predict"] - tips["tip"]
tips.head()
total_bill tip sex smoker day time size predict residual
0 16.99 1.01 Female No Sun Dinner 2 2.735243 1.725243
1 10.34 1.66 Male No Sun Dinner 3 2.252699 0.592699
2 21.01 3.50 Male No Sun Dinner 3 3.258888 -0.241112
3 23.68 3.31 Male No Sun Dinner 2 3.333670 0.023670
4 24.59 3.61 Female No Sun Dinner 4 3.805931 0.195931
sns.jointplot(x="total_bill", y="predict", data=tips, kind="reg")
<seaborn.axisgrid.JointGrid at 0x7f9eb89aba90>

scikit-learnの利用

機械学習では数値しか扱えないので,ちょっと面倒になる.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

model = LinearRegression()

# カテゴリカルデータをダミー列を用いて数値に変換
tips = pd.get_dummies(
    tips, columns=["sex", "smoker", "day", "time"], drop_first=True
)  # drop_firstは男性,喫煙,木曜は従属なので入れないことを意味する.

X = tips[
    [
        "total_bill",
        "size",
        "sex_Female",
        "smoker_No",
        "day_Fri",
        "day_Sat",
        "day_Sun",
        "time_Dinner",
    ]
]
y = tips["tip"]
model.fit(X, y)
print("係数=", model.coef_)
print("y-切片=", model.intercept_)

print("残差自乗和=", sum((y - model.predict(X)) ** 2))
係数= [ 0.09448701  0.175992    0.03244094  0.08640832  0.1622592   0.04080082
  0.13677854 -0.0681286 ]
y-切片= 0.5908374259513756
残差自乗和= 246.52626893909175

Anscombeデータ(百聞は一見にしかず)

同じ通常の統計的性質(平均、分散、相関、回帰直線)を持つが全く異なる 4つの x-y データセット

import seaborn as sns

anscombe = sns.load_dataset("anscombe")
anscombe.head()
dataset x y
0 I 10.0 8.04
1 I 8.0 6.95
2 I 13.0 7.58
3 I 9.0 8.81
4 I 11.0 8.33
test = {}
test[0] = anscombe[anscombe.dataset == "I"]
test[1] = anscombe[anscombe.dataset == "II"]
test[2] = anscombe[anscombe.dataset == "III"]
test[3] = anscombe[anscombe.dataset == "IV"]

# grouped = anscombe.groupby('dataset') #とグループ化してもよい
for i in range(4):
    print(test[i].describe())
               x          y
count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.031568
min     4.000000   4.260000
25%     6.500000   6.315000
50%     9.000000   7.580000
75%    11.500000   8.570000
max    14.000000  10.840000
               x          y
count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.031657
min     4.000000   3.100000
25%     6.500000   6.695000
50%     9.000000   8.140000
75%    11.500000   8.950000
max    14.000000   9.260000
               x          y
count  11.000000  11.000000
mean    9.000000   7.500000
std     3.316625   2.030424
min     4.000000   5.390000
25%     6.500000   6.250000
50%     9.000000   7.110000
75%    11.500000   7.980000
max    14.000000  12.740000
               x          y
count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.030579
min     8.000000   5.250000
25%     8.000000   6.170000
50%     8.000000   7.040000
75%     8.000000   8.190000
max    19.000000  12.500000
for i in range(4):
    # model1 = sm.OLS(test[i].y, test[i].x)
    model1 = smf.glm("y ~ x ", {"y": test[i].y, "x": test[i].x})
    r1 = model1.fit()
    print(r1.summary2())
             Results: Generalized linear model
============================================================
Model:              GLM              AIC:            37.6814
Link Function:      identity         BIC:            -7.8184
Dependent Variable: y                Log-Likelihood: -16.841
Date:               2022-06-08 07:43 LL-Null:        -25.939
No. Observations:   11               Deviance:       13.763 
Df Model:           1                Pearson chi2:   13.8   
Df Residuals:       9                Scale:          1.5292 
Method:             IRLS                                    
-------------------------------------------------------------
             Coef.   Std.Err.    z     P>|z|   [0.025  0.975]
-------------------------------------------------------------
Intercept    3.0001    1.1247  2.6673  0.0076  0.7956  5.2046
x            0.5001    0.1179  4.2415  0.0000  0.2690  0.7312
============================================================

             Results: Generalized linear model
============================================================
Model:              GLM              AIC:            37.6922
Link Function:      identity         BIC:            -7.8048
Dependent Variable: y                Log-Likelihood: -16.846
Date:               2022-06-08 07:43 LL-Null:        -25.933
No. Observations:   11               Deviance:       13.776 
Df Model:           1                Pearson chi2:   13.8   
Df Residuals:       9                Scale:          1.5307 
Method:             IRLS                                    
-------------------------------------------------------------
             Coef.   Std.Err.    z     P>|z|   [0.025  0.975]
-------------------------------------------------------------
Intercept    3.0009    1.1253  2.6668  0.0077  0.7954  5.2065
x            0.5000    0.1180  4.2386  0.0000  0.2688  0.7312
============================================================

             Results: Generalized linear model
============================================================
Model:              GLM              AIC:            37.6762
Link Function:      identity         BIC:            -7.8249
Dependent Variable: y                Log-Likelihood: -16.838
Date:               2022-06-08 07:43 LL-Null:        -25.928
No. Observations:   11               Deviance:       13.756 
Df Model:           1                Pearson chi2:   13.8   
Df Residuals:       9                Scale:          1.5285 
Method:             IRLS                                    
-------------------------------------------------------------
             Coef.   Std.Err.    z     P>|z|   [0.025  0.975]
-------------------------------------------------------------
Intercept    3.0025    1.1245  2.6701  0.0076  0.7985  5.2064
x            0.4997    0.1179  4.2394  0.0000  0.2687  0.7308
============================================================

             Results: Generalized linear model
============================================================
Model:              GLM              AIC:            37.6652
Link Function:      identity         BIC:            -7.8386
Dependent Variable: y                Log-Likelihood: -16.833
Date:               2022-06-08 07:43 LL-Null:        -25.938
No. Observations:   11               Deviance:       13.742 
Df Model:           1                Pearson chi2:   13.7   
Df Residuals:       9                Scale:          1.5269 
Method:             IRLS                                    
-------------------------------------------------------------
             Coef.   Std.Err.    z     P>|z|   [0.025  0.975]
-------------------------------------------------------------
Intercept    3.0017    1.1239  2.6708  0.0076  0.7989  5.2046
x            0.4999    0.1178  4.2430  0.0000  0.2690  0.7308
============================================================

sns.lmplot(x="x", y="y", col="dataset", data=anscombe)

Titanicデータを用いてロジスティック回帰

import seaborn as sns

titanic = sns.load_dataset("titanic")
titanic.head()
print(len(titanic))
891
mod = smf.glm(
    formula="survived ~ sex +pclass ", data=titanic, family=sm.families.Binomial()
)
res = mod.fit()
print(res.summary2())
titanic["predict"] = res.predict()
               Results: Generalized linear model
===============================================================
Model:              GLM              AIC:            833.1956  
Link Function:      Logit            BIC:            -5204.4062
Dependent Variable: survived         Log-Likelihood: -413.60   
Date:               2022-06-08 07:43 LL-Null:        -593.33   
No. Observations:   891              Deviance:       827.20    
Df Model:           2                Pearson chi2:   911.      
Df Residuals:       888              Scale:          1.0000    
Method:             IRLS                                       
---------------------------------------------------------------
                Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
---------------------------------------------------------------
Intercept       3.2946   0.2974  11.0769 0.0000  2.7117  3.8776
sex[T.male]    -2.6434   0.1838 -14.3799 0.0000 -3.0037 -2.2831
pclass         -0.9606   0.1061  -9.0572 0.0000 -1.1684 -0.7527
===============================================================

titanic.head()
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone predict
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False 0.097052
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False 0.911661
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True 0.601803
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False 0.911661
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True 0.097052
print("正解率", sum((titanic.predict > 0.5) == titanic.survived) / len(titanic))
正解率 0.7867564534231201
sns.lmplot(x="fare", y="predict", col="sex", data=titanic)
<seaborn.axisgrid.FacetGrid at 0x7f9ec89ba100>
titanic.dropna(subset=["age"], inplace=True)
mod = smf.glm(
    formula="survived ~ pclass + sex + age", data=titanic, family=sm.families.Binomial()
)
res = mod.fit()
print(res.summary2())
titanic["predict"] = res.predict()
               Results: Generalized linear model
===============================================================
Model:              GLM              AIC:            655.2909  
Link Function:      Logit            BIC:            -4018.0360
Dependent Variable: survived         Log-Likelihood: -323.65   
Date:               2022-06-08 07:43 LL-Null:        -482.26   
No. Observations:   714              Deviance:       647.29    
Df Model:           3                Pearson chi2:   768.      
Df Residuals:       710              Scale:          1.0000    
Method:             IRLS                                       
---------------------------------------------------------------
                Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
---------------------------------------------------------------
Intercept       5.0560   0.5021  10.0692 0.0000  4.0719  6.0402
sex[T.male]    -2.5221   0.2073 -12.1676 0.0000 -2.9284 -2.1159
pclass         -1.2885   0.1393  -9.2528 0.0000 -1.5615 -1.0156
age            -0.0369   0.0076  -4.8415 0.0000 -0.0519 -0.0220
===============================================================

print("正解率", sum((titanic.predict > 0.5) == titanic.survived) / len(titanic))
正解率 0.788515406162465

一般化線形モデル(ポアソン回帰)

米国の州による死刑執行数

Number of Observations - 17

Number of Variables - 7

Variable name definitions::

EXECUTIONS - Executions in 1996
INCOME - Median per capita income in 1996 dollars
PERPOVERTY - Percent of the population classified as living in poverty
PERBLACK - Percent of black citizens in the population
VC100k96 - Rate of violent crimes per 100,00 residents for 1996
SOUTH - SOUTH == 1 indicates a state in the South
DEGREE - An esimate of the proportion of the state population with a
    college degree of some kind

State names are included in the data file, though not returned by load.

data = sm.datasets.cpunish.load()  # statsmodelの死刑執行数のデータを読み込み
df = pd.DataFrame(data.data)
df.head()
EXECUTIONS INCOME PERPOVERTY PERBLACK VC100k96 SOUTH DEGREE
0 37.0 34453.0 16.7 12.2 644.0 1.0 0.16
1 9.0 41534.0 12.5 20.0 351.0 1.0 0.27
2 6.0 35802.0 10.6 11.2 591.0 0.0 0.21
3 4.0 26954.0 18.4 16.1 524.0 1.0 0.16
4 3.0 31468.0 14.8 25.9 565.0 1.0 0.19
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x7f9e89aef0a0>
r = smf.glm(
    "EXECUTIONS ~ INCOME + PERPOVERTY + PERBLACK + VC100k96 + SOUTH + DEGREE",
    df,
    family=sm.families.Poisson(),
).fit()
df["predict"] = r.predict(df)
r.summary2()
sns.jointplot(x="EXECUTIONS", y="predict", data=df, kind="reg")
<seaborn.axisgrid.JointGrid at 0x7f9ec889f100>

主成分分析

日本語による簡単解説

http://dev.classmethod.jp/statistics/pythonscikit-learn-pca1/

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.decomposition import PCA

%matplotlib inline
X = np.array([[1, 2, 3, 4, 5], [15, 20, 34, 42, 58]]).T

pca = PCA(n_components=1)
print(pca.fit_transform(X))
print(pca.get_covariance())
print(pca.explained_variance_ratio_)
s = pca.components_[0]
print(s)

plt.plot(X.T[0], X.T[1], "o")
plt.plot(X.T[0], np.mean(X.T[1]) + s[1] / s[0] * (X.T[0] - np.mean(X.T[0])), "r")
[[-18.90367323]
 [-13.83402242]
 [  0.19919026]
 [  8.25669503]
 [ 24.28181035]]
[[  2.5  27. ]
 [ 27.  299.2]]
[0.99979122]
[0.08989421 0.99595132]
[<matplotlib.lines.Line2D at 0x7f9e98895d00>]

手計算による確認

H = np.identity(5) - (1 / 5) * np.ones((5, 5))  # H makes every column centered
XC = H @ X
print("centering: \n  H X =", XC)
print("variance= \n", np.cov(XC.T))
print(np.cov(X.T))
COV = (XC.T @ XC) / 4
print(COV)
centering: 
  H X = [[-2.00000000e+00 -1.88000000e+01]
 [-1.00000000e+00 -1.38000000e+01]
 [-5.55111512e-17  2.00000000e-01]
 [ 1.00000000e+00  8.20000000e+00]
 [ 2.00000000e+00  2.42000000e+01]]
variance= 
 [[  2.5  27. ]
 [ 27.  299.2]]
[[  2.5  27. ]
 [ 27.  299.2]]
[[  2.5  27. ]
 [ 27.  299.2]]
lambda_, u = np.linalg.eig(COV)  # 固有値分解
print(lambda_)
print(u)
[6.29896178e-02 3.01637010e+02]
[[-0.99595132 -0.08989421]
 [ 0.08989421 -0.99595132]]
u[np.argmax(lambda_)]  # 最大固有値に対応する固有ベクトル
array([ 0.08989421, -0.99595132])

米国の州ごとの犯罪数データに対する主成分分析

Number of observations: 51 Number of variables: 8 Variable name definitions:

  • state

    All 50 states plus DC.

  • violent

    Rate of violent crimes / 100,000 population. Includes murder, forcible rape, robbery, and aggravated assault. Numbers for Illinois and Minnesota do not include forcible rapes. Footnote included with the American Statistical Abstract table reads: "The data collection methodology for the offense of forcible rape used by the Illinois and the Minnesota state Uniform Crime Reporting (UCR) Programs (with the exception of Rockford, Illinois, and Minneapolis and St. Paul, Minnesota) does not comply with national UCR guidelines. Consequently, their state figures for forcible rape and violent crime (of which forcible rape is a part) are not published in this table."

  • murder

Rate of murders / 100,000 population.

  • hs_grad

    Precent of population having graduated from high school or higher.

  • poverty

    % of individuals below the poverty line

  • white

    Percent of population that is one race - white only. From 2009 American Community Survey

  • single

    Calculated from 2009 1-year American Community Survey obtained obtained from Census. Variable is Male householder, no wife present, family household combined with Female household, no husband prsent, family household, divided by the total number of Family households.

  • urban

    % of population in Urbanized Areas as of 2010 Census. Urbanized Areas are area of 50,000 or more people.

data = sm.datasets.statecrime.load()
df = pd.DataFrame(data.data)
df.head()
violent murder hs_grad poverty single white urban
state
Alabama 459.9 7.1 82.1 17.5 29.0 70.0 48.65
Alaska 632.6 3.2 91.4 9.0 25.5 68.3 44.46
Arizona 423.2 5.5 84.2 16.5 25.7 80.0 80.07
Arkansas 530.3 6.3 82.4 18.8 26.3 78.4 39.54
California 473.4 5.4 80.6 14.2 27.8 62.7 89.73
pca = PCA(n_components=2)

data = df.iloc[:, 1:]

data.head()

data -= data.mean()
data /= data.std()
pca.fit(data)  # state 以外を主成分分析
PCA(n_components=2)
print(pca.explained_variance_ratio_)
print(pca.components_)  # 第1成分は暴力犯罪数,殺人数,貧困度を表し,第2成分は都市部の人口から殺人数,貧困度を引いた量を表す
x = pca.transform(data)  # データを2次元に変換
[0.56167072 0.23706556]
[[ 0.47735601 -0.4009687   0.35551947  0.51577612 -0.40838754  0.22837656]
 [ 0.01344782  0.34643987 -0.59436752  0.11474222 -0.34611372  0.62734786]]
# import networkx as nx
# pos ={}
# G=nx.Graph()
# for i, name in  enumerate(df.state):
#     pos[name] = (x[i,0],x[i,1])
#     G.add_node(name)

# nx.draw(G,pos=pos, node_size=5, with_labels=True, font_size=8)
# #plt.savefig("statecrime.pdf")