目录
前言
本文只做复习使用,只给出关键算法描述和证明。
MLE/MAP
给定N N N 个样本对{ ( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } \{(X^{(i)}, y^{(i)}), i = 1, \cdots, N\} {( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } ,其中y ∈ { C k , k = 1 , ⋯ , K } y \in \{C_k, k = 1, \cdots, K\} y ∈ { C k , k = 1 , ⋯ , K } ,要求估计参数模型P ( X ∣ θ ) P(X | \theta) P ( X ∣ θ ) 的参数θ \theta θ ,使之最能描述给定数据分布。
最大似然估计(MLE)
优化目标: θ ^ = arg max P ( D ∣ θ ) 定义: L ( D ∣ θ ) = P ( D ∣ θ ) = ∏ i P ( X ( i ) ∣ θ ) 取对数: log L ( D ∣ θ ) = ∑ i log P ( X ( i ) ∣ θ ) 求取极值: ∂ ∂ θ log L ( D ∣ θ ) = 0 ⇒ θ ^ \begin{aligned}
优化目标:& \hat{\theta} = \arg \max P(D | \theta) \\
定义:& L(D | \theta) = P(D | \theta) = \prod_i P(X^{(i)} | \theta) \\
取对数:& \log L(D | \theta) = \sum_i \log P(X^{(i)} | \theta) \\
求取极值:& \frac{\partial}{\partial \theta} \log L(D | \theta) = 0 \Rightarrow \hat{\theta}
\end{aligned}
优化目标: 定义: 取对数: 求取极值: θ ^ = arg max P ( D ∣ θ ) L ( D ∣ θ ) = P ( D ∣ θ ) = i ∏ P ( X ( i ) ∣ θ ) log L ( D ∣ θ ) = i ∑ log P ( X ( i ) ∣ θ ) ∂ θ ∂ log L ( D ∣ θ ) = 0 ⇒ θ ^
最大后验概率估计(MAP)
优化目标: θ ^ = arg max P ( θ ∣ D ) 其中: P ( θ ∣ D ) = P ( D ∣ θ ) P ( θ ) P ( D ) P ( θ ) 为给定的参数先验概率分布 定义: L ( θ ∣ D ) = P ( D ∣ θ ) P ( θ ) = ∏ i P ( X ( i ) ∣ θ ) ⋅ P ( θ ) 取对数: log L ( θ ∣ D ) = ∑ i log P ( X ( i ) ∣ θ ) + log P ( θ ) 求取极值: ∂ ∂ θ log L ( θ ∣ D ) = 0 ⇒ θ ^ \begin{aligned}
优化目标:& \hat{\theta} = \arg \max P(\theta | D) \\
其中:& P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)} \\
& P(\theta)为给定的参数先验概率分布 \\
定义:& L(\theta | D) = P(D | \theta) P(\theta) = \prod_i P(X^{(i)} | \theta) \cdot P(\theta) \\
取对数:& \log L(\theta | D) = \sum_i \log P(X^{(i)} | \theta) + \log P(\theta) \\
求取极值:& \frac{\partial}{\partial \theta} \log L(\theta | D) = 0 \Rightarrow \hat{\theta}
\end{aligned}
优化目标: 其中: 定义: 取对数: 求取极值: θ ^ = arg max P ( θ ∣ D ) P ( θ ∣ D ) = P ( D ) P ( D ∣ θ ) P ( θ ) P ( θ ) 为给定的参数先验概率分布 L ( θ ∣ D ) = P ( D ∣ θ ) P ( θ ) = i ∏ P ( X ( i ) ∣ θ ) ⋅ P ( θ ) log L ( θ ∣ D ) = i ∑ log P ( X ( i ) ∣ θ ) + log P ( θ ) ∂ θ ∂ log L ( θ ∣ D ) = 0 ⇒ θ ^
线性回归/逻辑斯蒂回归
给定N N N 个样本对{ ( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } \{(X^{(i)}, y^{(i)}), i = 1, \cdots, N\} {( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } ,记样本矩阵X N × n X_{N \times n} X N × n 。
线性回归
标签信息: y ∈ R 1 ,定义模型: y ^ 1 × 1 = w n × 1 T x n × 1 + b 增广后: y ^ 1 × 1 = w n × 1 T x n × 1 { w 1 = b x 1 = 1 M S E 作为损失,则总体损失: L ( y ^ , y ) = 1 N ∑ i = 1 N 1 2 ( y ^ ( i ) − y ( i ) ) 2 求取梯度: ∂ L ∂ w j = 1 N ∑ i = 1 N ( y ^ ( i ) − y ( i ) ) ∂ y ^ ( i ) ∂ w j = 1 N ∑ i = 1 N ( y ^ ( i ) − y ( i ) ) x j ( i ) ⇒ 梯度下降: w j : = w j − α ∂ L ∂ w j \begin{aligned}
标签信息:& y \in \mathcal{R}^1,
定义模型:\hat{y}_{1\times 1} = w_{n \times 1}^T x_{n \times 1} + b \\
增广后:& \hat{y}_{1\times 1} = w_{n \times 1}^T x_{n \times 1} \begin{cases} w_1 = b \\ x_1 = 1 \end{cases} \\
MSE作为损失,则总体损失:& L(\hat{y}, y) = \frac{1}{N} \sum_{i=1}^N \frac{1}{2} (\hat{y}^{(i)} - y^{(i)})^2 \\
求取梯度:& \frac{\partial L}{\partial w_j} =
\frac{1}{N} \sum_{i=1}^N (\hat{y}^{(i)} - y^{(i)}) \frac{\partial \hat{y}^{(i)}}{\partial w_j} =
\frac{1}{N} \sum_{i=1}^N (\hat{y}^{(i)} - y^{(i)}) x^{(i)}_j \Rightarrow \\
梯度下降:& w_j := w_j - \alpha \frac{\partial L}{\partial w_j}
\end{aligned}
标签信息: 增广后: MSE 作为损失,则总体损失: 求取梯度: 梯度下降: y ∈ R 1 ,定义模型: y ^ 1 × 1 = w n × 1 T x n × 1 + b y ^ 1 × 1 = w n × 1 T x n × 1 { w 1 = b x 1 = 1 L ( y ^ , y ) = N 1 i = 1 ∑ N 2 1 ( y ^ ( i ) − y ( i ) ) 2 ∂ w j ∂ L = N 1 i = 1 ∑ N ( y ^ ( i ) − y ( i ) ) ∂ w j ∂ y ^ ( i ) = N 1 i = 1 ∑ N ( y ^ ( i ) − y ( i ) ) x j ( i ) ⇒ w j := w j − α ∂ w j ∂ L
若描述为矩阵
标签信息 Y ∈ R N 定义模型: Y ^ N × 1 = X N × ( n + 1 ) w ( n + 1 ) × 1 总体损失: L ( Y ^ , Y ) = 1 N ⋅ 1 2 ∣ ∣ Y ^ − Y ∣ ∣ 2 2 = 1 N ⋅ 1 2 ( Y ^ − Y ) T ( Y ^ − Y ) } ⇒ L ( Y ^ , Y ) = 1 2 N ( w T X T X w − 2 Y T X w + Y T Y ) 求取梯度: ∂ L ∂ w = 1 2 N ( 2 X T X w − 2 X T Y ) = 0 ⇒ { 梯度下降: w : = w − α ∂ L ∂ w 解析解: w ^ ∗ = ( X T X + λ I ) − 1 X T ⏟ X + Y \begin{aligned}
\left.\begin{aligned}
& 标签信息 Y \in R^{N} \\
定义模型:& \hat{Y}_{N \times 1} = X_{N \times (n + 1)} w_{(n + 1) \times 1} \\
总体损失:& L(\hat{Y}, Y) = \frac{1}{N} \cdot \frac{1}{2} || \hat{Y} - Y ||_2^2 =
\frac{1}{N} \cdot \frac{1}{2} (\hat{Y} - Y)^T(\hat{Y} - Y)
\end{aligned}\right\} \Rightarrow \\
L(\hat{Y}, Y) = \frac{1}{2 N} (w^T X^T X w - 2 Y^T X w + Y^T Y) \\
求取梯度: \frac{\partial L}{\partial w} = \frac{1}{\cancel{2} N} (\cancel{2} X^T X w - \cancel{2} X^T Y) = 0 \Rightarrow \\
\begin{cases}
梯度下降:& w := w - \alpha \frac{\partial L}{\partial w} \\
解析解:& \hat{w}^* = \underbrace{(X^T X + \lambda I)^{-1} X^T}_{X^+} Y
\end{cases}
\end{aligned}
定义模型: 总体损失: 标签信息 Y ∈ R N Y ^ N × 1 = X N × ( n + 1 ) w ( n + 1 ) × 1 L ( Y ^ , Y ) = N 1 ⋅ 2 1 ∣∣ Y ^ − Y ∣ ∣ 2 2 = N 1 ⋅ 2 1 ( Y ^ − Y ) T ( Y ^ − Y ) ⎭ ⎬ ⎫ ⇒ L ( Y ^ , Y ) = 2 N 1 ( w T X T Xw − 2 Y T Xw + Y T Y ) 求取梯度: ∂ w ∂ L = 2 N 1 ( 2 X T Xw − 2 X T Y ) = 0 ⇒ ⎩ ⎨ ⎧ 梯度下降: 解析解: w := w − α ∂ w ∂ L w ^ ∗ = X + ( X T X + λ I ) − 1 X T Y
逻辑斯蒂回归(LR)
标签信息: y ∈ { 0 , 1 } 定义模型: { y ^ = σ ( z ) z = w T X + b 其中 σ ( z ) = 1 1 + exp ( − z ) 样本 X 服从 0 − 1 分布: P ( X ) = ( 1 − y ^ ) 1 − y ( y ^ ) y ( y ^ ( i ) 为直接待估参数 ) M L E : L ( D ∣ w ) = ∏ i P ( X ( i ) ) ⇒ log L ( D ∣ w ) = ∑ i log P ( X ( i ) ) 优化目标: w ^ = arg max L ( D ∣ w ) = arg max log L ( D ∣ w ) 求取极值: ∂ L ∂ w j = ∂ ∂ w j ∑ i log P ( X ( i ) ) = ∂ ∂ w j ∑ i log ( 1 − y ^ ( i ) ) 1 − y ( i ) ( y ^ ( i ) ) y ( i ) = ∂ ∂ w j ∑ i ( 1 − y ( i ) ) log ( 1 − y ^ ( i ) ) + ∂ ∂ w j ∑ i y ( i ) log y ^ ( i ) = ∑ i ( 1 − y ( i ) ) 1 1 − y ^ ( i ) ( − ∂ y ( i ) ∂ w j ) + ∑ i y ( i ) 1 y ^ ( i ) ( ∂ y ( i ) ∂ w j ) 其中: ∂ y ( i ) ∂ w j = σ ′ ( z ( i ) ) ∂ z ( i ) ∂ w j = σ ( z ( i ) ) ( 1 − σ ( z ( i ) ) ) x j ( i ) ⇒ ∂ L ∂ w j = ∑ i − ( 1 − y ( i ) ) 1 1 − y ^ ( i ) σ ( z ( i ) ) ( 1 − σ ( z ( i ) ) ) x j ( i ) + ∑ i y ( i ) 1 y ^ ( i ) σ ( z ( i ) ) ( 1 − σ ( z ( i ) ) ) x j ( i ) = ∑ i ( y ( i ) − y ^ ( i ) ) x j ( i ) ⇒ 梯度下降: w j : = w j − α ∂ L ∂ w j \begin{aligned}
标签信息: y \in \{0, 1\} \\
定义模型:& \begin{cases} \hat{y} = \sigma(z) \\ z = w^T X + b \end{cases} \\
& 其中 \sigma(z) = \frac{1}{1 + \exp(-z)} \\
样本X服从0-1分布:& P(X) = (1 - \hat{y})^{1 - y} (\hat{y})^{y} (\hat{y}^{(i)}为直接待估参数) \\
MLE:& L(D | w) = \prod_i P(X^{(i)}) \Rightarrow
\log L(D | w) = \sum_i \log P(X^{(i)}) \\
优化目标:& \hat{w} = \arg \max L(D | w) = \arg \max \log L(D | w) \\
求取极值:& \begin{aligned}
\frac{\partial L}{\partial w_j} & =
\frac{\partial}{\partial w_j} \sum_i \log P(X^{(i)}) \\
& = \frac{\partial}{\partial w_j} \sum_i \log (1 - \hat{y}^{(i)})^{1 - y^{(i)}} (\hat{y}^{(i)})^{y^{(i)}} \\
& = \frac{\partial}{\partial w_j} \sum_i (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) + \frac{\partial}{\partial w_j} \sum_i y^{(i)} \log \hat{y}^{(i)} \\
& = \sum_i (1 - y^{(i)}) \frac{1}{1 - \hat{y}^{(i)}} (- \frac{\partial y^{(i)}}{\partial w_j}) +
\sum_i y^{(i)} \frac{1}{\hat{y}^{(i)}} (\frac{\partial y^{(i)}}{\partial w_j})
\end{aligned} \\
其中:& \frac{\partial y^{(i)}}{\partial w_j} = \sigma'(z^{(i)}) \frac{\partial z^{(i)}}{\partial w_j} = \sigma(z^{(i)}) (1 - \sigma(z^{(i)})) x^{(i)}_j \Rightarrow \\
& \frac{\partial L}{\partial w_j} = \sum_i - (1 - \bcancel{y^{(i)}}) \frac{1}{\cancel{1 - \hat{y}^{(i)}}} \sigma(z^{(i)}) \cancel{(1 - \sigma(z^{(i)}))} x^{(i)}_j + \\
& \sum_i y^{(i)} \frac{1}{\cancel{\hat{y}^{(i)}}} \cancel{\sigma(z^{(i)})} (1 - \bcancel{\sigma(z^{(i)})}) x^{(i)}_j
= \sum_i (y^{(i)} - \hat{y}^{(i)}) x^{(i)}_j \Rightarrow \\
梯度下降:& w_j := w_j - \alpha \frac{\partial L}{\partial w_j}
\end{aligned}
标签信息: y ∈ { 0 , 1 } 定义模型: 样本 X 服从 0 − 1 分布: M L E : 优化目标: 求取极值: 其中: 梯度下降: { y ^ = σ ( z ) z = w T X + b 其中 σ ( z ) = 1 + exp ( − z ) 1 P ( X ) = ( 1 − y ^ ) 1 − y ( y ^ ) y ( y ^ ( i ) 为直接待估参数 ) L ( D ∣ w ) = i ∏ P ( X ( i ) ) ⇒ log L ( D ∣ w ) = i ∑ log P ( X ( i ) ) w ^ = arg max L ( D ∣ w ) = arg max log L ( D ∣ w ) ∂ w j ∂ L = ∂ w j ∂ i ∑ log P ( X ( i ) ) = ∂ w j ∂ i ∑ log ( 1 − y ^ ( i ) ) 1 − y ( i ) ( y ^ ( i ) ) y ( i ) = ∂ w j ∂ i ∑ ( 1 − y ( i ) ) log ( 1 − y ^ ( i ) ) + ∂ w j ∂ i ∑ y ( i ) log y ^ ( i ) = i ∑ ( 1 − y ( i ) ) 1 − y ^ ( i ) 1 ( − ∂ w j ∂ y ( i ) ) + i ∑ y ( i ) y ^ ( i ) 1 ( ∂ w j ∂ y ( i ) ) ∂ w j ∂ y ( i ) = σ ′ ( z ( i ) ) ∂ w j ∂ z ( i ) = σ ( z ( i ) ) ( 1 − σ ( z ( i ) )) x j ( i ) ⇒ ∂ w j ∂ L = i ∑ − ( 1 − y ( i ) ) 1 − y ^ ( i ) 1 σ ( z ( i ) ) ( 1 − σ ( z ( i ) )) x j ( i ) + i ∑ y ( i ) y ^ ( i ) 1 σ ( z ( i ) ) ( 1 − σ ( z ( i ) ) ) x j ( i ) = i ∑ ( y ( i ) − y ^ ( i ) ) x j ( i ) ⇒ w j := w j − α ∂ w j ∂ L
朴素贝叶斯
给定N N N 个样本对{ ( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } \{(X^{(i)}, y^{(i)}), i = 1, \cdots, N\} {( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } ,其中y ∈ { C k , k = 1 , ⋯ , K } y \in \{C_k, k = 1, \cdots, K\} y ∈ { C k , k = 1 , ⋯ , K } 。
定义模型为条件概率分布: P ( Y ∣ X ) 由贝叶斯公式: P ( Y ∣ X ) = P ( X ∣ Y ) P ( Y ) P ( X ) 称: { 后验概率: P ( Y ∣ X ) 似然函数: P ( X ∣ Y ) = ∏ j = 1 n P ( X j ∣ Y ) ( 朴素贝叶斯 ) 先验概率: P ( Y ) 证据因子: P ( X ) = ∑ k P ( X ∣ Y = C k ) P ( Y = C k ) y ^ = max k P ( X ∣ Y = C k ) P ( Y = C k ) = max k ∏ j = 1 n P ( X j ∣ Y = C k ) P ( Y = C k ) \begin{aligned}
定义模型为条件概率分布:& P(Y | X) \\
由贝叶斯公式:& P(Y | X) = \frac{P(X | Y) P(Y)}{P(X)} \\
称:& \begin{cases}
后验概率:& P(Y | X) \\
似然函数:& P(X | Y) = \prod_{j=1}^n P(X_j | Y) (朴素贝叶斯)\\
先验概率:& P(Y) \\
证据因子:& P(X) = \sum_k P(X | Y = C_k) P(Y = C_k)
\end{cases} \\
\hat{y} & = \max_k P(X | Y = C_k) P(Y = C_k) \\
& = \max_k \prod_{j=1}^n P(X_j | Y = C_k) P(Y = C_k)
\end{aligned}
定义模型为条件概率分布: 由贝叶斯公式: 称: y ^ P ( Y ∣ X ) P ( Y ∣ X ) = P ( X ) P ( X ∣ Y ) P ( Y ) ⎩ ⎨ ⎧ 后验概率: 似然函数: 先验概率: 证据因子: P ( Y ∣ X ) P ( X ∣ Y ) = ∏ j = 1 n P ( X j ∣ Y ) ( 朴素贝叶斯 ) P ( Y ) P ( X ) = ∑ k P ( X ∣ Y = C k ) P ( Y = C k ) = k max P ( X ∣ Y = C k ) P ( Y = C k ) = k max j = 1 ∏ n P ( X j ∣ Y = C k ) P ( Y = C k )
PCA/LDA
PCA
给定包含M M M 个样本的N N N 维数据集 { X N × 1 ( i ) , i = 1 , ⋯ , M } \{X_{N \times 1}^{(i)}, i = 1, \cdots, M\} { X N × 1 ( i ) , i = 1 , ⋯ , M } 构成样本矩阵X N × M = [ X ( 1 ) X ( 2 ) ⋯ X ( M ) ] X_{N \times M} = \begin{bmatrix}X^{(1)} & X^{(2)} & \cdots X^{(M)}\end{bmatrix} X N × M = [ X ( 1 ) X ( 2 ) ⋯ X ( M ) ] ,现希望求取主分量β k , k = 1 , ⋯ , K \beta_k, k = 1, \cdots, K β k , k = 1 , ⋯ , K 使得数据投影在各主分量上的散布最大/方差最大 。
计算步骤
计算维度间的协方差矩阵 Σ N × N = 1 M X ~ X ~ T \Sigma_{N \times N} = \frac{1}{M} \tilde{X} \tilde{X}^T Σ N × N = M 1 X ~ X ~ T ,其中X ~ ( i ) = X ( i ) − X ‾ , X ‾ = 1 M ∑ i = 1 M X ( i ) \tilde{X}^{(i)} = X^{(i)} - \overline{X}, \overline{X} = \frac{1}{M} \sum_{i=1}^{M} X^{(i)} X ~ ( i ) = X ( i ) − X , X = M 1 ∑ i = 1 M X ( i ) ;
求矩阵Σ \Sigma Σ 的特征值分解 ,即Σ β k = λ k β k \Sigma \beta_k = \lambda_k \beta_k Σ β k = λ k β k ;
将特征对( λ k , β k ) (\lambda_k, \beta_k) ( λ k , β k ) 按特征值λ k \lambda_k λ k 降序排序后,选取前K K K 个主分量 作为投影轴构成投影矩阵B N × K B_{N \times K} B N × K ;
投影 :S K × M = B N × K T X N × M S_{K \times M} = B_{N \times K}^T X_{N \times M} S K × M = B N × K T X N × M ;重建 ;X ^ = B N × K S K × M \hat{X} = B_{N \times K} S_{K \times M} X ^ = B N × K S K × M 。
证明
第1 1 1 主成分
优化目标为
β 1 = arg max ∣ ∣ S 1 ∣ ∣ 2 2 s . t . ∣ ∣ β 1 ∣ ∣ 2 2 = 1 \begin{aligned}
\beta_1 & = \arg \max ||S_1||_2^2 \\ s.t. & \quad ||\beta_1||_2^2 = 1
\end{aligned}
β 1 s . t . = arg max ∣∣ S 1 ∣ ∣ 2 2 ∣∣ β 1 ∣ ∣ 2 2 = 1
那么
∣ ∣ S 1 ∣ ∣ 2 2 = S 1 T S 1 S 1 = X T β 1 } ⇒ ∣ ∣ S 1 ∣ ∣ 2 2 = β 1 T X X T ⏟ C β 1 C = X X T = W Λ W T } ⇒ ∣ ∣ S 1 ∣ ∣ 2 2 = β 1 T W Λ W T β 1 ⏟ α 1 = ∑ i = 1 N λ i α 1 i ≤ λ 1 ∑ i = 1 N α 1 i β 1 T β 1 = α 1 T W T W α = α 1 T α = ∑ i = 1 N α 1 i = 1 ( 单位约束 ) } ⇒ ∣ ∣ S 1 ∣ ∣ 2 2 ≤ λ 1 为使 ∣ ∣ S 1 ∣ ∣ 2 2 极大化,取 { α 11 = 1 α 1 i = 0 , i = 2 , 3 , ⋯ , N ⇒ β 1 = W α 1 = w 1 \begin{aligned}
\left. \begin{aligned}
\left. \begin{aligned}
||S_1||_2^2 & = S_1^T S_1 \\
S_1 & = X^T \beta_1
\end{aligned} \right\} \Rightarrow
||S_1||_2^2 = \beta_1^T \underbrace{X X^T}_C \beta_1 \\
C = X X^T = W \Lambda W^T
\end{aligned} \right\} \Rightarrow \\
\left. \begin{aligned}
||S_1||_2^2 = \beta_1^T W \Lambda \underbrace{W^T \beta_1}_{\alpha_1} = \sum_{i=1}^N \lambda_i \alpha_{1i} \leq \lambda_1 \sum_{i=1}^N \alpha_{1i} \\
\beta_1^T \beta_1 = \alpha_1^T W^T W \alpha = \alpha_1^T \alpha = \sum_{i=1}^N \alpha_{1i} = 1(单位约束)
\end{aligned} \right\} \Rightarrow \\
||S_1||_2^2 \leq \lambda_1 \quad 为使||S_1||_2^2极大化,取 \\
\begin{cases}
\alpha_{11} = 1\\
\alpha_{1i} = 0, i = 2, 3, \cdots, N
\end{cases} \Rightarrow
\beta_1 = W \alpha_1 = w_1
\end{aligned}
∣∣ S 1 ∣ ∣ 2 2 S 1 = S 1 T S 1 = X T β 1 } ⇒ ∣∣ S 1 ∣ ∣ 2 2 = β 1 T C X X T β 1 C = X X T = W Λ W T ⎭ ⎬ ⎫ ⇒ ∣∣ S 1 ∣ ∣ 2 2 = β 1 T W Λ α 1 W T β 1 = i = 1 ∑ N λ i α 1 i ≤ λ 1 i = 1 ∑ N α 1 i β 1 T β 1 = α 1 T W T W α = α 1 T α = i = 1 ∑ N α 1 i = 1 ( 单位约束 ) ⎭ ⎬ ⎫ ⇒ ∣∣ S 1 ∣ ∣ 2 2 ≤ λ 1 为使 ∣∣ S 1 ∣ ∣ 2 2 极大化,取 { α 11 = 1 α 1 i = 0 , i = 2 , 3 , ⋯ , N ⇒ β 1 = W α 1 = w 1
第r ( r > 1 ) r(r>1) r ( r > 1 ) 主成分
优化目标为
β r = arg max ∣ ∣ S r ∣ ∣ 2 2 s . t . β r T β i = 0 , i = 1 , ⋯ , r − 1 ∣ ∣ β r ∣ ∣ 2 2 = 1 \begin{aligned}
\beta_r & = \arg \max ||S_r||_2^2 \\
s.t. & \quad \beta_r^T \beta_i = 0, i = 1, \cdots, r - 1 \\
& ||\beta_r||_2^2 = 1
\end{aligned}
β r s . t . = arg max ∣∣ S r ∣ ∣ 2 2 β r T β i = 0 , i = 1 , ⋯ , r − 1 ∣∣ β r ∣ ∣ 2 2 = 1
那么
∣ ∣ S r ∣ ∣ 2 2 = S r T S r S r = X T β r } ⇒ ∣ ∣ S r ∣ ∣ 2 2 = β r T X X T ⏟ C β r C = X X T = W Λ W T } ⇒ ∣ ∣ S r ∣ ∣ 2 2 = β r T W Λ W T β r ⏟ α r = ∑ i = 1 N λ i α r i β r T β i = ( W α r ) T ( w i ) = α r i = 0 , i ≠ r ( 正交约束 ) β r T β r = α r T W T W α = α r T α = ∑ i = 1 N α 1 i = 1 ( 单位约束 ) } ⇒ ∣ ∣ S r ∣ ∣ 2 2 = λ r α r r 为使 ∣ ∣ S r ∣ ∣ 2 2 极大化,取 { α r r = 1 α r i = 0 , i = ≠ r ⇒ β r = W α r = w r \begin{aligned}
\left. \begin{aligned}
\left. \begin{aligned}
||S_r||_2^2 = S_r^T S_r \\
S_r = X^T \beta_r
\end{aligned} \right\} \Rightarrow
||S_r||_2^2 = \beta_r^T \underbrace{X X^T}_C \beta_r \\
C = X X^T = W \Lambda W^T
\end{aligned} \right\} \Rightarrow \\
\left. \begin{aligned}
||S_r||_2^2 = \beta_r^T W \Lambda \underbrace{W^T \beta_r}_{\alpha_r} = \sum_{i=1}^N \lambda_i \alpha_{ri} \\
\beta_r^T \beta_i =(W \alpha_r)^T (w_i) = \alpha_{ri} = 0, i \neq r (正交约束) \\
\beta_r^T \beta_r = \alpha_r^T W^T W \alpha = \alpha_r^T \alpha = \sum_{i=1}^N \alpha_{1i} = 1(单位约束)
\end{aligned} \right\} \Rightarrow \\
||S_r||_2^2 = \lambda_r \alpha_{rr} \quad 为使||S_r||_2^2极大化,取 \\
\begin{cases}
\alpha_{rr} = 1 \\
\alpha_{ri} = 0, i = \neq r
\end{cases} \Rightarrow
\beta_r = W \alpha_r = w_r
\end{aligned}
∣∣ S r ∣ ∣ 2 2 = S r T S r S r = X T β r } ⇒ ∣∣ S r ∣ ∣ 2 2 = β r T C X X T β r C = X X T = W Λ W T ⎭ ⎬ ⎫ ⇒ ∣∣ S r ∣ ∣ 2 2 = β r T W Λ α r W T β r = i = 1 ∑ N λ i α r i β r T β i = ( W α r ) T ( w i ) = α r i = 0 , i = r ( 正交约束 ) β r T β r = α r T W T W α = α r T α = i = 1 ∑ N α 1 i = 1 ( 单位约束 ) ⎭ ⎬ ⎫ ⇒ ∣∣ S r ∣ ∣ 2 2 = λ r α rr 为使 ∣∣ S r ∣ ∣ 2 2 极大化,取 { α rr = 1 α r i = 0 , i = = r ⇒ β r = W α r = w r
LDA
给定N N N 个样本对{ ( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } \{(X^{(i)}, y^{(i)}), i = 1, \cdots, N\} {( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } ,其中y ∈ { C k , k = 1 , ⋯ , K } y \in \{C_k, k = 1, \cdots, K\} y ∈ { C k , k = 1 , ⋯ , K } ,记样本矩阵X N × n X_{N \times n} X N × n 。现利用类别信息求取投影主轴u u u ,使得投影后类内散步小,类间散步大 。
定义:
{ 总样本均值: μ = 1 N ∑ i = 1 N X ( i ) 类别样本均值: μ k = 1 N k ∑ i = 1 N k X ( i ) , y ( i ) = C k 类内离差阵: S W , n × n = ∑ k N k N [ 1 N k ∑ i ( X ( i ) − μ k ) ( X ( i ) − μ k ) T ] 类内离差阵: S B , n × n = ∑ k N k N [ ( μ k − μ ) ( μ k − μ ) T ] \begin{cases}
总样本均值: & \mu = \frac{1}{N} \sum_{i=1}^N X^{(i)} \\
类别样本均值: & \mu_k = \frac{1}{N_k} \sum_{i=1}^{N_k} X^{(i)}, y^{(i)} = C_k \\
类内离差阵: & S_{W, n \times n} = \sum_k \frac{N_k}{N} \left[
\frac{1}{N_k} \sum_i (X^{(i)} - \mu_k) (X^{(i)} - \mu_k)^T
\right] \\
类内离差阵: & S_{B, n \times n} = \sum_k \frac{N_k}{N} \left[
(\mu_k - \mu) (\mu_k - \mu)^T
\right] \\
\end{cases}
⎩ ⎨ ⎧ 总样本均值: 类别样本均值: 类内离差阵: 类内离差阵: μ = N 1 ∑ i = 1 N X ( i ) μ k = N k 1 ∑ i = 1 N k X ( i ) , y ( i ) = C k S W , n × n = ∑ k N N k [ N k 1 ∑ i ( X ( i ) − μ k ) ( X ( i ) − μ k ) T ] S B , n × n = ∑ k N N k [ ( μ k − μ ) ( μ k − μ ) T ]
计算步骤
计算类内/类间离差阵S W / S B S_W/S_B S W / S B ;
计算矩阵S W − 1 S B S_W^{-1}S_B S W − 1 S B 的特征对( λ i , u i ) (\lambda_i, u_i) ( λ i , u i ) ;
将特征对按特征值降序排序,选取最大的特征值对应特征向量作为投影主轴,构成投影矩阵U n × m U_{n \times m} U n × m ;
投影到主轴上,X ^ N × m = X N × n U n × m \hat{X}_{N \times m} = X_{N \times n} U_{n \times m} X ^ N × m = X N × n U n × m 。
证明
将样本点 X ( i ) 投影到第一主轴 u 1 上有 X ~ ( i ) = u 1 T X ( i ) 在投影空间有 X ~ ( i ) = u 1 T X ( i ) , μ ~ = u 1 T μ , μ ~ k = u 1 T μ k S W ~ 1 × 1 = ∑ k N k N [ 1 N k ∑ i ( X ~ ( i ) − μ ~ k ) ( X ~ ( i ) − μ ~ k ) T ] S B ~ 1 × 1 = ∑ k N k N [ ( μ ~ k − μ ~ ) ( μ ~ k − μ ~ ) T ] } ⇒ { S W ~ = u 1 T S W u 1 S B ~ = u 1 T S B u 1 定义优化目标为: u 1 = arg min S W ~ S B ~ = arg min u 1 T S W u 1 u 1 T S B u 1 求取极值: ∂ ∂ u 1 u 1 T S W u 1 u 1 T S B u 1 = ( u 1 T S B u 1 ) ( 2 S W u 1 ) − ( u 1 T S W u 1 ) ( 2 S B u 1 ) ( u 1 T S B u 1 ) 2 = 0 ⇒ S B u 1 = u 1 T S B u 1 u 1 T S W u 1 ⏟ λ 1 S W u 1 ,记 λ 1 = u 1 T S B u 1 u 1 T S W u 1 \begin{aligned}
将样本点X^{(i)}投影到第一主轴u_1上有 \quad \tilde{X}^{(i)} = u_1^T X^{(i)} \quad 在投影空间有 \\
\left.\begin{aligned}
\tilde{X}^{(i)} & = u_1^T X^{(i)}, \tilde{\mu} = u_1^T \mu, \tilde{\mu}_k = u_1^T \mu_k \\
\tilde{S_W}_{1 \times 1} & = \sum_k \frac{N_k}{N} \left[
\frac{1}{N_k} \sum_i (\tilde{X}^{(i)} - \tilde{\mu}_k) (\tilde{X}^{(i)} - \tilde{\mu}_k)^T
\right] \\
\tilde{S_B}_{1 \times 1} & = \sum_k \frac{N_k}{N} \left[
(\tilde{\mu}_k - \tilde{\mu}) (\tilde{\mu}_k - \tilde{\mu})^T
\right]
\end{aligned}\right\} \Rightarrow
\begin{cases}
\tilde{S_W} = u_1^T S_W u_1 \\
\tilde{S_B} = u_1^T S_B u_1
\end{cases} \\
定义优化目标为:u_1 = \arg \min \frac{\tilde{S_W}}{\tilde{S_B}} = \arg \min \frac{u_1^T S_W u_1}{u_1^T S_B u_1} \\
求取极值:\frac{\partial}{\partial u_1} \frac{u_1^T S_W u_1}{u_1^T S_B u_1} = \frac{(u_1^T S_B u_1)(2 S_W u_1) - (u_1^T S_W u_1)(2 S_B u_1)}{(u_1^T S_B u_1)^2} = 0 \Rightarrow \\
S_B u_1 = \underbrace{\frac{u_1^T S_B u_1}{u_1^T S_W u_1}}_{\lambda_1} S_W u_1,记\lambda_1 = \frac{u_1^T S_B u_1}{u_1^T S_W u_1}
\end{aligned}
将样本点 X ( i ) 投影到第一主轴 u 1 上有 X ~ ( i ) = u 1 T X ( i ) 在投影空间有 X ~ ( i ) S W ~ 1 × 1 S B ~ 1 × 1 = u 1 T X ( i ) , μ ~ = u 1 T μ , μ ~ k = u 1 T μ k = k ∑ N N k [ N k 1 i ∑ ( X ~ ( i ) − μ ~ k ) ( X ~ ( i ) − μ ~ k ) T ] = k ∑ N N k [ ( μ ~ k − μ ~ ) ( μ ~ k − μ ~ ) T ] ⎭ ⎬ ⎫ ⇒ { S W ~ = u 1 T S W u 1 S B ~ = u 1 T S B u 1 定义优化目标为: u 1 = arg min S B ~ S W ~ = arg min u 1 T S B u 1 u 1 T S W u 1 求取极值: ∂ u 1 ∂ u 1 T S B u 1 u 1 T S W u 1 = ( u 1 T S B u 1 ) 2 ( u 1 T S B u 1 ) ( 2 S W u 1 ) − ( u 1 T S W u 1 ) ( 2 S B u 1 ) = 0 ⇒ S B u 1 = λ 1 u 1 T S W u 1 u 1 T S B u 1 S W u 1 ,记 λ 1 = u 1 T S W u 1 u 1 T S B u 1
EM/GMM
EM算法
给定包含N N N 对样本数据{ ( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } \{(X^{(i)}, y^{(i)}), i = 1, \cdots, N\} {( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } 。设分类模型为概率模型P ( X ∣ θ ) P(X | \theta) P ( X ∣ θ ) ,其中θ \theta θ 待估。该模型包含K K K 种隐藏变量 状态{ w k , k = 1 , ⋯ , K } \{w_k, k = 1, \cdots, K\} { w k , k = 1 , ⋯ , K } 。那么证明过程总结如下
M L E ⇒ L ( D ∣ θ ) = ∏ i P ( X ( i ) ∣ θ ) ⇒ log L ( D ∣ θ ) = ∑ i log P ( X ( i ) ∣ θ ) ⇒ 优化目标: θ ( t + 1 ) = arg max log L ( D ∣ θ ) P ( X ( i ) ∣ θ ) = ∑ k P ( X ( i ) , w k ( i ) ∣ θ ) ( 引入隐变量 w k ) P ( w k ( i ) ∣ θ ( t ) ) P ( w k ( i ) ∣ θ ( t ) ) = 1 ( 引入迭代变量 θ ( t ) ) } ⇒ log L ( D ∣ θ ) = ∑ i log ∑ k P ( X ( i ) , w k ( i ) ∣ θ ) P ( w k ( i ) ∣ θ ( t ) ) P ( w k ( i ) ∣ θ ( t ) ) { φ ( ⋅ ) 下凸 ∑ i w i = 1 ⇒ φ ( ∑ i w i x i ) ≤ ∑ i w i φ ( x i ) ( J e n s e n 不等式 ) } ⇒ log L ( D ∣ θ ) = ∑ i ∑ k P ( w k ( i ) ∣ θ ( t ) ) log P ( X ( i ) , w k ( i ) ∣ θ ) P ( w k ( i ) ∣ θ ( t ) ) = ∑ i ∑ k P ( w k ( i ) ∣ θ ( t ) ) log P ( X ( i ) , w k ( i ) ∣ θ ) ⏟ E w [ log P ( X ( i ) , w k ( i ) ∣ θ ) ] − ∑ i ∑ k P ( w k ( i ) ∣ θ ( t ) ) log P ( w k ( i ) ∣ θ ( t ) ) ⏟ H [ P ( w k ( i ) ∣ θ ( t ) ) ] 记 Q ( θ ∣ θ ( t ) ) = E w [ log P ( X ( i ) , w k ( i ) ∣ θ ) ] ⇒ 优化目标: θ ( t + 1 ) = arg max Q ( θ ∣ θ ( t ) ) 对 Q ( θ ∣ θ ( t ) ) 求极值求解 θ ( t + 1 ) 。 \begin{aligned}
MLE \Rightarrow L(D | \theta) = \prod_i P(X^{(i)} | \theta)
\Rightarrow \log L(D | \theta) = \sum_i \log P(X^{(i)} | \theta) \\
\Rightarrow 优化目标:\theta^{(t + 1)} = \arg \max \log L(D | \theta) \\ \\
\left. \begin{aligned}
P(X^{(i)} | \theta) = \sum_k P(X^{(i)}, w^{(i)}_k | \theta) (引入隐变量w_k) \\
\frac{P(w^{(i)}_k | \theta^{(t)})}{P(w^{(i)}_k | \theta^{(t)})} = 1 (引入迭代变量\theta^{(t)})
\end{aligned} \right\} \Rightarrow \\
\left. \begin{aligned}
\log L(D | \theta) = \sum_i
\log \sum_k
P(X^{(i)}, w^{(i)}_k | \theta) \frac{P(w^{(i)}_k | \theta^{(t)})}{P(w^{(i)}_k | \theta^{(t)})} \\
\begin{cases}
\varphi(\cdot)下凸 \\ \sum_i w_i = 1
\end{cases} \Rightarrow \varphi(\sum_i w_i x_i) \leq \sum_i w_i \varphi(x_i) (Jensen不等式)
\end{aligned} \right\} \Rightarrow \\
\log L(D | \theta) = \sum_i \sum_k P(w^{(i)}_k | \theta^{(t)})
\log \frac{P(X^{(i)}, w^{(i)}_k | \theta)}{P(w^{(i)}_k | \theta^{(t)})} \\
= \underbrace{ \sum_i \sum_k P(w^{(i)}_k | \theta^{(t)})
\log P(X^{(i)}, w^{(i)}_k | \theta)}_{E_w\left[ \log P(X^{(i)}, w^{(i)}_k | \theta) \right]} \\
\underbrace{- \sum_i \sum_k P(w^{(i)}_k | \theta^{(t)})
\log P(w^{(i)}_k | \theta^{(t)})}_{H\left[ P(w^{(i)}_k | \theta^{(t)}) \right]} \\
记 \quad Q(\theta | \theta^{(t)}) = E_w\left[ \log P(X^{(i)}, w^{(i)}_k | \theta) \right] \\
\Rightarrow 优化目标:\theta^{(t + 1)} = \arg \max Q(\theta | \theta^{(t)}) \\
对Q(\theta | \theta^{(t)})求极值求解\theta^{(t + 1)}。
\end{aligned}
M L E ⇒ L ( D ∣ θ ) = i ∏ P ( X ( i ) ∣ θ ) ⇒ log L ( D ∣ θ ) = i ∑ log P ( X ( i ) ∣ θ ) ⇒ 优化目标: θ ( t + 1 ) = arg max log L ( D ∣ θ ) P ( X ( i ) ∣ θ ) = k ∑ P ( X ( i ) , w k ( i ) ∣ θ ) ( 引入隐变量 w k ) P ( w k ( i ) ∣ θ ( t ) ) P ( w k ( i ) ∣ θ ( t ) ) = 1 ( 引入迭代变量 θ ( t ) ) ⎭ ⎬ ⎫ ⇒ log L ( D ∣ θ ) = i ∑ log k ∑ P ( X ( i ) , w k ( i ) ∣ θ ) P ( w k ( i ) ∣ θ ( t ) ) P ( w k ( i ) ∣ θ ( t ) ) { φ ( ⋅ ) 下凸 ∑ i w i = 1 ⇒ φ ( i ∑ w i x i ) ≤ i ∑ w i φ ( x i ) ( J e n se n 不等式 ) ⎭ ⎬ ⎫ ⇒ log L ( D ∣ θ ) = i ∑ k ∑ P ( w k ( i ) ∣ θ ( t ) ) log P ( w k ( i ) ∣ θ ( t ) ) P ( X ( i ) , w k ( i ) ∣ θ ) = E w [ l o g P ( X ( i ) , w k ( i ) ∣ θ ) ] i ∑ k ∑ P ( w k ( i ) ∣ θ ( t ) ) log P ( X ( i ) , w k ( i ) ∣ θ ) H [ P ( w k ( i ) ∣ θ ( t ) ) ] − i ∑ k ∑ P ( w k ( i ) ∣ θ ( t ) ) log P ( w k ( i ) ∣ θ ( t ) ) 记 Q ( θ ∣ θ ( t ) ) = E w [ log P ( X ( i ) , w k ( i ) ∣ θ ) ] ⇒ 优化目标: θ ( t + 1 ) = arg max Q ( θ ∣ θ ( t ) ) 对 Q ( θ ∣ θ ( t ) ) 求极值求解 θ ( t + 1 ) 。
GMM模型
高斯混合模型,具有如下概率形式
P ( X ∣ μ , Σ ) = ∑ k = 1 K π k N ( X ∣ μ k , Σ k ) P(X | \mu, \Sigma) = \sum_{k=1}^K \pi_k N(X | \mu_k, \Sigma_k)
P ( X ∣ μ , Σ ) = k = 1 ∑ K π k N ( X ∣ μ k , Σ k )
其中
{ ∑ k π k = 1 N ( X ∣ μ k , Σ k ) = 1 ( 2 π ) d / 2 ∣ Σ ∣ 1 / 2 exp [ − 1 2 ( X − μ k ) T Σ k − 1 ( X − μ k ) ] \begin{cases}
\sum_k \pi_k = 1 \\
N(X | \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}}
\exp \left[
- \frac{1}{2} (X - \mu_k)^T \Sigma_k^{-1} (X - \mu_k)
\right]
\end{cases}
{ ∑ k π k = 1 N ( X ∣ μ k , Σ k ) = ( 2 π ) d /2 ∣Σ ∣ 1/2 1 exp [ − 2 1 ( X − μ k ) T Σ k − 1 ( X − μ k ) ]
用EM算法 对参数进行估计
Q ( θ ∣ θ ( t ) ) = ∑ i ∑ k P ( w k ( i ) ∣ θ ( t ) ) log P ( x ( i ) ∣ w k ( i ) , θ ) P ( w k ( i ) ∣ θ ) ⏟ P ( x ( i ) , w k ( i ) ∣ θ ) { P ( w k ( i ) ∣ θ ( t ) ) = π k ( t ) N ( x ( i ) ∣ μ k ( t ) , Σ k ( t ) ) ∑ j π j ( t ) N ( x ( i ) ∣ μ j ( t ) , Σ j ( t ) ) = γ k ( i ) ( t ) P ( x ( i ) ∣ w k ( i ) , θ ) = N ( x ( i ) ∣ μ k , Σ k ) P ( w k ( i ) ∣ θ ) = π k } ⇒ Q ( θ ∣ θ ( t ) ) = ∑ i ∑ k γ k ( i ) ( t ) log π k N ( x ( i ) ∣ μ k , Σ k ) 求解 Q 函数极值 ⇒ { μ k ( t + 1 ) = ∑ i γ k ( i ) ( t ) x ( i ) ∑ i γ k ( i ) ( t ) Σ k ( t + 1 ) = ∑ i γ k ( i ) ( t ) ( x ( i ) − μ k ) ( x ( i ) − μ k ) T ∑ i γ k ( i ) ( t ) π k ( t + 1 ) = ∑ i γ k ( i ) ( t ) N \begin{aligned}
\left. \begin{aligned}
Q(\theta|\theta^{(t)}) = \sum_i \sum_k P(w_k^{(i)}|\theta^{(t)}) \log \underbrace{P(x^{(i)} | w_k^{(i)}, \theta) P(w_k^{(i)} | \theta)}_{P(x^{(i)}, w_k^{(i)} | \theta)} \\
\begin{cases}
P(w_k^{(i)}|\theta^{(t)}) =
\frac{\pi_k^{(t)} N(x^{(i)}|\mu_k^{(t)}, \Sigma_k^{(t)})}
{\sum_j \pi_j^{(t)} N(x^{(i)}|\mu_j^{(t)}, \Sigma_j^{(t)})}
= \gamma^{(i)(t)}_k \\
P(x^{(i)} | w_k^{(i)}, \theta) = N(x^{(i)}|\mu_k, \Sigma_k) \\
P(w_k^{(i)} | \theta) = \pi_k
\end{cases}
\end{aligned} \right\} \Rightarrow \\
Q(\theta|\theta^{(t)}) = \sum_i \sum_k \gamma^{(i)(t)}_k \log \pi_k N(x^{(i)}|\mu_k, \Sigma_k) \\
求解Q函数极值 \Rightarrow
\begin{cases}
\mu_k^{(t+1)} = \frac{\sum_i \gamma^{(i)(t)}_k x^{(i)}}{\sum_i \gamma^{(i)(t)}_k} \\
\Sigma_k^{(t+1)} = \frac{\sum_i \gamma^{(i)(t)}_k (x^{(i)} - \mu_k) (x^{(i)} - \mu_k)^T}{\sum_i \gamma^{(i)(t)}_k} \\
\pi_k^{(t+1)} = \frac{\sum_i \gamma^{(i)(t)}_k}{N}
\end{cases}
\end{aligned}
Q ( θ ∣ θ ( t ) ) = i ∑ k ∑ P ( w k ( i ) ∣ θ ( t ) ) log P ( x ( i ) , w k ( i ) ∣ θ ) P ( x ( i ) ∣ w k ( i ) , θ ) P ( w k ( i ) ∣ θ ) ⎩ ⎨ ⎧ P ( w k ( i ) ∣ θ ( t ) ) = ∑ j π j ( t ) N ( x ( i ) ∣ μ j ( t ) , Σ j ( t ) ) π k ( t ) N ( x ( i ) ∣ μ k ( t ) , Σ k ( t ) ) = γ k ( i ) ( t ) P ( x ( i ) ∣ w k ( i ) , θ ) = N ( x ( i ) ∣ μ k , Σ k ) P ( w k ( i ) ∣ θ ) = π k ⎭ ⎬ ⎫ ⇒ Q ( θ ∣ θ ( t ) ) = i ∑ k ∑ γ k ( i ) ( t ) log π k N ( x ( i ) ∣ μ k , Σ k ) 求解 Q 函数极值 ⇒ ⎩ ⎨ ⎧ μ k ( t + 1 ) = ∑ i γ k ( i ) ( t ) ∑ i γ k ( i ) ( t ) x ( i ) Σ k ( t + 1 ) = ∑ i γ k ( i ) ( t ) ∑ i γ k ( i ) ( t ) ( x ( i ) − μ k ) ( x ( i ) − μ k ) T π k ( t + 1 ) = N ∑ i γ k ( i ) ( t )
SVM
KKT条件
w = arg min f ( w ) s . t . h j ( w ) = 0 , j = 1 , ⋯ , m g j ( w ) ≤ 0 , j = 1 , ⋯ , p } ⇒ L ( w , λ , μ ) = f ( w ) + ∑ j λ j h j ( w ) + ∑ j μ j ( g j ( w ) + ϵ 2 ) ⇒ { ∂ ∂ w f ( w ) + ∑ j λ j ∂ ∂ w h j ( w ) + ∑ j μ j ∂ ∂ w g j ( w ) = 0 h j ( w ) = 0 , j = 1 , ⋯ , m μ j g j ( w ) = 0 μ j ≥ 0 } j = 1 , ⋯ , p \begin{aligned}
\left.\begin{aligned}
w = \arg \min f(w) \\
s.t. \quad h_j(w) = 0, j = 1, \cdots, m \\
g_j(w) \leq 0, j = 1, \cdots, p
\end{aligned}\right\} \Rightarrow \\
L(w, \lambda, \mu) = f(w) + \sum_j \lambda_j h_j(w) + \sum_j \mu_j \left(g_j(w) + \epsilon^2 \right) \\
\Rightarrow \begin{cases}
\frac{\partial}{\partial w} f(w) +
\sum_j \lambda_j \frac{\partial}{\partial w} h_j(w) +
\sum_j \mu_j \frac{\partial}{\partial w} g_j(w) = 0 \\
h_j(w) = 0, j = 1, \cdots, m \\
\left.\begin{aligned}
\mu_j g_j(w) = 0 \\
\mu_j \geq 0
\end{aligned} \right\} j = 1, \cdots, p
\end{cases}
\end{aligned}
w = arg min f ( w ) s . t . h j ( w ) = 0 , j = 1 , ⋯ , m g j ( w ) ≤ 0 , j = 1 , ⋯ , p ⎭ ⎬ ⎫ ⇒ L ( w , λ , μ ) = f ( w ) + j ∑ λ j h j ( w ) + j ∑ μ j ( g j ( w ) + ϵ 2 ) ⇒ ⎩ ⎨ ⎧ ∂ w ∂ f ( w ) + ∑ j λ j ∂ w ∂ h j ( w ) + ∑ j μ j ∂ w ∂ g j ( w ) = 0 h j ( w ) = 0 , j = 1 , ⋯ , m μ j g j ( w ) = 0 μ j ≥ 0 } j = 1 , ⋯ , p
核技巧
设某函数Φ ( x ) \Phi(x) Φ ( x ) ,可将x x x 从n n n 维空间映射到n ′ n' n ′ 维空间,定义两个向量的核函数为κ ( x i , x j ) = Φ ( x i ) T Φ ( x j ) \kappa(x_i, x_j) = \Phi(x_i)^T \Phi(x_j) κ ( x i , x j ) = Φ ( x i ) T Φ ( x j ) ,常用和函数有
{ 线性核: κ ( x i , x j ) = x i T x j 多项式核: κ ( x i , x j ) = ( γ x i T x j + c ) n s i g m o i d 核: κ ( x i , x j ) = tanh ( γ x i T x j + c ) 拉普拉斯核: κ ( x i , x j ) = exp ( − γ ∣ ∣ x i − x j ∣ ∣ σ ) 高斯核: κ ( x i , x j ) = exp ( − γ ∣ ∣ x i − x j ∣ ∣ 2 2 σ 2 ) \begin{cases}
线性核:& \kappa(x_i, x_j) = x_i^T x_j \\
多项式核:& \kappa(x_i, x_j) = (\gamma x_i^T x_j + c)^n \\
sigmoid核:& \kappa(x_i, x_j) = \tanh (\gamma x_i^T x_j + c) \\
拉普拉斯核:& \kappa(x_i, x_j) = \exp (- \gamma \frac{||x_i - x_j||}{\sigma}) \\
高斯核:& \kappa(x_i, x_j) = \exp (- \gamma \frac{||x_i - x_j||^2}{2 \sigma^2})
\end{cases}
⎩ ⎨ ⎧ 线性核: 多项式核: s i g m o i d 核: 拉普拉斯核: 高斯核: κ ( x i , x j ) = x i T x j κ ( x i , x j ) = ( γ x i T x j + c ) n κ ( x i , x j ) = tanh ( γ x i T x j + c ) κ ( x i , x j ) = exp ( − γ σ ∣∣ x i − x j ∣∣ ) κ ( x i , x j ) = exp ( − γ 2 σ 2 ∣∣ x i − x j ∣ ∣ 2 )
分类问题
给定N N N 对样本{ ( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } , y ∈ { − 1 , 1 } \{(X^{(i)}, y^{(i)}), i = 1, \cdots, N\}, y \in \{-1, 1\} {( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } , y ∈ { − 1 , 1 } ,求取超平面w T Φ ( x ) + b = 0 w^T \Phi(x) + b = 0 w T Φ ( x ) + b = 0 使样本点落在该超平面两侧。
线性可分
记 r + / − 为分类平面到支持向量 x + / − 的距离,则 r = r + + r − ,且 r + / − = ∣ w T Φ ( x + / − ) + b ∣ ∣ ∣ w ∣ ∣ = 1 ∣ ∣ w ∣ ∣ 正 / 负样本分别满足 { w T Φ ( x ( i ) ) + b > 1 y ( i ) > 0 w T Φ ( x ( i ) ) + b < − 1 y ( i ) < 0 ⇒ y ( i ) [ w T Φ ( x ( i ) ) + b ] ≥ 1 ( 包括支持向量 ) } ⇒ \begin{aligned}
\left.\begin{aligned}
记r_{+/-}为分类平面到支持向量x_{+/-}的距离,则r = r_+ + r_-,且r_{+/-} = \frac{|w^T \Phi(x_{+/-}) + b|}{||w||} = \frac{1}{||w||} \\
正/负样本分别满足\begin{cases}
w^T \Phi(x^{(i)}) + b > 1 & y^{(i)} > 0 \\
w^T \Phi(x^{(i)}) + b < -1 & y^{(i)} < 0
\end{cases} \Rightarrow y^{(i)} [w^T \Phi(x^{(i)}) + b] \geq 1(包括支持向量)
\end{aligned}\right\} \Rightarrow \\
\end{aligned}
记 r + / − 为分类平面到支持向量 x + / − 的距离,则 r = r + + r − ,且 r + / − = ∣∣ w ∣∣ ∣ w T Φ ( x + / − ) + b ∣ = ∣∣ w ∣∣ 1 正 / 负样本分别满足 { w T Φ ( x ( i ) ) + b > 1 w T Φ ( x ( i ) ) + b < − 1 y ( i ) > 0 y ( i ) < 0 ⇒ y ( i ) [ w T Φ ( x ( i ) ) + b ] ≥ 1 ( 包括支持向量 ) ⎭ ⎬ ⎫ ⇒
优化目标: w , b = arg max r s . t . y ( i ) [ w T Φ ( x ( i ) ) + b ] ≥ 1 即: w , b = arg min 1 2 ∣ ∣ w ∣ ∣ 2 s . t . y ( i ) [ w T Φ ( x ( i ) ) + b ] ≥ 1 \begin{aligned}
优化目标:& \begin{aligned}
w, b & = \arg \max r \\
s.t. & \quad y^{(i)} [w^T \Phi(x^{(i)}) + b] \geq 1
\end{aligned} \\
即: & \begin{aligned}
w, b & = \arg \min \frac{1}{2} ||w||^2 \\ s.t. & \quad y^{(i)} [w^T \Phi(x^{(i)}) + b] \geq 1
\end{aligned}
\end{aligned}
优化目标: 即: w , b s . t . = arg max r y ( i ) [ w T Φ ( x ( i ) ) + b ] ≥ 1 w , b s . t . = arg min 2 1 ∣∣ w ∣ ∣ 2 y ( i ) [ w T Φ ( x ( i ) ) + b ] ≥ 1
线性不可分
在线性可分支持向量机基础上,对每个样本添加松弛变量ϵ ( i ) \epsilon^{(i)} ϵ ( i )
优化目标: w , b = arg min [ 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i ϵ ( i ) ] s . t . y ( i ) [ w T Φ ( x ( i ) ) + b ] ≥ 1 − ϵ ( i ) ϵ ( i ) ≥ 0 \begin{aligned}
优化目标:\begin{aligned}
w, b & = \arg \min \left[ \frac{1}{2} ||w||^2 + C \sum_i \epsilon^{(i)} \right] \\
s.t. & \quad y^{(i)} [w^T \Phi(x^{(i)}) + b] \geq 1 - \epsilon^{(i)}
\\ & \epsilon^{(i)} \geq 0
\end{aligned}
\end{aligned}
优化目标: w , b s . t . = arg min [ 2 1 ∣∣ w ∣ ∣ 2 + C i ∑ ϵ ( i ) ] y ( i ) [ w T Φ ( x ( i ) ) + b ] ≥ 1 − ϵ ( i ) ϵ ( i ) ≥ 0
回归问题
给定N N N 对样本{ ( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } , y ∈ R \{(X^{(i)}, y^{(i)}), i = 1, \cdots, N\}, y \in R {( X ( i ) , y ( i ) ) , i = 1 , ⋯ , N } , y ∈ R ,求回归模型y ^ = w T Φ ( x ) + b \hat{y} = w^T \Phi(x) + b y ^ = w T Φ ( x ) + b ,使得每个样本尽量拟合到该模型上,定义损失为
L ( i ) = { ∣ y ( i ) − w T Φ ( x ( i ) ) − b ∣ − ϵ ∣ y ( i ) − w T Φ ( x ( i ) ) − b ∣ > ϵ 0 o t h e r w i s e L^{(i)} = \begin{cases}
|y^{(i)} - w^T \Phi(x^{(i)}) - b| - \epsilon & |y^{(i)} - w^T \Phi(x^{(i)}) - b| > \epsilon \\
0 & otherwise
\end{cases}
L ( i ) = { ∣ y ( i ) − w T Φ ( x ( i ) ) − b ∣ − ϵ 0 ∣ y ( i ) − w T Φ ( x ( i ) ) − b ∣ > ϵ o t h er w i se
求解优化问题
以线性可分支持向量机为例,讲解参数w , b w, b w , b 的优化方法
优化目标: w , b = arg min 1 2 ∣ ∣ w ∣ ∣ 2 s . t . y ( i ) [ w T Φ ( x ( i ) ) + b ] ≥ 1 优化目标:\begin{aligned}
w, b & = \arg \min \frac{1}{2} ||w||^2 \\
s.t. & \quad y^{(i)} [w^T \Phi(x^{(i)}) + b] \geq 1
\end{aligned}
优化目标: w , b s . t . = arg min 2 1 ∣∣ w ∣ ∣ 2 y ( i ) [ w T Φ ( x ( i ) ) + b ] ≥ 1
拉格朗日函数: L ( w , b , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + ∑ i μ ( i ) { 1 − y ( i ) [ w T Φ ( x ( i ) ) + b ] } w , b , μ = arg min w , b max μ L ( w , b , μ ) ⇒ w , b , μ = arg max μ min w , b L ( w , b , μ ) ( 对偶问题 ) 求解极值: { ∂ ∂ w j L ( w , b , μ ) = 1 2 ∂ ∂ w j ∣ ∣ w ∣ ∣ 2 + ∑ i μ ( i ) { − y ( i ) ∂ ∂ w j w T Φ ( x ( i ) ) } = w j − ∑ i μ ( i ) y ( i ) Φ ( x ( i ) ) j ∂ ∂ b L ( w , b , μ ) = ∑ i μ ( i ) { − y ( i ) ∂ ∂ b b } = − ∑ i μ ( i ) y ( i ) 由 K . K . T 条件: { ∑ i μ ( i ) y ( i ) Φ ( x ( i ) ) j = w j ∑ i μ ( i ) y ( i ) = 0 } ( 极值条件 ) 1 − y ( i ) [ w T Φ ( x ( i ) ) + b ] ≤ 0 ( 不等式约束 ) μ ( i ) { 1 − y ( i ) [ w T Φ ( x ( i ) ) + b ] } = 0 μ ( i ) > 0 } ( 优化目标 取 ′ = ′ 的必要条件 ) \begin{aligned}
拉格朗日函数:L(w, b, \mu) = \frac{1}{2} ||w||^2 + \sum_i \mu^{(i)} \left\{ 1 - y^{(i)} [w^T \Phi(x^{(i)}) + b] \right\} \\
w, b, \mu = \arg \min_{w, b} \max_{\mu} L(w, b, \mu) \Rightarrow
w, b, \mu = \arg \max_{\mu} \min_{w, b} L(w, b, \mu)(对偶问题) \\
求解极值:\begin{cases}
\begin{aligned}
\frac{\partial}{\partial w_j} L(w, b, \mu) = \frac{1}{2} \frac{\partial}{\partial w_j} ||w||^2 +
\sum_i \mu^{(i)} \left\{ - y^{(i)} \frac{\partial}{\partial w_j} w^T \Phi(x^{(i)}) \right\} = \\
w_j - \sum_i \mu^{(i)} y^{(i)} \Phi(x^{(i)})_j
\end{aligned} \\
\begin{aligned}
\frac{\partial}{\partial b} L(w, b, \mu) = \sum_i \mu^{(i)} \left\{ -y^{(i)} \frac{\partial}{\partial b} b \right\} = \\
- \sum_i \mu^{(i)} y^{(i)}
\end{aligned}
\end{cases} \\
由K.K.T条件:\begin{cases}
\left.\begin{aligned}
\sum_i \mu^{(i)} y^{(i)} \Phi(x^{(i)})_j & = w_j \\
\sum_i \mu^{(i)} y^{(i)} & = 0
\end{aligned}\right\} (极值条件) \\
1 - y^{(i)} [w^T \Phi(x^{(i)}) + b] \leq 0 (不等式约束) \\
\left.\begin{aligned}
\mu^{(i)} \left\{ 1 - y^{(i)} [w^T \Phi(x^{(i)}) + b] \right\} = 0 \\
\mu^{(i)} > 0
\end{aligned} \right\} (优化目标取'='的必要条件)
\end{cases}
\end{aligned}
拉格朗日函数: L ( w , b , μ ) = 2 1 ∣∣ w ∣ ∣ 2 + i ∑ μ ( i ) { 1 − y ( i ) [ w T Φ ( x ( i ) ) + b ] } w , b , μ = arg w , b min μ max L ( w , b , μ ) ⇒ w , b , μ = arg μ max w , b min L ( w , b , μ ) ( 对偶问题 ) 求解极值: ⎩ ⎨ ⎧ ∂ w j ∂ L ( w , b , μ ) = 2 1 ∂ w j ∂ ∣∣ w ∣ ∣ 2 + i ∑ μ ( i ) { − y ( i ) ∂ w j ∂ w T Φ ( x ( i ) ) } = w j − i ∑ μ ( i ) y ( i ) Φ ( x ( i ) ) j ∂ b ∂ L ( w , b , μ ) = i ∑ μ ( i ) { − y ( i ) ∂ b ∂ b } = − i ∑ μ ( i ) y ( i ) 由 K . K . T 条件: ⎩ ⎨ ⎧ i ∑ μ ( i ) y ( i ) Φ ( x ( i ) ) j i ∑ μ ( i ) y ( i ) = w j = 0 ⎭ ⎬ ⎫ ( 极值条件 ) 1 − y ( i ) [ w T Φ ( x ( i ) ) + b ] ≤ 0 ( 不等式约束 ) μ ( i ) { 1 − y ( i ) [ w T Φ ( x ( i ) ) + b ] } = 0 μ ( i ) > 0 ⎭ ⎬ ⎫ ( 优化目标 取 ′ = ′ 的必要条件 )
拉格朗日函数展开后,将极值条件代入,有 拉格朗日函数展开后,将极值条件代入,有 拉格朗日函数展开后,将极值条件代入,有
L ( w , b , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + ∑ i μ ( i ) { 1 − y ( i ) [ w T Φ ( x ( i ) ) + b ] } = 1 2 w T w + ∑ i μ ( i ) − ∑ i μ ( i ) y ( i ) w T Φ ( x ( i ) ) − ∑ i μ ( i ) y ( i ) b = 1 2 w T w + ∑ i μ ( i ) − ∑ i μ ( i ) y ( i ) ( ∑ j w j Φ ( x ( i ) ) j ) ⏟ w T Φ ( x ( i ) ) − ∑ i μ ( i ) y ( i ) b = 1 2 w T w + ∑ i μ ( i ) − ∑ j w j ⋅ ∑ i μ ( i ) y ( i ) Φ ( x ( i ) ) j ⏟ w i = − 1 2 w T w + ∑ i μ ( i ) w T w = ( ∑ i μ ( i ) y ( i ) Φ ( x ( i ) ) ) T ( ∑ i μ ( i ) y ( i ) Φ ( x ( i ) ) ) = ∑ i ∑ j μ ( i ) μ ( j ) y ( i ) y ( j ) Φ ( x ( i ) ) T Φ ( x ( j ) ) } ⇒ L ( μ ) = − 1 2 ∑ i ∑ j μ ( i ) μ ( j ) y ( i ) y ( j ) Φ ( x ( i ) ) T Φ ( x ( j ) ) ⏟ w T w + ∑ i μ ( i ) \begin{aligned}
L(w, b, \mu) & = \frac{1}{2} ||w||^2 + \sum_i \mu^{(i)} \left\{ 1 - y^{(i)} [w^T \Phi(x^{(i)}) + b] \right\} \\
& = \frac{1}{2} w^T w + \sum_i \mu^{(i)} - \sum_i \mu^{(i)} y^{(i)} w^T \Phi(x^{(i)}) - \sum_i \mu^{(i)} y^{(i)} b \\
& = \frac{1}{2} w^T w + \sum_i \mu^{(i)} - \sum_i \mu^{(i)} y^{(i)} \underbrace{\left( \sum_j w_j \Phi(x^{(i)})_j \right)}_{w^T \Phi(x^{(i)})} - \cancel{\sum_i \mu^{(i)} y^{(i)} b} \\
& \left.\begin{aligned}
= \frac{1}{2} w^T w + \sum_i \mu^{(i)} - \sum_j w_j \cdot \underbrace{\sum_i \mu^{(i)} y^{(i)} \Phi(x^{(i)})_j}_{w_i}
= - \frac{1}{2} w^T w + \sum_i \mu^{(i)} \\
w^T w = \left( \sum_i \mu^{(i)} y^{(i)} \Phi(x^{(i)}) \right)^T
\left( \sum_i \mu^{(i)} y^{(i)} \Phi(x^{(i)}) \right) = \\
\sum_i \sum_j \mu^{(i)} \mu^{(j)} y^{(i)} y^{(j)} \Phi(x^{(i)})^T \Phi(x^{(j)})
\end{aligned}\right\} \Rightarrow \\
L(\mu) & = - \frac{1}{2} \underbrace{\sum_i \sum_j \mu^{(i)} \mu^{(j)} y^{(i)} y^{(j)} \Phi(x^{(i)})^T \Phi(x^{(j)})}_{w^T w} + \sum_i \mu^{(i)}
\end{aligned}
L ( w , b , μ ) L ( μ ) = 2 1 ∣∣ w ∣ ∣ 2 + i ∑ μ ( i ) { 1 − y ( i ) [ w T Φ ( x ( i ) ) + b ] } = 2 1 w T w + i ∑ μ ( i ) − i ∑ μ ( i ) y ( i ) w T Φ ( x ( i ) ) − i ∑ μ ( i ) y ( i ) b = 2 1 w T w + i ∑ μ ( i ) − i ∑ μ ( i ) y ( i ) w T Φ ( x ( i ) ) ( j ∑ w j Φ ( x ( i ) ) j ) − i ∑ μ ( i ) y ( i ) b = 2 1 w T w + i ∑ μ ( i ) − j ∑ w j ⋅ w i i ∑ μ ( i ) y ( i ) Φ ( x ( i ) ) j = − 2 1 w T w + i ∑ μ ( i ) w T w = ( i ∑ μ ( i ) y ( i ) Φ ( x ( i ) ) ) T ( i ∑ μ ( i ) y ( i ) Φ ( x ( i ) ) ) = i ∑ j ∑ μ ( i ) μ ( j ) y ( i ) y ( j ) Φ ( x ( i ) ) T Φ ( x ( j ) ) ⎭ ⎬ ⎫ ⇒ = − 2 1 w T w i ∑ j ∑ μ ( i ) μ ( j ) y ( i ) y ( j ) Φ ( x ( i ) ) T Φ ( x ( j ) ) + i ∑ μ ( i )
那么现在的优化问题如下,用 S M O 进行求解 那么现在的优化问题如下,用SMO进行求解 那么现在的优化问题如下,用 SMO 进行求解
μ = arg max μ L ( μ ) s . t . μ ( i ) ≥ 0 , ∑ i μ ( i ) y ( i ) = 0 ⇒ μ ∗ ⇒ w ∗ , b ∗ \begin{aligned}
\mu & = \arg \max_{\mu} L(\mu) \\
s.t. & \quad \mu^{(i)} \geq 0, \quad \sum_i \mu^{(i)} y^{(i)} = 0 \\
\Rightarrow & \mu^* \Rightarrow w^*, b^*
\end{aligned}
μ s . t . ⇒ = arg μ max L ( μ ) μ ( i ) ≥ 0 , i ∑ μ ( i ) y ( i ) = 0 μ ∗ ⇒ w ∗ , b ∗
聚类
仅介绍部分概念和算法步骤。给定样本集合{ X ( i ) , i = 1 , ⋯ , N } \{X^{(i)}, i = 1, \cdots, N\} { X ( i ) , i = 1 , ⋯ , N } ,指定划分类别K K K ,要求利用样本分布,将样本划分为K K K 个类别。
距离度量
定义两个n n n 维向量x , y x, y x , y ,有如下常用距离定义
曼哈顿距离 d = ∣ ∣ x − y ∣ ∣ 1 = ∑ j ∣ x j − y j ∣ 欧氏距离 d = ∣ ∣ x − y ∣ ∣ 2 = ( ∑ j ( x j − y j ) 2 ) 1 / 2 闵可夫斯基距离 d = ∣ ∣ x − y ∣ ∣ p = ( ∑ j ∣ x j − y j ∣ p ) 1 / p 余弦距离 d = ∣ ∣ x − y ∣ ∣ 1 = cos < x , y > = x T y ∣ ∣ x ∣ ∣ ⋅ ∣ ∣ y ∣ ∣ \begin{aligned}
曼哈顿距离 & d = || x - y ||_1 = \sum_j |x_j - y_j| \\
欧氏距离 & d = || x - y ||_2 = (\sum_j (x_j - y_j)^2)^{1 / 2} \\
闵可夫斯基距离 & d = || x - y ||_p = (\sum_j |x_j - y_j|^p)^{1 / p} \\
余弦距离 & d = || x - y ||_1 = \cos <x, y> = \frac{x^T y}{||x||\cdot||y||} \\
\end{aligned}
曼哈顿距离 欧氏距离 闵可夫斯基距离 余弦距离 d = ∣∣ x − y ∣ ∣ 1 = j ∑ ∣ x j − y j ∣ d = ∣∣ x − y ∣ ∣ 2 = ( j ∑ ( x j − y j ) 2 ) 1/2 d = ∣∣ x − y ∣ ∣ p = ( j ∑ ∣ x j − y j ∣ p ) 1/ p d = ∣∣ x − y ∣ ∣ 1 = cos < x , y >= ∣∣ x ∣∣ ⋅ ∣∣ y ∣∣ x T y
KMeans
随机选取K K K 个样本点作为初始中心点(初值敏感);
计算每个样本点到各中心点的距离(N × K N \times K N × K );
将每个样本划分到距离最近的中心点指代的类别中;
每个类别重新计算中心点,更新参数;
重复2~4直至收敛。
Spectral
构建相似矩阵{ S N × N = [ d i j ] d i j = ∣ ∣ x ( i ) − x ( j ) ∣ ∣ 2 2 \begin{cases} S_{N \times N} = \begin{bmatrix} d_{ij} \end{bmatrix} \\ d_{ij} = ||x^{(i)} - x^{(j)}||_2^2 \end{cases} { S N × N = [ d ij ] d ij = ∣∣ x ( i ) − x ( j ) ∣ ∣ 2 2 ;
计算邻接矩阵{ ϵ 近邻法: w i j = { ϵ d i j ≤ ϵ 0 o t h e r w i s e K 近邻法: w i j = { exp ( − d i j 2 σ 2 ) x ( i ) ∈ δ K ( x ( j ) ) A N D / O R x ( j ) ∈ δ K ( x ( i ) ) 0 o t h e r w i s e δ K ( x ) 表示 x 的 K 邻域 全连接法: w i j = exp ( − d i j 2 σ 2 ) \begin{cases}
\epsilon近邻法:& w_{ij} = \begin{cases}
\epsilon & d_{ij} \leq \epsilon \\
0 & otherwise
\end{cases} \\
K近邻法:& w_{ij} = \begin{cases}
\exp(-\frac{d_{ij}}{2 \sigma^2}) & x^{(i)} \in \delta_K(x^{(j)}) \quad AND/OR \quad x^{(j)} \in \delta_K(x^{(i)}) \\
0 & otherwise
\end{cases} \\ & \delta_K(x)表示x的K邻域 \\
全连接法:& w_{ij} = \exp(-\frac{d_{ij}}{2 \sigma^2})
\end{cases}
⎩ ⎨ ⎧ ϵ 近邻法: K 近邻法: 全连接法: w ij = { ϵ 0 d ij ≤ ϵ o t h er w i se w ij = { exp ( − 2 σ 2 d ij ) 0 x ( i ) ∈ δ K ( x ( j ) ) A N D / OR x ( j ) ∈ δ K ( x ( i ) ) o t h er w i se δ K ( x ) 表示 x 的 K 邻域 w ij = exp ( − 2 σ 2 d ij )
求度矩阵D N × N = diag { ∑ j w i j , i = 1 , ⋯ , N } D_{N \times N} = \text{diag}\{\sum_j w_{ij}, i = 1, \cdots, N\} D N × N = diag { ∑ j w ij , i = 1 , ⋯ , N } ,即W W W 行和作为对角元素;
求(正则)拉普拉斯矩阵L = D − W L = D - W L = D − W 或L = D − 1 ( D − W ) L = D^{-1}(D - W) L = D − 1 ( D − W ) 或L = D − 1 / 2 ( D − W ) D − 1 / 2 L = D^{-1/2}(D - W)D^{-1/2} L = D − 1/2 ( D − W ) D − 1/2 ;
求L L L 的特征分解,选取N ′ ( N ′ ≤ N ) N'(N' \leq N) N ′ ( N ′ ≤ N ) 个最小 特征值对应的特征向量组成矩阵F N × N ′ F_{N \times N'} F N × N ′ ;
将矩阵F F F 每行视作样本f ( i ) f^{(i)} f ( i ) ,标准化后执行其他简单的聚类如KMeans,得到聚类结果。
决策树
给定包含∣ D ∣ |D| ∣ D ∣ 个样本的样本集D = { ( X ( i ) , y ( i ) ) , i = 1 , ⋯ , ∣ D ∣ } D = \{(X^{(i)}, y^{(i)}), i = 1, \cdots, |D|\} D = {( X ( i ) , y ( i ) ) , i = 1 , ⋯ , ∣ D ∣ } ,属于K K K 个类别y ∈ { C k , k = 1 , ⋯ , K } y \in \{C_k, k = 1, \cdots, K\} y ∈ { C k , k = 1 , ⋯ , K } ,设类别C k C_k C k 的样本数目为∣ D k ∣ |D_{k}| ∣ D k ∣ ,设特征A A A 有∣ A ∣ |A| ∣ A ∣ 个特征{ A a , a = 1 , ⋯ , ∣ A ∣ } \{A_a, a = 1, \cdots, |A|\} { A a , a = 1 , ⋯ , ∣ A ∣ } ,每个特征包含样本数目∣ D a ∣ |D_{a}| ∣ D a ∣ ,记特征为A a A_a A a 的样本中属于类别C k C_k C k 的样本数目为∣ D a k ∣ |D_{ak}| ∣ D ak ∣ 。
ID3
用信息增益 作为准则选择当前最优划分属性:信息增益越大表示属性越优
g ( D , A ) = H ( D ) − H ( D ∣ A ) H ( D ) = − ∑ k ∣ D k ∣ ∣ D ∣ log ∣ D k ∣ ∣ D ∣ ( 总样本的类别熵 ) H ( D ∣ A ) = ∑ a ∣ D a ∣ ∣ D ∣ ( − ∑ k ∣ D a k ∣ ∣ D a ∣ log ∣ D a k ∣ ∣ D a ∣ ) ⏟ H ( D a ) ( 特征 A a 的类别熵的加权和 ) } \begin{aligned}
g(D, A) = H(D) - H(D | A) \\
\left.\begin{aligned}
H(D) & = - \sum_k \frac{|D_k|}{|D|} \log \frac{|D_k|}{|D|}(总样本的类别熵) \\
H(D | A) & = \sum_a \frac{|D_a|}{|D|}
\underbrace{\left( - \sum_k \frac{|D_{ak}|}{|D_a|} \log \frac{|D_{ak}|}{|D_a|} \right)}_{H(D_a)} (特征A_a的类别熵的加权和)
\end{aligned} \right\}
\end{aligned}
g ( D , A ) = H ( D ) − H ( D ∣ A ) H ( D ) H ( D ∣ A ) = − k ∑ ∣ D ∣ ∣ D k ∣ log ∣ D ∣ ∣ D k ∣ ( 总样本的类别熵 ) = a ∑ ∣ D ∣ ∣ D a ∣ H ( D a ) ( − k ∑ ∣ D a ∣ ∣ D ak ∣ log ∣ D a ∣ ∣ D ak ∣ ) ( 特征 A a 的类别熵的加权和 ) ⎭ ⎬ ⎫
C4.5
用信息增益比 作为准则选择当前最优划分属性:信息增益比越大表示属性越优
以信息增益比(information gain ratio)作为特征选择的准则,克服ID3会优先选择有较多属性值的特征的缺点;
弥补不能处理特征属性值连续的问题。
g R ( D , A ) = g ( D , A ) H A ( D ) H A ( D ) = − ∑ a ∣ D a ∣ ∣ D ∣ log ∣ D a ∣ ∣ D ∣ ( 特征 A 的属性熵 ) \begin{aligned}
g_R(D, A) & = \frac{g(D, A)}{H_A(D)} \\
H_A(D) & = - \sum_a \frac{|D_a|}{|D|} \log \frac{|D_a|}{|D|} (特征A的属性熵)
\end{aligned}
g R ( D , A ) H A ( D ) = H A ( D ) g ( D , A ) = − a ∑ ∣ D ∣ ∣ D a ∣ log ∣ D ∣ ∣ D a ∣ ( 特征 A 的属性熵 )
CART
用信息增益比 作为准则选择当前最优划分属性:信息增益比越大表示属性越优
g G ( D , A ) = Gini ( D ) − Gini ( D ∣ A ) Gini ( D ) = 1 − ∑ k ( ∣ D k ∣ ∣ D ∣ ) 2 ( 总样本的类别基尼系数 ) Gini ( D ∣ A ) = ∑ a ∣ D a ∣ ∣ D ∣ ( 1 − ∑ k ( ∣ D a k ∣ ∣ D a ∣ ) 2 ) ⏟ Gini ( D a ) ( 特征 A a 的类别基尼系数的加权和 ) } \begin{aligned}
g_G(D, A) = \text{Gini}(D) - \text{Gini}(D|A) \\
\left.\begin{aligned}
\text{Gini}(D) & = 1 - \sum_k (\frac{|D_k|}{|D|})^2 (总样本的类别基尼系数) \\
\text{Gini}(D|A) & = \sum_a \frac{|D_a|}{|D|}
\underbrace{\left( 1 - \sum_k (\frac{|D_{ak}|}{|D_a|})^2 \right)}_{\text{Gini}(D_a)} (特征A_a的类别基尼系数的加权和)
\end{aligned}\right\}
\end{aligned}
g G ( D , A ) = Gini ( D ) − Gini ( D ∣ A ) Gini ( D ) Gini ( D ∣ A ) = 1 − k ∑ ( ∣ D ∣ ∣ D k ∣ ) 2 ( 总样本的类别基尼系数 ) = a ∑ ∣ D ∣ ∣ D a ∣ Gini ( D a ) ( 1 − k ∑ ( ∣ D a ∣ ∣ D ak ∣ ) 2 ) ( 特征 A a 的类别基尼系数的加权和 ) ⎭ ⎬ ⎫
RF
随机森林是用Bagging策略,对包含N N N 个样本的数据集进行M M M 次的有放回的采样,每次随机取N m N_m N m 个样本 ,得到M M M 个样本数目为N m N_m N m 的样本子集,对每个子集建立分类器。
Bootstrap采样 :对于一个样本,它在某一次含m m m 个样本的训练集的随机采样中,每次被采集到的概率是1 / m 1/m 1/ m 。不被采集到的概率为1 − 1 / m 1−1/m 1 − 1/ m 。如果m m m 次采样都没有被采集中的概率是( 1 − 1 / m ) m (1−1/m)^m ( 1 − 1/ m ) m 。当m → ∞ m→\infty m → ∞ 时,lim m → ∞ ( 1 − 1 / m ) m ≈ 0.368 \lim_{m \rightarrow \infty} (1−1/m)^m \approx 0.368 lim m → ∞ ( 1 − 1/ m ) m ≈ 0.368 。也就是说,在bagging的每轮随机采样中,训练集中大约有36.8%的数据没有被采样集采集中。对于这部分大约36.8 % 36.8\% 36.8% 的没有被采样到的数据,我们常常称之为袋外数据(Out Of Bag, 简称OOB)。这些数据没有参与训练集模型的拟合,因此可以用来检测模型的泛化能力。
随机森林在Bagging策略上进行训练:
用Bootstrap策略随机采样M M M 次;
一棵树的生成时,仅从所有特征(K K K 个)中选取k k k 个特征 ;
生成M M M 棵树进行投票表决,确定预测结果(分类可取众数、回归可取均值)。