机器学习知识点总结

本学期机器学习课程知识点总结，持续更新中…
由于是双语课所以总结尽量也保持双语…

¶引言

机器学习（Machine Learning）致力于研究如何通过计算的手段，利用经验来改善系统自身的性能。

¶机器学习的定义

“The field of study that gives computers the ability to learn without being explicitly programmed.” in 1959 by Arthur Samuel.

旧的定义：不显式编程地赋予计算机能力的研究领域。

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance as tasks in T, as measured by P, improves with experience E.” in 1997 by Tom Mitchell.

新的定义：一个程序的对某些任务 T 的表现 P 会随着相关的经验 E 的增加而变好。

¶几种机器学习类型与基本术语

监督学习：可以由训练资料中学到或建立一个模式（函数 / learning model），并依此模式推测新的实例。训练资料是由输入物件（通常是向量）和预期输出所组成。函数的输出可以是一个连续的值（称为回归分析 regression analysis），或是预测一个分类标签（称作分类 classification）。

无监督学习：没有给定事先标记过的训练示例，自动对输入的资料进行分类或分群，主要运用包含：聚类分析（cluster analysis）、关系规则（association rule）、维度缩减（dimensionality reduce）。

强化学习：强调如何基于环境而行动，以取得最大化的预期利益。

在复习前应理解的基本术语：

数据集（data set）样本（sample）特征（feature）标记（label）
维数（dimensionality）训练集（training set）测试集（test set）参数（parameter）
假设（hypothesis）预测（prediction）簇（cluster）分布（distribution）
独立同分布（i.i.d.）泛化（generalization）过拟合（overfitting）
欠拟合（underfitting）错误率（error rate）精度（accuracy）…

¶监督学习 (Supervised Learning)

¶分类与回归（Classification and Regression）

Fundamentally, classification is about predicting a label and regression is about predicting a quantity.

分类预测的是离散值，而回归预测的是连续值。

¶K 最邻近分类器（K-Nearest Neighbors Classifier）

¶核心思想

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. It can be used to solve both classification and regression problems. However, it is more widely used in classification problems in the industry.

KNN 算法的核心思想是，假设相似的事物在很近的距离内存在，也就是说，相似的事物距离对方很近。KNN 算法既可以解决分类问题也可以解决回归问题，但是我们还是倾向于用它解决回归问题，这大概就是它也被称作 $K$ 最邻近分类器的原因吧。

¶算法步骤

Step 1: Calculate Similarity based on distance function
Step 2 : Find K-Nearest Neighbors

(1) 导入一组测试数据；
(2) 计算这组数据与训练集中的所有元素的距离（一般为欧氏距离）；
(3) 对所有计算出的距离进行排序，选择距离测试数据最小的 $K$ 个训练集样本；
(4) 对 $K$ 个样本所属的类别进行统计，将测试样本点分到在 $K$ 个训练集样本点中占比最高的那个类别去。

¶关键点

Key component of the KNN algorithm
- Distance measures
- Value of $K$

实现 KNN 算法的关键之一是搞清楚怎样定义两个对象的距离，常用的距离函数有欧氏距离（Euclidean distance）、曼哈顿距离（Manhattan Distance）等。

Euclidean distance

$$ \begin{array}{lcl} d(p, q) = d(q, p) & = & \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + \cdots + (q_n - p_n)^2}\\ & = & \sqrt{\sum\limits^{n}_{i=1} (q_i - p_i)^2}. \end{array} $$

Manhattan Distance

$$ d_{12} = \sum^{n}_{k=1} \big|x_{1k} - x_{2k} \big|$$

实现 KNN 算法的关键之二是选择一个合适的 $k$ 值。为了这个目标，我们可以留出一部分训练数据来进行检验，不断改变 $k$ 的值，使得检验的结果最接近实际结果，选择泛化能力最强时的 $k$ 值作为算法使用的 $k$ 值，具体可见下面介绍的评估方法。

Selecting the value of $k$
- set aside a portion of the training data(validation set)
- vary $k$, observe training->validation error
- pick $k$ that gives best generalization performance

实践经验告诉我们，$k$ 应该要足够大，这样才能保证错误率最小，$k$ 如果太小会导致决策边界出现噪声，但是，$k$ 也应该要足够小，这样才能保证只有与测试数据真正邻近的训练集样本才会被选择到 $k$ 个邻居中来，$k$ 如果太大会导致决策边界过度平滑。

In practice
- $K$ should be large so that error rate is minimized
  - $K$ too small will lead to noisy decision boundaries.
- $K$ should be small enough so that only nearby samples are included
  - $K$ too large will lead to over-smoothed boundaries

¶评估方法

可以通过实验测试的评估方法来对学习器的泛化误差进行评估并进而做出选择，这里介绍 交叉检验法（cross validation）。

交叉检验法（又叫：$k$ 折交叉验证 $k$-fold cross validation）先将数据集 $D$ 划分为 $k$ 个大小相似的互斥子集，即 $D = D_1 \cup D_2 \cup \dots \cup D_k, D_i \cap D_j = \varnothing(i \neq j)$ ，每个子集 $D_i$ 都尽可能保持数据分布的一致性。然后每次用 $k-1$ 个子集的并集作为训练集，余下的那个子集作为测试集，最终返回 $k$ 次测试的结果的均值作为评估的依据。$k$ 的取值又在很大程度上影响着评估结果。

另外，如果让 $k = m$，$m$ 为数据集 $D$ 中的样本数，就得到留一法（Leave-One-Out, 简称 LOO）。

¶线性模型（Linear Models）

Linear Model make a prediction by using a linear function of the input features.

线性模型用输入特征的线性函数来做预测。

¶线性回归（Linear Regression）

For one feature the prediction is a line, for two features — plane, for more dimensions — hyperplane.

在线性回归中，给定若干个点 $X \in \mathbb{R}^n$，要求尽可能地拟合出一个 $n-1$ 维的超平面，即当 $X \in \mathbb{R}^2$ 时，就是在平面直角坐标系上给出很多个点，要求拟合一条直线经过这些点，这里与之后的 $n$ 都为 features 的数量。

The Linear function(hypothesis)

$$h(x)= \theta_0 x_0 + \theta_1 x_1 + \cdots + \theta_{n-1} x_{n-1} = \sum^{n-1}_{j=0}\theta_j x_j ~~~~\text{where} ~~ x_0 = 1.$$

我们的算法的任务就是要找到所有的参数 $\theta$，用矩阵表示为 $\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_j \\ \end{bmatrix} $，来构建这样一个线性模型 $h_\theta(x)$ 或者简写为 $h(x)$，使得我们能够对 input X 来较为准确地预测出 output Y。即

$$\text{choose } \theta \text{ such that } h(x) \text{ is close to } y \text{ for the training examples.}\tag{1}$$

A line, or say a hyperplane that fits the data “best” will be one for which the n prediction errors — one for each observed data point — are as small as possible in some overall sense. One way to achieve this goal is to invoke the “least squares criterion,” which says to “minimize the sum of the squared prediction errors.”

怎样说 $h(x)$-预测值是足够接近 $y$-真实值呢？这里关键在于如何衡量 $h(x)$ 与 $y$ 的差别，我们了解到，均方误差是回归任务中最常用的性能度量，均方误差有着非常好的几何意义，它其实就对应了之前在 KNN 算法中用到的 Euclidean distance（欧氏距离）。于是我们得到了基于均方误差最小化来进行模型求解的 最小二乘法（least square method），在线性回归中，最小二乘法就是试图找到一条直线，使所有样本到直线上的欧氏距离之和最小。

于是将任务(1)具体为任务(2)

$$\text{choose values of}~~\theta~~\text{that minimizes }\sum^{m}_{i=1}(h_\theta(x^{(i)}) - y^{(i)})^2. \tag{2}$$

这时候我们就地定义一个代价函数，只是在前边拼上一个 $\dfrac{1}{2}$，

$$J(\theta) = \dfrac{1}{2}\sum^{m}_{i=1}(h_\theta(x^{(i)}) - y^{(i)})^2.$$

然后有两种方法使我们能够找到一个合适的 $\theta$ 使得代价函数最小，一种是 梯度下降法（Gradient Descent），一种是求最小二乘法估计的 正规方程（Normal Equation）。

梯度下降法，顾名思义，用下式进行参数 $\theta$ 的迭代，直至收敛

$$\theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta).$$

其中，$\alpha$ 为学习率，一般在实践中根据情况设定。

这个式子可以放心地使用，在线性回归中它不会求出来个局部最优解。上式涉及到对代价函数求偏导，可以利用和的导数即为导数之和的性质，先对个体求偏导再求和，将偏导数带入上式得

$$\theta_j := \theta_j - \alpha\sum^{m}_{i=1}(h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}.$$

最终我们得到 批量梯度下降（Batch Gradient Descent）的方法：

$~~~~~~~~\text{Repeat until convergence }\{$ $~~~~~~~~ ~~~~~~~~\theta_j := \theta_j + \alpha\sum^{m}_{i=1}(y^{(i)}-h_{\theta}(x^{(i)})) x_j^{(i)}~~~~~~~~(\text{for every }j)\\$ $~~~~~~~~ \}$

接着介绍求最小二乘法估计的正规方程，将代价函数整理为矩阵的形式，并对代价函数求导，令导数为 $0$，可以得到

$$ \dfrac{\partial J(\theta)}{\partial \theta} = \dfrac{\partial }{\partial \theta}\dfrac{1}{2}(X\theta - \vec{y})^T(X\theta - \vec{y}) = 0 $$

方程化简

$$X^T X \theta = X^T \vec{y}$$

于是我们得到了最小化 $J(\theta)$ 的 $\theta$ 最优解的闭式解

$$\theta = (X^T X)^{-1} X^T \vec{y}$$

¶逻辑回归（Logistic Regression）

Logistic Regression is used when the dependent variable(target) is categorical.

Data is fit into linear regression model, which then be acted upon by a logistic function predicting the target categorical dependent variable.

当因变量(目标)是分类变量时，可以使用逻辑回归来进行分类。

考虑一个二分类问题，其输出标记 $y \in \{0, 1\}$，线性回归模型产生的预测值是一个实值，我们需要一个单调可微函数将实值转换为 0/1 值，最理想的单位阶跃函数是不连续的，我们一般用 对数几率函数（logistic function）来替代它，即

$$\text{sigmoid}(z) = \dfrac{1}{1 + e^{-z}}.$$

于是我们将 hypotheses $h_\theta(x)$ 变为以下形式

$$h_\theta(x) = g(\theta^T x) = \dfrac{1}{1 + e^{-\theta^T x}},$$

并假设

$$P(y=1~|~x; \theta) = h_\theta(x),$$ $$P(y=0~|~x; \theta) = 1 - h_\theta(x).$$

上面两式可合并写为

$$P(y~|~x; \theta) = (h_\theta(x))^y (1 - h_\theta(x))^{1-y}.$$

假设训练样本是独立地生成的，我们可以用 最大似然估计（maximum likelihood estimation, MLE）的方法来估计 $\theta$，其似然函数为

$$ \begin{array}{lcl} \mathscr{L}(\theta) & = & p(\vec{y}~|~X; \theta)\\ & = & \prod\limits^{m}_{i=1} p(y^{(i)}~|~x^{(i)}; \theta)\\ & = & \prod\limits^{m}_{i=1} \big(h_\theta(x^{(i)})\big)^{y^{(i)}} \big(1 - h_\theta(x^{(i)})\big)^{1-y^{(i)}}. \end{array} $$

对似然函数两边取对数有

$$ \begin{array}{lcl} ℓ(\theta) & = & \log \mathscr{L}(\theta)\\ & = & \sum\limits^{m}_{i=1} y^{(i)} \log h(x^{(i)}) + (1 - y^{(i)}) \log (1 - h(x^{(i)})) \end{array} $$

最大化 $ℓ(\theta)$ 等价于最小化 $-ℓ(\theta)$，可以使用经典的数值优化算法如梯度下降法（gradient descent）或者牛顿法（Newton method），来求得其最优解，这里以梯度下降法为例，其迭代公式为

$$\theta := \theta - \alpha \nabla_\thetaℓ(\theta).$$

化简得

$$\theta := \theta - \alpha \big(y^{(i)} - h_\theta(x^{(i)}) \big)x_j^{(i)}$$

¶朴素贝叶斯分类器（Naïve Bayes Classifiers)

¶贝叶斯准则（Bayes Rule）

贝叶斯准则是关于随机事件 $A$ 和 $B$ 的条件概率和边缘概率的。

设 $A_1, A_2, \cdots, A_n$ 是一组互不相容的事件，形成样本空间的一个分割（每一个试验结果必定使得其中一个事件发生）。又假定对每一个 $i$，$P(A_i) > 0$，即 $A_1, A_2, \cdots, A_n$ 为完备事件组，满足 $\cup^{n}_{i=1} A_i = \Omega$，$A_iA_j = \varnothing$，$P(A_i) > 0$。则对于任何事件 $B$，只要它满足 $P(B) > 0$，下列公式成立

$$ \begin{array}{lcl} P(A_i | B) & = & \dfrac{P(B | A_i) P(A_i)}{P(B)} \\ & = & \dfrac{P(B | A_i) P(A_i)}{P(B | A_1)P(A_1) + \cdots + P(B | A_n)P(A_n)}\\ & = & \dfrac{P(B | A_i) P(A_i)}{\sum^{}_{j} P(B | A_j)P(A_j)} \end{array} $$

$P(A_i)$ : independent probability of $A_i$ : prior probability（先验概率）
$P(B)$ : independent probability of $B$
$P(B|A_i)$ : conditional probability of $B$ given $A_i$ : likelihood
$P(A_i|B)$ : cond. probability of $A_i$ given $B$ : posterior probability（后验概率）

使用贝叶斯概率术语，上面的方程可以写成如下形式

$$\text{posterior} = \dfrac{\text{prior} \times \text{likelihood}}{\text{evidence}}$$

¶朴素贝叶斯（naive Bayes）

朴素贝叶斯分类（NBC）是以贝叶斯定理为基础并且假设特征条件之间相互独立的方法，先通过已给定的训练集，以特征词之间独立作为前提假设，学习从输入到输出的联合概率分布，再基于学习到的模型，输入 $X$ 求出使得后验概率最大的输出 $Y$ 。

Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector $\mathbf x = (x_1, \dots, x_n)$ representing some $n$ features (independent variables), it assigns to this instance probabilities

$$p(C_k~|~x_1, \dots, x_n)$$

for each of $K$ possible outcomes or classes $C_k$.

In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on $C$ and the values of the features $x_i$ are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model（分母实际上是常数。分子等价于联合概率模型）

$$p(C_k, x_1, \dots, x_n)$$

which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability（应用条件概率的链式法则）:

$$ \begin{array}{lcl} & & p(C_k, x_1, \dots, x_n) \\ & = & p(x_1, \dots, x_n, C_k) \\ & = & p(x_1~|~x_2, \dots, x_n, C_k)p(x_2, \dots, x_n, C_k) \\ & = & p(x_1~|~x_2, \dots, x_n, C_k)p(x_2~|~x_3, \dots, x_n, C_k)p(x_3, \dots, x_n, C_k) \\ & = & \dots \\ & = & p(x_1~|~x_2, \dots, x_n, C_k) \dots p(x_{n-1}~|~x_n, C_k)p(x_n~|~C_k)p(C_k) \end{array} $$

这里对上面的过程做一些解释：
对 $p(C_k~|~x_1, \dots, x_n)$ 应用贝叶斯公式，由于特征值 $x_i$ 是给出的，且分母与 $C_k$ 无关，所以分母 $p(x_1, \dots, x_n)$ 实际上是常数，而分子 $p(x_1, \dots, x_n | C_k) p(C_k)$ 等价于联合概率模型。那么实际上就有：
$$p(C_k~|~x_1, \dots, x_n) = \dfrac{p(x_1, \dots, x_n | C_k) p(C_k)}{p(x_1, \dots, x_n)} \propto p(C_k, x_1, \dots, x_n)$$

Now the “naive” conditional independence assumptions come into play: assume that each feature $x_i$ is conditionally independent of every other feature $x_j$ for $j \neq i$, given the category $C_k$. This means that

$$p(x_i~|~x_{i+1}, \dots, x_n, C_k) = p(x_i~|~C_k)$$

基于有限训练样本直接估计联合概率，在计算上将会遭遇组合爆炸问题，在数据上将会遭遇样本稀疏问题；属性数越多，问题越严重。为了避开这个障碍，naive Bayes 采用了 “属性条件独立性假设”（attribute conditional independence assumption）：对已知类别，假设所有属性相互独立。换言之，假设每个属性独立地对分类结果发生影响。

Thus, the joint model can be expressed as

$$ \begin{array}{lcl} p(C_k~|~x_1, \dots, x_n) & \propto & p(C_k, x_1, \dots, x_n)\\ & = & p(C_k)p(x_1~|~C_k)p(x_2~|~C_k)p(x_3~|~C_k)\cdots\\ & = & p(C_k)\prod\limits^{n}_{i=1}p(x_i~|~C_k), \end{array} $$

where $\propto$ denotes proportionality.

The discussion so far has derived the independent feature model, that is, the naive Bayes probability model（朴素贝叶斯概率模型）. The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable（常见的规则是选择最有可能的假设）; this is known as the maximum a posteriori（最大后验概率） or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label $\widehat y = C_k$ for some $k$ as follows:

$$ \widehat y = \mathop{\arg\max}_{k \in \{ 1, \dots, K \}} p(C_k)\prod^{n}_{i=1}p(x_i~|~C_k) $$

$\mathop{\arg\max}$ 函数
$$ \mathop{\arg\max}_{}f(x) := \{ x~|~x \in S \land \forall y \in S : f(y) \le f(x) \} $$
对一个函数 $f(x)$ 或一个映射 $f:X \rightarrow Y$，当 $x$ 取值范围为 $S$ 的时候（也叫 $x \in S$ ），$\mathop{\arg\max}$ 的结果是使得 $f(x)$ 取得最大值的 $x$ 点集。

$p(C_k)$ 是好求的，当属性或者说特征为离散值时，$p(x_i~|~C_k)$ 亦不难求得，而当其为连续值时，则需考虑其概率密度函数，假定 $p(x_i~|~C_k) \sim \mathcal{N}(\mu_{C_k, i}, \sigma^2_{C_k, i})$，其中 $\mu_{C_k, i}, \sigma^2_{C_k, i}$ 分别是第 $k$ 类样本在第 $i$ 个属性上取值的均值和方差，则有

$$p(x_i~|~C_k) = \dfrac{1}{\sqrt{2\pi}\sigma_{C_k, i}}\exp\Big( -\dfrac{(x_i - \mu_{C_k, i})^2}{2\sigma^2_{C_k, i}} \Big)$$

¶拉普拉斯平滑（Laplace smoothing）

如果某个属性上的概率为 $0$，该样本的其他属性携带的信息将会被抹去，产生零概率问题，为了避免这种情况，在估计概率值时可进行 拉普拉斯平滑，修正后的公式为：

$$\widehat P(C_k)=\dfrac{|D_{C_k}|+1}{|D|+N},~~~~\widehat P(x_i~|~C_k) = \dfrac{|D_{C_k, x_i}|+1}{|D_{C_k}|+N_i}$$

其中，$N$ 表示训练集 $D$ 中可能的类别数，$N_i$ 表示第 $i$ 个属性可能的取值数。

¶决策树（Decision Trees）

Decision tree is a hierarchical tree structure that used to classify classes based on a series of questions (or rules) about the attributes of the class. In short, given a data of attributes together with its classes, a decision tree produces a sequence of rules (or series of questions) that can be used to recognize the class.

决策树是一种分层的树结构，它用于根据一系列关于类属性的问题或是规则来进行分类的问题。简而言之，给定一个包含一些属性及其所属类别的数据，决策树会产生一系列规则(或一系列问题)，这些规则可用于识别该类。

一颗决策树包含一个根节点、若干个内部结点和若干个叶结点；叶结点对应于决策结果，其他每个结点则对应于一个属性测试；每个结点包含的样本集合根据属性测试的结果被划分到子结点中；根结点包含样本全集，从根结点到每个叶子结点的路径对应了一个判定测试序列。

决策树的基本特点：贪婪（greedy）、自顶向下（top-down）、递归（recursive）。

¶基本流程

决策树学习的关键在于如何选择最优的划分属性，所谓的最优划分属性，对于二元分类而言，就是尽量使划分的样本属于同一类别，即“纯度”最高的属性。那么如何来度量特征（features）的纯度，这时候就要用到 信息熵（information entropy），

$$H(x) = -\sum^{n}_{i=0}P(x_i)\log_bP(x_i)$$

其中，$n$ 为类别的总数，一般取 $b=2$，此时 $H(x)$ 的单位为 $\text{bits}$ 。

信息熵可以衡量一个数据集的信息“纯度”。信息越纯，熵就越低；信息越混杂，熵就越高。

再定义 信息增益（information gain）来表示纯度的提升，

$$\text{Information gain}(i) = \text{Entropy of parent table}~D~-\text{Sum}(\dfrac{n_k}{n} \cdot\\ \text{Entropy of each value}~k~\text{of subset table }S_i).$$

即父亲节点的信息熵减去子节点上样本熵的加权平均数。在 ID3 算法中，通过测量各种切分方案的信息增益，选择具有最大信息增益的切分方式，即可确定决策树上这一个父亲节点的切分。

¶基尼系数（Gini index）

¶剪枝处理（pruning）

¶支持向量机（Support Vector Machines）

¶神经网络（Neural Networks）

人工神经网络
前馈神经网络
BP 神经网络
卷积神经网络CNN
循环神经网络RNN

以上的总结中有的来自下面的书籍或文章：

[1] 机器学习 / 周志华著. 北京：清华大学出版社，2016
[2] 概率导论 /（美）伯特瑟卡斯（Dimitri P．Bertsekas）、齐齐克利斯（John N．Tsitsiklis）著，郑忠国、童行伟译
[3] CS229 Lecture notes / Andrew Ng
[4] 机器学习复习笔记 / Ruan Xingzhi
[5] 决策树（decision tree）(一)——构造决策树方法 / 天泽28
[6] 决策树（decision tree）(二)——剪枝 / 天泽28
[7] 决策树（decision tree）(三)——连续值处理 / 天泽28
[8] 决策树（decision tree）(四)——缺失值处理 / 天泽28