机器学习是当下非常热门的话题，恰巧本阶段要开始机器人的项目，其中有一部分内容涉及到机器学习。目前规划的学习方案是与实物项目交叉同步进行，在理论学习与实践操作中加速入门的过程。笔记主要记录总结coursera 中Machine Learning 公开课的内容，按照公开课时间的顺序划分，后半部包含机器人项目的相关内容。本文的完成参考了一些前辈的优秀博客，限于篇幅不一一列出。

第一、二周公开课笔记

机器学习定义

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

对于某类任务T 和性能度量P ，如果一个计算机程序在T 上以P 衡量的性能随着经验E 而自我完善，那么我们称这个程序在经验E 中学习。

线性回归、梯度下降与正规方程组

模型定义

$ x^{(i)}$ : 输入变量或者输入特征

$ y^{(i)}$ : 输出结果或者预测的目标变量

$ (x^{(i)} , y^{(i)} )$ : 第$ i$ 个训练样本

$ h_ \theta (x)$ : 学习算法解决函数，以$ \theta$ 为参数的假设（hypothesis），表示对于输入的预测值，假设基于参数向量$ \theta$

线性回归

假设有$ n$ 个特征变量，通过这$ n$ 个特征变量预测一个$ y$ 值，假设函数如下：
$$
h_\theta(x)=\displaystyle \sum_{i=0}^n\theta_ix_i = \theta^Tx
$$
其中，$
\theta = \left[
\begin{matrix}
\theta_0 \\
\theta_1 \\
\vdots \\
\theta_n\\
\end{matrix}
\right]
$ ， $
x=\left[\begin{matrix}
x_0\\
x_1\\
\vdots\\
x_n\\
\end{matrix}\right]
$ 。而$ \displaystyle \sum_{i=0}^n\theta_ix$ 的计算在机器学习中更倾向于使用矩阵运算$ \theta^Tx$。一是便于在增加feature的时候不用改变算法；二是矩阵运算速度快，可以优化（并行）。
线性回归就是为了找到最优的一组$ n$ 个系数，实现最优拟合。
从而引出一个如何定义最优的问题，通常采用代价函数来表示：

$$
\begin{align}
J(\theta_0, \theta_1) & = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 \nonumber \\
& = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2 \nonumber \\
\end{align}
$$

代价函数其实就是$ \frac{1} {2} \bar{x}$ 的形式。其中，$ \bar{x}$ 是$ h_\theta (x_{i}) - y_{i}$ 的平方， $ \frac{1} {2}$ 是为了梯度下降计算方便，平方项求导后可以约去 $ \frac{1}{2}$ 。

当下的目标是找到一些系数$ \theta$ 使代价函数收敛到可以接受的误差范围内，有两种方案：

梯度下降
正规方程

梯度下降

只有两个特征时的更新规则 :
$$
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)
$$

其中，$ :=$ 代表赋值， $ \alpha$ 为学习速率，
对每一个系数取其偏导数，然后乘以学习速率$ \alpha$ 并递减
$ \alpha$ 需要根据具体应用调整，$ \alpha$ 过小会需要多次迭代才能收敛，$ \alpha$ 过大则会导致震荡。在运算过程中$ \alpha$ 不需要再次调整。

在面对多维特征值问题时，要保证特征值有相近的尺度，可以保证梯度下降算法更快的收敛。使用特征缩放使$ x_n$ 规约到 $ -1\lt x_n\lt 1$ 。即：
$$
x_i := \dfrac{x_i - \mu_i}{s_i}
$$
其中，$ \mu_i$ 是$ i$ 个特征值的平均值，$ s_i$ 是这些值的范围$ x_{max} - x_{min}$ ，即标准偏差。

批量梯度下降

核心算法：

$$
\begin{align}
\text{repeat until convergence: } & \lbrace \nonumber \\
\theta_0 := \theta_0 - \alpha \frac{1}{m} & \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \nonumber \\
\theta_1 := \theta_1 - \alpha \frac{1}{m} & \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \nonumber \\
& \rbrace \nonumber \\
\end{align}
$$

判断收敛的规则：

判断两次迭代后参数变化
判断两次迭代后目标变量的变化

梯度下降法有可能导致局部最优点，通过随机初始化寻求多个最优点，从这些最优点中选择最终结果。
当数据量较大时，每轮迭代都需要遍历全部数据，非常慢。因此提出增量梯度算法

增量梯度算法

一次仅用一个样本点来更新回归系数，可以在新样本到来时，进行增量更新，因此是一种“在线学习”算法。

$$
\theta_j:=\theta_j-\alpha(h_{\theta}(x^{(i)}-y^{(i)})){x_j}^{(i)}
$$

正规方程

定义$ \Delta$ 为梯度，$ J$ 的梯度表示如下：

$$
\Delta_\theta J = \left[ \frac {\partial J}{\partial \theta_0} \cdots \frac {\partial J}{\partial \theta _n} \right]^T \in \mathbb{R^{n + 1}}
$$

数据集合用矩阵$ m\times n$ 样本数为$ m$，特征值为$ n$ 表示如下：

$$
X = \left[ \begin{matrix}
(x^{(1)})^T\\
(x^{(2)})^T\\
\vdots \\
(x^{(n)})^T\\
\end{matrix} \right]
$$

目标值为$ m\times1$ 维向量：

$$
Y=\left[ \begin{matrix}
y^1, \cdots, y^n
\end{matrix}
\right]^T
$$

用矩阵表示目标函数
公式如下：

$$
\theta = (X^T X)^{-1}X^T y
$$

其中，$X$ 为$ m \times(n+1) $ 维矩阵，$ y$ 为$ (n+1)$ 维向量，$ \theta$ 为$ (n+1)$ 维向量。$ x_0$ 为$ [1, 1, \cdots, 1]^T$

正规方程法可以一次性给出最优的系数向量解。通常特征变量小于10000 的时候使用较好。

注：对于那些不可逆的矩阵（通常是因为特征之间不独立，如同时包含英尺为单位的尺寸和米为单位的尺寸两个特征，也有可能是特征数量大于训练集的数量），正规方程方法不能使用。

对比

梯度下降	正规方程
选择$ \alpha$	不需要选择$ \alpha$
多次迭代	不需要迭代
$ O(kn^2)$	$ O(n^3)$
$ n$ 较大仍可适用	$ n$ 过大会很慢（需计算$ (X^TX)^{-1}$ ）
适用于各种模型	仅适用于线性模型，不适合逻辑回归模型等

第一次作业

深入理解、细节知识点

第三周

分类和表达式

假设函数（Hypothesis Representation）

S型函数（Sigmoid Function）也称逻辑函数（Logistic Function）
逻辑回归的假设函数如下：
$$
h_\theta(x) = g(\theta^Tx)
$$
$$
z = \theta^Tx
$$
$$
g(z)=\frac{1}{1+e^{-z}}
$$
S型函数把输入限定在0~1之间，图像如下：

$$
h_\theta(x) = P(y = 1| x; \theta) = 1 - P(y = 0|x;\theta)
P(y = 0|x;\theta) + P(y = 1|x;\theta) = 1
$$
eg.
$ h_\theta(x) = 0.7 $ 代表预测值为1 的概率为70%，为0 的概率为30%。

决策边界（Decision Boundary）

决策边界以能够把样本正确分类的一条边界，不局限于直线，也可能是圆、椭圆等。决策边界由参数$ \theta$ 决定，而不是由数据集的特征决定。

逻辑回归模型

代价函数

逻辑回归的代价函数如下：
$$
\begin{align}& J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \nonumber \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) \; & \text{if y = 1} \nonumber \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) \; & \text{if y = 0} \nonumber \end{align}
$$

代价函数$ J(\theta)$ 和假设函数$ h_\theta (x)$ 的图像如下：

如果$ y = 0$ ，当假设函数为零，代价函数也是零，如果假设函数接近1，那么代价函数趋近无穷大。
- 如果$ y = 1$ ，当假设函数为1，代价函数则是零，如果假设函数接近零，那么代价函数趋近无穷大。

$$
\begin{align}& \mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y \nonumber \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 \; \mathrm{and} \; h_\theta(x) \rightarrow 1 \nonumber \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 \; \mathrm{and} \; h_\theta(x) \rightarrow 0 \nonumber \newline \end{align}
$$

代价函数这样的写法是为了保证$ J(\theta)$ 对于逻辑回归是凸函数。

简化代价函数和梯度下降

代价函数

代价函数的两种情况$\begin{cases} y = 1\\y= 0\end{cases}$可以写到一起，简化如下：
$$
\mathrm{Cost}(h_\theta(x),y) = - y \; \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x))
$$

因为$ y$ 的取值只会是0 或者1 ，会使另一项等于零，被约去。
即当$ y = 1$ 的时候，$ (1 - y)log^{(1 - h_\theta (x))} = 0$不会影响结果，当$ y = 0$ 时，$ -ylog^{(h_\theta(x))} = 0$。
完整的代价函数为

$$
J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]
$$

向量化实现如下：
$$
\begin{align} & h = g(X\theta) \nonumber \newline & J(\theta) = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right) \nonumber \end{align}
$$

梯度下降

梯度下降的通用形式如下：
$$
\begin{align}& Repeat \; \lbrace \nonumber \newline & \; \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta) \nonumber \newline & \rbrace \nonumber \end{align}
$$
偏导项计算后带入得到：
$$
\begin{align} & Repeat \; \lbrace \nonumber \newline & \; \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \nonumber \newline & \rbrace \nonumber \end{align}
$$
对于所有的$ \theta$ 都要同步更新。
矢量化如下：
$$
\theta := \theta - \frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})
$$

高级优化

对于复杂的算法，建议直接调用已有的库。需要计算下面两个函数。
$$
\begin{align} & J(\theta) \nonumber \newline & \dfrac{\partial}{\partial \theta_j}J(\theta) \nonumber \end{align}
$$

梯度下降算法(Gradient descent)
共轭梯度法(Conjugate gradient)
变尺度法(BFGS)
限制变尺度法(L-BFGS)

优点：

不需要手动选择学习速率
收敛速度比梯度下降算法快

缺点：
复杂

解决过度拟合问题

过度拟合

欠拟合，高偏差。拟合效果比较差。
过拟合，高方差。适应每一个样本（使用了更高次项，可能是一个一般性较差的曲线），但是不能有效泛化到新样本，即预测能力较差
解决过拟合：

减少特征值数量
正规化

正规化

思路：减小参数的影响，更接近简单函数。
$$
J(\theta) = \frac{1}{2m}\left[\sum_{i=1}^m(h_\theta(x^{(x)}) - y^{(i)})^2 + \lambda\sum_{i = 1}^n\theta_j^2 \right]
$$
其中，$\lambda\sum_{i = 1}^n\theta_j^2$ 是正规化项，$ \lambda$ 是正规化参数，用于控制参数，避免过度拟合或者欠拟合，使曲线平滑。

正规化线性回归

梯度下降

把 $ \theta_0$ 单独处理，因为不希望惩罚$ \theta_0$ 项。
$$
\begin{align} & \text{Repeat}\ \lbrace \nonumber \newline & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \nonumber \newline & \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2…n\rbrace \nonumber\newline & \rbrace \nonumber \end{align}
$$
对惩罚项合并后，可以改写为如下形式：
$$
\theta_j :=\theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}
$$
其中，$ 1 - \alpha\frac{\lambda}{m}$ 是一个比1小一点的数，后面部分和梯度下降完全一样。实际效果就是在每次更新$ \theta_j$ 的时候会略微减小该值。

正规方程

$$
X = \left[\begin{matrix}(x^{(1)})^T\\
\vdots\\
(x^{(m)})^T\end{matrix}\right]
\qquad
y = \left[\begin{matrix}y^{(1)}\\
\vdots\\
y^{(m)}\end{matrix}\right]
$$

其中，$ X$ 是$ m\times(n+1)$ 维， $ y$ 是$ m\times1$ 。
目标是求得$ {min}_\theta J(\theta)$ 对$ \theta_j$ 求偏导，令 $ \frac{\partial}{ {\partial\theta}_j} = 0$ 。
$$
\theta = \left(X^TX + \lambda\cdot \color{red}{L}\right)^{-1}X^Ty\\
\text{where } \color{red}{L} = \begin{bmatrix}0&&&&\\
&1&&&\\
&&1&&\\
&&&\ddots\\
&&&&1\\
\end{bmatrix}
$$
$ L$ 是$ (n + 1) \times (n + 1)$ 维。第一行第一列是0
当样本数目少于特征数$ m < n$ 时，$ X^TX$ 是不可逆的，然而加上$\lambda\cdot L$后，$ X^TX + \lambda\cdot L$ 就变成可逆的了。

正规化逻辑回归

代价函数添加修正项如下：
$$
J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log ^{(h_\theta (x^{(i)}))} + (1 - y^{(i)})\ \log ^{(1 - h_\theta(x^{(i)}))}\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2
$$
需要忽略$ \theta_0$ 项，从$ \theta_1$到$ \theta_n$。

$ \lambda$ 过大会导致 $ \theta_1 \text{、}\theta_2 \text{、} \cdots\text{、}\theta_n$ 过小，最终只有 $ \theta_0$ 起作用，决策分界会趋近一条平行x轴的直线。
更多的特征值会让假设函数拟合训练集的数据，而不能更好的预测新数据。
正规化是为了控制过拟合现象，所以正规化之后对已经过拟合的数据拟合的更好，$ \lambda$ 太大会导致欠拟合。

神经网络

概念与应用

使用神经网络预测 $ x_1$ AND $ x_2$ 如下：

$$
\begin{align}\begin{bmatrix}x_0 \newline x_1 \newline x_2\end{bmatrix} \rightarrow\begin{bmatrix}g(z^{(2)})\end{bmatrix} \rightarrow h_\Theta(x) \nonumber \end{align}
$$
$ x_0$是偏差变量，全部是 1。
设置$ \theta^{(1)}$ 矩阵为 $ \theta^{(1)}=\left[ \begin{matrix}-30 &20 &20\end{matrix}\right]$
激励函数$ g(x)$ 为S 型函数，-30、20、20 可理解为$ x$ 的权重。可得下表。
$$
\begin{array}{cc|r}
x_1&x_2&h_\theta(x)\\
\hline
0&0&g(-30)\approx0\\
0&1&g(-10)\approx0\\
1&0&g(-10)\approx0\\
1&1&g(10)\approx1
\end{array}
$$
与、异或、或门的权重$ \theta^{(1)}$ 可以表示如下：
$$
\begin{align}& a^{(2)} = g(\Theta^{(1)} \cdot x) \nonumber \newline& a^{(3)} = g(\Theta^{(2)} \cdot a^{(2)}) \nonumber \newline& h_\Theta(x) = a^{(3)}\nonumber \end{align}
$$

同或门XNOR(相同为1，相异为0)可表示为$ Y =\overline{A \oplus B}$ 等价于$ Y = A\cdot B + \overline{A}\cdot \overline{B}$，如下：
$$
\begin{align}\begin{bmatrix}x_0 \nonumber \newline x_1 \nonumber \newline x_2\end{bmatrix} \rightarrow\begin{bmatrix}a_1^{(2)} \nonumber \newline a_2^{(2)} \nonumber \end{bmatrix} \rightarrow \begin{bmatrix}a^{(3)}\end{bmatrix} \rightarrow h_\Theta(x) \nonumber \end{align}
$$
与、异或、或门权重如下：
$$
\begin{align}AND: \nonumber \newline\Theta^{(1)} &=\begin{bmatrix}-30 & 20 & 20\end{bmatrix} \nonumber \newline NOR: \nonumber \newline\Theta^{(1)} &= \begin{bmatrix}10 & -20 & -20\end{bmatrix} \nonumber \newline OR: \nonumber \newline\Theta^{(1)} &= \begin{bmatrix}-10 & 20 & 20\end{bmatrix} \nonumber \newline\end{align}
$$
在第一层、第二层之间用$\theta^{(1)}=\begin{bmatrix}-30&20&20\\10&-20&-20\end{bmatrix}$ 连接。
第二层、第三层用$ \theta^{(2)}=\begin{bmatrix}-10&20&20\end{bmatrix}$ 连接。
计算所有的节点信息如下：
$$
\begin{align}& a^{(2)} = g(\Theta^{(1)} \cdot x) \nonumber \newline& a^{(3)} = g(\Theta^{(2)} \cdot a^{(2)}) \nonumber \newline& h_\Theta(x) = a^{(3)} \nonumber \end{align}
$$

参考

维基百科
 Coursera-ML
斯坦福ML公开课
 高效使用matlab的一些总结

机器学习笔记（斯坦福公开课）

第一、二周公开课笔记

机器学习定义

分类

监督学习（Supervised learning）

无监督学习（ Unsupervised learning）

线性回归、梯度下降与正规方程组

模型定义

线性回归

梯度下降

批量梯度下降

增量梯度算法

正规方程

对比

第一次作业

深入理解、细节知识点

第三周

分类和表达式

分类（Classification）

假设函数（Hypothesis Representation）

决策边界（Decision Boundary）

逻辑回归模型

代价函数

简化代价函数和梯度下降

代价函数

梯度下降

高级优化

解决过度拟合问题

过度拟合

正规化

正规化线性回归

梯度下降

正规方程

正规化逻辑回归

神经网络

概念与应用

参考