Machine Learning in Python (Scikit-learn 2)

无名小卒917 2016-09-25

展开全文

以前没写过这么长的日志，人人网亲情提示，不能超过12000个字符。好吧，那就另起一篇。

接着之前No.1，我们继续。

之前的易懂的线性模型基本走了一遭，我们再看看，如果数据的特征因素是复合的，平方的，立方的（也就是多项式回归会怎么样？）。我觉得这种东西没有定论，谁也不能确定特征组合会不会有道理，再说的直白点，到底特征是不是帮助我们机器学习的有效利器，也没有定论，但是至少目前看还是有效的。

1.1.15. Polynomial regression: extending linear models with basis functions

我们之前都是关注，怎么找到特征的线性组合，但是事实上，不可能都是线性组合，房价也许从某个特征（比如有一个特征是房子的平均面积，这个和价格有可能是线性关系；但是如果是这个地区的房子的数量，这个很难讲，有可能就不是线性的，有可能是平方的，也有可能是其他复杂的关系，比如逻辑斯蒂关系，因为环境饱和有可能造成房价持平甚至下跌）。我们这里考虑这种多项式组合的特征关系。

这是原来的特征线性组合

$\hat{y}(w, x) = w_0 + w_1 x_1 + w_2 x_2$

这个就是特征的二项式组合，

$\hat{y}(w, x) = w_0 + w_1 x_1 + w_2 x_2 + w_3 x_1 x_2 + w_4 x_1^2 + w_5 x_2^2$

我们来看看代码上，怎么来处理，还是用房价的数据。

# Data tranform
polynominalData = sklearn.preprocessing.PolynomialFeatures(degree=2).fit_transform(boston.data)

# Split the dataset with sampleRatio
sampleRatio = 0.5
n_samples = len(boston.target)
sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data
shuffleIdx = range(n_samples)
numpy.random.shuffle(shuffleIdx)

# Make the training data
train_features = polynominalData[shuffleIdx[:sampleBoundary]]
train_targets = boston.target[shuffleIdx [:sampleBoundary]]

# Make the testing data
test_features = polynominalData[shuffleIdx[sampleBoundary:]]
test_targets = boston.target[shuffleIdx[sampleBoundary:]]

# Train
linearRegression = sklearn.linear_model.LinearRegression()
linearRegression.fit(train_features, train_targets)

# Predict
predict_targets = linearRegression.predict(test_features)

# Evaluation
n_test_samples = len(test_targets)
X = range(n_test_samples)
error = numpy.linalg.norm(predict_targets - test_targets, ord = 1) / n_test_samples
print "Polynomial Regression (Degree = 2) (Boston) Error: %.2f" %(error)
# Draw

matplotlib.pyplot.plot(X, predict_targets, 'r--', label = 'Predict Price')
matplotlib.pyplot.plot(X, test_targets, 'g:', label='True Price')
legend = matplotlib.pyplot.legend()
matplotlib.pyplot.title("Polynomial Regression (Degree = 2) (Boston)")
matplotlib.pyplot.ylabel("Price (1000 U.S.D)")
matplotlib.pyplot.savefig("Polynomial Regression (Degree = 2) (Boston).png", format='png')
matplotlib.pyplot.show()

这份代码里，我使用的是二项式特征转换，最高阶次是2。然后使用普通的线性拟合，

输出：

Polynomial Regression (Degree = 2) (Boston) Error: 3.26

误差在3260美金上下，我记得之前的普通的线性回归是3350。略好一点点。

有些喜欢质疑的同学也许会问，我这代码会不会有问题？没关系，我们继续延伸一个小话题，如果我们只修改一个地方：

# Data tranform
polynominalData = sklearn.preprocessing.PolynomialFeatures(degree=4).fit_transform(boston.data)，改成4阶的，会怎么样呢？后果不堪设想。。。

输出：

Polynomial Regression (Degree = 4) (Boston) Error: 30.19

误差达到3W美金，这模型完全不能用了。

大家可以看到，预测价格（红色虚线）的震动非常强烈，而真实价格基本在30左右徘徊（绿色的虚线）。这说明你的模型在对测试数据的泛化能力上非常差。但是有人一定会问：“我设计的4阶模型应该比2阶的考虑的特征组合要多得多啊，怎么会测试的时候这么差？” 是啊，考虑全面了，还这么差，我只能说“您想多了”。事实上，没有那么多数据够你合理地调整参数，因为你的模型过于复杂。这种情况叫做过拟合（overfitting）。上面的图片显示的就是典型的过拟合。那么如果你的模型本身就是二次的，你用线性回归，那么效果也会略差，这种情况叫做欠拟合（underfitting）

在大数据时代，深度学习的模型参数非常多，但是数据也多，这样复杂模型本身的强大的表达能力得以展现，这是我觉得为什么在图像，语音这些领域，深度学习这么有效的简单原因。

---------------------------------------------------------------------------------------------------------------------------------

1.2. Support Vector Machines

支持向量机的历史命运特别像诺基亚，曾经辉煌很长一段时间，尽管现在已经成为历史，但是终究不能磨灭期伟大贡献。应该是上个世纪90年代，几乎在学术界充斥了大量的关于SVM的话题论文。要是那个时候谁不知道SVM，就跟现在不知道深度学习似的，不知道要遭到多少鄙视:)。其实我也不懂深度学习。。。被鄙视习惯了，也就见惯不惯了。

我们的这个sklearn系列的讨论帖不在于介绍数学细节，更关注怎么用，什么情况下使用什么模型更适合。因此我同意下面的四条关于SVM的优势的总结，这些总结从侧面告诉你什么时候用SVM：

a. 高维度特征数据有效

b. 训练样本数量小于特征维数的数据有效（这个特别霸气）

c. 节约模型的存储内存（就那么几个支持向量有用）

d. 还可以根据需要对特征进行高维变化（核函数的方法）

1.2.1. Classification

SVM用来做Classification，缩写就是SVC（Support Vector Classification）（SVM不仅仅能做分类，这个一定要说明）的基本思想非常直观，也是要找一个超平面（2类分类），但是要找最好的那个。下图来自博文：http://blog.csdn.net/marvin521/article/details/9286099。我们可以看到，类似B,C的分隔线可以有无数个，都能分离蓝色和红色的两个类别，但是貌似D的分类方式更让人接受，好像如果有一个新的数据，大体上像D这样划分更容易对，是吧。这里D的方式就是找到了已知数据分布的最大间隔，有充足的泛化空间让给那些没有看到的数据，这样模型的泛化能力达到了最大（机器学习的关键问题不在于模型在训练样本上的契合程度，在于泛化能力如何，虽然这是很难评估的），这是为什么SVM在90年代的时候风靡一时的原因，它也的确好使。

再来看，其实像D这样的分隔线的确定貌似不太依赖那些远离分隔线的数据点，只有那些距离分割线（如果是更多维度的特征，那就是分隔超平面）最近的一些点能够支持分割线确定位置，因此叫支持向量机。而那些用来确定分割线的有效数据点（特征向量），叫做支持向量。

来，我们用代码找找感觉：

这里需要说明一下：如果我们继续使用Iris的数据，这是一个多类别（3个类别）的分类问题，我觉得大家需要大致了解一下SVC这套工具是怎么处理多类分类的问题的（毕竟，我们给出的例子是2类分类的）。

大体上有两种，将两类分类器扩展到多类分类问题，我这里强调，不是只有两种，而是，将两类分类问题进行扩展，达到多（假设有n个类别) 分类的目的，这个思路有两种：一种是训练n*(n-1)/ 2个二类分类器，两两类别之间训练一个分类器，用于专门处理；另外一种就是把其中一个类别拿出来作为正类别，其他的所有类别统一归为负类，这样会训练n个训练样本。

用Iris的数据我们都来试试。

'''
Author: Miao Fan
Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.
Email: fanmiao.cslt.thu@gmail.com
'''

import sklearn.datasets
import sklearn.svm
import numpy.random
import matplotlib.pyplot
import matplotlib.colors

if __name__ == "__main__":
# Load iris dataset
iris = sklearn.datasets.load_iris()

# Split the dataset with sampleRatio
sampleRatio = 0.5
n_samples = len(iris.target)

sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data
shuffleIdx = range(n_samples)
numpy.random.shuffle(shuffleIdx)

# Make the training data
train_features = iris.data[shuffleIdx[:sampleBoundary]]
train_targets = iris.target[shuffleIdx [:sampleBoundary]]

# Make the testing data
test_features = iris.data[shuffleIdx[sampleBoundary:]]
test_targets = iris.target[shuffleIdx[sampleBoundary:]]

# Train
svc = sklearn.svm.SVC()
nusvc = sklearn.svm.NuSVC()
linearsvc = sklearn.svm.LinearSVC()

svc.fit(train_features, train_targets)
nusvc.fit(train_features, train_targets)
linearsvc.fit(train_features, train_targets)

predict_targets = svc.predict(test_features)

#SVC Evaluation
n_test_samples = len(test_targets)
X = range(n_test_samples)
correctNum = 0
for i in X:
if predict_targets[i] == test_targets[i]:
correctNum += 1
accuracy = correctNum * 1.0 / n_test_samples
print "SVC Accuracy: %.2f" %(accuracy)

predict_targets = nusvc.predict(test_features)

#NuSVC Evaluation
n_test_samples = len(test_targets)
X = range(n_test_samples)
correctNum = 0
for i in X:
if predict_targets[i] == test_targets[i]:
correctNum += 1
accuracy = correctNum * 1.0 / n_test_samples
print "NuSVC Accuracy: %.2f" %(accuracy)

predict_targets = linearsvc.predict(test_features)

#LinearSVC Evaluation
n_test_samples = len(test_targets)
X = range(n_test_samples)
correctNum = 0
for i in X:
if predict_targets[i] == test_targets[i]:
correctNum += 1
accuracy = correctNum * 1.0 / n_test_samples
print "LinearSVC Accuracy: %.2f" %(accuracy)

1.3. Stochastic Gradient Descent

1.4. Nearest Neighbors

1.4.2. Nearest Neighbors Classification

借着刚刚更新过的Logistic Regression 对 Iris做分类的余兴，我们来看看使用近邻法是怎么做分类（近邻法不仅能做分类，还能回归，我先介绍分类，这个比较好懂）的。这个算是基于实例的分类方法，和前面介绍的回归啊，分类啊这些方法都不同，之前都是要训练出一个具体的数学函数，对吧。这种近邻法不需要预先训练出什么公式。近邻法的思想很简单，“物以类聚，人以群分”，特征相似的，类别最相近。KNN（K Nearest Neighbor）的意思就是在某个待分类的样本周围找K个根据特征度量距离最近的K个已知类别的样本，这K个样本里面，如果某个类别个数最多，那么这个待分类的样本就从属于那个类别。意思就是，找特性最相近的朋党，然后少数服从多数。

当然，这个工具包也没有那么简单，除了KNN（KNeighborsClassifier）还有RNN（RadiusNeighborsClassifier），说白了，KNN不在乎那K个最近的点到底离你有多远，反正总有相对最近的K个。但是RNN要考虑半径Radius，在待测样本以Radius为半径画个球（如果是二维特征就是圆，三维特征以上，你可以理解为一个超球面），这个球里面的都算进来，这样就不能保证每个待测样本都能考虑相同数量的最近样本。

同时，我们也可以根据距离的远近来对这些已知类别的样本的投票进行加权，这个想法当然很自然。后面的代码都会体现。

我们还是用Iris来测试一下，这次采样比例弄得狠了点，20%训练，80%用来预测测试，就是为了区别一下两种距离加权方式[unifrom, distance]。

'''
Author: Miao Fan
Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.
Email: fanmiao.cslt.thu@gmail.com
'''

import sklearn.datasets
import sklearn.neighbors
import numpy.random
import matplotlib.pyplot
import matplotlib.colors

if __name__ == "__main__":
# Load iris dataset
iris = sklearn.datasets.load_iris()

# Split the dataset with sampleRatio
sampleRatio = 0.2
n_samples = len(iris.target)

sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data
shuffleIdx = range(n_samples)
numpy.random.shuffle(shuffleIdx)

# Make the training data
train_features = iris.data[shuffleIdx[:sampleBoundary]]
train_targets = iris.target[shuffleIdx [:sampleBoundary]]

# Make the testing data
test_features = iris.data[shuffleIdx[sampleBoundary:]]
test_targets = iris.target[shuffleIdx[sampleBoundary:]]

# Train
n_neighbors = 5 #选5个最近邻

for weights in ['uniform', 'distance']: #这个地方采用两种加权方式
kNeighborsClassifier = sklearn.neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
kNeighborsClassifier.fit(train_features, train_targets)

# Test
predict_targets = kNeighborsClassifier.predict(test_features)

#Evaluation
n_test_samples = len(test_targets)
X = range(n_test_samples)
correctNum = 0
for i in X:
if predict_targets[i] == test_targets[i]:
correctNum += 1
accuracy = correctNum * 1.0 / n_test_samples
print "K Neighbors Classifier (Iris) Accuracy [weight = '%s']: %.2f" %(weights, accuracy)

# Draw
cmap_bold = matplotlib.colors.ListedColormap(['red', 'blue', 'green'])
X_test = test_features[:, 2:4]
X_train = train_features[:, 2:4]
matplotlib.pyplot.scatter(X_train[:, 0], X_train[:, 1], label = 'train samples', marker='o', c = train_targets, cmap=cmap_bold,)
matplotlib.pyplot.scatter(X_test[:,0], X_test[:, 1], label = 'test samples', marker='+', c = predict_targets, cmap=cmap_bold)
legend = matplotlib.pyplot.legend()
matplotlib.pyplot.title("K Neighbors Classifier (Iris) [weight = %s]" %(weights))
matplotlib.pyplot.savefig("K Neighbors Classifier (Iris) [weight = %s].png" %(weights), format='png')
matplotlib.pyplot.show()

输出：

K Neighbors Classifier (Iris) Accuracy [weight = 'uniform']: 0.91
K Neighbors Classifier (Iris) Accuracy [weight = 'distance']: 0.93

加权方法略好一点，大约提升2%的精度（注意这两个图，我只是采用了其中的两个维度特征进行的重建，事实上应该有4个维度）：

1.5. Gaussian Processes

1.6. Cross decomposition

1.7. Naive Bayes

1.8. Decision Trees

1.9. Ensemble methods

1.10. Multiclass and multilabel algorithms

1.11. Feature selection

1.12. Semi-Supervised

1.13. Linear and quadratic discriminant analysis

1.14. Isotonic regression

2. Unsupervised learning

然后让我们开始无监督学习：（聚类啊，概率密度估计（离群点检测）啊，数据降维啊）等等。相对而言，这个部分的工具还是比起许多其他ML包要丰富地多！什么流形学习啊都有。

2.1. Gaussian mixture models

2.2. Manifold learning

2.3. Clustering

2.4. Biclustering

2.5. Decomposing signals in components (matrix factorization problems)

2.6. Covariance estimation

2.7. Novelty and Outlier Detection

2.8. Density Estimation

2.9. Neural network models (unsupervised)

3. Model selection and evaluation

模型选择有的时候，特别是在使用ML创业的时候更需要把握。其实好多问题不同模型都差不多到80%精度，后面怎么提升才是重点。不止一个小伙伴想要用Deep Learning 这个话题作为噱头准备9月份的博士或者硕士开题，那玩意儿想做好，你还真得有耐心调参数，回想起MSRA我那同一排的大婶（神）们，都是NIPS啊！！！丫的，1%的提升都要尖叫了:)，其实我想说，妹的，参数不一样呗。。。这就是Black Magic（黑魔法）。玩深度学习的多了，估计以后不是模型值钱，是参数值钱了。

另外就是特征选择，这个玩意儿也有讲究，如果真正用ML创业，其实模型还是那些模型，特征和参数的选择往往更能看出这个人的水平，别瞎试，千万别。。。

3.1. Cross-validation: evaluating estimator performance

3.2. Grid Search: Searching for estimator parameters

3.3. Pipeline: chaining estimators

3.4. FeatureUnion: Combining feature extractors

3.5. Model evaluation: quantifying the quality of predictions

3.6. Model persistence

3.7. Validation curves: plotting scores to evaluate models

4. Dataset transformations

4.1. Feature extraction

4.2. Preprocessing data

4.3. Kernel Approximation