8种顶级Python机器学习算法-你必须学习

静幻堂 2018-08-14

展开全文

大数据信息站 2018-08-13 18:00:52

今天，我们将更深入地学习和实现8个顶级Python机器学习算法。

让我们开始Python编程中的机器学习算法之旅。

8 Python机器学习算法 - 你必须学习

以下是Python机器学习的算法：

1。线性回归

线性回归是受监督的Python机器学习算法之一，它可以观察连续特征并预测结果。根据它是在单个变量上还是在许多特征上运行，我们可以将其称为简单线性回归或多元线性回归。

这是最受欢迎的Python ML算法之一，经常被低估。它为变量分配最佳权重以创建线ax + b来预测输出。我们经常使用线性回归来估计实际值，例如基于连续变量的房屋调用和房屋成本。回归线是拟合Y = a * X + b的最佳线，表示独立变量和因变量之间的关系。

您是否了解Python机器学习环境设置？

让我们为糖尿病数据集绘制这个图。

>>>将matplotlib.pyplot导入为plt
>>>将numpy导入为np
>>>来自sklearn导入数据集，linear_model
>>>来自sklearn.metrics import mean_squared_error，r2_score
>>>糖尿病=数据集。load_diabetes （）
>>> diabetes_X = diabetes.data [ ：，np.newaxis，2 ]
>>> diabetes_X_train = diabetes_X [ ： - 30 ] #splitting数据到训练和测试集
>>> diabetes_X_test = diabetes_X [ - 30 ：]
>>> diabetes_y_train = diabetes.target [ ： - 30 ] #splitting目标分为训练和测试集
>>> diabetes_y_test = diabetes.target [ - 30 ：]
>>> regr = linear_model。LinearRegression （）＃线性回归对象
>>> regr。fit （diabetes_X_train，diabetes_y_train ）#Use training set训练模型

LinearRegression（copy_X = True，fit_intercept = True，n_jobs = 1，normalize = False）

>>> diabetes_y_pred = regr。预测（diabetes_X_test ）#Make预测
>>> regr.coef_

阵列（[941.43097333]）

>>> mean_squared_error （diabetes_y_test，diabetes_y_pred ）

3035.0601152912695

>>> r2_score （diabetes_y_test，diabetes_y_pred ）#Variance得分

0.410920728135835

>>> plt。散射（diabetes_X_test，diabetes_y_test，color = 'lavender' ）

<matplotlib.collections.PathCollection对象位于0x0584FF70>

>>> plt。情节（diabetes_X_test，diabetes_y_pred，color = 'pink' ，linewidth = 3 ）

[<matplotlib.lines.Line2D对象位于0x0584FF30>]

>>> plt。xticks （（））

（[]，<a 0 of text xticklabel objects>）

>>> plt。yticks （（））

（[]，<a 0 of text yticklabel objects>）

>>> plt。show （）

Python机器学习算法 - 线性回归

2 Logistic回归

Logistic回归是一种受监督的分类Python机器学习算法，可用于估计离散值，如0/1，是/否和真/假。这是基于一组给定的自变量。我们使用逻辑函数来预测事件的概率，这给出了0到1之间的输出。

虽然它说'回归'，但这实际上是一种分类算法。Logistic回归将数据拟合到logit函数中，也称为logit回归。让我们描绘一下。

>>>将numpy导入为np
>>>将matplotlib.pyplot导入为plt
>>>来自sklearn import linear_model
>>> XMIN，XMAX = - 7 ，7 #TEST集; 高斯噪声的直线
>>> n_samples = 77
>>> np.random。种子（0 ）
>>> x = np.random。正常（size = n_samples ）
>>> y = （x> 0 ）。astype （np.float ）
>>> x [ x> 0 ] * = 3
>>> x + =。4 * np.random。正常（size = n_samples ）
>>> x = x [ ：，np.newaxis ]
>>> clf = linear_model。LogisticRegression （C = 1e4 ）#Classifier
>>> clf。适合（x，y ）
>>> plt。图（1 ，figsize = （3 ，4 ））
<图大小与300x400 0 轴>
>>> plt。clf （）
>>> plt。散射（X。拆纱（）中，Y，颜色= '薰衣草' ，ZORDER = 17 ）

<matplotlib.collections.PathCollection对象位于0x057B0E10>

>>> x_test = np。linspace （- 7 ，7 ，277 ）
>>> def model （x ）：
返回1 / （1个+ NP。EXP （-x ））
>>> loss = model （x_test * clf.coef_ + clf.intercept_ ）。拉威尔（）
>>> plt。plot （x_test，loss，color = 'pink' ，linewidth = 2.5 ）

[<matplotlib.lines.Line2D对象位于0x057BA090>]

>>> ols = linear_model。LinearRegression （）
>>> ols。适合（x，y ）

LinearRegression（copy_X = True，fit_intercept = True，n_jobs = 1，normalize = False）

>>> plt。plot （x_test，ols.coef_ * x_test + ols.intercept_，linewidth = 1 ）

[<matplotlib.lines.Line2D对象位于0x057BA0B0>]

>>> plt。axhline （。4 ，颜色= ” 0.4' ）

<matplotlib.lines.Line2D对象位于0x05860E70>

>>> plt。ylabel （'y' ）

文本（0,0.5， 'Y'）

>>> plt。xlabel （'x' ）

文本（0.5,0， 'X'）

>>> plt。xticks （范围（- 7 ，7 ））
>>> plt。yticks （[ 0 ，0.4 ，1 ] ）
>>> plt。ylim （- 。25 ，1.25 ）

（-0.25,1.25）

>>> plt。XLIM （- 4 ，10 ）

（-4,10）

>>> plt。图例（（'Logistic回归' ，'线性回归' ），loc = '右下' ，fontsize = 'small' ）

<matplotlib.legend.Legend对象位于0x057C89F0>

>>> plt。show （）

机器学习算法 - Logistic Regreesion

3。决策树

决策树属于受监督的Python机器学习学习，并且用于分类和回归 - 尽管主要用于分类。此模型接受一个实例，遍历树，并将重要特征与确定的条件语句进行比较。是下降到左子分支还是右分支取决于结果。通常，更重要的功能更接近根。

这种Python机器学习算法可以对分类和连续因变量起作用。在这里，我们将人口分成两个或更多个同类集。让我们看看这个算法 -

>>>来自sklearn.cross_validation import train_test_split
>>>来自sklearn.tree导入DecisionTreeClassifier
>>>来自sklearn.metrics import accuracy_score
>>>来自sklearn.metrics import classification_report
>>> def importdata （）：#Importing data
balance_data = PD。read_csv （ 'https://archive.ics./ml/machine-learning-' +
'databases / balance-scale / balance-scale.data' ，
sep = '，' ，header = None ）
print （len （balance_data ））
print （balance_data.shape ）
打印（balance_data。头（））
return balance_data
>>> def splitdataset （balance_data ）：# Splitting 数据
x = balance_data.values [ ：，1 ：5 ]
y = balance_data.values [ ：，0 ]
x_train，x_test，y_train，y_test = train_test_split （
x，y，test_size = 0.3 ，random_state = 100 ）
返回x，y，x_train，x_test，y_train，y_test
>>> def train_using_gini （x_train，x_test，y_train ）：#gining with giniIndex
clf_gini = DecisionTreeClassifier （criterion = “ gini ” ，
random_state = 100 ，max_depth = 3 ，min_samples_leaf = 5 ）
clf_gini。适合（x_train，y_train ）
返回clf_gini
>>> def train_using_entropy （x_train，x_test，y_train ）：#Training with entropy
clf_entropy = DecisionTreeClassifier （
criterion = “entropy” ，random_state = 100 ，
max_depth = 3 ，min_samples_leaf = 5 ）
clf_entropy。适合（x_train，y_train ）
返回clf_entropy
>>> def 预测（x_test，clf_object ）：＃制作预测
y_pred = clf_object。预测（x_test ）
print （f “预测值：{y_pred}” ）
返回y_pred
>>> def cal_accuracy （y_test，y_pred ）：＃计算准确性
print （confusion_matrix （y_test，y_pred ））
打印（accuracy_score （y_test，y_pred ）* 100 ）
print （classification_report （y_test，y_pred ））
>>> data = importdata （）

625

（625,5）

0 1 2 3 4

0 B 1 1 1 1

1 R 1 1 1 2

2 R 1 1 1 3

3 R 1 1 1 4

4 R 1 1 1 5

>>> x，y，x_train，x_test，y_train，y_test = splitdataset （data ）
>>> clf_gini = train_using_gini （x_train，x_test，y_train ）
>>> clf_entropy = train_using_entropy （x_train，x_test，y_train ）
>>> y_pred_gini = 预测（x_test，clf_gini ）

Python机器学习算法 - 决策树

>>> cal_accuracy （y_test，y_pred_gini ）

[[0 6 7]

[0 67 18]

[0 19 71]]

73.40425531914893

Python机器学习算法 - 决策树

>>> y_pred_entropy = 预测（x_test，clf_entropy ）

Python机器学习算法 - 决策树

>>> cal_accuracy （y_test，y_pred_entropy ）

[[0 6 7]

[0 63 22]

[0 20 70]]

70.74468085106383

Python机器学习算法 - 决策树

4。支持向量机（SVM）

SVM是一种受监督的分类Python机器学习算法，它绘制了一条划分不同类别数据的线。在这个ML算法中，我们计算向量以优化线。这是为了确保每组中最近的点彼此相距最远。虽然你几乎总会发现这是一个线性向量，但它可能不是那样的。

在这个Python机器学习教程中，我们将每个数据项绘制为n维空间中的一个点。我们有n个特征，每个特征都具有某个坐标的值。

首先，让我们绘制一个数据集。

>>>来自sklearn.datasets.samples_generator import make_blobs
>>> x，y = make_blobs （n_samples = 500 ，centers = 2 ，
random_state = 0 ，cluster_std = 0 .40 ）
>>>将matplotlib.pyplot导入为plt
>>> plt。scatter （x [ ：，0 ] ，x [ ：，1 ] ，c = y，s = 50 ，cmap = 'plasma' ）

位于0x04E1BBF0的<matplotlib.collections.PathCollection对象>

>>> plt。show （）

Python机器学习算法 - SVM

>>>将numpy导入为np
>>> xfit = np。linspace （- 1 ，3 0.5 ）
>>> plt。scatter （X [ ：，0 ] ，X [ ：，1 ] ，c = Y，s = 50 ，cmap = 'plasma' ）

<matplotlib.collections.PathCollection对象位于0x07318C90>

>>>为M，B，d在[ （1 ，0.65 ，0.33 ），（0.5 ，1.6 ，0.55 ），（- 0 0.2 ，2 0.9 ，0.2 ）] ：
yfit = m * xfit + b
PLT。情节（xfit，yfit，' - k' ）
PLT。fill_between （xfit ，yfit - d，yfit + d，edgecolor = 'none' ，
color = '＃AFFEDC' ，alpha = 0.4 ）

[<matplotlib.lines.Line2D对象位于0x07318FF0>]

<matplotlib.collections.PolyCollection对象位于0x073242D0>

[<matplotlib.lines.Line2D对象位于0x07318B70>]

<matplotlib.collections.PolyCollection对象位于0x073246F0>

[<matplotlib.lines.Line2D对象位于0x07324370>]

<matplotlib.collections.PolyCollection对象位于0x07324B30>

>>> plt。XLIM （- 1 ，3.5 ）

（-1,3.5）

>>> plt。show （）

Python机器学习算法 - SVM

5，朴素贝叶斯

朴素贝叶斯是一种基于贝叶斯定理的分类方法。这假定预测变量之间的独立性。朴素贝叶斯分类器将假定类中的特征与任何其他特征无关。考虑一个水果。这是一个苹果，如果它是圆形，红色，直径2.5英寸。朴素贝叶斯分类器将说这些特征独立地促成果实成为苹果的概率。即使功能相互依赖，这也是如此。

对于非常大的数据集，很容易构建朴素贝叶斯模型。这种模型不仅非常简单，而且比许多高度复杂的分类方法表现更好。让我们建立这个。

>>>来自sklearn.naive_bayes导入GaussianNB
>>>来自sklearn.naive_bayes导入MultinomialNB
>>>来自sklearn导入数据集
>>>来自sklearn.metrics import confusion_matrix
>>>来自sklearn.model_selection import train_test_split
>>> iris =数据集。load_iris （）
>>> x = iris.data
>>> y = iris.target
>>> x_train，x_test，y_train，y_test = train_test_split （x，y，test_size = 0 .3 ，random_state = 0 ）
>>> gnb = GaussianNB （）
>>> MNB = MultinomialNB （）
>>> y_pred_gnb = gnb。适合（x_train，y_train ）。预测（x_test ）
>>> cnf_matrix_gnb = confusion_matrix （y_test，y_pred_gnb ）
>>> cnf_matrix_gnb

数组（[[16,0,0]，

[0,18,0]，

[0,0,11]]，dtype = int64）

>>> y_pred_mnb = mnb。适合（x_train，y_train ）。预测（x_test ）
>>> cnf_matrix_mnb = confusion_matrix （y_test，y_pred_mnb ）
>>> cnf_matrix_mnb

数组（[[16,0,0]，

[0,0,18]，

[0,0,11]]，dtype = int64）

6。kNN（k-Nearest Neighbors）

这是一种用于分类和回归的Python机器学习算法 - 主要用于分类。这是一种监督学习算法，它考虑不同的质心并使用通常的欧几里德函数来比较距离。然后，它分析结果并将每个点分类到组以优化它以放置所有最接近的点。它使用其邻居k的多数票对新案件进行分类。它分配给一个类的情况是其K个最近邻居中最常见的一个。为此，它使用距离函数。

I,对整个数据集进行培训和测试

>>>来自sklearn.datasets import load_iris
>>> iris = load_iris （）
>>> x = iris.data
>>> y = iris.target
>>>来自sklearn.linear_model import LogisticRegression
>>> logreg = LogisticRegression （）
>>> logreg。适合（x，y ）

LogisticRegression（C = 1.0，class_weight = None，dual = False，fit_intercept = True，

intercept_scaling = 1，max_iter = 100，multi_class ='ovr'，n_jobs = 1，

penalty ='l2'，random_state = None，solver ='liblinear'，tol = 0.0001，

verbose = 0，warm_start = False）

>>> logreg。预测（x ）

array（[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0，

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0，

0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

2,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,1,1，

1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2，

2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2，

2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]]

>>> y_pred = logreg。预测（x ）
>>> len （y_pred ）

150

>>>来自sklearn导入指标
>>>指标。accuracy_score （y，y_pred ）

0.96

>>>来自sklearn.neighbors导入KNeighborsClassifier
>>> knn = KNeighborsClassifier （n_neighbors = 5 ）
>>> knn。适合（x，y ）

KNeighborsClassifier（algorithm ='auto'，leaf_size = 30，metric ='minkowski'，

metric_params =无，n_jobs = 1，n_neighbors = 5，p = 2，

权重=“均匀”）

>>> y_pred = knn。预测（x ）
>>>指标。accuracy_score （y，y_pred ）

0.9666666666666667

>>> knn = KNeighborsClassifier （n_neighbors = 1 ）
>>> knn。适合（x，y ）

KNeighborsClassifier（algorithm ='auto'，leaf_size = 30，metric ='minkowski'，

metric_params =无，n_jobs = 1，n_neighbors = 1，p = 2，

权重=“均匀”）

>>> y_pred = knn。预测（x ）
>>>指标。accuracy_score （y，y_pred ）

1.0

II。分裂成火车/测试

>>> x.shape

（150,4）

>>> y.shape

（150）

>>>来自sklearn.cross_validation import train_test_split
>>> x.shape

（150,4）

>>> y.shape

（150）

>>>来自sklearn.cross_validation import train_test_split
>>> x_train，x_test，y_train，y_test = train_test_split （x，y，test_size = 0.4 ，random_state = 4 ）
>>> x_train.shape

（90,4）

>>> x_test.shape

（60,4）

>>> y_train.shape

（90）

>>> y_test.shape

（60）

>>> logreg = LogisticRegression （）
>>> logreg。适合（x_train，y_train ）
>>> y_pred = knn。预测（x_test ）
>>>指标。accuracy_score （y_test，y_pred ）

0.9666666666666667

>>> knn = KNeighborsClassifier （n_neighbors = 5 ）
>>> knn。适合（x_train，y_train ）

KNeighborsClassifier（algorithm ='auto'，leaf_size = 30，metric ='minkowski'，

metric_params =无，n_jobs = 1，n_neighbors = 5，p = 2，

权重=“均匀”）

>>> y_pred = knn。预测（x_test ）
>>>指标。accuracy_score （y_test，y_pred ）

0.9666666666666667

>>> k_range = 范围（1 ，26 ）
>>>得分= [ ]
>>> for k in k_range：
knn = KNeighborsClassifier （n_neighbors = k ）
KNN。适合（x_train，y_train ）
y_pred = knn。预测（x_test ）
分数。追加（指标。accuracy_score （y_test，y_pred ））
>>>分数

[0.95，0.95，0.9666666666666667，0.9666666666666667，0.9666666666666667，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9666666666666667，0.9833333333333333，0.9666666666666667，0.9666666666666667，0.9666666666666667，0.9666666666666667 0.95，0.95 ]

>>>将matplotlib.pyplot导入为plt
>>> plt。情节（k_range，分数）

[<matplotlib.lines.Line2D对象位于0x05FDECD0>]

>>> plt。xlabel （'k代表kNN' ）

文字（0.5,0，'k为kNN'）

>>> plt。ylabel （'测试准确度' ）

文字（0,0.5，'测试准确度'）

>>> plt。show （）

Python机器学习算法 - kNN（k-Nearest Neighbors）

阅读Python统计数据 - p值，相关性，T检验，KS检验

7。K-Means

k-Means是一种无监督算法，可以解决聚类问题。它使用许多集群对数据进行分类。类中的数据点与同类组是同构的和异构的。

>>>将numpy导入为np
>>>将matplotlib.pyplot导入为plt
>>>来自matplotlib导入样式
>>>风格。使用（'ggplot' ）
>>>来自sklearn.cluster导入KMeans
>>> X = [ 1 ，5 ，1 0.5 ，8 ，1 ，9 ]
>>> Y = [ 2 ，8 ，1.7 ，6 ，0 0.2 ，12 ]
>>> plt。散射（x，y ）

<matplotlib.collections.PathCollection对象位于0x0642AF30>

>>> x = np。阵列（[ [ 1 ，2 ] ，[ 5 ，8 ] ，[ 1.5 ，1 0.8 ] ，[ 8 ，8 ] ，[ 1 ，0 0.6 ] ，[ 9 ，11 ] ] ）
>>> kmeans = KMeans （n_clusters = 2 ）
>>> kmeans。适合（x ）

KMeans（algorithm ='auto'，copy_x = True，init ='k-means ++'，max_iter = 300，

n_clusters = 2，n_init = 10，n_jobs = 1，precompute_distances ='auto'，

random_state =无，tol = 0.0001，verbose = 0）

>>> centroids = kmeans.cluster_centers_
>>> labels = kmeans.labels_
>>>质心

数组（[[1.16666667,1.46666667]，

[7.33333333,9。]]）

>>>标签

数组（[0,1,0,1,0,1]）

>>> colors = [ 'g。' ，'r。' ，'c。' ，'呃。' ]
>>> for i in range （len （x ））：
print （x [ i ] ，labels [ i ] ）
PLT。plot （x [ i ] [ 0 ] ，x [ i ] [ 1 ] ，colors [ labels [ i ] ] ，markersize = 10 ）

[1。2.] 0

[<matplotlib.lines.Line2D对象位于0x0642AE10>]

[5。8.] 1

[<matplotlib.lines.Line2D对象位于0x06438930>]

[1.5 1.8] 0

[<matplotlib.lines.Line2D对象位于0x06438BF0>]

[8。8.] 1

[<matplotlib.lines.Line2D对象位于0x06438EB0>]

[1。0.6] 0

[<matplotlib.lines.Line2D对象位于0x06438FB0>]

[9. 11.] 1

[<matplotlib.lines.Line2D对象位于0x043B1410>]

>>> plt。scatter （centroids [ ：，0 ] ，centroids [ ：，1 ] ，marker = 'x' ，s = 150 ，linewidths = 5 ，zorder = 10 ）

<matplotlib.collections.PathCollection对象位于0x043B14D0>

>>> plt。show （）

8。Random Forest

Random Forest是决策树的集合。为了根据其属性对每个新对象进行分类，树投票给类 - 每个树提供一个分类。投票最多的分类在Random

中获胜。

>>>将numpy导入为np
>>>将pylab导入为pl
>>> x = np.random。均匀的（1 ，100 ，1000 ）
>>> y = np。log （x ）+ np.random。正常（0 ，。3 ，1000 ）
>>> pl。scatter （x，y，s = 1 ，label = 'log（x）with noise' ）

<matplotlib.collections.PathCollection对象，位于0x0434EC50>

>>> pl。情节（NP。人气指数（1 ，100 ），NP。日志（NP。人气指数（1 ，100 ））中，c = 'B' ，标记= '日志（x）的函数真' ）

[<matplotlib.lines.Line2D对象位于0x0434EB30>]

>>> pl。xlabel （'x' ）

文本（0.5,0， 'X'）

>>> pl。ylabel （'f（x）= log（x）' ）

文本（0,0.5， 'F（X）=日志（X）'）

>>> pl。传奇（loc = 'best' ）

<matplotlib.legend.Legend对象，位于0x04386450>

>>> pl。标题（'基本日志功能' ）

文字（0.5,1，'基本日志功能'）

>>> pl。show （）

Python机器学习算法 -

>>>来自sklearn.datasets import load_iris
>>>来自sklearn.ensemble导入RandomForestClassifier
>>>将pandas导入为pd
>>>将numpy导入为np
>>> iris = load_iris （）
>>> df = pd。DataFrame （iris.data，columns = iris.feature_names ）
>>> df [ 'is_train' ] = np.random。均匀的（0 ，1 ，LEN （DF ））<=。75
>>> df [ 'species' ] = pd.Categorical。from_codes （iris.target，iris.target_names ）
>>> df。头（）

萼片长度（厘米）萼片宽度（厘米）... is_train物种

0 5.1 3.5 ...真正的setosa

1 4.9 3.0 ...真正的setosa

2 4.7 3.2 ...真正的setosa

3 4.6 3.1 ...真正的setosa

4 5.0 3.6 ...假setosa

[5行x 6列]

>>> train，test = df [ df [ 'is_train' ] == True ] ，df [ df [ 'is_train' ] == False ]
>>> features = df.columns [ ：4 ]
>>> clf = RandomForestClassifier （n_jobs = 2 ）
>>> y，_ = pd。factorize （train [ 'species' ] ）
>>> clf。适合（火车[ 功能] ，y ）

RandomForestClassifier（bootstrap = True，class_weight = None，criterion ='gini'，

max_depth =无，max_features ='auto'，max_leaf_nodes =无，

min_impurity_decrease = 0.0，min_impurity_split =无，

min_samples_leaf = 1，min_samples_split = 2，

min_weight_fraction_leaf = 0.0，n_estimators = 10，n_jobs = 2，

oob_score = False，random_state = None，verbose = 0，

warm_start = FALSE）