
Anna-Lena Popkes,德国波恩大学计算机科学专业的研究生,主要关注机器学习和神经网络。
编译 | 林椿眄 出品 | 人工智能头条
导读:Python 被称为是最接近 AI 的语言。最近一位名叫Anna-Lena Popkes的小姐姐在GitHub上分享了自己如何使用Python(3.6及以上版本)实现7种机器学习算法的笔记,并附有完整代码。所有这些算法的实现都没有使用其他机器学习库。这份笔记可以帮大家对算法以及其底层结构有个基本的了解,但并不是提供最有效的实现。
k-nn 算法是一种简单的监督式的机器学习算法,可以用于解决分类和回归问题。这是一个基于实例的算法,并不是估算模型,而是将所有训练样本存储在内存中,并使用相似性度量进行预测。
给定一个输入示例,k-nn 算法将从内存中检索 k 个最相似的实例。相似性是根据距离来定义的,也就是说,与输入示例之间距离最小(欧几里得距离)的训练样本被认为是最相似的样本。
输入示例的目标值计算如下:
分类问题:
a) 不加权:输出 k 个最近邻中最常见的分类 b) 加权:将每个分类值的k个最近邻的权重相加,输出权重最高的分类
回归问题:
a) 不加权:输出k个最近邻值的平均值 b) 加权:对于所有分类值,将分类值加权求和并将结果除以所有权重的总和
加权版本的 k-nn 算法是改进版本,其中每个近邻的贡献值根据其与查询点之间的距离进行加权。下面,我们在 sklearn 用 k-nn 算法的原始版本实现数字数据集的分类。
In [1]:
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split np.random.seed(123)
% matplotlib inline
数据集
In [2]:
# We will use the digits dataset as an example. It consists of the 1797 images of hand-written digits. Each digit is # represented by a 64-dimensional vector of pixel values.
digits = load_digits() X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y) print(f'X_train shape: {X_train.shape}') print(f'y_train shape: {y_train.shape}') print(f'X_test shape: {X_test.shape}') print(f'y_test shape: {y_test.shape}')
# Example digits fig = plt.figure(figsize=(10,8)) for i in range(10): ax = fig.add_subplot(2, 5, i+1) plt.imshow(X[i].reshape((8,8)), cmap='gray')
X_train shape: (1347, 64) y_train shape: (1347,) X_test shape: (450, 64) y_test shape: (450,)

K 最邻近类别
In [3]:
class kNN(): def __init__(self): pass
def fit(self, X, y): self.data = X self.targets = y
def euclidean_distance(self, X): ''' Computes the euclidean distance between the training data and a new input example or matrix of input examples X ''' # input: single data point if X.ndim == 1: l2 = np.sqrt(np.sum((self.data - X)**2, axis=1))
# input: matrix of data points if X.ndim == 2: n_samples, _ = X.shape l2 = [np.sqrt(np.sum((self.data - X[i])**2, axis=1)) for i in range(n_samples)]
return np.array(l2)
def predict(self, X, k=1): ''' Predicts the classification for an input example or matrix of input examples X ''' # step 1: compute distance between input and training data dists = self.euclidean_distance(X)
# step 2: find the k nearest neighbors and their classifications if X.ndim == 1: if k == 1: nn = np.argmin(dists) return self.targets[nn] else: knn = np.argsort(dists)[:k] y_knn = self.targets[knn] max_vote = max(y_knn, key=list(y_knn).count) return max_vote
if X.ndim == 2: knn = np.argsort(dists)[:, :k] y_knn = self.targets[knn] if k == 1: return y_knn.T else: n_samples, _ = X.shape max_votes = [max(y_knn[i], key=list(y_knn[i]).count) for i in range(n_samples)] return max_votes
初始化并训练模型
In [11]:
knn = kNN() knn.fit(X_train, y_train)
print('Testing one datapoint, k=1') print(f'Predicted label: {knn.predict(X_test[0], k=1)}') print(f'True label: {y_test[0]}') print() print('Testing one datapoint, k=5') print(f'Predicted label: {knn.predict(X_test[20], k=5)}') print(f'True label: {y_test[20]}') print() print('Testing 10 datapoint, k=1') print(f'Predicted labels: {knn.predict(X_test[5:15], k=1)}') print(f'True labels: {y_test[5:15]}') print() print('Testing 10 datapoint, k=4') print(f'Predicted labels: {knn.predict(X_test[5:15], k=4)}') print(f'True labels: {y_test[5:15]}') print()
Testing one datapoint, k=1 Predicted label: 3 True label: 3
Testing one datapoint, k=5 Predicted label: 9 True label: 9
Testing 10 datapoint, k=1 Predicted labels: [[3 1 0 7 4 0 0 5 1 6]] True labels: [3 1 0 7 4 0 0 5 1 6]
Testing 10 datapoint, k=4 Predicted labels: [3, 1, 0, 7, 4, 0, 0, 5, 1, 6] True labels: [3 1 0 7 4 0 0 5 1 6]
测试集精度
In [12]:
# Compute accuracy on test set y_p_test1 = knn.predict(X_test, k=1) test_acc1= np.sum(y_p_test1[0] == y_test)/len(y_p_test1[0]) * 100 print(f'Test accuracy with k = 1: {format(test_acc1)}')
y_p_test8 = knn.predict(X_test, k=5) test_acc8= np.sum(y_p_test8 == y_test)/len(y_p_test8) * 100 print(f'Test accuracy with k = 8: {format(test_acc8)}')
Test accuracy with k = 1: 97.77777777777777 Test accuracy with k = 8: 97.55555555555556
|