【原】ML之DR：sklearn.manifold(流形学习和降维的算法模块)的简介、部分源码解读、案例应用之详细攻略

处女座的程序猿 2023-04-24 发布于上海

展开全文

ML之DR：sklearn.manifold(流形学习和降维的算法模块)的简介、部分源码解读、案例应用之详细攻略

sklearn.manifold的简介

sklearn.manifold(流形学习和降维的算法模块)的概述

简介

sklearn.manifold是scikit-learn机器学习库中的一个模块，主要用于流形学习和降维的算法实现。该模块包含多种降维和流形学习算法的实现，如PCA（主成分分析）、LLE（局部线性嵌入）、Isomap、MDS（多维缩放）和T-SNE（t分布随机邻域嵌入）等。

作用

使用sklearn.manifold模块可以方便地实现流形学习和降维算法，同时也提供了很多参数可以调整，以便用户可以根据实际应用需求进行调整。

此外，该模块还提供了一些可视化工具，可以将降维后的数据集可视化，便于用户进行结果分析和展示。

使用说明

在sklearn.manifold模块中，每个算法都有相应的类，如PCA算法对应于PCA类，LLE算法对应于LocallyLinearEmbedding类，Isomap算法对应于Isomap类等。

这些类都提供了fit_transform方法，可以将数据集进行降维或流形学习，并返回降维后的数据集。此外，这些类还提供了其他参数，如n_components用于指定输出的维度，n_neighbors用于指定邻居的个数等等。

外文翻译

Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high.

Manifold是一种非线性降维的方法。这个任务的算法是基于这样一种想法，即许多数据集的维数只是人为地偏高。

High-dimensional datasets can be very difficult to visualize. While data in two or three dimensions can be plotted to show the inherent structure of the data, equivalent high-dimensional plots are much less intuitive. To aid visualization of the structure of a dataset, the dimension must be reduced in some way.

The simplest way to accomplish this dimensionality reduction is by taking a random projection of the data. Though this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired. In a random projection, it is likely that the more interesting structure within the data will be lost.

To address this concern, a number of supervised and unsupervised linear dimensionality reduction frameworks have been designed, such as Principal Component Analysis (PCA), Independent Component Analysis, Linear Discriminant Analysis, and others. These algorithms define specific rubrics to choose an “interesting” linear projection of the data. These methods can be powerful, but often miss important non-linear structure in the data.

高维数据集很难直观地展示其内在结构。虽然二维或三维数据可以绘制图表以显示数据的内在结构，但等价的高维图表则很难理解。为了帮助可视化数据集的结构，必须以某种方式降低维度。

实现这种降维最简单的方法是随机投影数据。虽然这样做可以在一定程度上可视化数据结构，但选择的随机性仍有很大改进空间。在随机投影中，数据的更有趣的结构很可能会丢失。

为了解决这个问题，设计了许多有监督和无监督的线性降维框架，例如主成分分析（PCA）、独立成分分析、线性判别分析等。这些算法定义了特定的标准来选择数据的“有趣”线性投影。这些方法可以非常强大，但通常会忽略数据中的重要非线性结构。

Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications.

Manifold可以被认为是一种推广线性框架的尝试，如PCA，以敏感的非线性数据结构。虽然有监督变量存在，但典型的Manifold问题是非监督的:它从数据本身学习数据的高维结构，而不使用预定的分类。