Matplotlib —— (六) 频次直方图、数据区间划分和分布密度

LibraryPKU 2022-05-28 发布于北京

展开全文

文章目录

（一）简易频次直方图
（二）二维频次直方图与数据区间划分

1. plt.hist2d：二维频次直方图
2. plt.hexbin：六边形区间划分
3. 核密度估计

七、频次直方图、数据区间划分和分布密度

[ Matplotlib version: 3.2.1 ]

七、频次直方图、数据区间划分和分布密度

（一）简易频次直方图

%matplotlib inlineimport numpy as npimport matplotlib.pyplot as plt
plt.style.use('seaborn-white')date = np.random.randn(1000)plt.hist(data)

在这里插入图片描述

自定义频次直方图

plt.hist(data, bins=30, density=True, alpha=0.5, histtype='stepfilled', color='steelblue', edgecolor='none')

在这里插入图片描述

同坐标轴的多个频次直方图

用频次直方图对不同分布特征的样本进行对比时，将histtype='stepfilled'与透明性设置参数alpha搭配使用的效果非常好

x1 = np.random.normal(0, 0.8, 1000)x2 = np.random.normal(-2, 1, 1000)x3 = np.random.normal(3, 2, 1000)kwargs = dict(histtype='stepfilled', alpha=0.3, density=True, bins=40)plt.hist(x1, **kwargs)plt.hist(x2, **kwargs)plt.hist(x3, **kwargs)

在这里插入图片描述

如果只需要简单计算频次直方图（计算每段区间的样本数），而不像画图显示它们，可以直接用np.histogram()

counts, bin_edges = np.histogram(data, bins=5)counts# array([ 36, 255, 445, 230,  34])

更多详见：matplotlib.pyplot.hist - Matplotlib 3.2.1 documentation

（二）二维频次直方图与数据区间划分

如同将一维数组分为区间创建一维频次直方图，也可以将二维数组按照二维区间进行切分，创建二维频次直方图

首先，用一个多元高斯分布（multivariate Gaussian distribution）生成x轴与y轴的样本数据

mean = [0, 0]cov = [[1, 1], [1, 2]]x, y = np.random.multivariate_normal(mean, cov, 10000).T

1. plt.hist2d：二维频次直方图

画二维频次直方图最简单的方法就是使用Matplotlib的plt.hist2d函数

plt.hist2d(x, y, bins=30, cmap='Blues')cb = plt.colorbar()cb.set_label('counts in bin')

在这里插入图片描述

plt.hist2d中，与np.histogram类似函数是np.histogram2d

counts, xedges, yedges = np.histogram2d(x, y, bins=30)

更多详见：matplotlib.pyplot.hist2d - Matplotlib 3.2.1 documentation

2. plt.hexbin：六边形区间划分

二维频次直方图是由与坐标轴正交的方块分割而成的，还有一种常用的方式是用正六边形分割。

Matplotlib提供plt.hexbin，可以将二维数据集分割成蜂窝状

plt.hexbin(x, y, gridsize=30, cmap='Blues')cb = plt.colorbar(label='count in bin')

在这里插入图片描述

更多详见：matplotlib.pyplot.hexbin - Matplotlib 3.2.1 documentation

3. 核密度估计

还有一种评估多维数据分布密度的常用方法是核密度估计（kernel density estimation, KDE）

简单演示如何用KDE方法“抹掉”空间中离散的数据点，从而拟合出一个平滑的函数。（scipy.stats中有一个快速实现KDE方法）

from scipy.stats import gaussian_kde# 拟合数组维度[Ndim, Nsamples]data = np.vstack([x, y])kde = gaussian_kde(data)# 用一对规则的网格数据进行拟合xgrid = np.linspace(-3.5, 3.5, 40)ygrid = np.linspace(-6, 6, 40)Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))# 画出结果图plt.imshow(Z.reshape(Xgrid.shape), origin='lower', aspect='auto',   extent=[-3.5, 3.5, -6, 6], cmap='Blues')cb = plt.colorbar()cb.set_label('density')

在这里插入图片描述

KDE方法通过不同的平滑带宽长度（smoothing length）在拟合函数的准确性与平滑性之间作出权衡。
想找到恰当的平滑带宽长度是件很难的事，gaussian_kde通过一种经验方法试图找到输入数据平滑长度的近似最优解。
在Scipy的生态系统中还有其他的KDE方法实现，每种版本都有各自优缺点，如sklearn.neighbors.KernelDensity, statsmodels.nonparametric.kernel_density.KDEMultivariate
用Matplotlib做KDE的可视化图的过程比较繁琐，Seaborn提供了更加简洁的API来创建基于KDE的可视化图