视觉化呈现数据，MatPlotLib是怎么大显身手的？

静幻堂 2019-12-14

展开全文

原创读芯术 2019-07-24 17:01:00

全文共6661字，预计学习时长20分钟或更长

图片来源：pexels.com/@divinetechygirl

在现代数字世界中，数据就像空气一样重要。

人们每天都会自觉或不自觉地消费和产生大量数据。近来，许多商家试图利用这些数据来进行营销和吸引消费者。所有行业都开始在其服务中增添人性化色彩，向消费者兜售绝佳的用户体验。而这一切都是基于数据科学下人工智能和机器学习技术的发展。机器正变得越发聪明，能通过分析大量数据从而做出决策。

为了分析大量数据集，机器需要使用通过Python语言构建的数据视觉化工具。因此，今天需要了解以下问题：

1. 什么是数据视觉化呈现？

2. 有哪些数据视觉化呈现工具？

3. 如何使用这些工具？

4. 为什么需要学习使用这些工具？

数据科学中的数据视觉化

众所周知，人脑更易于理解图像。因此有句俗语说：一图胜千言。而这完全适用于数据科学，因为其需要分析大量视觉化呈现的数据以推演数据模型。

数据视觉化是数据科学领域的一项技术，可使你讲出具有信服力的故事，通过易于理解、模拟化的方式将数据和分析结果视觉化呈现。这项技术能让复杂的数据看起来很简单，并易于理解。

数据视觉化呈现工具

以下介绍几个常用的数据视觉化工具：

1. Matplotlib

2. Seaborn

3. Plotly

4. Pandas

学习使用这些视觉化工具可促进数据理解、信息提取和决策制定。本文将对Matplotlib工具进行详细介绍。

Matplotlib

Matplotlib是一个Python的2D绘图库，它以各种硬拷贝格式和跨平台的交互式环境生成出版质量级别的图形。Matplotlib可在多种环境下运行，包括Python脚本、Python及IPython shell命令行，Jupyter notebook，网络应用服务器与四个图形用户界面工具包。

Matplotlib广泛应用于数据视觉化，运行良好，用户界面与Matlab相似度较低，并给予用户极大的编码灵活度。写编码可能会很无聊，但Matplotlib会给用户极大的自由。

安装Matplotlib

1. 使用画中画

python -m pip install -U pip
python -m pip install -U matplotlib

2. 使用Python科学包

有许多第三方科学包，如：

· Anaconda

· Canopy

· Activestate

本文最推荐Anaconda。它是常用Python数据科学包之一，可帮助轻松安装所有数据科学包，并预装Numpy,SciPy, Pandas, Matplotlib, Plotly等工具。推荐所有人安装这个科学包，安装时间只需要几秒钟。

可在conda终端运行conda命令以安装任何科学包。不过需要访问官网获取准确的命令形式。

conda install PackageName

对于Matpotlib工具：

conda install matplotlib

不同类型的Matpolotlib工具提供的功能不同：

1. 线、条、标志

2. 图像、轮廓、场

3. 饼状图、极坐标图

4. 统计性绘图

以及许多其它功能。这些功能广泛应用于折线图、柱状图、直方图、饼状图等。

图库传送门：https:///gallery/index.html

案例学习

如上所述，可使用Matplotlib绘制多种图形，如散点图、柱状图和直方图。根据数据视觉化过程中的实际需求选择图形类型，如群组对比、定量变量对比、数据分布分析等。

以下介绍几个常用的绘图技巧：

基本要求

解决实际问题之前，须安装好工具：

安装Anaconda科学包

1. 首先，确保安装Anaconda

安装流程传送门：https://docs./anaconda/install/

启动JupyterNotebook

Anaconda科学包安装完成后，打开Anaconda指南，启动Jupyter notebook（如下图所示）。使用Jupyter notebook对案例进行编码。

检查安装预装包

参考下图：在Environments菜单下，右侧显示有预装包。比如，搜索Pandas，界面右侧显示Pandas已安装。同样，可输入需要的安装包名称进行安装。检查并确保安装matplotlib,numpy, pandas, seaborn等工具。

确保安装所有科学包后，学习绘制饼状图。

关于Matplotlibs的几个要点

Matplotlibs包含一个帮助绘图的子模块，称为Pyplot。绘图中可使用Jupyter notebook，它易于使用、操作简单。运行import matplotlib.pyplot as plt命令导入Matplotlibs下的Pyplot模块。

· 使用Pandas pd.read_csv()导入所需图库和数据集。

· 使用plt.plot()绘制折线图及其他图形。所有绘图功能都需要数据，数据以参数形式提供。

· 使用plot.xlabel , plt.ylabel分别标记x轴和y轴。

· 使用plt.xticks , plt.yticks分别标记x轴和y轴观察点。

· 使用 plt.legend()表示观察变量。

· 使用plt.title()设置图片标题。

· 使用plot.show()展示图片。

1. 绘制饼状图

#Here we import ther matplotlibpackage with alias name as plt
import matplotlib.pyplot as plt
plt.bar([1,3,5,7,9],[5,2,7,8,2],label=”Example one”)
plt.bar([2,4,6,8,10],[8,6,2,5,6],label=”Example two”, color=’g’)
plt.legend()
plt.xlabel(‘bar number’)
plt.ylabel(‘bar height’)
plt.title(‘Wow! We Got OurFirst Bar Graph’)
plt.show()

将以上编码复制粘贴到Jupyter notebook，运行该命令，饼状图如下所示：

说明：

导入matplotlib包后，其子模块pyplot运行饼状图绘制命令。

通过以下说明了解plt. bar绘图方法。

#matplotlib.pyplot.bar(x,height, width=0.8, bottom=None, *, align='center',data=None, **kwargs)[source]
So to Make a bar plot:
The bars are positioned at xwith the given alignment. Their dimensions are given by width andheight. The vertical baseline is bottom(default 0).
Each of x, height,width, and bottom may either be a scalar applying to all bars, orit may be a sequence of length N providing a separate value for each bar.

详情传送门：https:///3.1.0/api/_as_gen/matplotlib.pyplot.bar.html

2. 直方图

直方图由一系列高度不等的纵向条纹或线段表示数据分布的情况。

直方图可用以估测数据分布，频率值被划分到某一数值段。

若需想为某一直方赋值，使用numpyhistogram()方法，运行如下命令。若需估测数值分布，可使用.hist()方法，绘制简单直方图。

Matplotlib可通过NumPy’s histogram()方法视觉化呈现Python直方图，并提供通用包装：

案例：

#Histogram Code
import matplotlib.pyplot as plt
import numpy as np #importingnumpy package for array generation
np.set_printoptions(precision=3)
>>> d = np.random.laplace(loc=15, scale=3, size=500)
>>> d[:5]
# An "interface" tomatplotlib.axes.Axes.hist() method
n, bins, patches =plt.hist(x=d, bins='auto', color='#0504aa',
alpha=0.7,rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('My First Histogram Ever')
plt.text(23, 45, r'$\mu=15, b=3$')
maxfreq = n.max()
# Set a clean upper y-axis limit.
plt.ylim(ymax=np.ceil(maxfreq / 10) * 10 if maxfreq % 10 else maxfreq + 10)

说明：

可通过matplotlib下pyplot.hist()命令绘制直方图。须确定所需直方内线条数量。X轴接触线条边缘，y轴为相应频率。在以上直方图中，bins='auto'命令从两个算法中选择，以估算理想的线条数。更深层面，算法的目标是选择合适的线宽，以最忠实地表现数据。

源代码输出：#以上提到的直方图编码：

3. 散点图

散点图是绘图或数学图形，使用笛卡尔坐标表现一组数据中两个变量的数值。如果对散点进行编码（颜色/形状/规模），可额外表现一个变量。散点图将数据显示为一组点，一变量值决定散点水平位置，另一变量值确定其垂直位置。

散点图可以显示不同变量之间的相互关联关系，并具有一定的机密间隔。比如，重量和高度，y轴为重量，x轴为高度。关联关系可以是正向的（起）、反向的（落）、不存在的（无关联）。如果散点模型从左下向右上倾斜，显示变量之间为正相关。如果倾斜趋势为从左上到右下，则为负相关。

模型：

matplotlib.pyplot.scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, edgecolors=None, *, plotnonfinite=False, data=None, **kwargs)[source]
x,y :array_like, shape (n, )
The data positions.
s : scalar orarray_like, shape (n, ), optional
The marker size in points**2. Default is rcParams['lines.markersize'] ** 2.
c : color, sequence,or sequence of color, optional

更多信息传送门：https:///3.1.0/api/_as_gen/matplotlib.pyplot.scatter.html

案例：

#scatter plot lib example usingmatplotlb
import numpy as np
import matplotlib.pyplot as plt
# Create data
N = 100
x = np.random.rand(N)
y = np.random.rand(N)
colors = (0,100,255)
area = np.pi*3
# Plot
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.title('Scatter plot example using matplotlib')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

在jupyternotebook上编码，结果如下所示：

通过实际数据集理解数据视觉化

使用从kaggle下载的汽车数据集理解MatplotLib数据视觉化：https://www./toramky/automobile-dataset

切记：

1. 从上述网站下载Automobile.csv文件

2. 将Jupyter文件上载至编码所在的工作目录

3. 绘制直方图：分类使用群组数据：

可在一次绘图中绘制多幅直方图，有助于比较分类持续变量的分布。

使用Automobile.csv数据集进行理解：

读取数据集：

import pandas as pd
#Reading data frm the automobile #data sets using pandas read method
df =pd.read_csv(‘Automobile.csv’)
df.head()
#When you compile this code youwill see the below given o/p as a series of data column wise.

将以下编码编写/复制-粘贴至jupyter notebook文件：

import matplotlib.pyplot as plt
#is you don't want to make aregular call on #plt.show use this line
%matplotlib inline
x1 =df.loc[df.make=='alfa-romero', 'horsepower']
x2 = df.loc[df.make=='audi', 'horsepower']
x3 = df.loc[df.make=='bmw', 'horsepower']
x4 = df.loc[df.make=='ferrari', 'horsepower']
kwargs = dict(alpha=0.9,bins=100)
plt.hist(x1, **kwargs,color='g', label='alfa-romero')
plt.hist(x2, **kwargs, color='b', label='audi')
plt.hist(x3, **kwargs, color='r', label='bmw')
plt.hist(x3, **kwargs, color='y', label='ferrari')
plt.gca().set(title='Horsepower Varitation for various make of a car', ylabel='Frequency')
#plt.xlim(50,200)
plt.legend();

以下直方图使用了给定数据集的数值

显然，马力值集中在110-120 hp区段。

散点图：

用散点图表现数据分布。基于车型观察价格分布。

将以下编码复制/粘贴至jupyter notebook文件，并运行命令。

# Scatter Plot
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
df = pd.read_csv(‘Automobile.csv’)
bodystyle = df[‘body_style’]#fetching bodytype values r
price = df[‘price’] #fetching price for different body type
plt.scatter(bodystyle, price,edgecolors=’r’)
plt.xlabel(‘body_style’, 'make')
plt.ylabel(‘price (Rs)’)
plt.title(‘Price variation based on car body type’)

输出：

观察：

数据点集中与轿车车型，价格通常在10,000至15,000美元之间。仓门式汽车次之。货车车型的价格最低。

更多绘图类型：

1. 小提琴图

2. 堆积图

3. 茎叶图

4. 线条图

5. 箱型图

以下这幅图全面展示了常用数据视觉化图（表）类型，根据数据分析要求选择适合的图（表）：

图片来源：https://twitter.com/TonyDeJonker/status/1097191707916025856/photo/1