一起学习Python常用模块——pandas

老三的休闲书屋 2020-12-14

展开全文

作者介绍

@王多鱼

百度的一名推荐算法攻城狮。

主要负责推荐的召回和排序模型的优化工作。

1 前言

Pandas 是Python的一个数据分析包，它是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。Pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

2 数据结构

数据结构：

系列(Series)

数据框(DataFrame)

面板(Panel)

（多个series → 多个数据框 → 面板）。这些数据结构构建在Numpy数组之上，这意味着它们很快。

导入包

1 >>> import pandas as pd2 >>> import numpy as np

系列

 1   # 以列表定义 2   >>> s = pd.Series(['a', 'b', 'c', 'd']) 3   >>> s 4   0    a 5   1    b 6   2    c 7   3    d 8   dtype: object  910   # 以字典定义11   >>> s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.})12   >>> s13   a    0.014   b    1.015   c    2.016   dtype: float64

数据框

数据框的数据存储格式如下：

 1   # 以列表定义 2   >>> data = [['Alex',10], ['Bob',12], ['Clarke',13]] 3   >>> df = pd.DataFrame(data, columns=['Name', 'Age']) 4   >>> df 5             Name  Age 6   0    Alex   10 7   1     Bob   12 8   2  Clarke   13 910   # 以字典定义11   >>> data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 'Age':[28,34,29,42]}12   >>> df = pd.DataFrame(data)13   >>> df  14            Age   Name15   0   28    Tom16   1   34   Jack17   2   29  Steve18   3   42  Ricky

数据索引:

●底层是由索引和值构成的多元组，(index1, [index2,index3,] value)。再由这些多元组组合出可视化的数据框。

●缺失值：数据框某个位置的所以没有对应的多元组，则会显示缺省值。

定义带索引的系列

1   >>> data = {'a' : 0., 'b' : 1., 'c' : 2.}2   >>> s = pd.Series(data, index=['b','c','d','a'])3   >>> s4   b    1.05   c    2.06   d    NaN7   a    0.08   dtype: float64

3 数据输入/输出

方法一、直接定义

●pd.Series

●pd.DataFrame

方法二、读取器函数

读入：

●read_csv/read_table

●read_sql

●read_html

●read_json

读出：

●to_csv

4 基本功能

数据结构的属性

  1   >>> df     2   Age   Name  3   0   28    Tom  4   1   34   Jack  5   2   29  Steve  6   3   42  Ricky  7   8   >>> df.axes  9   [RangeIndex(start=0, stop=4, step=1), Index([u'Age', u'Name'], dtype='object')] 1011   >>> df.dtypes12   Age      int6413   Name    object14   dtype: object 15  16   >>> df.size17   8 18 19   >>> df.values20   array([[28, 'Tom'],21             [34, 'Jack'],  22             [29, 'Steve'], 23             [42, 'Ricky']], dtype=object)

简单统计

  1   >>> df.describe(include='all')  2                          Age Name  3   count       4.000000    4  4   unique         NaN       4  5   top               NaN  Tom  6   freq              NaN       1  7   mean    33.250000  NaN  8   std          6.396614  NaN  9   min       28.000000  NaN10   25%       28.750000  NaN11   50%       31.500000  NaN12   75%       36.000000  NaN13   max       42.000000  NaN

5 选择数据

定位函数（多轴索引）

●loc()：基于标签索引

●iloc()：基于整数索引

定位函数格式

df.loc[ 行索引, 列索引]

行索引选择数据图示：

列索引选择图示：

●指定索引

  1   >>> df = pd.DataFrame(np.random.randn(8, 4),   2                           index = ['a','b','c','d','e','f','g','h'],  3                           4                          columns = ['A', 'B', 'C', 'D'])  5   >>> df    6                   A               B                 C                D  7   a -0.484976   1.958562   -0.073555   0.524286  8   b  1.681393   1.041901    -0.109796  0.836486  9   c  0.352229    0.656365    0.590963   0.90898110   d   1.325258  1.199558    0.953455  -0.19250711   e  0.573300  -0.202530   -0.699603   1.504382 12   f  -1.423372 -0.311816     0.680950 -1.61934313   g  0.771233 -0.101350     -0.207373  1.24212714   h  0.084874 -0.655007    -0.834754   0.072229  151617   >>> df.loc['a', ['A', 'B']]18   A   -0.48497619   B    1.958562

●区间索引

  1   >>>>> df.loc[:, 'A']  2   a   -0.484976  3   b    1.681393  4   c    0.352229  5   d    1.325258  6   e    0.573300  7   f   -1.423372  8   g    0.771233  9   h    0.08487410   Name: A, dtype: float64111213   >>> df.loc['a':'e','A':'C']14                     A             B               C15   a -0.484976  1.958562 -0.07355516   b  1.681393  1.041901 -0.10979617   c  0.352229  0.656365  0.59096318   d  1.325258  1.199558  0.95345519   e  0.573300 -0.202530 -0.699603

●布尔值索引

  1   >>> df.loc[df.A>0,]  2                    A              B              C              D  3   b  1.681393  1.041901 -0.109796  0.836486  4   c  0.352229  0.656365  0.590963  0.908981  5   d  1.325258  1.199558  0.953455 -0.192507  6   e  0.573300 -0.202530 -0.699603  1.504382  7   g  0.771233 -0.101350 -0.207373  1.242127  8   h  0.084874 -0.655007 -0.834754  0.072229  910   >>> df.loc[df.A.isna(), ]11   Empty DataFrame12   Columns: [A, B, C, D]13   Index: []

6 操作数据

排序

●sort_index()：按索引排序

●sort_values()：按值排序

 1   >>> unsorted_df =pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]}) 2   >>> unsorted_df 3     col1  col2 4   0     2     1 5   1     1     3 6    2     1     2 7    3     1     4 8 910   # 按某列排序11   >>> unsorted_df.sort_values('col1')12     col1  col213   1     1     314   2     1     215   3     1     416   0     2     117 1819   # 按多列排序20   >>> unsorted_df.sort_values(['col1','col2'])21         col1  col222       2     1     223       1     1     324       3     1     425       0     2     1

聚合

●分组聚合：groupby + agg

groupby函数的图示，用于聚合相同key的数据。

>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}>>> df = pd.DataFrame(ipl_data)>>> df    Points  Rank    Team  Year0      876     1  Riders  20141      789     2  Riders  20152      863     2  Devils  20143      673     3  Devils  20154      741     3   Kings  20145      812     4   kings  20156      756     1   Kings  20167      788     1   Kings  20178      694     2  Riders  20169      701     4  Royals  201410     804     1  Royals  201511     690     2  Riders  2017  # 创建数据分组>>> df.groupby(['Team','Year'])<pandas.core.groupby.generic.DataFrameGroupBy object at 0x112f28c10>  # 查看分组>>> df.groupby('Team').groups{('Kings', 2014): Int64Index([4], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Devils', 2014): Int64Index([2], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64')}   # 查看其中一个分组>>> df.groupby(['Team','Year']).get_group(('Kings',2014))   Points  Rank   Team  Year4     741     3  Kings  2014 # 最新年份的数据(分组排序)>>> df.sort_values(['Team','Year'],ascending=False).groupby('Team').nth(0)        Points  Rank  YearTeam                     Devils     673     3  2015Kings      788     1  2017Riders     690     2  2017Royals     804     1  2015kings      812     4  2015 # 聚合函数>>> df.groupby(['Year'])['Points'].agg('mean')Year2014    795.252015    769.502016    725.002017    739.00Name: Points, dtype: float64  >>> df.groupby(['Year'])['Points'].agg(['mean','sum','median'])        mean   sum  medianYear                     2014  795.25  3181   802.02015  769.50  3078   796.52016  725.00  1450   725.02017  739.00  1478   739.0  # 过滤筛选>>> df.groupby('Team').filter(lambda x: len(x) >= 3)    Points  Rank    Team  Year0      876     1  Riders  20141      789     2  Riders  20154      741     3   Kings  20146      756     1   Kings  20167      788     1   Kings  20178      694     2  Riders  201611     690     2  Riders  2017  >>> df.groupby('Team').filter(lambda x:max(x['Points'])>=800)    Points  Rank    Team  Year0      876     1  Riders  20141      789     2  Riders  20152      863     2  Devils  20143      673     3  Devils  20155      812     4   kings  20158      694     2  Riders  20169      701     4  Royals  201410     804     1  Royals  201511     690     2  Riders  2017

●窗口聚合：rolling + agg

做定量模型比较常用。

应用函数

●pipe()：表格应用函数，应用于整个表格，方便链式编程

>>> def adder(x,y):        return x+y >>> df = pd.DataFrame(np.random.randn(5,3), columns=['col1','col2','col3'])>>> df       col1      col2      col30  1.200842 -0.387094  0.2189031 -2.469144  2.283831  0.3424512  0.688127  0.445456  0.9666263  0.912838  0.577441 -0.9674564 -0.706913  0.791318 -1.040644  >>> df.pipe(adder, 2)       col1      col2      col30  3.200842  1.612906  2.2189031 -0.469144  4.283831  2.3424512  2.688127  2.445456  2.9666263  2.912838  2.577441  1.0325444  1.293087  2.791318  0.959356

●apply()：行列应用函数

>>> df.apply(np.mean)col1   -0.074850col2    0.742191col3   -0.096024dtype: float64  >>> df.apply(np.mean,axis=1)0    0.3442171    0.0523802    0.7000703    0.1742744   -0.318746dtype: float64

●applymap()：元素映射函数，类似于map()

>>> aes_encrypt = crypto_util.AesEncrypt()>>> def decrypt(line):        decrypt_str = aes_encrypt.decrypt(line,            crypto_util.constants.Constants.CRM_ENCRYPT_PREFIX)        return decrypt_str>>> df = pd.DataFrame(             ['baiducrmcommonciper_LUjEqeTBXHcHFak5E3lwcgOR+Xfl6v/hkbSrzqBBFI4=',             'baiducrmcommonciper_4TReevfj06k3mg8871PvslHvPuPwlCUkn4xM6ZjrAn4=',             'baiducrmcommonciper_zmrudGYBOalk5LTqlF5ncg=='])>>> df.applymap(decrypt)                   00  25339384668@qq.com1   1909062174@qq.com2    8076719440@qq.om

7 操作数据框

连结

●append

●concat

concat 函数功能如下图所示，（1）不指定axis时，默认axis=0，上下拼接；（2）指定axis=1时，左右拼接。

>>> one = pd.DataFrame({         'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],         'subject_id':['sub1','sub2','sub4','sub6','sub5'],         'Marks_scored':[98,90,87,69,78]},         index=[1,2,3,4,5])>>> two = pd.DataFrame({         'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],         'subject_id':['sub2','sub4','sub3','sub6','sub5'],         'Marks_scored':[89,80,79,97,88]},         index=[1,2,3,4,5])>>> pd.concat([one, two])   Marks_scored    Name subject_id1            98    Alex       sub12            90     Amy       sub23            87   Allen       sub44            69   Alice       sub65            78  Ayoung       sub51            89   Billy       sub22            80   Brian       sub43            79    Bran       sub34            97   Bryce       sub65            88   Betty       sub5>>> pd.concat([one, two], axis = 1)   Marks_scored    Name subject_id  Marks_scored   Name subject_id1            98    Alex       sub1            89  Billy       sub22            90     Amy       sub2            80  Brian       sub43            87   Allen       sub4            79   Bran       sub34            69   Alice       sub6            97  Bryce       sub65            78  Ayoung       sub5            88  Betty       sub5

Merge

pd.merge(left,right,how='inner',on=None, left_on=None, right_on=None,left_index=False,right_index=False,sort=True)

merge函数图示：

8 画图

%matplotlib inlineimport numpy as npimport pandas as pdimport matplotlib.pyplot as plt plt.rc('figure', figsize=(5, 3))ts = pd.Series(np.random.randn(1000),                index=pd.date_range('1/1/2000', periods=1000)) ts = ts.cumsum()ts.plot()

df = pd.DataFrame(np.random.randn(1000, 4),                   index = ts.index,                   columns=list('ABCD'))    df = df.cumsum()  plt.figure(); df.plot(); plt.legend(loc='best')