pandas含有使数据分析工作变得更快更简单的高级数据结构和操作工具。pandas是基于NumPy构建的,让以NumPy为中心的应用变得更加简单。 具备按轴自动或显式数据对齐功能的数据结构。这可以防止许多由于数据未对齐以及来自不同数据源(索引方式不同)的数据而导致的常见错误。集成时间序列功能。既能处理时间序列数据也能处理非时间序列数据的数据结构。数学运算和约简(比如对某个轴求和)可以根据不同的元数据(轴编号)执行。灵活处理缺失数据。合并及其他出现在常见数据库(例如基于SQL的)中的关系型运算。pandas引入约定如下所示:
1. In [ 1 ]: from pandas import Series, DataFrame 2. 3. In [ 2 ]: import pandas as pd 1、pandas的数据结构介绍 要使用pandas,你首先就得熟悉它的两个主要数据结构:Series和DataFrame。虽然它们并不能解决所有问题,但它们为大多数应用提供了一种可靠的、易于使用的基础。 2、Series Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。仅由一组数据即可产生最简单的Series:
01. In [ 4 ]: obj = pd.Series([ 4 , 7 , - 5 , 3 ]) 02. 03. In [ 5 ]: obj 04. Out[ 5 ]: 05. 0 4 06. 1 7 07. 2 - 5 08. 3 3 09. dtype: int64
1. In [ 7 ]: obj.values 2. Out[ 7 ]: array([ 4 , 7 , - 5 , 3 ], dtype=int64) 3. 4. In [ 8 ]: obj.index 5. Out[ 8 ]: Int64Index([ 0 , 1 , 2 , 3 ], dtype= 'int64' )
01. In [ 9 ]: obj2 = pd.Series([ 4 , 7 , - 5 , 3 ], index=[ 'd' , 'b' , 'a' , 'c' ]) 02. 03. In [ 10 ]: obj2 04. Out[ 10 ]: 05. d 4 06. b 7 07. a - 5 08. c 3 09. dtype: int64 10. 11. In [ 11 ]: obj2.index 12. Out[ 11 ]: Index([u 'd' , u 'b' , u 'a' , u 'c' ], dtype= 'object' )
01. In [ 12 ]: obj2[ 'a' ] 02. Out[ 12 ]: - 5 03. 04. In [ 13 ]: obj2[ 'd' ] = 6 05. 06. In [ 14 ]: obj2[[ 'c' , 'a' , 'd' ]] 07. Out[ 14 ]: 08. c 3 09. a - 5 10. d 6 11. dtype: int64
01. In [ 25 ]: obj2 02. Out[ 25 ]: 03. d 4 04. b 7 05. a - 5 06. c 3 07. dtype: int64 08. 09. In [ 26 ]: obj2[obj2 > 0 ] 10. Out[ 26 ]: 11. d 4 12. b 7 13. c 3 14. dtype: int64 15. 16. In [ 27 ]: obj2 * 2 17. Out[ 27 ]: 18. d 8 19. b 14 20. a - 10 21. c 6 22. dtype: int64 23. 24. In [ 28 ]: import numpy as np 25. 26. In [ 29 ]: np.exp(obj2) 27. Out[ 29 ]: 28. d 54.598150 29. b 1096.633158 30. a 0.006738 31. c 20.085537 32. dtype: float64
1. In [ 31 ]: 'b' in obj2 2. Out[ 31 ]: True 3. 4. In [ 32 ]: 'e' in obj2 5. Out[ 32 ]: False 如果数据被存放在一个Python字典中,也可以直接通过这个字典来创建Series:
01. In [ 33 ]: sdata = { 'Ohio' : 3500 , 'Texas' : 71000 , 'Oregon' : 16000 , 'Utah' : 5000 } 02. 03. In [ 34 ]: obj3 = pd.Series(sdata) 04. 05. In [ 35 ]: obj3 06. Out[ 35 ]: 07. Ohio 3500 08. Oregon 16000 09. Texas 71000 10. Utah 5000 11. dtype: int64
01. In [ 36 ]: states = [ 'California' , 'Ohio' , 'Oregon' , 'Texas' ] 02. 03. In [ 37 ]: obj4 = pd.Series(sdata, index=states) 04. 05. In [ 38 ]: obj4 06. Out[ 38 ]: 07. California NaN 08. Ohio 3500 09. Oregon 16000 10. Texas 71000 11. dtype: float64
01. In [ 39 ]: pd.isnull(obj4) 02. Out[ 39 ]: 03. California True 04. Ohio False 05. Oregon False 06. Texas False 07. dtype: bool 08. 09. In [ 40 ]: pd.notnull(obj4) 10. Out[ 40 ]: 11. California False 12. Ohio True 13. Oregon True 14. Texas True 15. dtype: bool
1. In [ 41 ]: obj4.isnull() 2. Out[ 41 ]: 3. California True 4. Ohio False 5. Oregon False 6. Texas False 7. dtype: bool
01. In [ 42 ]: obj3 02. Out[ 42 ]: 03. Ohio 3500 04. Oregon 16000 05. Texas 71000 06. Utah 5000 07. dtype: int64 08. 09. In [ 43 ]: obj4 10. Out[ 43 ]: 11. California NaN 12. Ohio 3500 13. Oregon 16000 14. Texas 71000 15. dtype: float64 16. 17. In [ 44 ]: obj3 + obj4 18. Out[ 44 ]: 19. California NaN 20. Ohio 7000 21. Oregon 32000 22. Texas 142000 23. Utah NaN 24. dtype: float64
01. In [ 45 ]: obj4.name = 'population' 02. 03. In [ 46 ]: obj4.index.name = 'state' 04. 05. In [ 47 ]: obj4 06. Out[ 47 ]: 07. state 08. California NaN 09. Ohio 3500 10. Oregon 16000 11. Texas 71000 12. Name: population, dtype: float64
01. In [ 48 ]: obj.index = [ 'Bob' , 'Steve' , 'Jeff' , 'Ryan' ] 02. 03. In [ 49 ]: obj 04. Out[ 49 ]: 05. Bob 4 06. Steve 7 07. Jeff - 5 08. Ryan 3 09. dtype: int64
3、DataFrame DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔值)。DataFrame既有行索引也有列索引,它可以被看做由Series组成的字典(共用同一个索引)。跟其他类似的数据结构相比(如R的data.frame),DataFrame中面向行和面向列的操作基本上是平衡的。其实,DataFrame中的数据是一个或多个二维块存放的(而不是列表、字典或别的一维数据结构)。 注意: 虽然DataFrame是以二维结构保存数据的,但你仍然可以轻松地将其表示为更高维的数据(层次化索引的表格型结构,这是pandas中许多高级数据处理功能的关键要素)。 构建DataFrame的办法有很多,做常用的一种是直接传入一个由等长列表或NumPy数组组成的字典:
1. In [ 50 ]: data = { 'state' : [ 'Ohio' , 'Ohio' , 'Ohio' , 'Nevada' , 'Nevada' ], 2. ....: 'year' : [ 2000 , 2001 , 2002 , 2001 , 2002 ], 3. ....: 'pop' :[ 1.5 , 1.7 , 3.6 , 2.4 , 2.9 ]} 4. 5. In [ 51 ]: frame = pd.DataFrame(data) 1. pop state year 2. 0 1.5 Ohio 2000 3. 1 1.7 Ohio 2001 4. 2 3.6 Ohio 2002 5. 3 2.4 Nevada 2001 6. 4 2.9 Nevada 2002 7. 8. [ 5 rows x 3 columns]
01. In [ 53 ]: pd.DataFrame(data, columns=[ 'year' , 'state' , 'pop' ]) 02. Out[ 53 ]: 03. year state pop 04. 0 2000 Ohio 1.5 05. 1 2001 Ohio 1.7 06. 2 2002 Ohio 3.6 07. 3 2001 Nevada 2.4 08. 4 2002 Nevada 2.9 09. 10. [ 5 rows x 3 columns]
01. In [ 54 ]: frame2 = pd.DataFrame(data, columns=[ 'year' , 'state' , 'pop' , 'debt' ], 02. ....: index=[ 'one' , 'two' , 'three' , 'four' , 'five' ]) 03. 04. In [ 55 ]: frame2 05. Out[ 55 ]: 06. year state pop debt 07. one 2000 Ohio 1.5 NaN 08. two 2001 Ohio 1.7 NaN 09. three 2002 Ohio 3.6 NaN 10. four 2001 Nevada 2.4 NaN 11. five 2002 Nevada 2.9 NaN 12. 13. [ 5 rows x 4 columns] 14. 15. In [ 56 ]: frame2.columns 16. Out[ 56 ]: Index([u 'year' , u 'state' , u 'pop' , u 'debt' ], dtype= 'object' )
01. In [ 57 ]: frame2[ 'state' ] 02. Out[ 57 ]: 03. one Ohio 04. two Ohio 05. three Ohio 06. four Nevada 07. five Nevada 08. Name: state, dtype: object 09. 10. In [ 58 ]: frame2.year 11. Out[ 58 ]: 12. one 2000 13. two 2001 14. three 2002 15. four 2001 16. five 2002 17. Name: year, dtype: int64
1. In [ 59 ]: frame2.ix[ 'three' ] 2. Out[ 59 ]: 3. year 2002 4. state Ohio 5. pop 3.6 6. debt NaN 7. Name: three, dtype: object
01. In [ 60 ]: frame2[ 'debt' ] = 16.5 02. 03. In [ 61 ]: frame2 04. Out[ 61 ]: 05. year state pop debt 06. one 2000 Ohio 1.5 16.5 07. two 2001 Ohio 1.7 16.5 08. three 2002 Ohio 3.6 16.5 09. four 2001 Nevada 2.4 16.5 10. five 2002 Nevada 2.9 16.5 11. 12. [ 5 rows x 4 columns] 01. In [ 62 ]: frame2[ 'debt' ] = np.arange( 5 .) 02. 03. In [ 63 ]: frame2 04. Out[ 63 ]: 05. year state pop debt 06. one 2000 Ohio 1.5 0 07. two 2001 Ohio 1.7 1 08. three 2002 Ohio 3.6 2 09. four 2001 Nevada 2.4 3 10. five 2002 Nevada 2.9 4 11. 12. [ 5 rows x 4 columns]
01. In [ 64 ]: val = pd.Series([- 1.2 , - 1.5 , - 1.7 ], index=[ 'two' , 'four' , 'five' ]) 02. 03. In [ 65 ]: frame2[ 'debt' ] = val 04. 05. In [ 66 ]: frame2 06. Out[ 66 ]: 07. year state pop debt 08. one 2000 Ohio 1.5 NaN 09. two 2001 Ohio 1.7 - 1.2 10. three 2002 Ohio 3.6 NaN 11. four 2001 Nevada 2.4 - 1.5 12. five 2002 Nevada 2.9 - 1.7 13. 14. [ 5 rows x 4 columns]
01. In [ 67 ]: frame2[ 'eastern' ] = frame2.state == 'Ohio' 02. 03. In [ 68 ]: frame2 04. Out[ 68 ]: 05. year state pop debt eastern 06. one 2000 Ohio 1.5 NaN True 07. two 2001 Ohio 1.7 - 1.2 True 08. three 2002 Ohio 3.6 NaN True 09. four 2001 Nevada 2.4 - 1.5 False 10. five 2002 Nevada 2.9 - 1.7 False 11. 12. [ 5 rows x 5 columns] 1. In [ 69 ]: del frame2[ 'eastern' ] 2. 3. In [ 70 ]: frame2.columns 4. Out[ 70 ]: Index([u 'year' , u 'state' , u 'pop' , u 'debt' ], dtype= 'object' ) 通过索引方式返回的列只是相应数据的视图而已,并不是副本。因此,对返回的Series所做的任何就地修改全部会反映到源DataFrame上。通过Series的copy方法即可显式地复制列。 另一种常见的数据形式是嵌套字典(也就是字典的字典):
1. In [ 71 ]: pop = { 'Nevada' : { 2001 : 2.4 , 2002 : 2.9 }, 2. ....: 'Ohio' : { 2000 : 1.5 , 2001 : 1.7 , 2002 : 3.6 }}
01. In [ 72 ]: frame3 = pd.DataFrame(pop) 02. 03. In [ 73 ]: frame3 04. Out[ 73 ]: 05. Nevada Ohio 06. 2000 NaN 1.5 07. 2001 2.4 1.7 08. 2002 2.9 3.6 09. 10. [ 3 rows x 2 columns]
1. In [ 74 ]: frame3.T 2. Out[ 74 ]: 3. 2000 2001 2002 4. Nevada NaN 2.4 2.9 5. Ohio 1.5 1.7 3.6 6. 7. [ 2 rows x 3 columns]
1. In [ 75 ]: pd.DataFrame(pop, index=[ 2001 , 2002 , 2003 ]) 2. Out[ 75 ]: 3. Nevada Ohio 4. 2001 2.4 1.7 5. 2002 2.9 3.6 6. 2003 NaN NaN 7. 8. [ 3 rows x 2 columns]
01. In [ 76 ]: pdata = { 'Ohio' : frame3[ 'Ohio' ][:- 1 ], 02. ....: 'Nevada' : frame3[ 'Nevada' ][: 2 ]} 03. 04. In [ 77 ]: pd.DataFrame(pdata) 05. Out[ 77 ]: 06. Nevada Ohio 07. 2000 NaN 1.5 08. 2001 2.4 1.7 09. 10. [ 2 rows x 2 columns] 01. In [ 78 ]: frame3.index.name = 'year' ; frame3.columns.name = 'state' 02. 03. In [ 79 ]: frame3 04. Out[ 79 ]: 05. state Nevada Ohio 06. year 07. 2000 NaN 1.5 08. 2001 2.4 1.7 09. 2002 2.9 3.6 10. 11. [ 3 rows x 2 columns]
01. In [ 80 ]: frame3 02. Out[ 80 ]: 03. state Nevada Ohio 04. year 05. 2000 NaN 1.5 06. 2001 2.4 1.7 07. 2002 2.9 3.6 08. 09. [ 3 rows x 2 columns]
1. In [ 81 ]: frame2.values 2. Out[ 81 ]: 3. array([[2000L, 'Ohio' , 1.5 , nan], 4. [2001L, 'Ohio' , 1.7 , - 1.2 ], 5. [2002L, 'Ohio' , 3.6 , nan], 6. [2001L, 'Nevada' , 2.4 , - 1.5 ], 7. [2002L, 'Nevada' , 2.9 , - 1.7 ]], dtype=object)
|
|
来自: powerbaby > 《DataFrame》