PANDAS QUICK START

心不留意外尘 2016-07-15

展开全文

http://nbviewer./github/supergis/git_notebook/blob/master/pystart/pandas_quickstart.ipynb

2016

Pandas_QuickStart?

Origin from http://pandas./pandas-docs/stable/
by openthings@163.com, 2016-04.

6.1 Object Creation?

Creating a Series by passing a list of values, letting pandas create a default integer index:

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

s = pd.Series([1,3,5,np.nan,6,8])
s

Out[1]:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [2]:

dates = pd.date_range('20130101', periods=6)
dates

Out[2]:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [7]:

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

df

Out[7]:

	A	B	C	D
2013-01-01	-1.334013	-0.348297	0.388654	-0.225967
2013-01-02	-0.139974	-1.347789	0.817077	0.192477
2013-01-03	-1.082739	-0.544105	-1.423883	-1.247367
2013-01-04	0.034788	-0.677221	0.120449	0.794341
2013-01-05	0.428547	-0.610156	-0.950891	-0.058047
2013-01-06	0.125631	-0.116653	-0.544575	-1.578785

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [8]:

df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })

df2

Out[8]:

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

In [9]:

df2.dtypes

Out[9]:

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

In [13]: df2.<TAB>

In [11]:

df2.

Out[11]:

0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes have been truncated for brevity.

6.2 Viewing Data?

In [14]:

df.head()

Out[14]:

	A	B	C	D
2013-01-01	-1.334013	-0.348297	0.388654	-0.225967
2013-01-02	-0.139974	-1.347789	0.817077	0.192477
2013-01-03	-1.082739	-0.544105	-1.423883	-1.247367
2013-01-04	0.034788	-0.677221	0.120449	0.794341
2013-01-05	0.428547	-0.610156	-0.950891	-0.058047

In [15]:

df.tail(3)

Out[15]:

	A	B	C	D
2013-01-04	0.034788	-0.677221	0.120449	0.794341
2013-01-05	0.428547	-0.610156	-0.950891	-0.058047
2013-01-06	0.125631	-0.116653	-0.544575	-1.578785

In [16]:

df.index

Out[16]:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [17]:

df.values

Out[17]:

array([[-1.33401275, -0.34829657,  0.38865407, -0.22596701],
       [-0.13997444, -1.34778853,  0.81707707,  0.19247685],
       [-1.0827386 , -0.5441047 , -1.42388302, -1.24736743],
       [ 0.03478847, -0.67722051,  0.12044917,  0.7943414 ],
       [ 0.42854678, -0.61015602, -0.95089113, -0.0580473 ],
       [ 0.12563068, -0.11665286, -0.54457518, -1.57878468]])

In [18]:

df.describe()

Out[18]:

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	-0.327960	-0.607370	-0.265528	-0.353891
std	0.710887	0.415940	0.853035	0.896606
min	-1.334013	-1.347789	-1.423883	-1.578785
25%	-0.847048	-0.660454	-0.849312	-0.992017
50%	-0.052593	-0.577130	-0.212063	-0.142007
75%	0.102920	-0.397249	0.321603	0.129846
max	0.428547	-0.116653	0.817077	0.794341

In [19]:

df.T

Out[19]:

	2013-01-01 00:00:00	2013-01-02 00:00:00	2013-01-03 00:00:00	2013-01-04 00:00:00	2013-01-05 00:00:00	2013-01-06 00:00:00
A	-1.334013	-0.139974	-1.082739	0.034788	0.428547	0.125631
B	-0.348297	-1.347789	-0.544105	-0.677221	-0.610156	-0.116653
C	0.388654	0.817077	-1.423883	0.120449	-0.950891	-0.544575
D	-0.225967	0.192477	-1.247367	0.794341	-0.058047	-1.578785

In [20]:

df.sort_index(axis=1, ascending=False)

Out[20]:

	D	C	B	A
2013-01-01	-0.225967	0.388654	-0.348297	-1.334013
2013-01-02	0.192477	0.817077	-1.347789	-0.139974
2013-01-03	-1.247367	-1.423883	-0.544105	-1.082739
2013-01-04	0.794341	0.120449	-0.677221	0.034788
2013-01-05	-0.058047	-0.950891	-0.610156	0.428547
2013-01-06	-1.578785	-0.544575	-0.116653	0.125631

In [21]:

df.sort_values(by='B')

Out[21]:

	A	B	C	D
2013-01-02	-0.139974	-1.347789	0.817077	0.192477
2013-01-04	0.034788	-0.677221	0.120449	0.794341
2013-01-05	0.428547	-0.610156	-0.950891	-0.058047
2013-01-03	-1.082739	-0.544105	-1.423883	-1.247367
2013-01-01	-1.334013	-0.348297	0.388654	-0.225967
2013-01-06	0.125631	-0.116653	-0.544575	-1.578785

6.3 Selection?

Getting

In [22]:

df['A']

Out[22]:

2013-01-01   -1.334013
2013-01-02   -0.139974
2013-01-03   -1.082739
2013-01-04    0.034788
2013-01-05    0.428547
2013-01-06    0.125631
Freq: D, Name: A, dtype: float64

In [23]:

df[0:3]

Out[23]:

	A	B	C	D
2013-01-01	-1.334013	-0.348297	0.388654	-0.225967
2013-01-02	-0.139974	-1.347789	0.817077	0.192477
2013-01-03	-1.082739	-0.544105	-1.423883	-1.247367

In [24]:

df['20130102':'20130104']

Out[24]:

	A	B	C	D
2013-01-02	-0.139974	-1.347789	0.817077	0.192477
2013-01-03	-1.082739	-0.544105	-1.423883	-1.247367
2013-01-04	0.034788	-0.677221	0.120449	0.794341

6.3.2 Selection by Label?

For getting a cross section using a label

In [25]:

df.loc[dates[0]]

Out[25]:

A   -1.334013
B   -0.348297
C    0.388654
D   -0.225967
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label

In [26]:

df.loc[:,['A','B']]

Out[26]:

	A	B
2013-01-01	-1.334013	-0.348297
2013-01-02	-0.139974	-1.347789
2013-01-03	-1.082739	-0.544105
2013-01-04	0.034788	-0.677221
2013-01-05	0.428547	-0.610156
2013-01-06	0.125631	-0.116653

Showing label slicing, both endpoints are included

In [27]:

df.loc['20130102':'20130104',['A','B']]

Out[27]:

	A	B
2013-01-02	-0.139974	-1.347789
2013-01-03	-1.082739	-0.544105
2013-01-04	0.034788	-0.677221

Reduction in the dimensions of the returned object

In [30]:

df.loc['20130102',['A','B']]

Out[30]:

A   -0.139974
B   -1.347789
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value

In [31]:

df.loc[dates[0],'A']

Out[31]:

-1.3340127475498547

For getting fast access to a scalar (equiv to the prior method)

In [32]:

df.at[dates[0],'A']

Out[32]:

-1.3340127475498547

6.3.3 Selection by Position?

See more in Selection by Position Select via the position of the passed integers

In [33]:

df.iloc[3]

Out[33]:

A    0.034788
B   -0.677221
C    0.120449
D    0.794341
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to numpy/python

In [34]:

df.iloc[3:5,0:2]

Out[34]:

	A	B
2013-01-04	0.034788	-0.677221
2013-01-05	0.428547	-0.610156

By lists of integer position locations, similar to the numpy/python style

In [35]:

df.iloc[[1,2,4],[0,2]]

Out[35]:

	A	C
2013-01-02	-0.139974	0.817077
2013-01-03	-1.082739	-1.423883
2013-01-05	0.428547	-0.950891

For slicing rows explicitly

In [36]:

df.iloc[1:3,:]

Out[36]:

	A	B	C	D
2013-01-02	-0.139974	-1.347789	0.817077	0.192477
2013-01-03	-1.082739	-0.544105	-1.423883	-1.247367

For slicing columns explicitly

In [37]:

df.iloc[:,1:3]

Out[37]:

	B	C
2013-01-01	-0.348297	0.388654
2013-01-02	-1.347789	0.817077
2013-01-03	-0.544105	-1.423883
2013-01-04	-0.677221	0.120449
2013-01-05	-0.610156	-0.950891
2013-01-06	-0.116653	-0.544575

For getting a value explicitly

In [39]:

df.iloc[1,1]

Out[39]:

-1.3477885295869219

For getting fast access to a scalar (equiv to the prior method)

In [40]:

df.iat[1,1]

Out[40]:

-1.3477885295869219

6.3.4 Boolean Indexing?

Using a single column’s values to select data.

In [41]:

df[df.A > 0]

Out[41]:

	A	B	C	D
2013-01-04	0.034788	-0.677221	0.120449	0.794341
2013-01-05	0.428547	-0.610156	-0.950891	-0.058047
2013-01-06	0.125631	-0.116653	-0.544575	-1.578785

A where operation for getting.

In [42]:

df[df > 0]

Out[42]:

	A	B	C	D
2013-01-01	NaN	NaN	0.388654	NaN
2013-01-02	NaN	NaN	0.817077	0.192477
2013-01-03	NaN	NaN	NaN	NaN
2013-01-04	0.034788	NaN	0.120449	0.794341
2013-01-05	0.428547	NaN	NaN	NaN
2013-01-06	0.125631	NaN	NaN	NaN

Using the isin() method for filtering:?

In [43]:

df2 = df.copy()

添加一列。

In [44]:

df2['E'] = ['one', 'one','two','three','four','three']

In [45]:

df2

Out[45]:

	A	B	C	D	E
2013-01-01	-1.334013	-0.348297	0.388654	-0.225967	one
2013-01-02	-0.139974	-1.347789	0.817077	0.192477	one
2013-01-03	-1.082739	-0.544105	-1.423883	-1.247367	two
2013-01-04	0.034788	-0.677221	0.120449	0.794341	three
2013-01-05	0.428547	-0.610156	-0.950891	-0.058047	four
2013-01-06	0.125631	-0.116653	-0.544575	-1.578785	three

In [46]:

df2[df2['E'].isin(['two','four'])]

Out[46]:

	A	B	C	D	E
2013-01-03	-1.082739	-0.544105	-1.423883	-1.247367	two
2013-01-05	0.428547	-0.610156	-0.950891	-0.058047	four

6.3.5 Setting?

Setting a new column automatically aligns the data by the indexes

In [48]:

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1

Out[48]:

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

Setting values by position

In [49]:

df.iat[0,1] = 0

Setting by assigning with a numpy array

In [50]:

df.loc[:,'D'] = np.array([5] * len(df))

The result of the prior setting operations

In [51]:

df

Out[51]:

	A	B	C	D
2013-01-01	-1.334013	0.000000	0.388654	5
2013-01-02	-0.139974	-1.347789	0.817077	5
2013-01-03	-1.082739	-0.544105	-1.423883	5
2013-01-04	0.034788	-0.677221	0.120449	5
2013-01-05	0.428547	-0.610156	-0.950891	5
2013-01-06	0.125631	-0.116653	-0.544575	5

A where operation with setting.

In [52]:

df2 = df.copy()

In [53]:

df2[df2 > 0] = -df2

In [54]:

df2

Out[54]:

	A	B	C	D
2013-01-01	-1.334013	0.000000	-0.388654	-5
2013-01-02	-0.139974	-1.347789	-0.817077	-5
2013-01-03	-1.082739	-0.544105	-1.423883	-5
2013-01-04	-0.034788	-0.677221	-0.120449	-5
2013-01-05	-0.428547	-0.610156	-0.950891	-5
2013-01-06	-0.125631	-0.116653	-0.544575	-5