pandas中行列转换

北方的白桦林 2018-12-09

展开全文

①列转行方法

stack函数：pandas.DataFrame.stack(self, level=-1, dropna=True)

通过?pandas.DataFrame.stack命令查看帮助文档

Signature: pandas.DataFrame.stack(self, level=-1, dropna=True)
Docstring:
Pivot a level of the (possibly hierarchical) column labels, returning a
DataFrame (or Series in the case of an object with a single level of
column labels) having a hierarchical index with a new inner-most level
of row labels.
The level involved will automatically get sorted.

a、对于普通的DataFrame而言，直接列索引转换到最内层行索引，生一个Series对象

In [16]: import pandas as pd
    ...: import numpy as np
    ...: df = pd.DataFrame(np.arange(6).reshape(2,3),index=['AA','BB'],columns=
    ...: ['three','two','one'])
    ...: df
    ...:
Out[16]:
    three  two  one
AA      0    1    2
BB      3    4    5
In [17]: df.stack()
Out[17]:
AA  three    0
    two      1
    one      2
BB  three    3
    two      4
    one      5
dtype: int32
In [18]: df.stack(level=0)
Out[18]:
AA  three    0
    two      1
    one      2
BB  three    3
    two      4
    one      5
dtype: int32
In [19]: df.stack(level=-1)
Out[19]:
AA  three    0
    two      1
    one      2
BB  three    3
    two      4
    one      5
dtype: int32

b、对于层次化索引的DataFrame而言，可以将指定的索引层转换到行上，默认是将最内层的列索引转换到最内层行

In [31]: import pandas as pd
    ...: import numpy as np
    ...: df = pd.DataFrame(np.arange(8).reshape(2,4),index=['AA','BB'],columns=
    ...: [['two','two','one','one'],['A','B','C','D']])
    ...: df
    ...:
Out[31]:
   two    one
     A  B   C  D
AA   0  1   2  3
BB   4  5   6  7
In [32]: df.stack()
Out[32]:
      one  two
AA A  NaN  0.0
   B  NaN  1.0
   C  2.0  NaN
   D  3.0  NaN
BB A  NaN  4.0
   B  NaN  5.0
   C  6.0  NaN
   D  7.0  NaN
In [33]: df.stack(level=0)
Out[33]:
          A    B    C    D
AA one  NaN  NaN  2.0  3.0
   two  0.0  1.0  NaN  NaN
BB one  NaN  NaN  6.0  7.0
   two  4.0  5.0  NaN  NaN
In [34]: df.stack(level=1)
Out[34]:
      one  two
AA A  NaN  0.0
   B  NaN  1.0
   C  2.0  NaN
   D  3.0  NaN
BB A  NaN  4.0
   B  NaN  5.0
   C  6.0  NaN
   D  7.0  NaN
In [35]: df.stack(level=-1)
Out[35]:
      one  two
AA A  NaN  0.0
   B  NaN  1.0
   C  2.0  NaN
   D  3.0  NaN
BB A  NaN  4.0
   B  NaN  5.0
   C  6.0  NaN
   D  7.0  NaN
In [36]: df.stack(level=[0,1])
Out[36]:
AA  one  C    2.0
         D    3.0
    two  A    0.0
         B    1.0
BB  one  C    6.0
         D    7.0
    two  A    4.0
         B    5.0
dtype: float64

unstack函数：pandas.DataFrame.unstack(self, level=-1, fill_value=None)

通过?pandas.DataFrame.unstack命令查看帮助文档

Signature: pandas.DataFrame.unstack(self, level=-1, fill_value=None)
Docstring:
Pivot a level of the (necessarily hierarchical) index labels, returning
a DataFrame having a new level of column labels whose inner-most level
consists of the pivoted index labels. If the index is not a MultiIndex,
the output will be a Series (the analogue of stack when the columns are
not a MultiIndex).
The level involved will automatically get sorted.

a、对于普通的DataFrame而言，直接将列索引转换到行索引的最外层索引，生成一个Series对象

In [20]: df
Out[20]:
    three  two  one
AA      0    1    2
BB      3    4    5
In [21]: df.unstack()
Out[21]:
three  AA    0
       BB    3
two    AA    1
       BB    4
one    AA    2
       BB    5
dtype: int32
In [22]: df.unstack(0)
Out[22]:
three  AA    0
       BB    3
two    AA    1
       BB    4
one    AA    2
       BB    5
dtype: int32
In [23]: df.unstack(-1)
Out[23]:
three  AA    0
       BB    3
two    AA    1
       BB    4
one    AA    2
       BB    5
dtype: int32

b、对于层次化索引的DataFrame而言，和stack函数类似，似乎把两层索引当作一个整体，当level为列表时报错

In [37]: df
Out[37]:
   two    one
     A  B   C  D
AA   0  1   2  3
BB   4  5   6  7
In [38]: df.unstack()
Out[38]:
two  A  AA    0
        BB    4
     B  AA    1
        BB    5
one  C  AA    2
        BB    6
     D  AA    3
        BB    7
dtype: int32
In [39]: df.unstack(0)
Out[39]:
two  A  AA    0
        BB    4
     B  AA    1
        BB    5
one  C  AA    2
        BB    6
     D  AA    3
        BB    7
dtype: int32
In [40]: df.unstack(1)
Out[40]:
two  A  AA    0
        BB    4
     B  AA    1
        BB    5
one  C  AA    2
        BB    6
     D  AA    3
        BB    7
dtype: int32
In [41]: df.unstack(-1)
Out[41]:
two  A  AA    0
        BB    4
     B  AA    1
        BB    5
one  C  AA    2
        BB    6
     D  AA    3
        BB    7
dtype: int32
In [42]: df.unstack(level=[0,1])
IndexError: Too many levels: Index has only 1 level, not 2

那再试下level=5，发现也正常，这里的level怎么理解？--遗留问题

In [44]: df
Out[44]:
   two    one
     A  B   C  D
AA   0  1   2  3
BB   4  5   6  7
In [45]: df.unstack(level=5)
Out[45]:
two  A  AA    0
        BB    4
     B  AA    1
        BB    5
one  C  AA    2
        BB    6
     D  AA    3
        BB    7
dtype: int32

melt函数：pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)

通过?pandas.melt查看帮助文档

Signature: pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
Docstring:
"Unpivots" a DataFrame from wide format to long format, optionally leaving
identifier variables set.
This function is useful to massage a DataFrame into a format where one
or more columns are identifier variables (`id_vars`), while all other
columns, considered measured variables (`value_vars`), are "unpivoted" to
the row axis, leaving just two non-identifier columns, 'variable' and
'value'.

首先拿普通的DataFrame实验下，看看melt函数怎么转换的

In [46]: df = pd.DataFrame(np.arange(8).reshape(2,4),index=['AA','BB'],columns=
    ...: ['A','B','C','D'])
    ...: df
    ...:
Out[46]:
    A  B  C  D
AA  0  1  2  3
BB  4  5  6  7
In [47]: pd.melt(df,id_vars=['A','C'],value_vars=['B','D'],var_name='B|D',value
    ...: _name='(B|D)_value')
Out[47]:
   A  C B|D  (B|D)_value
0  0  2   B            1
1  4  6   B            5
2  0  2   D            3
3  4  6   D            7
In [48]: pd.melt(df,id_vars=['A'],value_vars=['B','D'],var_name='B|D',value_nam
    ...: e='(B|D)_value')
Out[48]:
   A B|D  (B|D)_value
0  0   B            1
1  4   B            5
2  0   D            3
3  4   D            7
In [49]: pd.melt(df,id_vars=['A'],value_vars=['B'],var_name='B',value_name='B_v
    ...: alue')
Out[49]:
   A  B  B_value
0  0  B        1
1  4  B        5

结论：从上述结果可以看出，id_vars可以理解为结果需要保留的原始列，value_vars可以理解为需需要列转行的列名；var_name把列转行的列变量重新命名，默认为variable；value_name列转行对应变量的值的名称

In [50]: df1 = pd.DataFrame(np.arange(8).reshape(2,4),columns=[list('ABCD'),lis
    ...: t('EFGH')])
    ...: df1
    ...:
Out[50]:
   A  B  C  D
   E  F  G  H
0  0  1  2  3
1  4  5  6  7
In [51]: pd.melt(df1,col_level=0,id_vars=['A'],value_vars=['D'])
Out[51]:
   A variable  value
0  0        D      3
1  4        D      7

②行转列方法

unstack函数：pandas.DataFrame.unstack(self, level=-1, fill_value=None)

In [26]: df2=df.stack()
    ...: df2
    ...:
Out[26]:
AA  three    0
    two      1
    one      2
BB  three    3
    two      4
    one      5
dtype: int32
In [27]: df2.unstack()
Out[27]:
    three  two  one
AA      0    1    2
BB      3    4    5
In [28]: df2.unstack(0)
Out[28]:
       AA  BB
three   0   3
two     1   4
one     2   5
In [29]: df2.unstack(1)
Out[29]:
    three  two  one
AA      0    1    2
BB      3    4    5
In [30]: df2.unstack(-1)
Out[30]:
    three  two  one
AA      0    1    2
BB      3    4    5