数据处理过程的数据类型- 当利用pandas进行数据处理的时候,经常会遇到数据类型的问题,当拿到数据的时候,首先需要确定拿到的是正确类型的数据,一般通过数据类型的转化,这篇文章就介绍pandas里面的数据类型(data types也就是常用的dtyps),以及pandas与numpy之间的数据对应关系。
- 主要介绍object,int64,float64,datetime64,bool等几种类型,category与timedelta两种类型会单独的在其他文章中进行介绍。当然本文中也会涉及简单的介绍。
数据类型的问题一般都是出了问题之后才会发现的,所以有了一些经验之后就会拿到数据之后,就直接看数据类型,是否与自己想要处理的数据格式一致,这样可以从一开始避免一些尴尬的问题出现。那么我们以一个简单的例子,利用jupyter notebook进行一个数据类型的介绍。import numpy as npimport pandas as pddf = pd.read_csv('sales_data_types.csv')print(df)
Customer Number Customer Name 2016 2017 \0 10002 Quest Industries $125,000.00 $162,500.00 1 552278 Smith Plumbing $920,000.00 $1,012,000.00 2 23477 ACME Industrial $50,000.00 $62,500.00 3 24900 Brekke LTD $350,000.00 $490,000.00 4 651029 Harbor Co $15,000.00 $12,750.00 Percent Growth Jan Units Month Day Year Active 0 30.00% 500 1 10 2015 Y 1 10.00% 700 6 15 2014 Y 2 25.00% 125 3 29 2016 Y 3 4.00% 75 10 27 2015 Y 4 -15.00% Closed 2 2 2014 N
df.dtypes
Customer Number int64Customer Name object2016 object2017 objectPercent Growth objectJan Units objectMonth int64Day int64Year int64Active objectdtype: object
df['2016']+df['2017']
0 $125,000.00 $162,500.00 1 $920,000.00 $1,012,000.00 2 $50,000.00 $62,500.00 3 $350,000.00 $490,000.00 4 $15,000.00 $12,750.00 dtype: object
df.info()
class 'pandas.core.frame.DataFrame'>RangeIndex: 5 entries, 0 to 4Data columns (total 10 columns):Customer Number 5 non-null int64Customer Name 5 non-null object2016 5 non-null object2017 5 non-null objectPercent Growth 5 non-null objectJan Units 5 non-null objectMonth 5 non-null int64Day 5 non-null int64Year 5 non-null int64Active 5 non-null objectdtypes: int64(4), object(6)memory usage: 480.0+ bytes
首先介绍最常用的astype()比如可以通过astype()将第一列的数据转化为整数int类型df['Customer Number'].astype('int')
0 100021 5522782 234773 249004 651029Name: Customer Number, dtype: int32
df['Customer Number'] = df['Customer Number'].astype('int')print(df)print('--------'*10)print(df.dtypes)
Customer Number Customer Name 2016 2017 \0 10002 Quest Industries $125,000.00 $162,500.00 1 552278 Smith Plumbing $920,000.00 $1,012,000.00 2 23477 ACME Industrial $50,000.00 $62,500.00 3 24900 Brekke LTD $350,000.00 $490,000.00 4 651029 Harbor Co $15,000.00 $12,750.00 Percent Growth Jan Units Month Day Year Active 0 30.00% 500 1 10 2015 Y 1 10.00% 700 6 15 2014 Y 2 25.00% 125 3 29 2016 Y 3 4.00% 75 10 27 2015 Y 4 -15.00% Closed 2 2 2014 N --------------------------------------------------------------------------------Customer Number int32Customer Name object2016 object2017 objectPercent Growth objectJan Units objectMonth int64Day int64Year int64Active objectdtype: object
print(df)
Customer Number Customer Name 2016 2017 \0 10002 Quest Industries $125,000.00 $162,500.00 1 552278 Smith Plumbing $920,000.00 $1,012,000.00 2 23477 ACME Industrial $50,000.00 $62,500.00 3 24900 Brekke LTD $350,000.00 $490,000.00 4 651029 Harbor Co $15,000.00 $12,750.00 Percent Growth Jan Units Month Day Year Active 0 30.00% 500 1 10 2015 Y 1 10.00% 700 6 15 2014 Y 2 25.00% 125 3 29 2016 Y 3 4.00% 75 10 27 2015 Y 4 -15.00% Closed 2 2 2014 N
df['2016'].astype('float')
ValueError Traceback (most recent call last)<>input-19-47cc9d68cd65> in module>()C:\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, raise_on_error, **kwargs) 3052 # else, only a single dtype is given 3053 new_data = self._data.astype(dtype=dtype, copy=copy,-> 3054 raise_on_error=raise_on_error, **kwargs) 3055 return self._constructor(new_data).__finalize__(self) 3056 C:\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs) 3187 3188 def astype(self, dtype, **kwargs):-> 3189 return self.apply('astype', dtype=dtype, **kwargs) 3190 3191 def convert(self, **kwargs):C:\Anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs) 3054 3055 kwargs['mgr'] = self-> 3056 applied = getattr(b, f)(**kwargs) 3057 result_blocks = _extend_blocks(applied, result_blocks) 3058 C:\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, raise_on_error, values, **kwargs) 459 **kwargs): 460 return self._astype(dtype, copy=copy, raise_on_error=raise_on_error, 462 463 def _astype(self, dtype, copy=False, raise_on_error=True, values=None,C:\Anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, raise_on_error, values, klass, mgr, **kwargs) 502 503 # _astype_nansafe works fine with 1-d only 505 values = values.reshape(self.shape) 506 C:\Anaconda3\lib\site-packages\pandas\types\cast.py in _astype_nansafe(arr, dtype, copy) 535 536 if copy: 538 return arr.view(dtype) 539 ValueError: could not convert string to float: '$15,000.00 '
以上的问题说明了一些问题- 如果数据是纯净的数据,可以转化为数字
- astype基本也就是两种用作,数字转化为单纯字符串,单纯数字的字符串转化为数字,含有其他的非数字的字符串是不能通过astype进行转化的。
- 需要引入其他的方法进行转化,也就有了下面的自定义函数方法
通过自定义函数清理数据def convert_currency(var): ''' convert the string number to a float _ 去除$ - 去除逗号, - 转化为浮点数类型 ''' new_value = var.replace(',','').replace('$','') return float(new_value)
df['2016'].apply(convert_currency)
0 125000.01 920000.02 50000.03 350000.04 15000.0Name: 2016, dtype: float64
df['2016'].apply(lambda x: x.replace(',','').replace('$','')).astype('float64')
0 125000.01 920000.02 50000.03 350000.04 15000.0Name: 2016, dtype: float64
df['Percent Growth'].apply(lambda x: x.replace('%','')).astype('float')/100
0 0.301 0.102 0.253 0.044 -0.15Name: Percent Growth, dtype: float64
df['Active'] = np.where(df['Active'] == 'Y', True, False)df['Active']
0 True1 True2 True3 True4 FalseName: Active, dtype: bool
df['2016']=df['2016'].apply(lambda x: x.replace(',','').replace('$','')).astype('float64')df['2017']=df['2017'].apply(lambda x: x.replace(',','').replace('$','')).astype('float64')df['Percent Growth']=df['Percent Growth'].apply(lambda x: x.replace('%','')).astype('float')/100df.dtypes
Customer Number int32Customer Name object2016 float642017 float64Percent Growth float64Jan Units objectMonth int64Day int64Year int64Active booldtype: object
print(df)
Customer Number Customer Name 2016 2017 Percent Growth \0 10002 Quest Industries 125000.0 162500.0 0.30 1 552278 Smith Plumbing 920000.0 1012000.0 0.10 2 23477 ACME Industrial 50000.0 62500.0 0.25 3 24900 Brekke LTD 350000.0 490000.0 0.04 4 651029 Harbor Co 15000.0 12750.0 -0.15 Jan Units Month Day Year Active 0 500 1 10 2015 True 1 700 6 15 2014 True 2 125 3 29 2016 True 3 75 10 27 2015 True 4 Closed 2 2 2014 False
利用pandas中函数进行处理pd.to_numeric(df['Jan Units'],errors='coerce').fillna(0)
0 500.01 700.02 125.03 75.04 0.0Name: Jan Units, dtype: float64
pd.to_datetime(df[['Month', 'Day', 'Year']])
0 2015-01-101 2014-06-152 2016-03-293 2015-10-274 2014-02-02dtype: datetime64[ns]
df['Jan Units'] = pd.to_numeric(df['Jan Units'],errors='coerce')df['Start_date'] = pd.to_datetime(df[['Month', 'Day', 'Year']])
df
| Customer Number | Customer Name | 2016 | 2017 | Percent Growth | Jan Units | Month | Day | Year | Active | Start_date | 0 | 10002 | Quest Industries | 125000.0 | 162500.0 | 0.30 | 500.0 | 1 | 10 | 2015 | True | 2015-01-10 | 1 | 552278 | Smith Plumbing | 920000.0 | 1012000.0 | 0.10 | 700.0 | 6 | 15 | 2014 | True | 2014-06-15 | 2 | 23477 | ACME Industrial | 50000.0 | 62500.0 | 0.25 | 125.0 | 3 | 29 | 2016 | True | 2016-03-29 | 3 | 24900 | Brekke LTD | 350000.0 | 490000.0 | 0.04 | 75.0 | 10 | 27 | 2015 | True | 2015-10-27 | 4 | 651029 | Harbor Co | 15000.0 | 12750.0 | -0.15 | NaN | 2 | 2 | 2014 | False | 2014-02-02 | df.dtypes
Customer Number int32Customer Name object2016 float642017 float64Percent Growth float64Jan Units float64Month int64Day int64Year int64Active boolStart_date datetime64[ns]dtype: object
def convert_percent(val): ''' Convert the percentage string to an actual floating point percent - Remove % - Divide by 100 to make decimal ''' new_val = val.replace('%', '') return float(new_val) / 100df_2 = pd.read_csv('sales_data_types.csv',dtype={'Customer_Number':'int'},converters={ '2016':convert_currency, '2017':convert_currency, 'Percent Growth':convert_percent, 'Jan Units':lambda x:pd.to_numeric(x,errors='coerce'), 'Active':lambda x: np.where(x=='Y',True,False)})
df_2.dtypes
Customer Number int64Customer Name object2016 float642017 float64Percent Growth float64Jan Units float64Month int64Day int64Year int64Active booldtype: object
df_2
| Customer Number | Customer Name | 2016 | 2017 | Percent Growth | Jan Units | Month | Day | Year | Active | 0 | 10002 | Quest Industries | 125000.0 | 162500.0 | 0.30 | 500.0 | 1 | 10 | 2015 | True | 1 | 552278 | Smith Plumbing | 920000.0 | 1012000.0 | 0.10 | 700.0 | 6 | 15 | 2014 | True | 2 | 23477 | ACME Industrial | 50000.0 | 62500.0 | 0.25 | 125.0 | 3 | 29 | 2016 | True | 3 | 24900 | Brekke LTD | 350000.0 | 490000.0 | 0.04 | 75.0 | 10 | 27 | 2015 | True | 4 | 651029 | Harbor Co | 15000.0 | 12750.0 | -0.15 | NaN | 2 | 2 | 2014 | False | 至此,pandas里面数据类型目前还有timedelta以及category两个,之后会着重介绍category类型,这是类型是参考了R中的category设计的,在pandas 0.16 之后添加的,之后还会根据需要进行整理pandas的常用方法。
|