分享

清洗数据,我习惯用这 7 步!

 万皇之皇 2020-04-08

数据清洗 (data cleaning) 是机器学习和深度学习进入算法步前的一项重要任务,我平时比较习惯使用的 7 个步骤,总结如下:

  • Step1 : read csv
  • Step2 : preview data
  • Step3: check null value for every column
  • Step4: complete null value
  • Step5: feature engineering
    • Step 5.1: delete some features
    • Step 5.2: create new feature
  • Step6: encode for categories columns
    • Step 6.1: Sklearn LabelEncode
    • Step 6.2: Pandas get_dummies
  • Step 7: check for data cleaning

今天使用泰坦尼克数据集,完整介绍以上 7 步的具体操作过程。

1 读入数据

这不废话吗,第一步就是读入数据。

data_raw = pd.read_csv('../input/titanicdataset-traincsv/train.csv')
data_raw

结果:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
891 rows × 12 columns

2 数据预览

data_raw.info()
data_raw.describe(include='all')

结果:

<class 'pandas.core.frame.DataFrame'>
RangeIndex:
891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
count 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN NaN Hakkarainen, Mr. Pekka Pietari male NaN NaN NaN 1601 NaN G6 S
freq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN N

3 检查null值

data1 = data_raw.copy(deep=True)

data1.isnull().sum()

结果:

PassengerId      0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

Age 列 177 个空值,Cabin 687 个空值,一共才 891 行,估计没啥价值了!Embarked 2 个。

4 补全空值

data1['Age'].fillna(data1['Age'].median(), inplace = True)
data1['Embarked'].fillna(data1['Embarked'].mode()[0], inplace = True)

data1.isnull().sum()

补全操作check:

PassengerId      0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64

5 特征工程

5.1 干掉 3 列:

drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)

5.2 增加 3 列

增加一列 FamilySize

data1['FamilySize'] = data1 ['SibSp'] + data1['Parch'] + 1
data1

打印结果:


Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1
... ... ... ... ... ... ... ... ... ... ...
886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1
887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4
889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1
890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1
891 rows × 10 columns

再创建一列:

data1['IsAlone'] = np.where(data1['FamilySize'] > 1,0,1)

再创建一列:

data1['Title'] = data1['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
data1

结果:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Fare	Embarked	FamilySize	IsAlone	Title
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 0 Mr
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 0 Mrs
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 1 Miss
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 0 Mrs
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 1 Mr
... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 1 Rev
887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 1 Miss
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 0 Miss
889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 1 Mr
890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 1 Mr
891 rows × 12 columns

5.3 分箱走起

data1['FareCut'] = pd.qcut(data1['Fare'], 4)
data1['AgeCut'] = pd.cut(data1['Age'].astype(int), 6)
data1

结果:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Fare	Embarked	FamilySize	IsAlone	Title	FareCut	AgeCut
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 0 Mr (-0.001, 7.91] (13.333, 26.667]
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 0 Mrs (31.0, 512.329] (26.667, 40.0]
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 1 Miss (7.91, 14.454] (13.333, 26.667]
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 0 Mrs (31.0, 512.329] (26.667, 40.0]
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 1 Mr (7.91, 14.454] (26.667, 40.0]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 1 Rev (7.91, 14.454] (26.667, 40.0]
887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 1 Miss (14.454, 31.0] (13.333, 26.667]
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 0 Miss (14.454, 31.0] (26.667, 40.0]
889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 1 Mr (14.454, 31.0] (13.333, 26.667]
890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 1 Mr (-0.001, 7.91] (26.667, 40.0]
891 rows × 14 columns

6 编码

6.1 LabelEncoder 方法

使用 Sklearn 的 LabelEncoder

from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
data1['Sex_Code'] = label.fit_transform(data1['Sex'])
data1['Embarked_Code'] = label.fit_transform(data1['Embarked'])
data1['Title_Code'] = label.fit_transform(data1['Title'])
data1['AgeBin_Code'] = label.fit_transform(data1['AgeCut'])
data1['FareBin_Code'] = label.fit_transform(data1['FareCut'])
data1

结果 data1 选取某些列,算法模型终于能认出它们了,多不容易!

6.2 get_dummies 方法

get_dummies 将长 DataFrame 变为宽 DataFrame:

pd.get_dummies(data1['Sex'])

结果:

female	male
0 0 1
1 1 0
2 1 0
3 1 0
4 0 1
... ... ...
886 0 1
887 1 0
888 1 0
889 0 1
890 0 1
891 rows × 2 columns

而 LabelEncoder 编码后,仅仅是把 Female 编码为 0, male 编码为 1.

label.fit_transform(data1['Sex'])
0 1
1 0
2 0
3 0
4 1
..
886 1
887 0
888 0
889 1
890 1
Name: Sex_Code, Length: 891, dtype: int64

7 再 check

# Step 7: data cleaning check
data1[data1_x_alg].info()
print('-'*50)
data1_dummy.info()

结果:

<class 'pandas.core.frame.DataFrame'>
RangeIndex:
891 entries, 0 to 890
Data columns (total 8 columns):
Sex_Code 891 non-null int64
Pclass 891 non-null int64
Embarked_Code 891 non-null int64
Title_Code 891 non-null int64
SibSp 891 non-null int64
Parch 891 non-null int64
Age 891 non-null float64
Fare 891 non-null float64
dtypes: float64(2), int64(6)
memory usage: 55.8 KB
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex:
891 entries, 0 to 890
Data columns (total 29 columns):
Pclass 891 non-null int64
SibSp 891 non-null int64
Parch 891 non-null int64
Age 891 non-null float64
Fare 891 non-null float64
FamilySize 891 non-null int64
IsAlone 891 non-null int64
Sex_female 891 non-null uint8
Sex_male 891 non-null uint8
Embarked_C 891 non-null uint8
Embarked_Q 891 non-null uint8
Embarked_S 891 non-null uint8
Title_Capt 891 non-null uint8
Title_Col 891 non-null uint8
Title_Don 891 non-null uint8
Title_Dr 891 non-null uint8
Title_Jonkheer 891 non-null uint8
Title_Lady 891 non-null uint8
Title_Major 891 non-null uint8
Title_Master 891 non-null uint8
Title_Miss 891 non-null uint8
Title_Mlle 891 non-null uint8
Title_Mme 891 non-null uint8
Title_Mr 891 non-null uint8
Title_Mrs 891 non-null uint8
Title_Ms 891 non-null uint8
Title_Rev 891 non-null uint8
Title_Sir 891 non-null uint8
Title_the Countess 891 non-null uint8
dtypes: float64(2), int64(5), uint8(22)
memory usage: 68.0 KB

Great !

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多