数据清洗 (data cleaning) 是机器学习和深度学习进入算法步前的一项重要任务,我平时比较习惯使用的 7 个步骤,总结如下: - Step3: check null value for every column
- Step4: complete null value
- Step5: feature engineering
- Step 5.1: delete some features
- Step 5.2: create new feature
- Step6: encode for categories columns
- Step 6.1: Sklearn LabelEncode
- Step 6.2: Pandas get_dummies
- Step 7: check for data cleaning
今天使用泰坦尼克数据集,完整介绍以上 7 步的具体操作过程。 1 读入数据这不废话吗,第一步就是读入数据。 data_raw = pd.read_csv('../input/titanicdataset-traincsv/train.csv') data_raw
结果: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S ... ... ... ... ... ... ... ... ... ... ... ... ... 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q 891 rows × 12 columns
2 数据预览data_raw.info() data_raw.describe(include='all')
结果: <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked count 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889 unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3 top NaN NaN NaN Hakkarainen, Mr. Pekka Pietari male NaN NaN NaN 1601 NaN G6 S freq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644 mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN std 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN min 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN 25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN 50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN 75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN max 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN N
3 检查null值data1 = data_raw.copy(deep=True)
data1.isnull().sum()
结果: PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
Age 列 177 个空值,Cabin 687 个空值,一共才 891 行,估计没啥价值了!Embarked 2 个。 4 补全空值data1['Age'].fillna(data1['Age'].median(), inplace = True) data1['Embarked'].fillna(data1['Embarked'].mode()[0], inplace = True)
data1.isnull().sum()
补全操作check: PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 0 dtype: int64
5 特征工程5.1 干掉 3 列:drop_column = ['PassengerId','Cabin', 'Ticket'] data1.drop(drop_column, axis=1, inplace = True)
5.2 增加 3 列增加一列 FamilySize data1['FamilySize'] = data1 ['SibSp'] + data1['Parch'] + 1 data1
打印结果: Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize 0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 ... ... ... ... ... ... ... ... ... ... ... 886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 891 rows × 10 columns
再创建一列: data1['IsAlone'] = np.where(data1['FamilySize'] > 1,0,1)
再创建一列: data1['Title'] = data1['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0] data1
结果: Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title 0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 0 Mr 1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 0 Mrs 2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 1 Miss 3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 0 Mrs 4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 1 Mr ... ... ... ... ... ... ... ... ... ... ... ... ... 886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 1 Rev 887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 1 Miss 888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 0 Miss 889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 1 Mr 890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 1 Mr 891 rows × 12 columns
5.3 分箱走起data1['FareCut'] = pd.qcut(data1['Fare'], 4) data1['AgeCut'] = pd.cut(data1['Age'].astype(int), 6) data1
结果: Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title FareCut AgeCut 0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 0 Mr (-0.001, 7.91] (13.333, 26.667] 1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 0 Mrs (31.0, 512.329] (26.667, 40.0] 2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 1 Miss (7.91, 14.454] (13.333, 26.667] 3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 0 Mrs (31.0, 512.329] (26.667, 40.0] 4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 1 Mr (7.91, 14.454] (26.667, 40.0] ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 1 Rev (7.91, 14.454] (26.667, 40.0] 887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 1 Miss (14.454, 31.0] (13.333, 26.667] 888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 0 Miss (14.454, 31.0] (26.667, 40.0] 889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 1 Mr (14.454, 31.0] (13.333, 26.667] 890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 1 Mr (-0.001, 7.91] (26.667, 40.0] 891 rows × 14 columns
6 编码6.1 LabelEncoder 方法使用 Sklearn 的 LabelEncoder from sklearn.preprocessing import LabelEncoder
label = LabelEncoder() data1['Sex_Code'] = label.fit_transform(data1['Sex']) data1['Embarked_Code'] = label.fit_transform(data1['Embarked']) data1['Title_Code'] = label.fit_transform(data1['Title']) data1['AgeBin_Code'] = label.fit_transform(data1['AgeCut']) data1['FareBin_Code'] = label.fit_transform(data1['FareCut']) data1
结果 data1 选取某些列,算法模型终于能认出它们了,多不容易! 6.2 get_dummies 方法get_dummies 将长 DataFrame 变为宽 DataFrame: pd.get_dummies(data1['Sex'])
结果: female male 0 0 1 1 1 0 2 1 0 3 1 0 4 0 1 ... ... ... 886 0 1 887 1 0 888 1 0 889 0 1 890 0 1 891 rows × 2 columns
而 LabelEncoder 编码后,仅仅是把 Female 编码为 0, male 编码为 1. label.fit_transform(data1['Sex']) 0 1 1 0 2 0 3 0 4 1 .. 886 1 887 0 888 0 889 1 890 1 Name: Sex_Code, Length: 891, dtype: int64
7 再 check# Step 7: data cleaning check data1[data1_x_alg].info() print('-'*50) data1_dummy.info()
结果: <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 8 columns): Sex_Code 891 non-null int64 Pclass 891 non-null int64 Embarked_Code 891 non-null int64 Title_Code 891 non-null int64 SibSp 891 non-null int64 Parch 891 non-null int64 Age 891 non-null float64 Fare 891 non-null float64 dtypes: float64(2), int64(6) memory usage: 55.8 KB -------------------------------------------------- <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 29 columns): Pclass 891 non-null int64 SibSp 891 non-null int64 Parch 891 non-null int64 Age 891 non-null float64 Fare 891 non-null float64 FamilySize 891 non-null int64 IsAlone 891 non-null int64 Sex_female 891 non-null uint8 Sex_male 891 non-null uint8 Embarked_C 891 non-null uint8 Embarked_Q 891 non-null uint8 Embarked_S 891 non-null uint8 Title_Capt 891 non-null uint8 Title_Col 891 non-null uint8 Title_Don 891 non-null uint8 Title_Dr 891 non-null uint8 Title_Jonkheer 891 non-null uint8 Title_Lady 891 non-null uint8 Title_Major 891 non-null uint8 Title_Master 891 non-null uint8 Title_Miss 891 non-null uint8 Title_Mlle 891 non-null uint8 Title_Mme 891 non-null uint8 Title_Mr 891 non-null uint8 Title_Mrs 891 non-null uint8 Title_Ms 891 non-null uint8 Title_Rev 891 non-null uint8 Title_Sir 891 non-null uint8 Title_the Countess 891 non-null uint8 dtypes: float64(2), int64(5), uint8(22) memory usage: 68.0 KB
Great !
|