【原】python数据清洗：pandas如何完成对Excel数据类型的转换！

Python集中营 2022-12-30 发布于甘肃

展开全文

在使用pandas进行Excle数据处理的时候，一般有object int float datetime等数据类型。

使用不同的python非标准模块读取Excel数据返回的数据类型又有不同。比如xlrd读取Excel数据后可能有更多的数据类型。

这里主要说的是pandas的于返回后的Excel数据类型的处理，若是没有pandas使用pip的方式安装即可。

pip install pandas

将pandas直接导入到当前的代码块中，本文部分注释采用的是AI插件自动生成的。关于AI插件自动生成注释可参考历史文章。

# Importing the pandas module and giving it an alias of pd.
import pandas as pd

# Reading the excel file and storing it in a dataframe.
data_frame = pd.read_excel('D:/test-data-work/data.xlsx')

# Printing the dataframe.
print(data_frame)

下面是需要执行Excel数据清洗的DataFrame对象内容，以此作为数据源来说明。

            姓名  年龄    班级   成绩 表现       入学时间
0   Python 集中营  10  1210   99  A 2022-10-17
1   Python 集中营  11  1211  100  A 2022-10-18
2   Python 集中营  12  1212  101  A 2022-10-19
3   Python 集中营  13  1213  102  A 2022-10-20
4   Python 集中营  14  1214  103  A 2022-10-21
5   Python 集中营  15  1215  104  A 2022-10-22
6   Python 集中营  16  1216  105  A 2022-10-23
7   Python 集中营  17  1217  106  A 2022-10-24
8   Python 集中营  18  1218  107  A 2022-10-25
9   Python 集中营  19  1219  108  A 2022-10-26
10  Python 集中营  20  1220  109  A 2022-10-27
11  Python 集中营  21  1221  110  A 2022-10-28
12  Python 集中营  22  1222  111  A 2022-10-29
13  Python 集中营  23  1223  112  A 2022-10-30
14  Python 集中营  24  1224  113  A 2022-10-31
15  Python 集中营  25  1225  114  A 2022-11-01
16  Python 集中营  26  1226  115  A 2022-11-02
17  Python 集中营  27  1227  116  A 2022-11-03
18  Python 集中营  10  1210   99  A 2022-11-04

首先通过DataFrame数据对象的dtypes函数来获取所有列的数据类型。

# Getting the column names of the dataframe.
columns = data_frame.columns.values

# Iterating through the columns of the dataframe.
for column_ in columns:
    # Printing the data type of each column.
    print(data_frame[column_].dtypes)

# object
# int64
# int64
# int64
# object
# datetime64[ns]

经过上面的处理可以循环获取每个DataFrame对象的数据列的类型，现在只需要判断是否需要该类型的列数据进行操作。

可以使用DataFrame数据对象提供的astype函数对象某一列的数据进行类型转换。

# Iterating through the columns of the dataframe.
for column_ in columns:
    if str(data_frame[column_].dtypes).__contains__('int'):
        # Converting the data type of the column to string.
        data_frame[column_] = data_frame[column_].astype(str)
        # Printing the string "当前列{}数据类型已转换成str类型！" with the value of column_ replacing the {}.
        print('当前列{}数据类型已转换成str类型！'.format(column_))

# 当前列年龄数据类型已转换成str类型！
# 当前列班级数据类型已转换成str类型！
# 当前列成绩数据类型已转换成str类型！

上面我们采用str对象提供的contains函数判断是否包含int类型，将数据类型是int的列转换成str字符串类型。

# Iterating through the columns of the dataframe.
for column_ in data_frame.columns.values:
    # Printing the data type of each column.
    print(data_frame[column_].dtypes)

# object
# object
# object
# object
# object
# datetime64[ns]

再次查看DataFrame对象中数据类型为int的三个列都已经完成了类型转换。

一般比较大的数字类型在Excel中会显示为科学计数法的显示，因此转换为字符串也可以解决这个问题。

# Iterating through the columns of the dataframe.
for column_ in data_frame.columns.values:
    # Checking if the data type of the column is datetime.
    if str(data_frame[column_].dtypes).__contains__('datetime'):
        # Converting the data type of the column to string.
        data_frame[column_] = data_frame[column_].astype(str)

# Saving the dataframe to an excel file.
data_frame.to_excel('data2.xlsx')