Python数据分析：基础

gjzh090 2022-01-18

展开全文

在进行数据分析时，一般的首要工作就是对原始数据进行处理。因为原始数据可能会在很多方面有些问题，比如格式混乱、存在空缺等。就算原始数据相对完整，但有时也需要对原始数据进行一些格式或者内容上的处理，更加方便后续的处理和分析。

那么在Python编程语言里就提供了很多API函数来应对以上的需求，让数据分析师能够更加方便的应对多种数据环境。下面就基于Python，来介绍一下经常被用到的一些数据处理的操作。

注：因为Python语言应用广泛，而且发布年限较早，所以已经有了很多成熟的API函数来应对各种场景和问题。下面仅就一般常用的功能做了一些介绍，并不敢妄言能够覆盖所有数据处理场景。如有遗漏，请通过网络进行搜索来查漏补缺。

注：以下代码都是在jupyter notebook环境下编写的。所以如果需要测试或者演练，需要事先安装好anaconda开发环境，并启动jupyter notebook笔记本工具，将代码复制到开发环境中运行。具体操作方法请参阅anaconda文档。

整数

定义变量x，初始值为整数9

x = 9

在Python语言中，可以通过 format 函数配合 {} 语法，来动态的替换字符串里的内容。

{} 里的0对应的是format函数的第一个参数变量x，最终 {0} 就会被变量x的值替换掉，输出一个完整的字符串

print('Output {0}'.format(x))

Output 9

print('Output {0}'.format(3**4))

Output 81

int函数对其内部的参数进行取整

int(8.3)

8

而且不会进行四舍五入的运算，而是会直接去掉小数部分

int(2.7)

2

但是除法运算会默认的返回浮点数，就算能够整除。

print('Output {0}'.format(int(8.3) / int(2.7)))

Output 4.0

浮点数

{}里冒号后面的.3f表示float浮点数保留3为小数

'output {0:.3f}'.format(8.3 / 2.7)

'output 3.074'

也存在除不尽的情况

r = 8 / float(3)r

2.6666666666666665

进行四舍五入处理，并保留2为小数

'output {0:.2f}'.format(r)

'output 2.67'

'output {0:.4f}'.format(8.0 / 3)

'output 2.6667'

简单的数学计算

导入常用的数学计算的包

from math import exp, log, sqrt

e 的 3 次方，并保留4为小数

'output {0:.4f}'.format(exp(3))

'output 20.0855'

以 e 为底，4 的对数

'output {0:.2f}'.format(log(4))

'output 1.39'

e 的值 2.718281828459045。所以下面的计算结果就是 1

'output {0:.2f}'.format(log(2.718281828459045))

'output 1.00'

sqrt 表示求平方根

'output {0:.1f}'.format(sqrt(81))

'output 9.0'

字符串

冒号后面的s表示string，即用字符串来替换{}这个位置的内容。

'output {0:s}'.format('I\'m enjoying learning python')

'output I'm enjoying learning python'

在书写字符串内容时要换行时，就在换行的地方用 \ 作为换行。

'{0:s}'.format('This is a long string. Without the backslash\it wold run off of the page on the right in the text editor and be very\difficult to read  and edit.')

'This is a long string. Without the backslashit wold run off of the page on the right in the text editor and be verydifficult to read and edit.'

重新定义两个字符串变量

firstname = 'frank 'lastname = 'li'

拼接成一个新的字符串变量

fullname = firstname + lastname

'{0:s}'.format(fullname)

'frank li'

{} 里的0,1,2分别对应format函数里的第一个，第二个和第三个参数的内容。即进行对照位置进行替换。

*4 表示这个字符串在进行替换的时候会出现 4 次。

'{0:s} {1:s}{2:s}'.format('She is', 'very ' * 4, 'beautiful.')

'She is very very very very beautiful.'

冒号后面的d表示十进制整数

len 函数用来计算它的参数里的字符串长度

'{0:d}'.format(len(fullname))

'8'

str1 = 'My deliverable is due in May'

默认用空格分割字符串

str1_list1 = str1.split()str1_list1

['My', 'deliverable', 'is', 'due', 'in', 'May']

第二个 2 表示对字符串进行两次拆分，所以会得到三个元素的字符串

str1_list2 = str1.split(' ', 2)str1_list2

['My', 'deliverable', 'is due in May']

join 合并字符串

words = 'I am studying Python data analysis with pandas'

先按照空格将字符串拆分为数组

words_list = words.split()words_list

['I', 'am', 'studying', 'Python', 'data', 'analysis', 'with', 'pandas']

用字符串变量“逗号”调用join函数，将列表作为参数传入join中，可以实现将列表的所有元素用逗号再拼接为一个新的字符串

'数组里的所有元素为: {0:s} '.format(','.join(words_list))

'数组里的所有元素为: I,am,studying,Python,data,analysis,with,pandas '

strip 删除字符串前后的空格

str3 = ' Remove unwanted characters from this string. \t\t \n'str3

' Remove unwanted characters from this string. \t\t \n'

清除字符串左边的空格

str3.lstrip()

'Remove unwanted characters from this string. \t\t \n'

清除字符串右边的空格

str3.rstrip()

' Remove unwanted characters from this string.'

删除字符串前后的空格，但是字符串内部的空格不做修改

str3.strip()

'Remove unwanted characters from this string.'

如果想要删除字符串前后的一些特殊符号，可以在strip函数里添加字符串类型的参数。

str4 = '$$Here's another string that has unwanted characters.__---++'str4

'$$Here's another string that has unwanted characters.__---++'

把字符串两端的所有strip函数里定义的字符都删掉

str4.strip('$_-+')

'Here's another string that has unwanted characters.'

replace 替换字符串

str5 = 'Let's replace the spaces in this sentence with other characters.'str5

'Let's replace the spaces in this sentence with other characters.'

将字符串里所有的空格替换为下划线

str5.replace(' ', '_')

'Let's_replace_the_spaces_in_this_sentence_with_other_characters.'

lower, upper, capitalize 大小写、首字母大写

str6 = 'Here's WHAT Happens WHEN You Use lower.'str6

'Here's WHAT Happens WHEN You Use lower.'

str7 = 'Here's what Happens when You Use UPPER.'str7

'Here's what Happens when You Use UPPER.'

所有字母变为小写

str6.lower()

'here's what happens when you use lower.'

所有字母变为大写

str7.upper()

'HERE'S WHAT HAPPENS WHEN YOU USE UPPER.'

每个字母的第一个字母变为大写

for word in str6.split(): print('{0:s}'.format(word.capitalize()))

Here'sWhatHappensWhenYouUseLower.

for word in str7.split(' '): print('{0:s}'.format(word.capitalize()))

Here'sWhatHappensWhenYouUseUpper.

正则表达式

首先引入需要使用的包

from math import exp, log, sqrtimport re

str8 = 'The quick brown fox jumps over the lazy dog.'

str8_list = str8.split()str8_list

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']

创建一个正则表达式

complie 函数将文本形式的模式进行编译，以此来提高处理速度

r 后的字符串表示寻找的原始字符串

re.I 表示不区分大小写

pattern = re.compile(r'The', re.I)count = 0for word in str8_list:    if pattern.search(word):        count += 1# 找出 2 个 the 单词print('{0:d}'.format(count))

2

str9 = 'The quick brown fox jumps over the lazy dog.'str_to_find = r'The'pattern = re.compile(str_to_find, re.I)# 正则表达式pattern对象的sub函数用于匹配并替换'{0:s}'.format(pattern.sub('a', str9))

'a quick brown fox jumps over a lazy dog.'

Date 日期

导入需要的包

from math import exp, log, sqrtimport refrom datetime import date, time, datetime, timedelta

获取当前程序运行时候的日期

today = date.today()today

datetime.date(2021, 12, 29)

取出当前日期中的年份

'{0!s}'.format(today.year)

'2021'

取出当前日期中的月份

'{0!s}'.format(today.month)

'12'

取出当前日期中的日期

'{0!s}'.format(today.day)

'29'

获取程序运行时的日期和时间

current_datetime = datetime.today()current_datetime

datetime.datetime(2021, 12, 29, 14, 45, 8, 188377)

取出年份数据

current_datetime.year

2021

取出月份数据

current_datetime.month

12

取出日期数据

current_datetime.day

29

取出小时数据

current_datetime.hour

14

取出分钟数据

current_datetime.minute

45

取出秒数据

current_datetime.second

8

取出毫秒数据

current_datetime.microsecond

188377

时间还可以做加减法

日期减去2天，即两天前

current_datetime.day - 2

27

设置了一个变量，表示 -1 天，用于下面的减去一天的计算

one_day = timedelta(days=-1)

从今天减去一天，就是昨天的日期

yesterday = today + one_dayyesterday

datetime.date(2021, 12, 28)

将日期按照新的版式打印输出

today.strftime('%m/%d/%Y')

'12/29/2021'

注意大小写Y表示的年份是否为全写

today.strftime('%m/%d/%y')

'12/29/21'

月份用英文显示

today.strftime('%B %d, %Y')

'December 29, 2021'

还可以明确的定义输出的日期时间的内容和格式。这里只定义了日期的内容和格式，所以时间就用默认的内容填补

'{0!s}'.format(datetime.strptime('2020-01-10 20:04:56', '%Y-%m-%d %H:%M:%S'))

'2020-01-10 20:04:56'

列表

list_a = [1, 2, 3]list_a

[1, 2, 3]

获得列表的长度。即统计列表中有多少个元素

len(list_a)

3

列表的元素里值最小的元素

min(list_a)

1

列表的元素里值最大的元素

max(list_a)

3

还可以定义结构复杂的列表。即列表中的某个元素还是一个列表

list_b = ['printer', 5, ['star', 'circle', 9]]

此时只会统计列表的一级元素的个数

len(list_b)

3

还可以统计一级元素在列表中的个数

list_b.count('printer')

1

这里必须把子列表完整写出来，否则无法统计出结果

list_b.count(['star', 'circle', 9])

1

取出列表中第二个元素的值

list_b[1]

5

取出列表中的第三个元素，这个元素还是一个列表。然后再取出这个字列表里的第二个元素

list_b[2][1]

'circle'

取出列表里的第一个到第二个元素，组成一个字列表返回出来。即只包括下标为0、1 的元素，但是不包括下标为 2 的元素。

list_a[0:2]

[1, 2]

取出列表里的第三个元素（那个字列表）里从第二个到最后一个元素的内容，组成一个字列表返回

list_b[2][1:]

['circle', 9]

整数 2 是存在于列表中

2 in list_a

True

整数 2 不在列表中

2 in list_b

False

数组 ['star', 'circle', 9] 在列表中

['star', 'circle', 9] in list_b

True

['star'] 不在列表中

['star'] in list_b

False

元组

元组不能被修改，其他特点与列表相似

my_tuple = ('x', 'y', 'z')

len(my_tuple)

3

取出元组里的第二个元素

my_tuple[1]

'y'

拼接两个元组

my_tuple + my_tuple

('x', 'y', 'z', 'x', 'y', 'z')

元组解包。把元组里的三个元素分别赋值给三个变量

a, b, c = my_tuple

变量 a、b、c的值分别为 x、 y、 z

元组与列表互相转换

my_list = [1, 2, 3]my_tuple = ('x', 'y', 'z')

将里列表 my_list 转换为元组

tuple(my_list)

(1, 2, 3)

将元组 my_tuple 转换为列表

list(my_tuple)

['x', 'y', 'z']

字典

字典里的每个元素都是一个键值对

定义出一个空的字典

empty_dict = {}empty_dict

{}

定义一个字典，有三个元素，每个元素都是一个键值对

a_dict = {'one': 1, 'two': 2, 'three': 3}a_dict

{'one': 1, 'two': 2, 'three': 3}

求出字典中共有多少个键值对

len(a_dict)

3

len(empty_dict)

0

按照“键”来获取“值”

a_dict['two']

2

调用字典对象的copy函数，实现字典的拷贝

b_dict = a_dict.copy()b_dict

{'one': 1, 'two': 2, 'three': 3}

得到字典中的所有键

a_dict.keys()

dict_keys(['one', 'two', 'three'])

得到字典中的所有值

a_dict.values()

dict_values([1, 2, 3])

item的意思可以理解为键值对，items函数用来获取字典中的所有键值对

a_dict.items()

dict_items([('one', 1), ('two', 2), ('three', 3)])

也可以调用字典对象的get函数，并传入键的名字，以此来获得对应的值

a_dict.get('two')

2

如果传入的键在字典中不存在，那么可以用get函数的第二个参数作为内容返回

a_dict.get('four', 'Not in dict')

'Not in dict'

如果键在字典中存在，就返回这个键对应的值。那么get函数的第二个参数就忽略不用

a_dict.get('three', 'Not in dict')

3

本站是提供个人知识管理的网络存储空间，所有内容均由用户发布，不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息，谨防诈骗。如发现有害或侵权内容，请点击一键举报。

转藏分享

QQ空间 QQ好友新浪微博微信

献花（0） +1

来自： gjzh090 > 《计算机知识学习》

举报/认领

0条评论

发表

请遵守用户评论公约

类似文章 更多

gjzh090

关注对话

TA的最新馆藏

篆刻入门教程：教你如何学会篆刻
大理书画电商平台——传承中华文化，让艺术走进生活
黄自元《佛骨表》高清释文1.0版
五代画家黄居寀花鸟作品欣赏
宋，黄居寀《山鹧棘雀图》
袁老这三句话我们永志不忘

喜欢该文的人也喜欢更多

热门阅读换一换