测试：用python进行数据分析(一)

qqcy404 2016-04-05

展开全文

　　http://blog.163.com/zhoulili1987619@126/blog/static/353082012015220101240699/

NumPy(Numerical Python的简称)是Python科学计算的基础包。这里都基于NumPy以及构建于其上的库，它提供了以下功能(不限于此)：

　　1）快速高效的多维数组对象ndarray

　　2）用于对数组执行元素级计算以及直接对数组执行数学运算的函数

　　3）用于读写硬盘上基于数组的数据集的工具

　　4）线性代数运算、傅里叶变换，以及随机数生成

　　5）用于将C、C++、Fortran代码集成到Python的工具

　　NumPy在数据分析方面还有另外一个主要作用，即作为在算法之间传递数据的容器。对于数值型数据，NumPy数组在存储和处理数据时要比内置的Python数据结构高效得多。

　　pandas

　　pandas提供了使我们能够快速便捷地处理结构化数据的大量数据结构和函数，这里用得最多的pandas对象是DataFrame,它是一个面向列(column-oriented)的二维表结构，且含有行标和列标。

　　matplotlib

　　matplotlib是最流行的用于绘制数据图表的Python库，它跟IPython结合得很好，因而提供了一种非常好用的交互式数据绘图环境。绘制的图表也是交互式的，你可以利用绘图窗口中的工具栏放大图表中的某个区域或对整个图表进行平移浏览。

　　IPython

　　IPython是P樱桃红科学计算标准工具集的组成部分，它将其他所有的东西联系到了一起。它为交互式和探索式计算提供了一个强健而高效的环境。它是一个增强Python shell，目的是提高编写、测试、调试Python代码的速度。它主要用于交互式数据处理和利用matplotlib对数据进行可视化处理。

　　SciPy

　　SciPy是一组专门解决科学计算总各种标准问题域的包的集合，主要包括下面这些包：

　　scipy.integrate:数值积分例程和微分方程求解器

　　scipy.linalg：扩展了由numpy.linalg提供的线性代数例程和矩阵分解功能。

　　scipy.optimize：函数优化器(最小化器)以及根查找算法

　　scipy.signal:信号处理工具

　　scipy.sparse：稀疏矩阵和稀疏线性系统求解器。

　　scipy.special:SPECFUN(这是一个实现了许多常用数学函数(如伽马函数)的Fortran)的包装器

　　scipy.stats:标准连续和离散概率分布(如密度函数、采样器、连续分布函数等)、各种统计检验方法，以及更好的描述统计法

　　scipy.weave：利用内联C++代码加速数组计算的工具。

　　备注：

　　数据规整(Munge/Munging/Wrangling):指的是将非结构化和(或)散乱数据处理为结构化或整洁形式的整个过程。

　　伪码(Pseudocode)：算法或过程的“伪码式”描述，而这些代码本身并不是实际有效的源代码

　　语法糖(Syntactic sugar)：这是一终编程语法，它并不会带来新的特性，但却能使代码更易读、更易写。

　　实例1：

　　import json

　　path = 'D://Python//pydata-book-master//ch02//usagov_bitly_data2012-03-16-1331923249.txt'

　　records = [json.loads(line) for line in open(path)]

　　print records[0]

　　output:{u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11', u'c': u'US', u'nk': 1, u'tz': u'America/New_York', u'gr': u'MA', u'g': u'A6qOVH', u'h': u'wfLQtf', u'cy': u'Danvers', u'l': u'orofrog', u'al': u'en-US,en;q=0.8', u'hh': u'1.usa.gov', u'r': u'http://www./l/7AQEFzjSi/1.usa.gov/wfLQtf', u'u': u'http://www.ncbi.nlm./pubmed/22415991', u't': 1331923247, u'hc': 1331822918, u'll': [42.576698, -70.954903]}

　　print records[0]['tz']

　　output:America/New_York

　　time_zones = [rec['tz'] for rec in records]

　　并不是所有记录都有时区字段，so添加if，如下：

　　time_zones = [rec['tz'] for rec in records if 'tz' in rec]

　　print time_zones[:10] ##########前10个时区

　　output:[u'America/New_York', u'America/Denver', u'America/New_York', u'America/Sao_Paulo', u'America/New_York', u'America/New_York', u'Europe/Warsaw', u'', u'', u'']

　　对时区进行计数：

　　方法一：只使用标准Python库

　　def get_counts(sequence):

　　counts = {}

　　for x in sequence:

　　if x in counts:

　　counts[x] +=1

　　else:

　　counts[x] =1

　　return counts

　　方法二：使用pandas

　　form collections import defaultdict

　　def get_counts2(sequence):

　　counts = defaultdice(int) #所有的值均会被初始化为0

　　for x in sequence:

　　counts[x] += 1

　　return counts

　　运行程序：

　　counts = get_counts(time_zones)

　　print counts['America/New_York']

　　output:

　　1251

　　len(time_zones)

　　output:

　　3440

　　想得到前10位的时区极其计数值，需要一些有关字典的处理技巧：

　　def top_counts(count_dict,n=10):

　　value_key_pairs = [(count,tz) for tz,count in count_dict.items()]

　　value_key_pairs.sort()

　　return value_key_pairs[-n:]

　　print top_counts(counts)

　　output :

　　[(33, u'America/Sao_Paulo'), (35, u'Europe/Madrid'), (36, u'Pacific/Honolulu'), (37, u'Asia/Tokyo'), (74, u'Europe/London'), (191, u'America/Denver'), (382, u'America/Los_Angeles'), (400, u'America/Chicago'), (521, u''), (1251, u'America/New_York')]

　　用pandas对时区进行计数

　　from pandas import DataFrame,Series

　　import pandas as pd

　　import numpy as np

　　frame = DataFrame(records)

　　print frame[:5]

　　tz_counts = frame['tz'].value_counts()

　　print tz_counts[:10]

　　output :

　　America/New_York 1251

　　521

　　America/Chicago 400

　　America/Los_Angeles 382

　　America/Denver 191

　　Europe/London 74

　　Asia/Tokyo 37

　　Pacific/Honolulu 36

　　Europe/Madrid 35

　　fillna函数可以替换缺失值(NA),而未知值(空字符串)则可以通过布尔型数组索引加以替换：

　　clean_tz = frame['tz'].fillna('Missing')

　　clean_tz[clean_ta == ''] = 'Unknown'

　　tz_counts = clean_tz.value_counts()

　　print tz_counts[:10]

　　output:

　　America/New_York 1251

　　Unknown 521

　　America/Chicago 400

　　America/Los_Angeles 382

　　America/Denver 191

　　Missing 120

　　Europe/London 74

　　实例1：

　　import pandas as pd

　　unames = ['user_id','gender','age','occupation','zip']

　　users = pd.read_table('D://Python//pydata-book-master//ch02//movielens//users.dat',sep='::',header=None,names=unames)

　　rnames = ['user_id','movie_id','rating','timestamp']

　　ratings = pd.read_table('D://Python//pydata-book-master//ch02//movielens//ratings.dat',sep='::',header=None,names=rnames)

　　mnames = ['movie_id','title','genres']

　　movies = pd.read_table('D://Python//pydata-book-master//ch02//movielens//movies.dat',sep='::',header=None,names=mnames)

　　print users[:5]

　　#output:

　　user_id gender age occupation zip

　　0 1 F 1 10 48067

　　1 2 M 56 16 70072

　　2 3 M 25 15 55117

　　3 4 M 45 7 02460

　　4 5 M 25 20 55455

　　print ratings[:5]

　　#output:

　　user_id movie_id rating timestamp

　　0 1 1193 5 978300760

　　1 1 661 3 978302109

　　2 1 914 3 978301968

　　3 1 3408 4 978300275

　　4 1 2355 5 978824291

　　print movies[:5]

　　#output:

　　movie_id title genres

　　0 1 Toy Story (1995) Animation|Children's|Comedy

　　1 2 Jumanji (1995) Adventure|Children's|Fantasy

　　2 3 Grumpier Old Men (1995) Comedy|Romance

　　3 4 Waiting to Exhale (1995) Comedy|Drama

　　4 5 Father of the Bride Part II (1995) Comedy

　　#用pandas的merge函数将ratings跟users合并到一起，然后再将movies也合并进去。pandas会根据列名的重叠情况推断出那些列是合并(或连接)键：

　　data = pd.merge(pd.merge(ratings,users),movies)

　　print data[:5]

　　#output:

　　user_id movie_id rating timestamp gender age occupation zip

　　0 1 1193 5 978300760 F 1 10 48067

　　1 2 1193 5 978298413 M 56 16 70072

　　2 12 1193 4 978220179 M 25 12 32793

　　3 15 1193 4 978199279 M 25 7 22903

　　4 17 1193 5 978158471 M 50 1 95350

　　title genres

　　0 One Flew Over the Cuckoo's Nest (1975) Drama

　　1 One Flew Over the Cuckoo's Nest (1975) Drama

　　2 One Flew Over the Cuckoo's Nest (1975) Drama

　　3 One Flew Over the Cuckoo's Nest (1975) Drama

　　4 One Flew Over the Cuckoo's Nest (1975) Drama

　　#按性别计算每部电影的平均得分，可以用pivot_table方法：

　　mean_ratings = data.pivot_table('rating',rows='title',cols='gender',aggfunc = 'mean')

　　print mean_ratings[:5]

　　#output:

　　gender F M

　　title

　　$1,000,000 Duck (1971) 3.375000 2.761905

　　'Night Mother (1986) 3.388889 3.352941

　　'Til There Was You (1997) 2.675676 2.733333

　　'burbs, The (1989) 2.793478 2.962085

　　...And Justice for All (1979) 3.828571 3.689024

　　##########过滤掉评分数据不够250条的电影。先对title进行分组，然后利用size()得到一个含有各电影分组大小的Series对象：

　　ratings_by_title = data.groupby('title').size()

　　print ratings_by_title[:5]

　　#output:

　　title

　　$1,000,000 Duck (1971) 37

　　'Night Mother (1986) 70

　　'Til There Was You (1997) 52

　　'burbs, The (1989) 303

　　...And Justice for All (1979) 199

　　dtype: int64

　　active_titles = ratings_by_title.index[ratings_by_title>=250]

　　print active_titles[:5]

　　#output:

　　Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)', u'101 Dalmatians (1961)', u'101 Dalmatians (1996)', u'12 Angry Men (1957)'], dtype=object)

　　mean_ratings = mean_ratings.ix[active_titles]

　　print mean_ratings[:5]

　　#output:

　　gender F M

　　title

　　'burbs, The (1989) 2.793478 2.962085

　　10 Things I Hate About You (1999) 3.646552 3.311966

　　101 Dalmatians (1961) 3.791444 3.500000

　　101 Dalmatians (1996) 3.240000 2.911215

　　12 Angry Men (1957) 4.184397 4.328421

　　#了解女性观众最喜欢的电影，可以对F列降序排列：

　　top_female_ratings = mean_ratings.sort_index(by = 'F',ascending=False)

　　print top_female_ratings[:5]

　　#output:

　　#计算评分分歧

　　mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

　　sorted_by_diff = mean_ratings.sort_index(by='diff')

　　#对行反序，并取出前15行

　　sorted_by_diff[::-1][:15]

　　#根据电影名称分组的得分数据的标准差

　　rating_std_by_title = data.groupby('title')['rating'].std()

　　#根据active_titles进行过滤

　　rating_std_by_title = rating_std_by_title.ix[active_titles]

　　#根据值对Series进行降序排列

　　rating_std_by_title.order(ascending=False)[:10]

　　实例2：

　　import pandas as pd

　　names1880 = pd.read_csv('D://Python//pydata-book-master//ch02//movielens//names//yob1880.txt',names=['names','sex','births'])

　　print names1880.groupby('sex').births.sum() #用births列的sex分组小计表示该年度的births总计

　　由于该数据集按年度分割成了多个文件，所有首先将数据合并，并加上一个year字段，使用pandas.concat即可达到这个目的：

　　#2010是目前最后一个有效统计年度

　　import pandas as pd

　　years = range(1880,2011)

　　pieces = []

　　colums = ['names','sex','births']

　　for year in years:

　　path = 'D://Python//pydata-book-master//ch02//names//yob%d.txt' % year

　　frame = pd.read_csv(path,names=colums)

　　frame['year'] = year

　　pieces.append(frame)

　　#将所有数据整合到单个DataFrame中

　　names = pd.contcat(pieces,ignore_index=True)

　　注意：第一，concat默认是按行将多个DataFrame组合到一起的；第二：必须制定ignore_index=True，因为我们不希望保留read_csv所返回的原始行号。现在得到了一个非常大的DataFrame，它包含全部的名字数据。

　　#利用groupby或pivot_table在year和sex级别上对其进行聚合

　　total_births = names.pivot_table('births',rows='year',cols='sex',aggfunc=sum)

　　下面我们插入一个prop列，用于存放制定名字的婴儿数相对于总出生数的比例。prop值为0.02表示每100名婴儿中有2名取了当前这个名字。因此，先按year和sex分组，然后将新列加到各个分组上：

　　def add_prop(group):

　　#整数除法会向下圆整

　　births = group.births.astype(float)

　　group['group'] = births /births.sum()

　　return group

　　names = names.groupby(['year','sex']).apply(add_prop)

　　#由于这是一个浮点型数据，所以应用np.allclose来检车这个分组总计值是否足够近似于(可能不会精确等于)1：

　　np.allclose(names.groupby(['year','sex']).prop.sum(),1)

　　便于进一步的分析，需要取出该数据的一个子集：每对sex/year组合的前1000个名字

　　def get_top1000(group):

　　return group.sort_index(by='births',ascending=False)[:1000]

　　grouped = names.groupby(['year','sex'])

　　top1000 = grouped.apply(get_top1000)

　　或者：

　　pieces = []

　　for year ,group in names.groupby(['year','sex']):

　　pieces.append(group.sort_index(by='births',scending=False[:1000]))

　　top1000 = pd.concat(pieces,ignore_index=True)

　　1)分析命名趋势：

　　#前1000个名字分为男女两个部分：

　　boys = top1000[top1000.sex == 'M']

　　girls = top1000[top1000.sex == 'F']

　　#先生成一张按year和name统计的总出生透视表：

　　total_births = top1000.pivot_table('births',rows='year',cols='name',aggfunc=sum)

　　#用DataFrame的plot方法绘制几个名字的曲线图：

　　subset = total_births[['John','Harry','Mary','Marilyn']]

　　subset.plot(subplots=True,figsize=(12,10),grid=False,title='Number of births per year')

　　2)评估命名多样性的增长：

　　#方法一计算最流行1000个名字所占的比例，按year和sex进行聚合并绘图：

　　table = top1000.pivot_table('prop',rows='year',cols='sex',aggfunc=sum)

　　table.plot(title='Sum of table1000.prop by year and sex',yticks=np.linspace(0,1.2,13),xticks=range(1880,2020,10))

　　#方法二计算占总出生人数前50%的不同名字的数量，这个数字不太好计算，我们只考虑2010年男孩的名字：

　　df = boys[boys.year == 2010]

　　#有多少个名字的人数加起来才够50%

　　prop_cumsum = df.sort_index(by='prop',ascending=False).prop.cumsum()

　　3) '最后一个字母'的变革

　　#男孩名字在最后一个字母上的分布发生了显著的变化，为了了解具体的情况，首先将全部出生数据在年度、性别以及末字母进行了聚合：

　　#从name列出最后一个字母

　　get_last_letter = lambda x:x[-1]

　　last_letters = names.name.map(get_last_letter)

　　last_letters.name = 'last_letter'

　　table = names.pivot_table('births',rows=last_letters,col=['sex','year'],aggfunc=sum)

　　subtable = table.reindex(colums=[1990,1960,2010],level='year')

　　subtable.head()

　　#接下来我们需要按总出生数对该表进行规范化处理，以便计算出各性别各末字母占总出生人数的比例：

　　subtable.sum()

　　letter_prop = subtable / subtable.sum().astype(float)

　　#有了这个字母比例数据后，就可以生成一张各年度各性别的条形图

　　import matplotlib.pyplot as plt

　　fig,axes = plt.subplots(2,1,figsize=(10,8))

　　letter_prop['M'].plot(kind='bar',rot=0,ax=axes[0],title='Male')

　　letter_prop['F'].plot(kind='bar',rot=0,ax=axes[1],title='Female',legend=False)

　　#转置得到一个时间序列：

　　letter_prop = table / table.sum().astype(float)

　　dny_ts = letter_prop.ix[['d','n','y'],'M'].True

　　dny_ts.head()

　　dny_ta.plot()

　　4) 变成女孩名字的男孩名字(以及相反的情况)

　　all_names = top1000.name.unique()

　　mask = np.array(['lesl'in x.lower() for x in all_names])

　　lesley_like = all_names[mask]

　　array([Leslie,Lesley,Leslee,Lesli,Lesly],dtype=object)

　　#按名字分组计算出生数以查看相对频率：

　　filtered = top1000[top1000.name.isin(lesley_like)]

　　filtered.groupby('name').births.sum()

　　#按性别和年度进行聚合，并按年度进行规范化处理：

　　table = filtered.pivot_table('births',rows='year',cols='sex',aggfunc='sum')

　　table = table.dic(table.sum(1),axis=0)

　　table.tail()

　　#绘制分性别的年度曲线图

　　table.plot(style={'M': 'k-','F':'k--'})

　　time 模块

　　import time

　　start = time.time()

　　for i in range(iterations):

　　#这里放一些待执行的代码

　　elapsed_per = (time.time()-start)/iterations

　　IPython 提供

　　函数%time 以及%timeit自动完成该过程。

　　dreload函数来解决模块的“深度”(递归)重加载。如果执行import some_lib之后再输入dreload(some_lib)，则它会尝试重新加载some_lib及其所有的依赖项。

　　ipython notebook --pylab=inline

　　os.path.exists

　　Debugger:pdb

　　Logger:logging

　　Profilers:profile,hotshot,cProfile

　　调试模块pdb允许你设置(条件)断点，代码逐行执行，检查堆栈。还支持事后调试。

　　logging模块定义了一些函数和类帮助你的程序实现灵活的日志系统。共有五级日志级别：紧急，错误，警告，信息和调试。

　　x**y与pow(x,y)执行的均是x的y次方