基于基站定位数据的商圈分析代码详细解释

不丁真人 2017-07-26

展开全文

这一章的代码注意，作者只在windows下面运行过，没有在Linux下面运行过

第一个代码是为了看下数据的稳定程度，代码中主要关注点是那个离差标准化

[python] view plain copy

#-*- coding: utf-8 -*-
#数据标准化到[0,1]
import pandas as pd
#参数初始化
filename = '../data/business_circle.xls' #原始数据文件
standardizedfile = '../tmp/standardized.xls' #标准化后数据保存路径
data = pd.read_excel(filename, index_col = u'基站编号') #读取数据
data = (data - data.min())/(data.max() - data.min()) #离差标准化
data = data.reset_index()
data.to_excel(standardizedfile, index = False) #保存结果

第二个代码的意图是为了获知把数据分成几类合适

[python] view plain copy

#-*- coding: utf-8 -*-
#谱系聚类图
import pandas as pd
#参数初始化
standardizedfile = '../data/standardized.xls' #标准化后的数据文件
data = pd.read_excel(standardizedfile, index_col = u'基站编号') #读取数据
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage,dendrogram
#这里使用scipy的层次聚类函数
Z = linkage(data, method = 'ward', metric = 'euclidean') #谱系聚类图
P = dendrogram(Z, 0) #画谱系聚类图
plt.show()

纵轴是类别数量，在3的地方横向画一条横线，此时对应就是分成3类。

第三个代码的分类总数根据第二个代码来确定

[python] view plain copy

#-*- coding: utf-8 -*-
#层次聚类算法
import pandas as pd
#参数初始化
standardizedfile = '../data/standardized.xls' #标准化后的数据文件
k = 3 #聚类数
data = pd.read_excel(standardizedfile, index_col = u'基站编号') #读取数据
from sklearn.cluster import AgglomerativeClustering #导入sklearn的层次聚类函数
model = AgglomerativeClustering(n_clusters = k, linkage = 'ward')#AgglomerativeClustering的意思是层次聚类
model.fit(data) #训练模型
#详细输出原始数据及其类别
r = pd.concat([data, pd.Series(model.labels_, index = data.index)], axis = 1) #详细输出每个样本对应的类别，Series是种数据结构
print("r=",r)#把建模后的数据传给ｒ
r.columns = list(data.columns) + [u'聚类类别'] #重命名表头,这样同一类数据就会有相同的标签
print("************************************************************")
print("list(data.columns)",list(data.columns))
print("------------------------------------------------------------")
print("------------------------------------------------------------")
print("r.columns=",r.columns )
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号
style = ['ro-', 'go-', 'bo-']#这个表示绘图的样式，r代表红色，g代表绿色，b代表蓝色，o-中，o表示用粗点标记，-表示连线
xlabels = [u'工作日人均停留时间', u'凌晨人均停留时间', u'周末人均停留时间', u'日均人流量']
pic_output = '../tmp/type_' #聚类图文件名前缀
#下面两重for循环的意思是，外循环控制画哪个图，内循环负责把一条条彩线画上去
#前面分成几类，那么下面就会有几个图，图中的每条线代表excel中的整行数据
for i in range(k): #逐一作图，作出不同样式
plt.figure()
tmp = r[r[u'聚类类别'] == i].iloc[:,:4] #前面把建模后的数据传给了r，所以这里从r中获取其中一类的所有行和前4列数据。
#iloc是index location的意思，意思是用序号对行进行索引
print("tmp=",tmp)
print("ENNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN")
for j in range(len(tmp)):#由于前面k=3，总数据量为431，所以j的范围是1~146，1~146 1~139
plt.plot(range(1, 5), tmp.iloc[j], style[i])#这个range指的是对数据的前四列属性的具体值
#上面的tmp.iloc[j]指的是分类后，某一特定类早上的某条数据。
#上面的这个style用到了前面定义的一句话style = ['ro-', 'go-', 'bo-']
plt.xticks(range(1, 5), xlabels, rotation = 20) #坐标标签，rotation就是x轴标签的倾斜程度
#如果要根据需要来修改程序，那么上面的两个（1，5）和前面的tmp = r[r[u'聚类类别'] == i].iloc[:,:4]这句话中的范围要同时修改
plt.title(u'商圈类别%s' %(i+1)) #我们计数习惯从1开始
plt.subplots_adjust(bottom=0.15) #调整底部
plt.savefig(u'%s%s.png' %(pic_output, i+1)) #保存图片