数据分析维基百科收录的这35000部电影都在讲什么？

heii2 2020-06-28

展开全文

数据集内容

该数据集来自kaggle，包含来自世界各地的34,886部电影的描述。主要包含电影发行年份，电影标题、类型、演员、导演、电影情节描述等内容。

在后台回复“电影情节”即可获取数据集

查看数据

#读取数据集并查看前五行的数据情况
df = pd.read_csv('C:/Users/.../wiki_movie_plots_deduped.csv')
df.head()

数据集共8列，从左到右依次包含的内容为：发行年份、名称、发源地、导演、重要演员、体裁、wiki网址、情节描述。

df.describe()#查看一下数据集的整体情况

wiki百科共收录了从1901年至2017年的34886部电影。

数据分析

01 每年产出多少电影？

xcol = 'Release Year'
params = get_params()
sns.set(style='whitegrid')
figsize=(18, 30)

plt.rcParams.update(params)
plt.tick_params(labelsize=12)
sns.countplot(y=df[xcol], data=df)
plt.title('Movie Count Per '+ xcol)
plt.tight_layout()
plt.show()

一百多年中，电影数量急剧增加，步入21世纪以来电影的发展明显增快，更在2013年达到顶峰，产出了超过1000多部电影。

02 各国电影产出怎么样？

figsize=(18, 10)
xcol = 'Origin/Ethnicity'
sns.set(style='darkgrid')
params = get_params()

plt.tick_params(labelsize=18)
plt.rcParams.update(params)
sns.countplot(x=df[xcol], data=df)
plt.title('Movie Count Per '+ xcol)
plt.xticks(rotation=60)
plt.show()

可以看到美国的电影产出数量遥遥领先，其它国家的数据在柱状图中的展示不太明显。

换成饼图来看看：

dfAnalyze = df.copy()
dfAnalyze.head()
columns = ['Release Year', 'Director','Cast', 'Genre', 'Wiki Page', 'Plot']
dfPie = dfAnalyze.drop(columns, axis=1)
dfPie = dfPie.groupby(['Origin/Ethnicity']).count().rename(columns= {'Title':'count'})
pie = dfPie.plot.pie(subplots=True, figsize=(20, 20))

美国的电影占据了大片江山，印度宝莱坞和英国也表现的不错。

03 过去100多年7个国家的电影发展情况

sns.set(style='whitegrid')
figsize=(20, 8)
xcol = 'Origin/Ethnicity'
params = get_params()
plt.rcParams.update(params)
org = df[xcol].unique()
l = len(df[xcol])
con = []
for country in df[xcol].unique():
    c = df[df[xcol]==country]
  if len(c)>l*0.03:
    x = df[df[xcol]==country]['Release Year'].value_counts()
   sns.lineplot(x.index, x.values)
  con.append(country)
plt.legend(con)
plt.title('Movie Count Per '+ xcol)
plt.show()

美国电影早在1901年就开始启蒙，到了1920后迅速发展，其它6个国家则在1920后才陆续开始发展电影，增速与美国相差了一大截。

04 剧情发展

def getSimplerGenre(rowgenre):
if rowgenre in simpler_genres:
return rowgenre
for s_genre in simpler_genres:
if s_genre in rowgenre:
return s_genre
return 'unknown'
simpler_genre_set = [getSimplerGenre(g) for g in df['Genre']]

simpler_genre_set_count = {}
for sg in simpler_genre_set:
if not sg in simpler_genre_set_count:
simpler_genre_set_count[sg] = 0
simpler_genre_set_count[sg] += 1
sns.barplot(x=list(simpler_genre_set_count.values()),y=list(simpler_genre_set_count.keys()),log=True)

在数据集中，有相当一部分电影的体裁是未知的，除去未知项，共有9种体裁，其中剧情片和喜剧最多。

以美国为例，电影是国家经济、社会和文化需要的直接产物，从最初的无声电影时代的喜剧片、闹剧片和西部片，到歌舞片、盗匪片、侦探片、恐怖片等类型相继出现，发展至今包含的电影类型多种多样。

05 哪位导演拔得头筹?

n_show = 30
df2 = df1[df1['Count']>n_show]
sns.set(style='whitegrid')

figsize=(18, 30)
xcol = xcol

params = get_params()

plt.rcParams.update(params)
sns.barplot(x=df2['Count'], y=df2[xcol])
plt.title('Movie Count Per '+ xcol)
plt.show()

1.michael curtiz出生于奥匈帝国布达佩斯， 1924年推出具有里程碑意义的影片《以色列的月亮》，该片引起美国方面的注意；后被华纳兄弟电影公司纳入好莱坞，在此后的几十年里，科蒂兹为华纳拍摄了100余部电影。

2.lloyd bacon（1889年12月4日至1955年11月15日）是美国电影、舞台和杂耍演员兼电影导演。

3.jules white（出生于朱利叶斯·韦斯，1900年9月17日-1985年4月30日）是匈牙利裔美国电影导演和制片人，共获得4次奥斯卡金像奖的提名。

06 最高产的演员是谁？

xcol = 'Cast'
df1 = pd.DataFrame({xcol:df[xcol]})
df1[xcol] = df1[xcol].fillna('None')
df1[xcol] = df1[xcol].apply(lambda x: re.sub('[()]', '', x))
df1[xcol] = df1[xcol].apply(lambda x: re.sub(' & ', ', ', x))
df1[xcol] = df1[xcol].apply(lambda x: re.sub(' and ', ', ', x))
df1[xcol] = df1[xcol].apply(lambda x: re.sub('/', ', ', x))

l = list()
for index, row in df1.iterrows():
t = row[xcol].split(', ')
l.extend([i for i in t if len(i.split(' '))>1])

df1 = pd.DataFrame({xcol:l})
c = df1[xcol].value_counts()
df1 = pd.DataFrame({xcol:c.index, 'Count':c.values})
df1 = df1[df1[xcol]!='None']
n_show = 50
df2 = df1[df1['Count']>n_show]
sns.set(style='whitegrid')

figsize=(18, 30)
xcol = xcol

params = get_params()

plt.rcParams.update(params)
sns.barplot(x=df2['Count'], y=df2[xcol])
plt.title('Movie Count Per '+ xcol)
plt.show()

第一名是Mithun Chakraborty，共出演140+部电影，不认识？我也不认识，百度一下吧。

印度演员、编剧，出生于1947年6月16日，主要作品有《现代罗宾汉》、《真假王子》等，他是两个Filmfare奖和三个国家电影奖的获得者。

不仅高产还得过这么多奖，电影质量应该不错，我去补补剧。

07 根据剧情描述看看这些电影都在讲什么？

text = df['Plot'].str.cat(sep='. ')
stopwords = set(STOPWORDS)
wc = WordCloud(max_words=2000, stopwords=stopwords)
wc.generate(text)
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.figure()
plt.show()