程序目的:
根据特定的SNP list, 在千人基因组数据库中爬取CHB人群的等位基因频率信息,如https://www.ncbi.nlm./variation/tools/1000genomes/?q=rs12340895。
因为网页是动态的数据,嵌入了JavaScript代码,因此借助selenium来爬取信息。
Beautiful Soup是Python的一个库,最主要的功能是从网页抓取数据。Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,避免了用繁杂的正则表达式。
准备工作:
源代码
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
def get_allele_feq(browser, snp):
browser.get(
'https://www.ncbi.nlm./variation/tools/1000genomes/?q=%s' %snp) #Load page
# browser.implicitly_wait(60) #智能等待xx秒
time.sleep(30) #加载时间较长,等待加载完毕
# browser.find_element_by_css_selector("div[title=\"Han Chinese in Bejing, China\"]") #use selenium function to find elements
# 把selenium的webdriver调用page_source函数在传入BeautifulSoup中,就可以用BeautifulSoup解析网页了
bs = BeautifulSoup(browser.page_source, "lxml")
# bs.find_all("div", title="Han Chinese in Bejing, China")
try:
race = bs.find(string="CHB")
race_data = race.find_parent("div").find_parent(
"div").find_next_sibling("div")
# print race_data
race_feq = race_data.find("span", class_="gt-selected").find_all("li") # class_ 防止Python中类关键字重复,产生语法错误
base1_feq = race_feq[0].text #获取标签的内容
base2_feq = race_feq[1].text
return snp, base1_feq, base2_feq # T=0.1408 C=0.8592
except NoSuchElementException:
return "%s:can't find element" %snp
def main():
browser = webdriver.Chrome() # Get local session of chrome
fh = open("./4diseases_snps_1kCHB_allele_feq.list2", 'w')
snps = open("./4diseases_snps.list.uniq2",'r')
for line in snps:
snp = line.strip()
response = get_allele_feq(browser, snp)
time.sleep(1)
fh.write("\t".join(response)) #unicode 编码的对象写到文件中后相当于print效果
fh.write("\n")
print "\t".join(response)
time.sleep(1) # sleep a few seconds
fh.close()
browser.quit() # 退出并关闭窗口的每一个相关的驱动程序
if __name__ == '__main__':
main()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
参考资料:
1]: http://beautifulsoup./zh_CN/latest/ #Beautiful Soup 4.4.0 文档
2]: http://blog.csdn.net/buptlrw/article/details/48828201
3]: http://www.cnblogs.com/duyang/p/5144987.html
4]: http://blog.csdn.net/cjsafty/article/details/9206323
5]: http:///questions/8255929/running-webdriver-chrome-with-selenium
6]: https://pypi./pypi/selenium#downloads
7]: http://blog.csdn.net/leejeff/article/details/52935706
|