什么是Beautiful Soup?Beautiful Soup是一个Python库,它将HTML或XML文档解析为树结构,以便于从中查找和提取数据。它通常用于从网站上抓取数据。 Beautiful Soup具有简单的Pythonic界面和自动编码转换功能,可以轻松处理网站数据。 网页是结构化文档,Beaut是一个Python库,它将HTML或XML文档解析为树结构,以便于查找和提取数据。在本指南中,您将编写一个Python脚本,可以通过Craigslist获得摩托车价格。脚本将被设置为使用cron作业定期运行,生成的数据将导出到Excel电子表格中进行趋势分析。通过替换不同的url并相应地调整脚本,您可以轻松地将这些步骤适应于其他网站或搜索查询。 安装Beautiful Soup安装Python
安装美丽的汤和依赖
构建Web Scraper必需的模块
craigslist.py
添加全局变量在import语句之后,添加全局变量和配置选项: craigslist.py
检索网页该 craigslist.py
该 Beautiful Soup有不同的解析器,对网页的结构或多或少有些严格。对于本指南中的示例脚本,lxml解析器已经足够了,但是根据您的需要,您可能需要检查官方文件中描述的其他选项。 处理Soup对象类的对象 https://elpaso./search/mcy?sort=date <li class="result-row" data-pid="6370204467"> <a href="https://elpaso./mcy/d/ducati-diavel-dark/6370204467.html" class="result-image gallery" data-ids="1:01010_8u6vKIPXEsM,1:00y0y_4pg3Rxry2Lj,1:00F0F_2mAXBoBiuTS"> <span class="result-price">$12791</span> </a> <p class="result-info"> <span class="icon icon-star" role="button"> <span class="screen-reader-text">favorite this post</span> </span> <time class="result-date" datetime="2017-11-01 19:38" title="Wed 01 Nov 07:38:13 PM">Nov 1</time> <a href="https://elpaso./mcy/d/ducati-diavel-dark/6370204467.html" data-id="6370204467" class="result-title hdrlnk">Ducati Diavel | Dark</a> <span class="result-meta"> <span class="result-price">$12791</span> <span class="result-tags"> pic <span class="maptag" data-pid="6370204467">map</span> </span> <span class="banish icon icon-trash" role="button"> <span class="screen-reader-text">hide this posting</span> </span> <span class="unbanish icon icon-trash red" role="button" aria-hidden="true"></span> <a href="#" class="restore-link"> <span class="restore-narrow-text">restore</span> <span class="restore-wide-text">restore this posting</span> </a> </span> </p> </li>
results = soup.find_all("li", class_="result-row")
craigslist.py rec = { 'pid': result['data-pid'], 'date': result.p.time['datetime'], 'cost': clean_money(result.a.span.string.strip()), 'webpage': result.a['href'], 'pic': clean_pic(result.a['data-ids']), 'descr': result.p.a.string.strip(), 'createdt': datetime.datetime.now().isoformat() }
'pid': result'data-pid'
'date': result.p.time'datetime'
可以访问:
这里的值通过使用Python
'pic': clean_pic(result.a'data-ids')
'createdt': datetime.datetime.now().isoformat()
craigslist.py Result = Query() s1 = db.search(Result.pid == rec["pid"]) if not s1: total_added += 1 print ("Adding ... ", total_added) db.insert(rec) 错误处理处理两种类型的错误很重要。这些不是脚本中的错误,而是片段结构中的错误导致Beautiful Soup的API抛出错误。 一个 另一个错误是 如果在解析结果时发生这些错误中的任何一个,则将跳过该结果以确保未将错误的片段插入到数据库中: craigslist.py
清洁功能(Cleaning Functions)这是两个简短的自定义函数,用于清理代码段数据。该 craigslist.py
该 craigslist.py
该函数提取并清除第一个图像的id,然后将其添加到基本URL。 将数据写入Excel电子表格该
craigslist.py Headlines = "Pid", "Date", "Cost", "Webpage", "Pic", "Desc", "Created Date" row = 0 该标题变量是冠军在电子表格中列的列表。该行变量跟踪当前电子表格行。
craigslist.py1 2 workbook = xlsxwriter.Workbook('motorcycle.xlsx') worksheet = workbook.add_worksheet()
worksheet.set_column(0,0, 15) # pid worksheet.set_column(1,1, 20) # date worksheet.set_column(2,2, 7) # cost worksheet.set_column(3,3, 10) # webpage worksheet.set_column(4,4, 7) # picture worksheet.set_column(5,5, 60) # Description worksheet.set_column(6,6, 30) # created date 前两项在
craigslist.py for item in db.all(): row += 1 worksheet.write(row, 0, item['pid'] ) worksheet.write(row, 1, item['date'] ) worksheet.write(row, 2, item['cost'] ) worksheet.write_url(row, 3, item['webpage'], string='Web Page') worksheet.write_url(row, 4, item['pic'], string="Picture" ) worksheet.write(row, 5, item['descr'] ) worksheet.write(row, 6, item['createdt'] ) 每行中的大多数字段都可以使用
craigslist.py workbook.close() 主要常规主例程将遍历搜索结果的每一页,并在每个页面上运行soup_process函数。它还跟踪全局变量total_added中添加的数据库条目总数,该变量在soup_process函数中更新,并在完成scrape后显示。最后,它创建了一个TinyDB数据库 craigslist.py
示例运行可能如下所示。请注意,每个页面都在URL中嵌入了索引。这就是Craigslist如何知道下一页数据的开始位置: $ python3 craigslist.py Web Page: https://elpaso./search/mcy?sort=date Adding ... 1 Adding ... 2 Adding ... 3 Web Page: https://elpaso./search/mcy?s=120&sort=date Web Page: https://elpaso./search/mcy?s=240&sort=date Web Page: https://elpaso./search/mcy?s=360&sort=date Web Page: https://elpaso./search/mcy?s=480&sort=date Web Page: https://elpaso./search/mcy?s=600&sort=date Added 3 设置Cron自动本节将设置一个cron任务,以定期自动运行抓取脚本。数据
ssh normaluser@<Linode Public IP>
from bs4 import BeautifulSoup import datetime from tinydb import TinyDB, Query import urllib3 import xlsxwriter urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) url = 'https://elpaso./search/mcy?sort=date' total_added = 0 def make_soup(url): http = urllib3.PoolManager() r = http.request("GET", url) return BeautifulSoup(r.data,'lxml') def main(url): global total_added db = TinyDB("db.json") while url: print ("Web Page: ", url) soup = soup_process(url, db) nextlink = soup.find("link", rel="next") url = False if (nextlink): url = nextlink['href'] print ("Added ",total_added) make_excel(db) def soup_process(url, db): global total_added soup = make_soup(url) results = soup.find_all("li", class_="result-row") for result in results: try: rec = { 'pid': result['data-pid'], 'date': result.p.time['datetime'], 'cost': clean_money(result.a.span.string.strip()), 'webpage': result.a['href'], 'pic': clean_pic(result.a['data-ids']), 'descr': result.p.a.string.strip(), 'createdt': datetime.datetime.now().isoformat() } Result = Query() s1 = db.search(Result.pid == rec["pid"]) if not s1: total_added += 1 print ("Adding ... ", total_added) db.insert(rec) except (AttributeError, KeyError) as ex: pass return soup def clean_money(amt): return int(amt.replace("$","")) def clean_pic(ids): idlist = ids.split(",") first = idlist[0] code = first.replace("1:","") return "https://images./%s_300x300.jpg" % code def make_excel(db): Headlines = ["Pid", "Date", "Cost", "Webpage", "Pic", "Desc", "Created Date"] row = 0 workbook = xlsxwriter.Workbook('motorcycle.xlsx') worksheet = workbook.add_worksheet() worksheet.set_column(0,0, 15) # pid worksheet.set_column(1,1, 20) # date worksheet.set_column(2,2, 7) # cost worksheet.set_column(3,3, 10) # webpage worksheet.set_column(4,4, 7) # picture worksheet.set_column(5,5, 60) # Description worksheet.set_column(6,6, 30) # created date for col, title in enumerate(Headlines): worksheet.write(row, col, title) for item in db.all(): row += 1 worksheet.write(row, 0, item['pid'] ) worksheet.write(row, 1, item['date'] ) worksheet.write(row, 2, item['cost'] ) worksheet.write_url(row, 3, item['webpage'], string='Web Page') worksheet.write_url(row, 4, item['pic'], string="Picture" ) worksheet.write(row, 5, item['descr'] ) worksheet.write(row, 6, item['createdt'] ) workbook.close() main(url)
crontab -e 此示例条目将每天早上6:30运行python程序。 30 6 * * * /usr/bin/python3 /home/normaluser/craigslist.py python程序将编写 检索Excel报告在Linux上 使用scp scp normaluser@<Linode Public IP>:/home/normaluser/motorcycle.xlsx . 在Windows上 使用Firefox的内置sftp功能。在地址栏中键入以下URL,它将请求密码。从显示的目录列表中选择电子表格。 sftp://normaluser@<Linode Public IP>/home/normaluser |
|
来自: 昵称62125662 > 《beautifulsoup4》