分享

Python抓取淘宝IP地址数据

 心不留意外尘 2016-08-02
http://www.oschina.net/code/snippet_1583032_47477
2015

从http://ip.taobao.com上抓取IP地址库信息. 网上有很多这样的例子,但完整的代码不多, 这里分享下我写的版本.
因为淘宝有限制每秒最多10次请求, 所以用的是urllib同步, 效率也并不高, 可以考虑后面换成unirest库做异步.
还加了一个进度条...
链接:https://github.com/Ghostist/taobaoip

1. [代码]工作线程,抓取并分析返回数据     跳至 [1] [2] [3] [全屏预览]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def fetch(ip):
    url = 'http://ip.taobao.com/service/getIpInfo.php?ip=' + ip
    result = []
    try:
        response = urllib.urlopen(url).read()
        jsondata = json.loads(response)
        if jsondata[u'code'] == 0:
            result.append(jsondata[u'data'][u'ip'].encode('utf-8'))           
            result.append(jsondata[u'data'][u'country'].encode('utf-8'))
            result.append(jsondata[u'data'][u'country_id'].encode('utf-8'))
            result.append(jsondata[u'data'][u'area'].encode('utf-8'))
            result.append(jsondata[u'data'][u'area_id'].encode('utf-8'))
            result.append(jsondata[u'data'][u'region'].encode('utf-8'))
            result.append(jsondata[u'data'][u'region_id'].encode('utf-8'))
            result.append(jsondata[u'data'][u'city'].encode('utf-8'))
            result.append(jsondata[u'data'][u'city_id'].encode('utf-8'))
            result.append(jsondata[u'data'][u'county'].encode('utf-8'))
            result.append(jsondata[u'data'][u'county_id'].encode('utf-8'))
            result.append(jsondata[u'data'][u'isp'].encode('utf-8'))
            result.append(jsondata[u'data'][u'isp_id'].encode('utf-8'))           
        else:
            return 0, result
    except:
        logging.exception("Url open failed:" + url)
        return 0, result
    return 1, result
def worker(ratelimit, jobs, results, progress):
    global cancel
    while not cancel:
        try:
            ratelimit.ratecontrol()
            ip = jobs.get(timeout=2) # Wait 2 seconds
            ok, result = fetch(ip)
            if not ok:
                logging.error("Fetch information failed, ip:{}".format(ip))
                progress.put("") # Notify the progress even it failed
            elif result is not None:
                results.put(" ".join(result))
            jobs.task_done()    # Notify one item
        except Queue.Empty:
            pass
        except:
            logging.exception("Unknown Error!")

2. [代码]process线程输出结果到output     跳至 [1] [2] [3] [全屏预览]

1
2
3
4
5
6
7
8
9
10
11
def process(target, results, progress):
    global cancel
    while not cancel:
        try:
            line = results.get(timeout=5)
        except Queue.Empty:
            pass
        else:
            print >>target, line
            progress.put("")
            results.task_done()

3. [代码]progproc线程记录进度,我用了pip库里progressbar2这个包, 它默认输出到stderr,最终结果输出     跳至 [1] [2] [3] [全屏预览]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def progproc(progressbar, count, progress):
    """
    Since ProgressBar is not a thread-safe class, we use a Queue to do the counting job, like
    two other threads. Use this thread do the printing of progress bar. By the way, it will
    print to stderr, which does not conflict with the default result output(stdout).
    """
    idx = 1
    while True:
        try:
            progress.get(timeout=5)
        except Queue.Empty:
            pass
        else:
            progressbar.update(idx)
            idx += 1

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多