PAT个人代码的抓取 | WAmaker

敬而远 2013-03-20

展开全文

PAT个人代码的抓取

August 5,2012

PAT是浙大的程序设计测试平台，上面有浙大研究生机试和实验室面试所用题目。在参加复试之前曾在PAT上做过一段时间的题目，最近有同学索要代码，不幸的我本科四年所敲大部分代码在一次重装系统过程中不慎遗失，而且特别令人费解的是，虽然PAT保留的用户所有提交代码，但是我却没有找见提交的搜索页面，只得一页一页地在提交记录里面翻看。最近正在学Python，就拿Python抓一抓。

需要解决的问题有：

登录平台
个人提交编号的抓取
根据编号抓取代码
HTML字符转换
校园网VPN代理

PAT 的代码提交历史列表只有已登录用户才可查看，而且只能查看自己提交的代码。所以要实现代码的抓取，首先要过了登录一关，目测PAT采用的是cookie方式进行验证，所以需要用到cookielib。实验室上外网要用VPN，需用到urllib2的ProxyHandler。初始化代码如下：

def __init__(self,userName,password):
    self.userName=userName
    self.password=password
    #需要代理的将下面语句的注释去掉，提供代理服务器地址、端口、账号、密码。
    #proxy=urllib2.ProxyHandler({"http":"http://username:password@ address:port"})
    self.cookie=cookielib.LWPCookieJar()
    opener=urllib2.build_opener(proxy,urllib2.HTTPCookieProcessor(self.cookie))
    urllib2.install_opener(opener)

登录采用POST方式，向http://pat./users/sign_in提交用户名和密码，同时保存生成的cookie。

def login(self):
    postdata={
    'user[handle]':self.userName,
    'user[password]':self.password
    }
    postdata=urllib.urlencode(postdata)
    req=urllib2.Request(
            url='http://pat./users/sign_in',
            data=postdata,
            headers=self.header
            )
    result=urllib2.urlopen(req).read()
    self.cookie.save(self.cookiefile)

获得某分页所有用户的提交列表：

def getList(self,page=1):
    req=urllib2.Request(
            url='http://pat./contests/pat-practise/submissions?page='+repr(page),
            headers=self.header
            )
    result=urllib2.urlopen(req).read()
    result=str(result)
    return result

用正则表达式筛选出自己的提交：

def getMine(self,content):
    tr=re.compile('<tr>.*?</tr>',re.S)
    trs=tr.findall(content)
    result=[]
    #请自行更改用户名
    findme=re.compile('.*WAmaker.*',re.S)
    for item in trs:
        if findme.match(item):
            result.append(item)
    return result

利用正则表达式得到提交的ID和题目的ID，不得不说，正则的分组真是个好东西：

def getIDs(self,mysubs):
    result=[]
    getID=re.compile('.*submissions/(\d+)">.*practise/(\d+)',re.S)
    for item in mysubs:
        ret=getID.match(item)
        if ret:
            result.append([ret.group(1),ret.group(2)])
    return result
      
def gao(self):
    li=[]
    #请自行更改页码范围
    for i in range(1,200):
        print "Scanning page:",i,
        content=user.getList(i)
        res= user.getMine(content)
        ll=user.getIDs(res)
        li+=ll
        print ".Done! Found",len(ll)
    fl=open('res.dat','w')
    cPickle.dump(li,fl)
    fl.close()

根据提交ID进入代码查看页面，由于高亮过的代码含大量HTML标签，不适合抓取，还好页面的源代码里面还有未高亮处理的代码，不过还得用HTMLParser将>等字符转换回小于号等（当然字符串替换也可）。按照题目ID_提交ID的格式保存在本地：

def getCode(self):
    fl=open('res.dat','r')
    li=cPickle.load(fl)
    fl.close()
    htmlpaser=HTMLParser()
    for ID in li:
        print 'Handling '+ID[0]+'...',
        req=urllib2.Request(
                url='http://pat./submissions/'+ID[0]+'/source',
                headers=self.header
                )
        result=urllib2.urlopen(req).read()
        result=str(result)
        patten=re.compile('<textarea.*?>.*</textarea>',re.S)
        text=patten.findall(result)
        code=re.match('<textarea.*?>(.*)</textarea>',text[0],re.S).group(1)
        ff=open('pat/'+ID[1]+'_'+ID[0]+'.cpp','w+')
        code=htmlpaser.unescape(code)  
        ff.write(code)
        ff.close()
        print 'Saved to pat/'+ID[1]+'_'+ID[0]+'.cpp'