分享

java 模拟浏览器访问网页

 vavava 2010-07-30

正文:

在用Java的HttpURLConnection 来下载网页,发现访问google的网站时,会被google拒绝掉。

        try
         {
             url = new URL(urlStr);
             httpConn = (HttpURLConnection) url.openConnection();
             HttpURLConnection.setFollowRedirects(true);

             // logger.info(httpConn.getResponseMessage());
             in = httpConn.getInputStream();
             out = new FileOutputStream(new File(outPath));

             chByte = in.read();
             while (chByte != -1)
             {
                 out.write(chByte);
                 chByte = in.read();
             }
         }
         catch (MalformedURLException e)
         {
          }
         }



经过一段时间的研究和查找资料,发现是由于上面的代码缺少了一些必要的信息导致,增加更加详细的属性

             httpConn.setRequestMethod("GET"); 
             httpConn.setRequestProperty("User-Agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows 2000)"); 

完整代码如下:
    public static void DownLoadPages(String urlStr, String outPath)
     {
         int chByte = 0;
         URL url = null;
         HttpURLConnection httpConn = null;
         InputStream in = null;
         FileOutputStream out = null;

         try
         {
             url = new URL(urlStr);
             httpConn = (HttpURLConnection) url.openConnection();
             HttpURLConnection.setFollowRedirects(true);
             httpConn.setRequestMethod("GET"); 
             httpConn.setRequestProperty("User-Agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows 2000)"); 
            
             // logger.info(httpConn.getResponseMessage());
             in = httpConn.getInputStream();
             out = new FileOutputStream(new File(outPath));

             chByte = in.read();
             while (chByte != -1)
             {
                 out.write(chByte);
                 chByte = in.read();
             }
         }
         catch (MalformedURLException e)
         {
             e.printStackTrace();
         }
         catch (IOException e)
         {
             e.printStackTrace();
         }
         finally
         {
             try
             {
                 out.close();
                 in.close();
                 httpConn.disconnect();
             }
             catch (Exception ex)
             {
                 ex.printStackTrace();
             }
         }
     }

此外,还有第二种方法可以访问Google的网站,就是用apache的一个工具HttpClient 模仿一个浏览器来访问Google

         Document document = null;
         HttpClient httpClient = new HttpClient();
        
         GetMethod getMethod = new GetMethod(url);
         getMethod.setFollowRedirects(true);
         int statusCode = httpClient.executeMethod(getMethod);
        
         if (statusCode == HttpStatus.SC_OK)
         {
             InputStream in = getMethod.getResponseBodyAsStream();
             InputSource is = new InputSource(in);

             DOMParser domParser = new DOMParser();    //nekoHtml 将取得的网页转换成dom
             domParser.parse(is);
             document = domParser.getDocument();
            
             System.out.println(getMethod.getURI());
            
         }
         return document;

推荐使用第一种方式,使用HttpConnection 比较轻量级,速度也比第二种HttpClient 的快。

转载一些代码,使用HttpUrlConnection来模拟ie form登陆web: 


关于java模拟ie form登陆web的问题 

HttpURLConnection urlConn=(HttpURLConnection)(new URL(url).openConnection()); 
urlConn.addRequestProperty("Cookie",cookie); 
urlConn.setRequestMethod("POST"); 
urlConn.setRequestProperty("User-Agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows 2000)"); 
urlConn.setFollowRedirects(true); 
urlConn.setDoOutput(true); // 需要向服务器写数据 
urlConn.setDoInput(true); // 
urlConn.setUseCaches(false); // 获得服务器最新的信息 
urlConn.setAllowUserInteraction(false); 
urlConn.setRequestProperty("Content-Type","application/x-www-form-urlencoded"); 
urlConn.setRequestProperty("Content-Language","en-US" ); 
urlConn.setRequestProperty("Content-Length", ""+data.length()); 

DataOutputStream outStream = new DataOutputStream(urlConn.getOutputStream()); 
outStream.writeBytes(data); 
outStream.flush(); 
outStream.close(); 

cookie=urlConn.getHeaderField("Set-Cookie"); 
BufferedReader br=new BufferedReader(new InputStreamReader(urlConn.getInputStream(),"gb2312")); 

 

本文出处:

http://www./fisher/articles/86926.aspx

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多