分享

Extract Links From a HTML Page Using Delphi

 quasiceo 2014-06-16

Extract Links From an HTML Page Using Delphi

Get Href Attribute From The A Tag From an HTML Document

By

Ads

Abakus VCLwww.abaecker.bizDelphi und C++Builder Components for virtual instrumentation

Delphi Globalization Toolwww.tsilang.comGlobalize your Delphi applications easy, professionally and fast!

Delphi FireMonkey Controlswww.woll2woll.comProfessional FireMonkey Components Grids, Editors, Searching, Validation

Ads

Wholesale Herbal Extractswww.blueskybotanics.comFair trade botanicals, teas, fruits Organic products. UK manufacturer.

HTML UI Engineawesomium.comWindowless WebKit Renderer. Use HTML UI in your C++ or .NET app.

In most situations you use the TWebBrowser to display HTML documents to the user - thus creating your own version of the (Internet Explorer) Web browser.

A very nice feature of a Browser is to display link information, for example, in the status bar, when the mouse hovers over a link in a document. This can also be done in Delphi: Get the Url of a Hyperlink when the Mouse moves Over a TWebBrowser Document.

Sometimes, you "only" want to extract all the links from a HTML document / URL. You want to get the HREF attribute of all A tags.

Here's how to extract all hyperlinks from an HTML document. The ExtractLinks procedure fills a TStrings object with the value of the HREF attribute of the A HTML element.

Extract HyperLinks

uses mshtml, ActiveX, COMObj, IdHTTP, idURI;

//extract "href" attribute from A tags from an URL - into a TStrings
procedure ExtractLinks(const url: String; const strings: TStrings) ;
var
   iDoc : IHTMLDocument2;
   strHTML : string;
   v : Variant;
   x : integer;
   links : OleVariant;
   docURL : string;
   URI : TidURI;
   aHref : string;
   idHTTP : TidHTTP;
begin
  strings.Clear;
  URI := TidURI.Create(url) ;
  try
    docURL := 'http://' + URI.Host;
    if URI.Path <> '/' then docURL := docURL + URI.Path;
  finally
    URI.Free;
  end;
  iDoc := CreateComObject(Class_HTMLDOcument) as IHTMLDocument2;
  try
    iDoc.designMode := 'on';
    while iDoc.readyState <> 'complete' do Application.ProcessMessages;
    v := VarArrayCreate([0,0],VarVariant) ;
    idHTTP := TidHTTP.Create(nil) ;
    try
      strHTML := idHTTP.Get(url) ;
    finally
      idHTTP.Free;
    end;
    v[0]:= strHTML;
    iDoc.write(PSafeArray(System.TVarData(v).VArray)) ;
    iDoc.designMode := 'off';
    while iDoc.readyState<>'complete' do Application.ProcessMessages;
    links := iDoc.all.tags('A') ;
    if links.Length > 0 then
    begin
      for x := 0 to -1 + links.Length do
      begin
        aHref := links.Item(x).href;
        if (aHref[1] = '/') then
          aHref := docURL + aHref
        else if Pos('about:', aHref) = 1
          then aHref := docURL + Copy(aHref, 7, Length(aHref)) ;
        strings.Add(aHref) ;
      end;
    end;
  finally
    iDoc := nil;
  end;
end;

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多