分享

Parsing HTML

 素行 2007-04-19

 

 

 

Sometimes you want to read HTML, looking for information without actually displaying it on the screen. For instance, more than one author I know has written a "book ticker" program to track the hour-by-hour progress of their books in the Amazon.com bestseller list. The hardest part of this program isn‘t retrieving the HTML. It‘s reading through the HTML to find the one line that contains the book‘s ranking. As another example, consider a Web Whacker-style program that downloads a web site or part thereof to a local PC with all links intact. Downloading the files once you have the URLs is easy. But reading through the document to find the URLs of the linked pages is considerably more complex.

 

 

Both of these examples are parsing problems. While parsing a clearly defined language that doesn‘t allow syntax errors, such as Java or XML, is relatively straightforward, parsing a flexible language that attempts to recover from errors, like HTML, is extremely difficult. It‘s easier to write in HTML than it is to write in a strict language like XML, but it‘s much harder to read such a language. Ease of use for the page author has been favored at the cost of ease of development for the programmer.

 

 

Fortunately, the javax.swing.text.html and javax.swing.text.html.parser packages include classes that do most of the hard work for you. They‘re primarily intended for the internal use of the JEditorPane class discussed in the last section. Consequently, they can be a little tricky to get at. The constructors are often not public or hidden inside inner classes, and the classes themselves aren‘t very well documented. But once you‘ve seen a few examples, they aren‘t hard to use.

 

 

8.3.1 HTMLEditorKit.Parser

 

 

The main HTML parsing class is the inner class javax.swing.html.HTMLEditorKit.Parser:

 

 

public abstract static class HTMLEditorKit.Parser extends Object

 

 

Since this is an abstract class, the actual parsing work is performed by an instance of its concrete subclass javax.swing.text.html.parser.ParserDelegator:

 

 

public class ParserDelegator extends HTMLEditorKit.Parser

 

 

An instance of this class reads an HTML document from a Reader. It looks for five things in the document: start-tags, end-tags, empty-element tags, text, and comments. That covers all the important parts of a common HTML file. (Document type declarations and processing instructions are omitted, but they‘re rare and not very important in most HTML files, even when they are included.) Every time the parser sees one of these five items, it invokes the corresponding callback method in a particular instance of the javax.swing.text.html.HTMLEditorKit.ParserCallback class. To parse an HTML file, you write a subclass of HTMLEditorKit.ParserCallback that responds to text and tags as you desire. Then you pass an instance of your subclass to the HTMLEditorKit.Parser‘s parse( ) method, along with the Reader from which the HTML will be read:

 

 

public void parse(Reader in, HTMLEditorKit.ParserCallback callback,

 

 

 boolean ignoreCharacterSet) throws IOException

 

 

The third argument indicates whether you want to be notified of the character set of the document, assuming one is found in a META tag in the HTML header. This will normally be true. If it‘s false, then the parser will throw a javax.swing.text.ChangedCharSetException when a META tag in the HTML header is used to change the character set. This would give you an opportunity to switch to a different Reader that understands that character set and reparse the document (this time, setting ignoreCharSet to true since you already know the character set).

 

 

parse( ) is the only public method in the HTMLEditorKit.Parser class. All the work is handled inside the callback methods in the HTMLEditorKit.ParserCallback subclass. The parse( ) method simply reads from the Reader in until it‘s read the entire document. Every time it sees a tag, comment, or block of text, it invokes the corresponding callback method in the HTMLEditorKit.ParserCallback instance. If the Reader throws an IOException, that exception is passed along. Since neither the HTMLEditorKit.Parser nor the HTMLEditorKit.ParserCallback instance is specific to one reader, it can be used to parse multiple files simply by invoking parse( ) multiple times. If you do this, your HTMLEditorKit.ParserCallback class must be fully thread-safe, because parsing takes place in a separate thread and the parse( ) method normally returns before parsing is complete.

 

 

Before you can do any of this, however, you have to get your hands on an instance of the HTMLEditorKit.Parser class, and that‘s harder than it should be. HTMLEditorKit.Parser is an abstract class, so it can‘t be instantiated directly. Its subclass, javax.swing.text.html.parser.ParserDelegator, is concrete. However, before you can use it, you have to configure it with a DTD, using the protected static methods ParserDelegator.setDefaultDTD( ) and ParserDelegator.createDTD( ):

 

 

protected static void setDefaultDTD( )

 

 

protected static DTD createDTD(DTD dtd, String name)

 

 

So to create a ParserDelegator, you first need to have an instance of javax.swing.text.html.parser.DTD. This class represents a Standardized General Markup Language (SGML) document type definition. The DTD class has a protected constructor and many protected methods that subclasses can use to build a DTD from scratch, but this is an API that only an SGML expert could be expected to use. The normal way DTDs are created is by reading the text form of a standard DTD published by someone like the W3C. You should be able to get a DTD for HTML by using the DTDParser class to parse the W3C‘s published HTML DTD. Unfortunately, the DTDParser class isn‘t included in the published Swing API, so you can‘t. Thus, you‘re going to need to go through the back door to create an HTMLEditorKit.Parser instance. What we‘ll do is use the HTMLEditorKit.Parser.getParser( ) method instead, which ultimately returns a ParserDelegator after properly initializing the DTD for HTML 3.2:

 

 

protected HTMLEditorKit.Parser getParser( )

 

 

Since this method is protected, we‘ll simply subclass HTMLEditorKit and override it with a public version, as Example 8-6 demonstrates.

 

 

Example 8-6. This subclass just makes the getParser( ) method public

 

 

import javax.swing.text.html.*;

 

 

public class ParserGetter extends HTMLEditorKit {

 

 

  // purely to make this method public

 

 

  public HTMLEditorKit.Parser getParser( ){

 

 

    return super.getParser( );

 

 

  }

 

 

  }

 

 

Now that you‘ve got a way to get a parser, you‘re ready to parse some documents. This is accomplished through the parse( ) method of HTMLEditorKit.Parser:

 

 

public abstract void parse(Reader input, HTMLEditorKit.ParserCallback 

 

 

 callback, boolean ignoreCharSet) throws IOException

 

 

The Reader is straightforward. Simply chain an InputStreamReader to the stream reading the HTML document, probably one returned by the openStream() method of java.net.URL. For the third argument, you can pass true to ignore encoding issues (this generally works only if you‘re pretty sure you‘re dealing with ASCII text) or false if you want to receive a ChangedCharSetException when the document has a META tag indicating the character set. The second argument is where the action is. You‘re going to write a subclass of HTMLEditorKit.ParserCallback that is notified of every start-tag, end-tag, empty-element tag, text, comment, and error that the parser encounters.

 

 

8.3.2 HTMLEditorKit.ParserCallback

 

 

The ParserCallback class is a public inner class inside javax.swing.text.html.HTMLEditorKit:

 

 

public static class HTMLEditorKit.ParserCallback extends Object

 

 

It has a single, public noargs constructor:

 

 

public HTMLEditorKit.ParserCallback( )

 

 

However, you probably won‘t use this directly because the standard implementation of this class does nothing. It exists to be subclassed. It has six callback methods that do nothing. You will override these methods to respond to specific items seen in the input stream as the document is parsed:

 

 

public void handleText(char[] text, int position)

 

 

public void handleComment(char[] text, int position)

 

 

public void handleStartTag(HTML.Tag tag,

 

 

 MutableAttributeSet attributes, int position)

 

 

public void handleEndTag(HTML.Tag tag, int position)

 

 

public void handleSimpleTag(HTML.Tag tag,

 

 

 MutableAttributeSet attributes, int position)

 

 

public void handleError(String errorMessage, int position)

 

 

There‘s also a flush( ) method you use to perform any final cleanup. The parser invokes this method once after it‘s finished parsing the document:

 

 

public void flush( ) throws BadLocationException

 

 

Let‘s begin with a simple example. Suppose you want to write a program that strips out all the tags and comments from an HTML document and leaves only the text. You would write a subclass of HTMLEditorKit.ParserCallback that overrides the handleText( ) method to write the text on a Writer. You would leave the other methods alone. Example 8-7 demonstrates.

 

 

Example 8-7. TagStripper

 

 

import javax.swing.text.html.*;

 

 

import java.io.*;

 

 

public class TagStripper extends HTMLEditorKit.ParserCallback {

 

 

  private Writer out;

 

 

  public TagStripper(Writer out) {

 

 

    this.out = out;

 

 

  } 

 

 

    public void handleText(char[] text, int position) {

 

 

    try {

 

 

      out.write(text);

 

 

      out.flush( );

 

 

    }

 

 

    catch (IOException ex) {

 

 

      System.err.println(ex);

 

 

    }

 

 

  }

 

 

}

 

 

Now let‘s suppose you want to use this class to actually strip the tags from a URL. You begin by retrieving a parser using Example 8-5s ParserGetter class:

 

 

ParserGetter kit = new ParserGetter( );

 

 

HTMLEditorKit.Parser parser = kit.getParser( );

 

 

Next, construct an instance of your callback class like this:

 

 

HTMLEditorKit.ParserCallback callback

 

 

 = new TagStripper(new OutputStreamWriter(System.out));

 

 

Then you get a stream you can read the HTML document from. For example:

 

 

try {

 

 

  URL u = new URL("http://www.");

 

 

  InputStream in = new BufferedInputStream(u.openStream( ));

 

 

  InputStreamReader r = new InputStreamReader(in);

 

 

Finally, you pass the Reader and the HTMLEditorKit.ParserCallback to the HTMLEditorKit.Parser‘s parse( ) method, like this:

 

 

  parser.parse(r, callback, false);

 

 

}

 

 

catch (IOException ex) {

 

 

  System.err.println(ex);

 

 

}

 

 

There are a couple of details about the parsing process that are not obvious. First, the parser parses in a separate thread. Therefore, you should not assume that the document has been parsed when the parse( ) method returns. If you‘re using the same HTMLEditorKit.ParserCallback object for two separate parses, you need to make all your callback methods thread-safe.

 

 

Second, the parser actually skips some of the data in the input. In particular, it normalizes and strips whitespace. If the input document contains seven spaces in a row, the parser will convert that to a single space. Carriage returns, linefeeds, and tabs are all converted to a single space, so you lose line breaks. Furthermore, most text elements are stripped of all leading and trailing whitespace. Elements that contain nothing but space are eliminated completely. Thus, suppose the input document contains this content:

 

 

<H1> Here‘s   the   Title </H1>

 

 

<P> Here‘s the text </P>

 

 

What actually comes out of the tag stripper is:

 

 

Here‘s the TitleHere‘s the text

 

 

The single exception is the PRE element, which maintains all whitespace in its contents unedited. Short of implementing your own parser, I don‘t know of any way to retain all the stripped space. But you can include the minimum necessary line breaks and whitespace by looking at the tags as well as the text. Generally, you expect a single break in HTML when you see one of these tags:

 

 

<BR>

 

 

<LI>

 

 

<TR>

 

 

You expect a double break (paragraph break) when you see one of these tags:

 

 

<P>

 

 

</H1> </H2> </H3> </H4> </H5> </H6>

 

 

<HR>

 

 

<DIV>

 

 

</UL> </OL> </DL>

 

 

To include line breaks in the output, you have to look at each tag as it‘s processed and determine whether it falls in one of these sets. This is straightforward because the first argument passed to each of the tag callback methods is an HTML.Tag object.

 

 

8.3.3 HTML.Tag

 

 

Tag is a public inner class in the javax.swing.text.html.HTML class.

 

 

public static class HTML.Tag extends Object

 

 

It has these four methods:

 

 

public boolean isBlock( )

 

 

public boolean breaksFlow( )

 

 

public boolean isPreformatted( )

 

 

public String  toString( )

 

 

The breaksFlow( ) method returns true if the tag should cause a single line break. The isBlock() method returns true if the tag should cause a double line break. The isPreformatted() method returns true if the tag indicates that whitespace should be preserved. This makes it easy to provide the necessary breaks in the output.

 

 

Chances are you‘ll see more tags than you‘d expect when you parse a file. The parser inserts missing closing tags. In other words, if a document contains only a <P> tag, then the parser will report both the <P> start-tag and the implied </P> end-tag at the appropriate points in the document. Example 8-8 is a program that does the best job yet of converting HTML to pure text. It looks for the empty and end-tags, explicit or implied, and, if the tag indicates that line breaks are called for, inserts the necessary number of line breaks.

 

 

Example 8-8. LineBreakingTagStripper

 

 

import javax.swing.text.*;

 

 

import javax.swing.text.html.*;

 

 

import javax.swing.text.html.parser.*;

 

 

import java.io.*;

 

 

import java.net.*;

 

 

public class LineBreakingTagStripper

 

 

 extends HTMLEditorKit.ParserCallback {

 

 

  private Writer out;

 

 

  private String lineSeparator;

 

 

  public LineBreakingTagStripper(Writer out) {

 

 

 this(out, System.getProperty("line.separator", "\r\n"));

 

 

  } 

 

 

  public LineBreakingTagStripper(Writer out, String lineSeparator) {

 

 

    this.out = out;

 

 

    this.lineSeparator = lineSeparator;

 

 

  } 

 

 

    public void handleText(char[] text, int position) {

 

 

    try {

 

 

      out.write(text);

 

 

      out.flush( );

 

 

    }

 

 

    catch (IOException ex) {

 

 

      System.err.println(ex);

 

 

    }

 

 

  }

 

 

  public void handleEndTag(HTML.Tag tag, int position) {

 

 

    try {

 

 

      if (tag.isBlock( )) {

 

 

        out.write(lineSeparator);

 

 

        out.write(lineSeparator);

 

 

      }

 

 

      else if (tag.breaksFlow( )) {

 

 

        out.write(lineSeparator);

 

 

      }

 

 

    }

 

 

    catch (IOException ex) {

 

 

      System.err.println(ex);

 

 

    }

 

 

   

 

 

  }

 

 

  public void handleSimpleTag(HTML.Tag tag,

 

 

   MutableAttributeSet attributes, int position) {

 

 

    try {

 

 

      if (tag.isBlock( )) {

 

 

        out.write(lineSeparator);

 

 

       out.write(lineSeparator);

 

 

      }

 

 

      else if (tag.breaksFlow( )) {

 

 

        out.write(lineSeparator);

 

 

      }

 

 

      else {

 

 

        out.write(‘ ‘);

 

 

      }

 

 

    }

 

 

    catch (IOException ex) {

 

 

      System.err.println(ex);

 

 

    }

 

 

  }

 

 

}

 

 

Most of the time, of course, you want to know considerably more than whether a tag breaks a line. You want to know what tag it is, and behave accordingly. For instance, if you were writing a full-blown HTML-to-TeX or HTML-to-RTF converter, you‘d want to handle each tag differently. You test the type of tag by comparing it against these 73 mnemonic constants from the HTML.Tag class:

 

 

HTML.Tag.A

 

 

HTML.Tag.FRAMESET

 

 

HTML.Tag.PARAM

 

 

HTML.Tag.ADDRESS

 

 

HTML.Tag.H1

 

 

HTML.Tag.PRE

 

 

HTML.Tag.APPLET

 

 

HTML.Tag.H2

 

 

HTML.Tag.SAMP

 

 

HTML.Tag.AREA

 

 

HTML.Tag.H3

 

 

HTML.Tag.SCRIPT

 

 

HTML.Tag.B

 

 

HTML.Tag.H4

 

 

HTML.Tag.SELECT

 

 

HTML.Tag.BASE

 

 

HTML.Tag.H5

 

 

HTML.Tag.SMALL

 

 

HTML.Tag.BASEFONT

 

 

HTML.Tag.H6

 

 

HTML.Tag.STRIKE

 

 

HTML.Tag.BIG

 

 

HTML.Tag.HEAD

 

 

HTML.Tag.S

 

 

HTML.Tag.BLOCKQUOTE

 

 

HTML.Tag.HR

 

 

HTML.Tag.STRONG

 

 

HTML.Tag.BODY

 

 

HTML.Tag.HTML

 

 

HTML.Tag.STYLE

 

 

HTML.Tag.BR

 

 

HTML.Tag.I

 

 

HTML.Tag.SUB

 

 

HTML.Tag.CAPTION

 

 

HTML.Tag.IMG

 

 

HTML.Tag.SUP

 

 

HTML.Tag.CENTER

 

 

HTML.Tag.INPUT

 

 

HTML.Tag.TABLE

 

 

HTML.Tag.CITE

 

 

HTML.Tag.ISINDEX

 

 

HTML.Tag.TD

 

 

HTML.Tag.CODE

 

 

HTML.Tag.KBD

 

 

HTML.Tag.TEXTAREA

 

 

HTML.Tag.DD

 

 

HTML.Tag.LI

 

 

HTML.Tag.TH

 

 

HTML.Tag.DFN

 

 

HTML.Tag.LINK

 

 

HTML.Tag.TR

 

 

HTML.Tag.DIR

 

 

HTML.Tag.MAP

 

 

HTML.Tag.TT

 

 

HTML.Tag.DIV

 

 

HTML.Tag.MENU

 

 

HTML.Tag.U

 

 

HTML.Tag.DL

 

 

HTML.Tag.META

 

 

HTML.Tag.UL

 

 

HTML.Tag.DT

 

 

HTML.Tag.NOFRAMES

 

 

HTML.Tag.VAR

 

 

HTML.Tag.EM

 

 

HTML.Tag.OBJECT

 

 

HTML.Tag.IMPLIED

 

 

HTML.Tag.FONT

 

 

HTML.Tag.OL

 

 

HTML.Tag.COMMENT

 

 

HTML.Tag.FORM

 

 

HTML.Tag.OPTION

 

 

 

 

 

HTML.Tag.FRAME

 

 

HTML.Tag.P

 

 

 

 

 

These are not int constants. They are object constants to allow compile-time type checking. You saw this trick once before in the javax.swing.event.HyperlinkEvent class. All HTML.Tag elements passed to your callback methods by the HTMLEditorKit.Parser will be one of these 73 constants. They are not just the same as these 73 objects; they are these 73 objects. There are exactly 73 objects in this class; no more, no less. You can test against them with == rather than equals( ).

 

 

For example, let‘s suppose you need a program that outlines HTML pages by extracting their H1 through H6 headings while ignoring the rest of the document. It organizes the outline as nested lists in which each H1 heading is at the top level, each H2 heading is one level deep, and so on. You would write an HTMLEditorKit.ParserCallback subclass that extracted the contents of all H1, H2, H3, H4, H5, and H6 elements while ignoring all others, as Example 8-9 demonstrates.

 

 

Example 8-9. Outliner

 

 

import javax.swing.text.*;

 

 

import javax.swing.text.html.*;

 

 

import javax.swing.text.html.parser.*;

 

 

import java.io.*;

 

 

import java.net.*;

 

 

import java.util.*;

 

 

public class Outliner extends HTMLEditorKit.ParserCallback {

 

 

  private Writer out;

 

 

  private int level = 0;

 

 

  private boolean inHeader=false;

 

 

  private static String lineSeparator=System.getProperty("line.separator", "\r\n");

 

 

 public Outliner(Writer out) {

 

 

    this.out = out;

 

 

  }

 

 

  public void handleStartTag(HTML.Tag tag,

 

 

   MutableAttributeSet attributes, int position) {

 

 

    int newLevel = 0;

 

 

    if (tag == HTML.Tag.H1) newLevel = 1;

 

 

    else if (tag == HTML.Tag.H2) newLevel = 2;

 

 

    else if (tag == HTML.Tag.H3) newLevel = 3;

 

 

    else if (tag == HTML.Tag.H4) newLevel = 4;

 

 

    else if (tag == HTML.Tag.H5) newLevel = 5;

 

 

    else if (tag == HTML.Tag.H6) newLevel = 6;

 

 

    else return;

 

 

    this.inHeader = true;

 

 

    try {

 

 

      if (newLevel > this.level) {

 

 

        for (int i =0; i < newLevel-this.level; i++) {

 

 

          out.write("<ul>" + lineSeparator + "<li>");

 

 

        }

 

 

      }

 

 

      else if (newLevel < this.level) {

 

 

        for (int i =0; i < this.level-newLevel; i++) {

 

 

          out.write(lineSeparator + "</ul>" + lineSeparator);

 

 

        }

 

 

        out.write(lineSeparator + "<li>");

 

 

      }

 

 

      else {

 

 

        out.write(lineSeparator + "<li>");

 

 

      }

 

 

      this.level = newLevel;

 

 

      out.flush( );

 

 

    }

 

 

    catch (IOException ex) {

 

 

    System.err.println(ex);

 

 

    }

 

 

  }

 

 

  public void handleEndTag(HTML.Tag tag, int position) {

 

 

    if (tag == HTML.Tag.H1 || tag == HTML.Tag.H2||tag == HTML.Tag.H3 || tag == HTML.Tag.H4|| tag == HTML.Tag.H5 || tag == HTML.Tag.H6) {

 

 

      inHeader = false;

 

 

    }

 

 

    // work around bug in the parser that fails to call flush

 

 

    if (tag == HTML.Tag.HTML) this.flush( );

 

 

  }

 

 

  public void handleText(char[] text, int position) {

 

 

    if (inHeader) {

 

 

      try {

 

 

        out.write(text);

 

 

        out.flush( );

 

 

      }

 

 

      catch (IOException ex) {

 

 

        System.err.println(ex);

 

 

      }

 

 

    }

 

 

  }

 

 

  public void flush( ) {

 

 

    try {

 

 

      while (this.level-- > 0) {

 

 

        out.write(lineSeparator + "</ul>");  

 

 

      }

 

 

      out.flush( );

 

 

    }

 

 

    catch (IOException e) {

 

 

      System.err.println(e);

 

 

    }

 

 

  }

 

 

  private static void parse(URL url, String encoding) throws IOException {

 

 

      ParserGetter kit = new ParserGetter( );

 

 

      HTMLEditorKit.Parser parser = kit.getParser( );

 

 

      InputStream in = url.openStream( );

 

 

      InputStreamReader r = new InputStreamReader(in, encoding);

 

 

      HTMLEditorKit.ParserCallback callback = new Outliner

 

 

       (new OutputStreamWriter(System.out));

 

 

      parser.parse(r, callback, true);

 

 

  }

 

 

  public static void main(String[] args) {

 

 

    ParserGetter kit = new ParserGetter( );

 

 

    HTMLEditorKit.Parser parser = kit.getParser( );   

 

 

    String encoding = "ISO-8859-1";

 

 

    URL url = null;

 

 

    try {

 

 

      url = new URL(args[0]);

 

 

      InputStream in = url.openStream( );

 

 

      InputStreamReader r = new InputStreamReader(in, encoding);

 

 

      // parse once just to detect the encoding

 

 

      HTMLEditorKit.ParserCallback doNothing

 

 

       = new HTMLEditorKit.ParserCallback( );

 

 

      parser.parse(r, doNothing, false);

 

 

    }

 

 

    catch (MalformedURLException ex) {

 

 

      System.out.println("Usage: java Outliner url");

 

 

      return;

 

 

    }

 

 

    catch (ChangedCharSetException ex) {

 

 

      String mimeType = ex.getCharSetSpec( );

 

 

      encoding = mimeType.substring(mimeType.indexOf("=") + 1).trim( );

 

 

    }

 

 

    catch (IOException ex) {

 

 

      System.err.println(ex);

 

 

    }

 

 

    catch (ArrayIndexOutOfBoundsException ex) {

 

 

      System.out.println("Usage: java Outliner url");

 

 

      return;

 

 

    }

 

 

    try {

 

 

      parse(url, encoding);

 

 

    }

 

 

    catch(IOException ex) {

 

 

      System.err.println(ex);

 

 

    }

 

 

  }

 

 

}

 

 

When a heading start-tag is encountered by the handleStartTag( ) method, the necessary number of <ul>, </ul>, and <li> tags are emitted. Furthermore, the inHeading flag is set to true so that the handleText( ) method will know to output the contents of the heading. All start-tags except the six levels of headers are simply ignored. The handleEndTag( ) method likewise considers heading tags only by comparing the tag it receives with the seven tags it‘s interested in. If it sees a heading tag, it sets the inHeading flag to false again so that body text won‘t be emitted by the handleText( ) method. If it sees the end of the document via an </html> tag, it flushes out the document. Otherwise, it does nothing. The end result is a nicely formatted group of nested, unordered lists that outlines the document. For example, here‘s the output of running it against http://www.:

 

 

% java Outliner http://www./

 

 

<ul>

 

 

<li> Cafe con Leche XML News and Resources<ul>

 

 

<li>Quote of the Day

 

 

<li>Today‘s News

 

 

<li>Recommended Reading

 

 

<li>Recent News<ul>

 

 

<li>XML Overview

 

 

<li>Tutorials

 

 

<li>Projects

 

 

<li>Seminar Notes

 

 

<li>Random Notes

 

 

<li>Specifications

 

 

<li>Books

 

 

<li>XML Resources

 

 

<li>Development Tools<ul>

 

 

<li>Validating Parsers

 

 

<li>Non-validating Parsers

 

 

<li>Online Validators and Syntax Checkers

 

 

<li>Formatting Engines

 

 

<li>Browsers

 

 

<li>Class Libraries

 

 

<li>Editors

 

 

<li>XML Applications

 

 

<li>External Sites

 

 

</ul>

 

 

</ul>

 

 

</ul>

 

 

</ul>

 

 

8.3.4 Attributes

 

 

When processing an HTML file, you often need to look at the attributes as well as the tags. The second argument to the handleStartTag( ) and handleSimpleTag( ) callback methods is an instance of the javax.swing.text.MutableAttributeSet class. This object allows you to see what attributes are attached to a particular tag. MutableAttributeSet is a subinterface of the javax.swing.text.AttributeSet interface:

 

 

public abstract interface MutableAttributeSet extends AttributeSet

 

 

Both AttributeSet and MutableAttributeSet represent a collection of attributes on an HTML tag. The difference is that the MutableAttributeSet interface declares methods to add attributes to, remove attributes from, and inspect the attributes in the set. The attributes themselves are represented as pairs of java.lang.Object objects, one for the name of the attribute and one for the value. The AttributeSet interface declares these methods:

 

 

public int          getAttributeCount( )

 

 

public boolean      isDefined(Object name)

 

 

public boolean      containsAttribute(Object name, Object value)

 

 

public boolean      containsAttributes(AttributeSet attributes)

 

 

public boolean      isEqual(AttributeSet attributes)

 

 

public AttributeSet copyAttributes( )

 

 

public Enumeration  getAttributeNames( )

 

 

public Object       getAttribute(Object name)

 

 

public AttributeSet getResolveParent( )

 

 

Most of these methods are self-explanatory. The getAttributeCount( ) method returns the number of attributes in the set. The isDefined( ) method returns true if an attribute with the specified name is in the set, false otherwise. The containsAttribute(Object name, Object value) method returns true if an attribute with the given name and value is in the set. The containsAttributes(AttributeSet attributes) method returns true if all the attributes in the specified set are in this set with the same values; in other words, if the argument is a subset of the set on which this method is invoked. The isEqual() method returns true if the invoking AttributeSet is the same as the argument. The copyAttributes( ) method returns a clone of the current AttributeSet. The getAttributeNames( ) method returns a java.util.Enumeration of all the names of the attributes in the set. Once you know the name of one of the elements of the set, the getAttribute( ) method returns its value. Finally, the getResolveParent( ) method returns the parent AttributeSet, which will be searched for attributes that are not found in the current set. For example, given an AttributeSet, this method prints the attributes in name=value format:

 

 

private void listAttributes(AttributeSet attributes) {

 

 

  Enumeration e = attributes.getAttributeNames( );

 

 

  while (e.hasMoreElements( )) {

 

 

    Object name = e.nextElement( );

 

 

    Object value = attributes.getAttribute(name);

 

 

    System.out.println(name + "=" + value);

 

 

  }

 

 

}

 

 

Although the argument and return types of these methods are mostly declared in terms of java.lang.Object, in practice, all values are instances of java.lang.String, while all names are instances of the public inner class javax.swing.text.html.HTML.Attribute. Just as the HTML.Tag class predefines 73 HTML tags and uses a private constructor to prevent the creation of others, so too does the HTML.Attribute class predefine 80 standard HTML attributes (HTML.Attribute.ACTION, HTML.Attribute.ALIGN, HTML.Attribute.ALINK, HTML.Attribute.ALT, etc.) and prohibits the construction of others via a nonpublic constructor. Generally, this isn‘t an issue, since you mostly use getAttribute( ), containsAttribute(), and so forth only with names returned by getAttributeNames( ). The 80 predefined attributes are:

 

 

HTML.Attribute.ACTION

 

 

HTML.Attribute.DUMMY

 

 

HTML.Attribute.PROMPT

 

 

HTML.Attribute.ALIGN

 

 

HTML.Attribute.ENCTYPE

 

 

HTML.Attribute.REL

 

 

HTML.Attribute.ALINK

 

 

HTML.Attribute.ENDTAG

 

 

HTML.Attribute.REV

 

 

HTML.Attribute.ALT

 

 

HTML.Attribute.FACE

 

 

HTML.Attribute.ROWS

 

 

HTML.Attribute.ARCHIVE

 

 

HTML.Attribute.FRAMEBORDER

 

 

HTML.Attribute.ROWSPAN

 

 

HTML.Attribute.BACKGROUND

 

 

HTML.Attribute.HALIGN

 

 

HTML.Attribute. SCROLLING

 

 

HTML.Attribute.BGCOLOR

 

 

HTML.Attribute.HEIGHT

 

 

HTML.Attribute.SELECTED

 

 

HTML.Attribute.BORDER

 

 

HTML.Attribute.HREF

 

 

HTML.Attribute.SHAPE

 

 

HTML.Attribute. CELLPADDING

 

 

HTML.Attribute.HSPACE

 

 

HTML.Attribute.SHAPES

 

 

HTML.Attribute. CELLSPACING

 

 

HTML.Attribute.HTTPEQUIV

 

 

HTML.Attribute.SIZE

 

 

HTML.Attribute.CHECKED

 

 

HTML.Attribute.ID

 

 

HTML.Attribute.SRC

 

 

HTML.Attribute.CLASS

 

 

HTML.Attribute.ISMAP

 

 

HTML.Attribute.STANDBY

 

 

HTML.Attribute.CLASSID

 

 

HTML.Attribute.LANG

 

 

HTML.Attribute.START

 

 

HTML.Attribute.CLEAR

 

 

HTML.Attribute.LANGUAGE

 

 

HTML.Attribute.STYLE

 

 

HTML.Attribute.CODE

 

 

HTML.Attribute.LINK

 

 

HTML.Attribute.TARGET

 

 

HTML.Attribute.CODEBASE

 

 

HTML.Attribute.LOWSRC

 

 

HTML.Attribute.TEXT

 

 

HTML.Attribute.CODETYPE

 

 

HTML.Attribute. MARGINHEIGHT

 

 

HTML.Attribute.TITLE

 

 

HTML.Attribute.COLOR

 

 

HTML.Attribute.MARGINWIDTH

 

 

HTML.Attribute.TYPE

 

 

HTML.Attribute.COLS

 

 

HTML.Attribute.MAXLENGTH

 

 

HTML.Attribute.USEMAP

 

 

HTML.Attribute.COLSPAN

 

 

HTML.Attribute.METHOD

 

 

HTML.Attribute.VALIGN

 

 

HTML.Attribute.COMMENT

 

 

HTML.Attribute.MULTIPLE

 

 

HTML.Attribute.VALUE

 

 

HTML.Attribute.COMPACT

 

 

HTML.Attribute.N

 

 

HTML.Attribute. VALUETYPE

 

 

HTML.Attribute.CONTENT

 

 

HTML.Attribute.NAME

 

 

HTML.Attribute.VERSION

 

 

HTML.Attribute.COORDS

 

 

HTML.Attribute.NOHREF

 

 

HTML.Attribute.VLINK

 

 

HTML.Attribute.DATA

 

 

HTML.Attribute.NORESIZE

 

 

HTML.Attribute.VSPACE

 

 

HTML.Attribute.DECLARE

 

 

HTML.Attribute.NOSHADE

 

 

HTML.Attribute.WIDTH

 

 

HTML.Attribute.DIR

 

 

HTML.Attribute.NOWRAP

 

 

 

 

 

The MutableAttributeSet interface adds six methods to add attributes to and remove attributes from the set:

 

 

public void addAttribute(Object name, Object value)

 

 

public void addAttributes(AttributeSet attributes)

 

 

public void removeAttribute(Object name)

 

 

public void removeAttributes(Enumeration names)

 

 

public void removeAttributes(AttributeSet attributes)

 

 

public void setResolveParent(AttributeSet parent)

 

 

Again, the values are strings and the names are HTML.Attribute objects.

 

 

One possible use for all these methods is to modify documents before saving or displaying them. For example, most web browsers let you save a page on your hard drive as either HTML or text. However, both these formats lose track of images and relative links. The problem is that most pages are full of relative URLs, and these all break when you move the page to your local machine. Example 8-10 is an application called PageSaver that downloads a web page to a local hard drive while keeping all links intact by rewriting all relative URLs as absolute URLs.

 

 

The PageSaver class reads a series of URLs from the command line. It opens each one in turn and parses it. Every tag, text block, comment, and attribute is copied into a local file. However, all link attributes, such as SRC, LOWSRC, CODEBASE, and HREF, are remapped to an absolute URL. Note particularly the extensive use to which the URL and javax.swing.text classes were put; PageSaver could be rewritten with string replacements, but that would be considerably more complicated.

 

 

Example 8-10. PageSaver

 

 

import javax.swing.text.*;

 

 

import javax.swing.text.html.*;

 

 

import javax.swing.text.html.parser.*;

 

 

import java.io.*;

 

 

import java.net.*;

 

 

import java.util.*;

 

 

public class PageSaver extends HTMLEditorKit.ParserCallback {

 

 

  private Writer out;

 

 

  private URL base;

 

 

  public PageSaver(Writer out, URL base) {

 

 

    this.out = out;

 

 

    this.base = base;

 

 

  }

 

 

  public void handleStartTag(HTML.Tag tag,

 

 

   MutableAttributeSet attributes, int position) {

 

 

    try { 

 

 

      out.write("<" + tag);

 

 

      this.writeAttributes(attributes);

 

 

      // for the <APPLET> tag we may have to add a codebase attribute

 

 

      if (tag == HTML.Tag.APPLET

 

 

       && attributes.getAttribute(HTML.Attribute.CODEBASE) == null) {

 

 

        String codebase = base.toString( );

 

 

        if (codebase.endsWith(".htm") || codebase.endsWith(".html")) {

 

 

          codebase = codebase.substring(0, codebase.lastIndexOf(‘/‘));  

 

 

        }

 

 

        out.write(" codebase=\"" + codebase + "\"");

 

 

      }

 

 

      out.write(">");

 

 

      out.flush( );

 

 

    }

 

 

    catch (IOException ex) {

 

 

      System.err.println(ex);

 

 

      e.printStackTrace( );

 

 

    }

 

 

  }

 

 

  public void handleEndTag(HTML.Tag tag, int position) {

 

 

    try {   

 

 

      out.write("</" + tag + ">");

 

 

      out.flush( );

 

 

    }

 

 

    catch (IOException ex) {

 

 

      System.err.println(ex);

 

 

    }

 

 

  }

 

 

  private void writeAttributes(AttributeSet attributes)

 

 

   throws IOException {

 

 

    Enumeration e = attributes.getAttributeNames( );

 

 

    while (e.hasMoreElements( )) {

 

 

      Object name = e.nextElement( );

 

 

      String value = (String) attributes.getAttribute(name);

 

 

      try {

 

 

        if (name == HTML.Attribute.HREF || name == HTML.Attribute.SRC

 

 

         || name == HTML.Attribute.LOWSRC

 

 

         || name == HTML.Attribute.CODEBASE ) {

 

 

          URL u = new URL(base, value);

 

 

          out.write(" " + name + "=\"" + u + "\"");             

 

 

        }

 

 

        else {

 

 

          out.write(" " + name + "=\"" + value + "\"");

 

 

        }

 

 

      }

 

 

      catch (MalformedURLException ex) {

 

 

        System.err.println(ex);

 

 

        System.err.println(base);

 

 

        System.err.println(value);

 

 

        ex.printStackTrace( );

 

 

      }

 

 

    }

 

 

  }

 

 

  public void handleComment(char[] text, int position) {

 

 

    try {

 

 

      out.write("<!-- ");

 

 

      out.write(text);

 

 

      out.write(" -->");

 

 

      out.flush( );

 

 

    }

 

 

    catch (IOException ex) {

 

 

      System.err.println(ex);

 

 

    }

 

 

   }

 

 

  public void handleText(char[] text, int position) {

 

 

    try {

 

 

      out.write(text);

 

 

      out.flush( );

 

 

    }

 

 

    catch (IOException ex) {

 

 

      System.err.println(ex);

 

 

      e.printStackTrace( );

 

 

    }

 

 

  }

 

 

  public void handleSimpleTag(HTML.Tag tag,

 

 

   MutableAttributeSet attributes, int position) {

 

 

    try {

 

 

      out.write("<" + tag);

 

 

      this.writeAttributes(attributes);

 

 

      out.write(">");

 

 

    }

 

 

    catch (IOException ex) {

 

 

      System.err.println(ex);

 

 

      e.printStackTrace( );

 

 

    }

 

 

  }

 

 

  public static void main(String[] args) {

 

 

    for (int i = 0; i < args.length; i++) {

 

 

      ParserGetter kit = new ParserGetter( );

 

 

      HTMLEditorKit.Parser parser = kit.getParser( );

 

 

      try {

 

 

        URL u = new URL(args[i]);

 

 

        InputStream in = u.openStream( );

 

 

        InputStreamReader r = new InputStreamReader(in);

 

 

        String remoteFileName = u.getFile( );

 

 

        if (remoteFileName.endsWith("/")) {

 

 

          remoteFileName += "index.html";

 

 

        }

 

 

        if (remoteFileName.startsWith("/")) {

 

 

          remoteFileName = remoteFileName.substring(1);

 

 

        }

 

 

        File localDirectory = new File(u.getHost( ));

 

 

        while (remoteFileName.indexOf(‘/‘) > -1) {

 

 

          String part = remoteFileName.substring(0, remoteFileName.indexOf(‘/‘));

 

 

          remoteFileName =remoteFileName.substring(remoteFileName.indexOf(‘/‘)+1);

 

 

          localDirectory = new File(localDirectory, part);

 

 

        }

 

 

        if (localDirectory.mkdirs( )) {

 

 

          File output = new File(localDirectory, remoteFileName);

 

 

          FileWriter out = new FileWriter(output);

 

 

          HTMLEditorKit.ParserCallback callback = new PageSaver(out, u);

 

 

          parser.parse(r, callback, false);

 

 

        }

 

 

      }

 

 

      catch (IOException ex) {

 

 

        System.err.println(ex);

 

 

        e.printStackTrace( );

 

 

      }

 

 

    }

 

 

  }

 

 

}

 

 

The handleEndTag( ), handleText(), and handleComment( ) methods simply copy their content from the input into the output. The handleStartTag( ) and handleSimpleTag( ) methods write their respective tags onto the output but also invoke the private writeAttributes( ) method. This method loops through the attributes in the set and mostly just copies them onto the output. However, for a few select attributes, such as SRC and HREF, which typically have URL values, it rewrites the values as absolute URLs. Finally, the main( ) method reads URLs from the command line, calculates reasonable names and directories for corresponding local files, and starts a new PageSaver for each URL.

 

 

In the first edition of this book, I included a similar program that downloaded the raw HTML using the URL class and parsed it manually. That program was about a third longer than this one and much less robust. For instance, it did not support frames or the LOWSRC attributes of IMG tags. It went to great effort to handle both quoted and unquoted attribute values and still didn‘t recognize attribute values enclosed in single quotes. By contrast, this program needs only one extra line of code to support each additional attribute. It is much more robust, much easier to understand (since there‘s not a lot of detailed string manipulation), and much easier to extend.

 

 

This is just one example of the various HTML filters that the javax.swing.text.html package makes easy to write. You could, for example, write a filter that pretty-prints the HTML by indenting the different levels of tags. You could write a program to convert HTML to TeX , XML, RTF, or many other formats. You could write a program that spiders a web site, downloading all linked pages—and this is just the beginning. All of these programs are much easier to write because Swing provides a simple-to-use HTML parser. All you have to do is respond to the individual elements and attributes that the parser discovers in the HTML document. The more difficult problem of parsing the document is removed.

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多