Parsing HTML

素行 2007-04-19

展开全文

Sometimes you want to read HTML, looking for information without actually displaying it on the screen. For instance, more than one author I know has written a "book ticker" program to track the hour-by-hour progress of their books in the Amazon.com bestseller list. The hardest part of this program isn‘t retrieving the HTML. It‘s reading through the HTML to find the one line that contains the book‘s ranking. As another example, consider a Web Whacker-style program that downloads a web site or part thereof to a local PC with all links intact. Downloading the files once you have the URLs is easy. But reading through the document to find the URLs of the linked pages is considerably more complex.

Both of these examples are parsing problems. While parsing a clearly defined language that doesn‘t allow syntax errors, such as Java or XML, is relatively straightforward, parsing a flexible language that attempts to recover from errors, like HTML, is extremely difficult. It‘s easier to write in HTML than it is to write in a strict language like XML, but it‘s much harder to read such a language. Ease of use for the page author has been favored at the cost of ease of development for the programmer.

Fortunately, the javax.swing.text.html and javax.swing.text.html.parser packages include classes that do most of the hard work for you. They‘re primarily intended for the internal use of the JEditorPane class discussed in the last section. Consequently, they can be a little tricky to get at. The constructors are often not public or hidden inside inner classes, and the classes themselves aren‘t very well documented. But once you‘ve seen a few examples, they aren‘t hard to use.

8.3.1 HTMLEditorKit.Parser

The main HTML parsing class is the inner class javax.swing.html.HTMLEditorKit.Parser:

public abstract static class HTMLEditorKit.Parser extends Object

Since this is an abstract class, the actual parsing work is performed by an instance of its concrete subclass javax.swing.text.html.parser.ParserDelegator:

public class ParserDelegator extends HTMLEditorKit.Parser

An instance of this class reads an HTML document from a Reader. It looks for five things in the document: start-tags, end-tags, empty-element tags, text, and comments. That covers all the important parts of a common HTML file. (Document type declarations and processing instructions are omitted, but they‘re rare and not very important in most HTML files, even when they are included.) Every time the parser sees one of these five items, it invokes the corresponding callback method in a particular instance of the javax.swing.text.html.HTMLEditorKit.ParserCallback class. To parse an HTML file, you write a subclass of HTMLEditorKit.ParserCallback that responds to text and tags as you desire. Then you pass an instance of your subclass to the HTMLEditorKit.Parser‘s parse( ) method, along with the Reader from which the HTML will be read:

public void parse(Reader in, HTMLEditorKit.ParserCallback callback,

boolean ignoreCharacterSet) throws IOException

The third argument indicates whether you want to be notified of the character set of the document, assuming one is found in a META tag in the HTML header. This will normally be true. If it‘s false, then the parser will throw a javax.swing.text.ChangedCharSetException when a META tag in the HTML header is used to change the character set. This would give you an opportunity to switch to a different Reader that understands that character set and reparse the document (this time, setting ignoreCharSet to true since you already know the character set).

parse( ) is the only public method in the HTMLEditorKit.Parser class. All the work is handled inside the callback methods in the HTMLEditorKit.ParserCallback subclass. The parse( ) method simply reads from the Reader in until it‘s read the entire document. Every time it sees a tag, comment, or block of text, it invokes the corresponding callback method in the HTMLEditorKit.ParserCallback instance. If the Reader throws an IOException, that exception is passed along. Since neither the HTMLEditorKit.Parser nor the HTMLEditorKit.ParserCallback instance is specific to one reader, it can be used to parse multiple files simply by invoking parse( ) multiple times. If you do this, your HTMLEditorKit.ParserCallback class must be fully thread-safe, because parsing takes place in a separate thread and the parse( ) method normally returns before parsing is complete.

Before you can do any of this, however, you have to get your hands on an instance of the HTMLEditorKit.Parser class, and that‘s harder than it should be. HTMLEditorKit.Parser is an abstract class, so it can‘t be instantiated directly. Its subclass, javax.swing.text.html.parser.ParserDelegator, is concrete. However, before you can use it, you have to configure it with a DTD, using the protected static methods ParserDelegator.setDefaultDTD( ) and ParserDelegator.createDTD( ):

protected static void setDefaultDTD( )

protected static DTD createDTD(DTD dtd, String name)

So to create a ParserDelegator, you first need to have an instance of javax.swing.text.html.parser.DTD. This class represents a Standardized General Markup Language (SGML) document type definition. The DTD class has a protected constructor and many protected methods that subclasses can use to build a DTD from scratch, but this is an API that only an SGML expert could be expected to use. The normal way DTDs are created is by reading the text form of a standard DTD published by someone like the W3C. You should be able to get a DTD for HTML by using the DTDParser class to parse the W3C‘s published HTML DTD. Unfortunately, the DTDParser class isn‘t included in the published Swing API, so you can‘t. Thus, you‘re going to need to go through the back door to create an HTMLEditorKit.Parser instance. What we‘ll do is use the HTMLEditorKit.Parser.getParser( ) method instead, which ultimately returns a ParserDelegator after properly initializing the DTD for HTML 3.2:

protected HTMLEditorKit.Parser getParser( )

Since this method is protected, we‘ll simply subclass HTMLEditorKit and override it with a public version, as Example 8-6 demonstrates.

Example 8-6. This subclass just makes the getParser( ) method public

import javax.swing.text.html.*;

public class ParserGetter extends HTMLEditorKit {

// purely to make this method public

public HTMLEditorKit.Parser getParser( ){

return super.getParser( );

}

Now that you‘ve got a way to get a parser, you‘re ready to parse some documents. This is accomplished through the parse( ) method of HTMLEditorKit.Parser:

public abstract void parse(Reader input, HTMLEditorKit.ParserCallback

callback, boolean ignoreCharSet) throws IOException

The Reader is straightforward. Simply chain an InputStreamReader to the stream reading the HTML document, probably one returned by the openStream() method of java.net.URL. For the third argument, you can pass true to ignore encoding issues (this generally works only if you‘re pretty sure you‘re dealing with ASCII text) or false if you want to receive a ChangedCharSetException when the document has a META tag indicating the character set. The second argument is where the action is. You‘re going to write a subclass of HTMLEditorKit.ParserCallback that is notified of every start-tag, end-tag, empty-element tag, text, comment, and error that the parser encounters.

8.3.2 HTMLEditorKit.ParserCallback

The ParserCallback class is a public inner class inside javax.swing.text.html.HTMLEditorKit:

public static class HTMLEditorKit.ParserCallback extends Object

It has a single, public noargs constructor:

public HTMLEditorKit.ParserCallback( )

However, you probably won‘t use this directly because the standard implementation of this class does nothing. It exists to be subclassed. It has six callback methods that do nothing. You will override these methods to respond to specific items seen in the input stream as the document is parsed:

public void handleText(char[] text, int position)

public void handleComment(char[] text, int position)

public void handleStartTag(HTML.Tag tag,

MutableAttributeSet attributes, int position)

public void handleEndTag(HTML.Tag tag, int position)

public void handleSimpleTag(HTML.Tag tag,

MutableAttributeSet attributes, int position)

public void handleError(String errorMessage, int position)

There‘s also a flush( ) method you use to perform any final cleanup. The parser invokes this method once after it‘s finished parsing the document:

public void flush( ) throws BadLocationException

Let‘s begin with a simple example. Suppose you want to write a program that strips out all the tags and comments from an HTML document and leaves only the text. You would write a subclass of HTMLEditorKit.ParserCallback that overrides the handleText( ) method to write the text on a Writer. You would leave the other methods alone. Example 8-7 demonstrates.

Example 8-7. TagStripper

import javax.swing.text.html.*;

import java.io.*;

public class TagStripper extends HTMLEditorKit.ParserCallback {

private Writer out;

public TagStripper(Writer out) {

this.out = out;

}

public void handleText(char[] text, int position) {

try {

out.write(text);

out.flush( );

}

catch (IOException ex) {

System.err.println(ex);

}

Now let‘s suppose you want to use this class to actually strip the tags from a URL. You begin by retrieving a parser using Example 8-5s ParserGetter class:

ParserGetter kit = new ParserGetter( );

HTMLEditorKit.Parser parser = kit.getParser( );

Next, construct an instance of your callback class like this:

HTMLEditorKit.ParserCallback callback

= new TagStripper(new OutputStreamWriter(System.out));

Then you get a stream you can read the HTML document from. For example:

try {

URL u = new URL("http://www.");

InputStream in = new BufferedInputStream(u.openStream( ));

InputStreamReader r = new InputStreamReader(in);

Finally, you pass the Reader and the HTMLEditorKit.ParserCallback to the HTMLEditorKit.Parser‘s parse( ) method, like this:

parser.parse(r, callback, false);

}

catch (IOException ex) {

System.err.println(ex);

}

There are a couple of details about the parsing process that are not obvious. First, the parser parses in a separate thread. Therefore, you should not assume that the document has been parsed when the parse( ) method returns. If you‘re using the same HTMLEditorKit.ParserCallback object for two separate parses, you need to make all your callback methods thread-safe.

Second, the parser actually skips some of the data in the input. In particular, it normalizes and strips whitespace. If the input document contains seven spaces in a row, the parser will convert that to a single space. Carriage returns, linefeeds, and tabs are all converted to a single space, so you lose line breaks. Furthermore, most text elements are stripped of all leading and trailing whitespace. Elements that contain nothing but space are eliminated completely. Thus, suppose the input document contains this content:

<H1> Here‘s the Title </H1>

What actually comes out of the tag stripper is:

Here‘s the TitleHere‘s the text

The single exception is the PRE element, which maintains all whitespace in its contents unedited. Short of implementing your own parser, I don‘t know of any way to retain all the stripped space. But you can include the minimum necessary line breaks and whitespace by looking at the tags as well as the text. Generally, you expect a single break in HTML when you see one of these tags:

<BR>

<LI>

<TR>

You expect a double break (paragraph break) when you see one of these tags:

<P>

</H1> </H2> </H3> </H4> </H5> </H6>

<HR>

<DIV>

</UL> </OL> </DL>

To include line breaks in the output, you have to look at each tag as it‘s processed and determine whether it falls in one of these sets. This is straightforward because the first argument passed to each of the tag callback methods is an HTML.Tag object.

8.3.3 HTML.Tag

Tag is a public inner class in the javax.swing.text.html.HTML class.

public static class HTML.Tag extends Object

It has these four methods:

public boolean isBlock( )

public boolean breaksFlow( )

public boolean isPreformatted( )

public String toString( )

The breaksFlow( ) method returns true if the tag should cause a single line break. The isBlock() method returns true if the tag should cause a double line break. The isPreformatted() method returns true if the tag indicates that whitespace should be preserved. This makes it easy to provide the necessary breaks in the output.

Chances are you‘ll see more tags than you‘d expect when you parse a file. The parser inserts missing closing tags. In other words, if a document contains only a <P> tag, then the parser will report both the <P> start-tag and the implied </P> end-tag at the appropriate points in the document. Example 8-8 is a program that does the best job yet of converting HTML to pure text. It looks for the empty and end-tags, explicit or implied, and, if the tag indicates that line breaks are called for, inserts the necessary number of line breaks.

Example 8-8. LineBreakingTagStripper

import javax.swing.text.*;

import javax.swing.text.html.*;

import javax.swing.text.html.parser.*;

import java.io.*;

import java.net.*;

public class LineBreakingTagStripper

extends HTMLEditorKit.ParserCallback {

private Writer out;

private String lineSeparator;

public LineBreakingTagStripper(Writer out) {

this(out, System.getProperty("line.separator", "\r\n"));

}

public LineBreakingTagStripper(Writer out, String lineSeparator) {

this.out = out;

this.lineSeparator = lineSeparator;

}

public void handleText(char[] text, int position) {

try {

out.write(text);

out.flush( );

}

catch (IOException ex) {

System.err.println(ex);

}

public void handleEndTag(HTML.Tag tag, int position) {

try {

if (tag.isBlock( )) {

out.write(lineSeparator);

}

else if (tag.breaksFlow( )) {

out.write(lineSeparator);

}

catch (IOException ex) {

System.err.println(ex);

}

public void handleSimpleTag(HTML.Tag tag,

MutableAttributeSet attributes, int position) {

try {

if (tag.isBlock( )) {

out.write(lineSeparator);

}

else if (tag.breaksFlow( )) {

out.write(lineSeparator);

}

else {

out.write(‘ ‘);

}

catch (IOException ex) {

System.err.println(ex);

}

Most of the time, of course, you want to know considerably more than whether a tag breaks a line. You want to know what tag it is, and behave accordingly. For instance, if you were writing a full-blown HTML-to-TeX or HTML-to-RTF converter, you‘d want to handle each tag differently. You test the type of tag by comparing it against these 73 mnemonic constants from the HTML.Tag class:

HTML.Tag.A

HTML.Tag.FRAMESET

HTML.Tag.PARAM

HTML.Tag.ADDRESS

HTML.Tag.H1

HTML.Tag.PRE

HTML.Tag.APPLET

HTML.Tag.H2

HTML.Tag.SAMP

HTML.Tag.AREA

HTML.Tag.H3

HTML.Tag.SCRIPT

HTML.Tag.B

HTML.Tag.H4

HTML.Tag.SELECT

HTML.Tag.BASE

HTML.Tag.H5

HTML.Tag.SMALL

HTML.Tag.BASEFONT

HTML.Tag.H6

HTML.Tag.STRIKE

HTML.Tag.BIG

HTML.Tag.HEAD

HTML.Tag.S

HTML.Tag.BLOCKQUOTE

HTML.Tag.HR

HTML.Tag.STRONG

HTML.Tag.BODY

HTML.Tag.HTML

HTML.Tag.STYLE

HTML.Tag.BR

HTML.Tag.I

HTML.Tag.SUB

HTML.Tag.CAPTION

HTML.Tag.IMG

HTML.Tag.SUP

HTML.Tag.CENTER

HTML.Tag.INPUT

HTML.Tag.TABLE

HTML.Tag.CITE

HTML.Tag.ISINDEX

HTML.Tag.TD

HTML.Tag.CODE

HTML.Tag.KBD

HTML.Tag.TEXTAREA

HTML.Tag.DD

HTML.Tag.LI

HTML.Tag.TH

HTML.Tag.DFN

HTML.Tag.LINK

HTML.Tag.TR

HTML.Tag.DIR

HTML.Tag.MAP

HTML.Tag.TT

HTML.Tag.DIV

HTML.Tag.MENU

HTML.Tag.U

HTML.Tag.DL

HTML.Tag.META

HTML.Tag.UL

HTML.Tag.DT

HTML.Tag.NOFRAMES

HTML.Tag.VAR

HTML.Tag.EM

HTML.Tag.OBJECT

HTML.Tag.IMPLIED

HTML.Tag.FONT

HTML.Tag.OL

HTML.Tag.COMMENT

HTML.Tag.FORM

HTML.Tag.OPTION

HTML.Tag.FRAME

HTML.Tag.P

These are not int constants. They are object constants to allow compile-time type checking. You saw this trick once before in the javax.swing.event.HyperlinkEvent class. All HTML.Tag elements passed to your callback methods by the HTMLEditorKit.Parser will be one of these 73 constants. They are not just the same as these 73 objects; they are these 73 objects. There are exactly 73 objects in this class; no more, no less. You can test against them with == rather than equals( ).

For example, let‘s suppose you need a program that outlines HTML pages by extracting their H1 through H6 headings while ignoring the rest of the document. It organizes the outline as nested lists in which each H1 heading is at the top level, each H2 heading is one level deep, and so on. You would write an HTMLEditorKit.ParserCallback subclass that extracted the contents of all H1, H2, H3, H4, H5, and H6 elements while ignoring all others, as Example 8-9 demonstrates.

Example 8-9. Outliner

import javax.swing.text.*;

import javax.swing.text.html.*;

import javax.swing.text.html.parser.*;

import java.io.*;

import java.net.*;

import java.util.*;

public class Outliner extends HTMLEditorKit.ParserCallback {

private Writer out;

private int level = 0;

private boolean inHeader=false;

private static String lineSeparator=System.getProperty("line.separator", "\r\n");

public Outliner(Writer out) {

this.out = out;

}

public void handleStartTag(HTML.Tag tag,

MutableAttributeSet attributes, int position) {

int newLevel = 0;

if (tag == HTML.Tag.H1) newLevel = 1;

else if (tag == HTML.Tag.H2) newLevel = 2;

else if (tag == HTML.Tag.H3) newLevel = 3;

else if (tag == HTML.Tag.H4) newLevel = 4;

else if (tag == HTML.Tag.H5) newLevel = 5;

else if (tag == HTML.Tag.H6) newLevel = 6;

else return;

this.inHeader = true;

try {

if (newLevel > this.level) {

for (int i =0; i < newLevel-this.level; i++) {

out.write("<ul>" + lineSeparator + "<li>");

}

else if (newLevel < this.level) {

for (int i =0; i < this.level-newLevel; i++) {

out.write(lineSeparator + "</ul>" + lineSeparator);

}

out.write(lineSeparator + "<li>");

}

else {

out.write(lineSeparator + "<li>");

}

this.level = newLevel;

out.flush( );

}

catch (IOException ex) {

System.err.println(ex);

}

public void handleEndTag(HTML.Tag tag, int position) {

if (tag == HTML.Tag.H1 || tag == HTML.Tag.H2||tag == HTML.Tag.H3 || tag == HTML.Tag.H4|| tag == HTML.Tag.H5 || tag == HTML.Tag.H6) {

inHeader = false;

}

// work around bug in the parser that fails to call flush

if (tag == HTML.Tag.HTML) this.flush( );

}

public void handleText(char[] text, int position) {

if (inHeader) {

try {

out.write(text);

out.flush( );

}

catch (IOException ex) {

System.err.println(ex);

}

public void flush( ) {

try {

while (this.level-- > 0) {

out.write(lineSeparator + "</ul>");

}

out.flush( );

}

catch (IOException e) {

System.err.println(e);

}

private static void parse(URL url, String encoding) throws IOException {

ParserGetter kit = new ParserGetter( );

HTMLEditorKit.Parser parser = kit.getParser( );

InputStream in = url.openStream( );

InputStreamReader r = new InputStreamReader(in, encoding);

HTMLEditorKit.ParserCallback callback = new Outliner

(new OutputStreamWriter(System.out));

parser.parse(r, callback, true);

}

public static void main(String[] args) {

ParserGetter kit = new ParserGetter( );

HTMLEditorKit.Parser parser = kit.getParser( );

String encoding = "ISO-8859-1";

URL url = null;

try {

url = new URL(args[0]);

InputStream in = url.openStream( );

InputStreamReader r = new InputStreamReader(in, encoding);

// parse once just to detect the encoding

HTMLEditorKit.ParserCallback doNothing

= new HTMLEditorKit.ParserCallback( );

parser.parse(r, doNothing, false);

}

catch (MalformedURLException ex) {

System.out.println("Usage: java Outliner url");

return;

}

catch (ChangedCharSetException ex) {

String mimeType = ex.getCharSetSpec( );

encoding = mimeType.substring(mimeType.indexOf("=") + 1).trim( );

}

catch (IOException ex) {

System.err.println(ex);

}

catch (ArrayIndexOutOfBoundsException ex) {

System.out.println("Usage: java Outliner url");

return;

}

try {

parse(url, encoding);

}

catch(IOException ex) {

System.err.println(ex);

}

When a heading start-tag is encountered by the handleStartTag( ) method, the necessary number of <ul>, </ul>, and <li> tags are emitted. Furthermore, the inHeading flag is set to true so that the handleText( ) method will know to output the contents of the heading. All start-tags except the six levels of headers are simply ignored. The handleEndTag( ) method likewise considers heading tags only by comparing the tag it receives with the seven tags it‘s interested in. If it sees a heading tag, it sets the inHeading flag to false again so that body text won‘t be emitted by the handleText( ) method. If it sees the end of the document via an </html> tag, it flushes out the document. Otherwise, it does nothing. The end result is a nicely formatted group of nested, unordered lists that outlines the document. For example, here‘s the output of running it against http://www.:

% java Outliner http://www./

<ul>

<li> Cafe con Leche XML News and Resources<ul>

<li>Quote of the Day

<li>Today‘s News

<li>Recommended Reading

<li>Recent News<ul>

<li>XML Overview

<li>Tutorials

<li>Projects

<li>Seminar Notes

<li>Random Notes

<li>Specifications

<li>Books

<li>XML Resources

<li>Development Tools<ul>

<li>Validating Parsers

<li>Non-validating Parsers

<li>Online Validators and Syntax Checkers

<li>Formatting Engines

<li>Browsers

<li>Class Libraries

<li>Editors

<li>XML Applications

<li>External Sites

</ul>

8.3.4 Attributes

When processing an HTML file, you often need to look at the attributes as well as the tags. The second argument to the handleStartTag( ) and handleSimpleTag( ) callback methods is an instance of the javax.swing.text.MutableAttributeSet class. This object allows you to see what attributes are attached to a particular tag. MutableAttributeSet is a subinterface of the javax.swing.text.AttributeSet interface:

public abstract interface MutableAttributeSet extends AttributeSet

Both AttributeSet and MutableAttributeSet represent a collection of attributes on an HTML tag. The difference is that the MutableAttributeSet interface declares methods to add attributes to, remove attributes from, and inspect the attributes in the set. The attributes themselves are represented as pairs of java.lang.Object objects, one for the name of the attribute and one for the value. The AttributeSet interface declares these methods:

public int getAttributeCount( )

public boolean isDefined(Object name)

public boolean containsAttribute(Object name, Object value)

public boolean containsAttributes(AttributeSet attributes)

public boolean isEqual(AttributeSet attributes)

public AttributeSet copyAttributes( )

public Enumeration getAttributeNames( )

public Object getAttribute(Object name)

public AttributeSet getResolveParent( )

Most of these methods are self-explanatory. The getAttributeCount( ) method returns the number of attributes in the set. The isDefined( ) method returns true if an attribute with the specified name is in the set, false otherwise. The containsAttribute(Object name, Object value) method returns true if an attribute with the given name and value is in the set. The containsAttributes(AttributeSet attributes) method returns true if all the attributes in the specified set are in this set with the same values; in other words, if the argument is a subset of the set on which this method is invoked. The isEqual() method returns true if the invoking AttributeSet is the same as the argument. The copyAttributes( ) method returns a clone of the current AttributeSet. The getAttributeNames( ) method returns a java.util.Enumeration of all the names of the attributes in the set. Once you know the name of one of the elements of the set, the getAttribute( ) method returns its value. Finally, the getResolveParent( ) method returns the parent AttributeSet, which will be searched for attributes that are not found in the current set. For example, given an AttributeSet, this method prints the attributes in name=value format:

private void listAttributes(AttributeSet attributes) {

Enumeration e = attributes.getAttributeNames( );

while (e.hasMoreElements( )) {

Object name = e.nextElement( );

Object value = attributes.getAttribute(name);

System.out.println(name + "=" + value);

}

Although the argument and return types of these methods are mostly declared in terms of java.lang.Object, in practice, all values are instances of java.lang.String, while all names are instances of the public inner class javax.swing.text.html.HTML.Attribute. Just as the HTML.Tag class predefines 73 HTML tags and uses a private constructor to prevent the creation of others, so too does the HTML.Attribute class predefine 80 standard HTML attributes (HTML.Attribute.ACTION, HTML.Attribute.ALIGN, HTML.Attribute.ALINK, HTML.Attribute.ALT, etc.) and prohibits the construction of others via a nonpublic constructor. Generally, this isn‘t an issue, since you mostly use getAttribute( ), containsAttribute(), and so forth only with names returned by getAttributeNames( ). The 80 predefined attributes are:

HTML.Attribute.ACTION

HTML.Attribute.DUMMY

HTML.Attribute.PROMPT

HTML.Attribute.ALIGN

HTML.Attribute.ENCTYPE

HTML.Attribute.REL

HTML.Attribute.ALINK

HTML.Attribute.ENDTAG

HTML.Attribute.REV

HTML.Attribute.ALT

HTML.Attribute.FACE

HTML.Attribute.ROWS

HTML.Attribute.ARCHIVE

HTML.Attribute.FRAMEBORDER

HTML.Attribute.ROWSPAN

HTML.Attribute.BACKGROUND

HTML.Attribute.HALIGN

HTML.Attribute. SCROLLING

HTML.Attribute.BGCOLOR

HTML.Attribute.HEIGHT

HTML.Attribute.SELECTED

HTML.Attribute.BORDER

HTML.Attribute.HREF

HTML.Attribute.SHAPE

HTML.Attribute. CELLPADDING

HTML.Attribute.HSPACE

HTML.Attribute.SHAPES

HTML.Attribute. CELLSPACING

HTML.Attribute.HTTPEQUIV

HTML.Attribute.SIZE

HTML.Attribute.CHECKED

HTML.Attribute.ID

HTML.Attribute.SRC

HTML.Attribute.CLASS

HTML.Attribute.ISMAP

HTML.Attribute.STANDBY

HTML.Attribute.CLASSID

HTML.Attribute.LANG

HTML.Attribute.START

HTML.Attribute.CLEAR

HTML.Attribute.LANGUAGE

HTML.Attribute.STYLE

HTML.Attribute.CODE

HTML.Attribute.LINK

HTML.Attribute.TARGET

HTML.Attribute.CODEBASE

HTML.Attribute.LOWSRC

HTML.Attribute.TEXT

HTML.Attribute.CODETYPE

HTML.Attribute. MARGINHEIGHT

HTML.Attribute.TITLE

HTML.Attribute.COLOR

HTML.Attribute.MARGINWIDTH

HTML.Attribute.TYPE

HTML.Attribute.COLS

HTML.Attribute.MAXLENGTH

HTML.Attribute.USEMAP

HTML.Attribute.COLSPAN

HTML.Attribute.METHOD

HTML.Attribute.VALIGN

HTML.Attribute.COMMENT

HTML.Attribute.MULTIPLE

HTML.Attribute.VALUE

HTML.Attribute.COMPACT

HTML.Attribute.N

HTML.Attribute. VALUETYPE

HTML.Attribute.CONTENT

HTML.Attribute.NAME

HTML.Attribute.VERSION

HTML.Attribute.COORDS

HTML.Attribute.NOHREF

HTML.Attribute.VLINK

HTML.Attribute.DATA

HTML.Attribute.NORESIZE

HTML.Attribute.VSPACE

HTML.Attribute.DECLARE

HTML.Attribute.NOSHADE

HTML.Attribute.WIDTH

HTML.Attribute.DIR

HTML.Attribute.NOWRAP

The MutableAttributeSet interface adds six methods to add attributes to and remove attributes from the set:

public void addAttribute(Object name, Object value)

public void addAttributes(AttributeSet attributes)

public void removeAttribute(Object name)

public void removeAttributes(Enumeration names)

public void removeAttributes(AttributeSet attributes)

public void setResolveParent(AttributeSet parent)

Again, the values are strings and the names are HTML.Attribute objects.

One possible use for all these methods is to modify documents before saving or displaying them. For example, most web browsers let you save a page on your hard drive as either HTML or text. However, both these formats lose track of images and relative links. The problem is that most pages are full of relative URLs, and these all break when you move the page to your local machine. Example 8-10 is an application called PageSaver that downloads a web page to a local hard drive while keeping all links intact by rewriting all relative URLs as absolute URLs.

The PageSaver class reads a series of URLs from the command line. It opens each one in turn and parses it. Every tag, text block, comment, and attribute is copied into a local file. However, all link attributes, such as SRC, LOWSRC, CODEBASE, and HREF, are remapped to an absolute URL. Note particularly the extensive use to which the URL and javax.swing.text classes were put; PageSaver could be rewritten with string replacements, but that would be considerably more complicated.

Example 8-10. PageSaver

import javax.swing.text.*;

import javax.swing.text.html.*;

import javax.swing.text.html.parser.*;

import java.io.*;

import java.net.*;

import java.util.*;

public class PageSaver extends HTMLEditorKit.ParserCallback {

private Writer out;

private URL base;

public PageSaver(Writer out, URL base) {

this.out = out;

this.base = base;

}

public void handleStartTag(HTML.Tag tag,

MutableAttributeSet attributes, int position) {

try {

out.write("<" + tag);

this.writeAttributes(attributes);

// for the <APPLET> tag we may have to add a codebase attribute

if (tag == HTML.Tag.APPLET

&& attributes.getAttribute(HTML.Attribute.CODEBASE) == null) {

String codebase = base.toString( );

if (codebase.endsWith(".htm") || codebase.endsWith(".html")) {

codebase = codebase.substring(0, codebase.lastIndexOf(‘/‘));

}

out.write(" codebase=\"" + codebase + "\"");

}

out.write(">");

out.flush( );

}

catch (IOException ex) {

System.err.println(ex);

e.printStackTrace( );

}

public void handleEndTag(HTML.Tag tag, int position) {

try {

out.write("</" + tag + ">");

out.flush( );

}

catch (IOException ex) {

System.err.println(ex);

}

private void writeAttributes(AttributeSet attributes)

throws IOException {

Enumeration e = attributes.getAttributeNames( );

while (e.hasMoreElements( )) {

Object name = e.nextElement( );

String value = (String) attributes.getAttribute(name);

try {

if (name == HTML.Attribute.HREF || name == HTML.Attribute.SRC

|| name == HTML.Attribute.LOWSRC

|| name == HTML.Attribute.CODEBASE ) {

URL u = new URL(base, value);

out.write(" " + name + "=\"" + u + "\"");

}

else {

out.write(" " + name + "=\"" + value + "\"");

}

catch (MalformedURLException ex) {

System.err.println(ex);

System.err.println(base);

System.err.println(value);

ex.printStackTrace( );

}

public void handleComment(char[] text, int position) {

try {

out.write("<!-- ");

out.write(text);

out.write(" -->");

out.flush( );

}

catch (IOException ex) {

System.err.println(ex);

}

public void handleText(char[] text, int position) {

try {

out.write(text);

out.flush( );

}

catch (IOException ex) {

System.err.println(ex);

e.printStackTrace( );

}

public void handleSimpleTag(HTML.Tag tag,

MutableAttributeSet attributes, int position) {

try {

out.write("<" + tag);

this.writeAttributes(attributes);

out.write(">");

}

catch (IOException ex) {

System.err.println(ex);

e.printStackTrace( );

}

public static void main(String[] args) {

for (int i = 0; i < args.length; i++) {

ParserGetter kit = new ParserGetter( );

HTMLEditorKit.Parser parser = kit.getParser( );

try {

URL u = new URL(args[i]);

InputStream in = u.openStream( );

InputStreamReader r = new InputStreamReader(in);

String remoteFileName = u.getFile( );

if (remoteFileName.endsWith("/")) {

remoteFileName += "index.html";

}

if (remoteFileName.startsWith("/")) {

remoteFileName = remoteFileName.substring(1);

}

File localDirectory = new File(u.getHost( ));

while (remoteFileName.indexOf(‘/‘) > -1) {

String part = remoteFileName.substring(0, remoteFileName.indexOf(‘/‘));

remoteFileName =remoteFileName.substring(remoteFileName.indexOf(‘/‘)+1);

localDirectory = new File(localDirectory, part);

}

if (localDirectory.mkdirs( )) {

File output = new File(localDirectory, remoteFileName);

FileWriter out = new FileWriter(output);

HTMLEditorKit.ParserCallback callback = new PageSaver(out, u);

parser.parse(r, callback, false);

}

catch (IOException ex) {

System.err.println(ex);

e.printStackTrace( );

}

The handleEndTag( ), handleText(), and handleComment( ) methods simply copy their content from the input into the output. The handleStartTag( ) and handleSimpleTag( ) methods write their respective tags onto the output but also invoke the private writeAttributes( ) method. This method loops through the attributes in the set and mostly just copies them onto the output. However, for a few select attributes, such as SRC and HREF, which typically have URL values, it rewrites the values as absolute URLs. Finally, the main( ) method reads URLs from the command line, calculates reasonable names and directories for corresponding local files, and starts a new PageSaver for each URL.

In the first edition of this book, I included a similar program that downloaded the raw HTML using the URL class and parsed it manually. That program was about a third longer than this one and much less robust. For instance, it did not support frames or the LOWSRC attributes of IMG tags. It went to great effort to handle both quoted and unquoted attribute values and still didn‘t recognize attribute values enclosed in single quotes. By contrast, this program needs only one extra line of code to support each additional attribute. It is much more robust, much easier to understand (since there‘s not a lot of detailed string manipulation), and much easier to extend.

This is just one example of the various HTML filters that the javax.swing.text.html package makes easy to write. You could, for example, write a filter that pretty-prints the HTML by indenting the different levels of tags. You could write a program to convert HTML to TeX , XML, RTF, or many other formats. You could write a program that spiders a web site, downloading all linked pages—and this is just the beginning. All of these programs are much easier to write because Swing provides a simple-to-use HTML parser. All you have to do is respond to the individual elements and attributes that the parser discovers in the HTML document. The more difficult problem of parsing the document is removed.