HTMLParser

java.lang.Object
- javax.swing.text.html.HTMLEditorKit.ParserCallback
- - grammarscope.utils.HTMLParser

```
public class HTMLParser
extends javax.swing.text.html.HTMLEditorKit.ParserCallback
```
Parses an HTML document and returns the plain text (and title). The main thing that HTMLParser is used for is the parse(URL url) method, which will return a sentence with the contents of an HTML page, without the tags. After calling parse, you can get the HTML title (contents of the TITLE tag) by calling title(). Subclasses may override the handleText(), handleComment(), handleStartTag(), etc. methods so that parse(URL url) returns something other than the text of the web page. (For example, one may be interested in returning only part of the text, or only the links.)

Author:

Sepandar Kamvar (sdkamvar@stanford.edu)

Field Summary

Fields
Modifier and Type	Field and Description
`protected boolean`	`isBody`
`protected boolean`	`isScript`
`protected boolean`	`isTitle`
`private static int`	`SLURP_BUFFER_SIZE`
`protected java.lang.StringBuffer`	`textBuffer`
`protected java.lang.String`	`title`

Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
IMPLIED

Constructor Summary

Constructors
Constructor and Description

HTMLParser()

Constructors
Constructor and Description
`HTMLParser()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static java.lang.String`	`escapeString(java.lang.String s, char[] charsToEscape, char escapeChar)`
`void`	`handleEndTag(javax.swing.text.html.HTML.Tag tag, int pos)` Sets a flag if the end tag is the "TITLE" element end tag
`void`	`handleStartTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attrSet, int pos)` Sets a flag if the start tag is the "TITLE" element start tag.
`void`	`handleText(char[] data, int pos)`
`static void`	`main(java.lang.String[] args)`
`java.lang.String`	`parse(java.io.Reader r)`
`java.lang.String`	`parse(java.lang.String text0)` The parse method that actually does the work.
`java.lang.String`	`parse(java.net.URL url)`
`static java.lang.String`	`searchAndReplace(java.lang.String text, java.lang.String from0, java.lang.String to)`
`static java.lang.String`	`slurpReader(java.io.Reader reader)` Returns all the text from the given Reader.
`static java.lang.String`	`slurpURL(java.net.URL u)` Returns all the text at the given URL.
`java.lang.String`	`title()`

Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
flush, handleComment, handleEndOfLineString, handleError, handleSimpleTag

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

textBuffer

protected java.lang.StringBuffer textBuffer

title
```
protected java.lang.String title
```

isTitle
```
protected boolean isTitle
```

isBody
```
protected boolean isBody
```

isScript
```
protected boolean isScript
```

SLURP_BUFFER_SIZE
```
private static final int SLURP_BUFFER_SIZE
```
See Also:

Constant Field Values

Constructor Detail
- HTMLParser
```
public HTMLParser()
```

Method Detail

handleText
```
public void handleText(char[] data,
                       int pos)
```
Overrides:

handleText in class javax.swing.text.html.HTMLEditorKit.ParserCallback

handleStartTag

public void handleStartTag(javax.swing.text.html.HTML.Tag tag,
                           javax.swing.text.MutableAttributeSet attrSet,
                           int pos)

Sets a flag if the start tag is the "TITLE" element start tag.

Overrides:: handleStartTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback

handleEndTag
```
public void handleEndTag(javax.swing.text.html.HTML.Tag tag,
                         int pos)
```
Sets a flag if the end tag is the "TITLE" element end tag

Overrides:

handleEndTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback

parse

public java.lang.String parse(java.net.URL url)
                       throws java.io.IOException

Throws:: java.io.IOException

parse

public java.lang.String parse(java.io.Reader r)
                       throws java.io.IOException

Throws:: java.io.IOException

parse
```
public java.lang.String parse(java.lang.String text0)
                       throws java.io.IOException
```
The parse method that actually does the work. Now it first gets rid of singleton tags before running.

Parameters:

text0 - input text

Returns:

parsed string

Throws:

java.io.IOException - exception

title
```
public java.lang.String title()
```

searchAndReplace

public static java.lang.String searchAndReplace(java.lang.String text,
                                                java.lang.String from0,
                                                java.lang.String to)

escapeString

public static java.lang.String escapeString(java.lang.String s,
                                            char[] charsToEscape,
                                            char escapeChar)

slurpReader
```
public static java.lang.String slurpReader(java.io.Reader reader)
                                    throws java.io.IOException
```
Returns all the text from the given Reader. Closes the Reader when done.

Parameters:

reader - reader

Returns:

The text in the file.

Throws:

java.io.IOException - exception

slurpURL
```
public static java.lang.String slurpURL(java.net.URL u)
                                 throws java.io.IOException
```
Returns all the text at the given URL.

Parameters:

u - url

Returns:

all the text at the given URL

Throws:

java.io.IOException - exception

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException

Throws:: java.io.IOException

Class HTMLParser

Field Summary

Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback

Constructor Summary

Method Summary

Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback

Methods inherited from class java.lang.Object

Field Detail

textBuffer

title

isTitle

isBody

isScript

SLURP_BUFFER_SIZE

Constructor Detail

HTMLParser

Method Detail

handleText

handleStartTag

handleEndTag

parse

parse

parse

title

searchAndReplace

escapeString

slurpReader

slurpURL

main