USING BREAKITERATOR TO
PARSE TEXT
These tips were developed using Java(tm) 2 SDK, Standard Edition,
v 1.2.2.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The standard Java(tm) packages such as java.util include several
classes that you can use to break text into words or other logical
units. One of these classes is java.util.StringTokenizer. When you
use StringTokenizer, you specify a set of delimiter characters;
instances of StringTokenizer then return words delimited by these
characters. java.io.StreamTokenizer is a class that does something
similar.
These classes are quite useful. However they have some limitations.
This is especially true when you're trying to parse text that
represents human language. For example, the classes don't have
built-in knowledge of punctuation rules, and the classes might
define a "word" as simply a string of contiguous non-whitespace
characters.
java.text.BreakIterator is a class specifically designed to parse
human language text into words, lines, and sentences. To see how it
works, here's a simple example:
import java.text.BreakIterator;
public class BreakDemo1 {
public static void main(String args[]) {
// string to be broken into sentences
String str = "\"Testing.\" \"???\" (This is a test.)";
// create a sentence break iterator
BreakIterator brkit =
BreakIterator.getSentenceInstance();
brkit.setText(str);
// iterate across the string
int start = brkit.first();
int end = brkit.next();
while (end != BreakIterator.DONE) {
String sentence = str.substring(start, end);
System.out.println(start + " " + sentence);
start = end;
end = brkit.next();
}
}
}
The input string is:
"Testing." "???" (This is a test.)
It is immediately apparent that parsing this input is not trivial.
For example, suppose you follow a simple rule that a sentence ends
with a period. Well, actually, it doesn't. The fact that it
doesn't is demonstrated by the following two sentences, both
of which are considered correct:
"This is a test."
"This is a test".
The first of these sentences is more standard relative to
long-standing English usage.
BreakIterator applies a set of rules to handle situations such as
this. When you run the BreakDemo1 program in the United States
locale, the result is:
0 "Testing."
11 "???"
17 (This is a test.)
The numbers are offsets into the string where each sentence starts.
In other words, BreakIterator return a series of offsets that tell
where some particular unit (sentence, word) starts in a string.
BreakIterator is particularly useful in applications such as word
processing, where, for example, you might be trying to find the
location of the next sentence in some currently displayed text.
The demo program uses default locale settings, but it could have
specified a specific locale, for example:
... BreakIterator.getSentenceInstance(Locale.GERMAN);
Another way you can use BreakIterator is to find line breaks,
that is, locations in text where a line could be broken for
text formatting. Here's an example:
import java.text.BreakIterator;
public class BreakDemo2 {
public static void main(String args[]) {
// string to be broken into sentences
String str = "This sen-tence con-tains hyphenation.";
// create a line break iterator
BreakIterator brkit =
BreakIterator.getLineInstance();
brkit.setText(str);
// iterate across the string
int start = brkit.first();
int end = brkit.next();
while (end != BreakIterator.DONE) {
String sentence = str.substring(start, end);
System.out.println(start + " " + sentence);
start = end;
end = brkit.next();
}
}
}
Program output is:
0 This
5 sen-
9 tence
15 con-
19 tains
25 hyphenation.
BreakIterator applies punctuation rules about where text can be
broken, such as between words or within a hyphenated word (but not
between a word and a following ".").
You can also use BreakIterator to find word and character breaks.
It's important to note that in finding breaks, BreakIterator
analyzes characters independently of how they are stored.
A "character" in a human language is not necessarily equivalent to
a single Java 16-bit char. For example, an accented character might
be stored as a base character along with a mark. BreakIterator
analyzes these kinds of composite characters as a single character.
One final note about BreakIterator: it's intended for use with
human languages, not computer ones. For example, a "sentence" in
programming language source code has little meaning.
For more information about BreakIterator, see
http://java.sun.com/products//jdk/1.2/docs/api/java/text/BreakIterator.html
|