-
Notifications
You must be signed in to change notification settings - Fork 3
Java Interface for Phrase Highlighting
Dainius Jocas edited this page Oct 3, 2019
·
8 revisions
Beagle phrase highlighting exposes options to control:
- case sensitivity,
- ASCII folding,
- stemming support for various languages,
- phrase slop,
- defining synonymous phrases,
- assigning metadata,
- combining all the options.
Examples will be given using Beagle library for processing text snippets from texts about one of the most famous beagle owners Lyndon B. Johnson.
Beagle is deployed on Maven Central. Just add an entry in you favourite dependency manager configuration, e.g. pom.xml
<dependency>
<groupId>lt.tokenmill</groupId>
<artifactId>beagle</artifactId>
<version>0.3.1</version>
</dependency>
Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "Lyndon Baines Johnson"
import lt.tokenmill.beagle.phrases.Annotation;
import lt.tokenmill.beagle.phrases.Annotator;
import lt.tokenmill.beagle.phrases.DictionaryEntry;
import java.util.Arrays;
import java.util.Collection;
public class Main {
public static void main(String[] args) {
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Baines Johnson");
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
}
}
// => Annotated: 'Lyndon Baines Johnson' at offset: 0:21
Other examples will not include class definitions and imports for conciseness.
Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "LYNDON BAINES JOHNSON"
DictionaryEntry dictionaryEntry = new DictionaryEntry("LYNDON BBAINES JOHNSON");
dictionaryEntry.setCaseSensitive(false);
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
// => Annotated: 'Lyndon Baines Johnson' at offset: 0:21
Text: "Lyndon Baines Johnson was naïve kid from Brooklyn."
Phrase: "Lyndon Baines Johnson was naive"
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Baines Johnson was a naive");
dictionaryEntry.setAsciiFold(true);
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson was a naïve kid from Stonewall, Texas.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));ach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
// => Annotated: 'Lyndon Baines Johnson was a naïve' at offset: 0:33
Text: "Johnson's presidency marked the peak of modern liberalism after the New Deal era."
Phrase: "Johnson presidency"
DictionaryEntry dictionaryEntry = new DictionaryEntry("Johnson presidency");
dictionaryEntry.setStem(true);
dictionaryEntry.setStemmer("english");
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Johnson's presidency marked the peak of modern liberalism after the New Deal era.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
// => Annotated: 'Johnson's presidency' at offset: 0:20
Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "Lyndon Johnson"
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Johnson");
dictionaryEntry.setSlop(1);
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "Lyndon Johnson" with a synonym "JBL"
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Baines Johnson");dictionaryEntry.setId("lyndon-baines-johnson");
dictionaryEntry.setSynonyms(Arrays.asList("LBJ"));
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: dictioanryEntryId=" + s.dictionaryEntryId() + ": \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
// => Annotated: dictioanryEntryId=lyndon-baines-johnson: 'Lyndon Baines Johnson' at offset: 0:21
// => Annotated: dictioanryEntryId=lyndon-baines-johnson: 'LBJ' at offset: 99:102
Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "Lyndon Johnson" with a metadata map {"email": "demo@example.com"}
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Baines Johnson");
HashMap<String, String> meta = new HashMap<>();
meta.put("email", "demo@example.com");
dictionaryEntry.setMeta(meta);
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset() + " with meta: " + s.meta()));
Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrases: "Baines" and "Lyndon Baines Johnson"
DictionaryEntry dictionaryEntry1 = new DictionaryEntry("Baines");
DictionaryEntry dictionaryEntry2 = new DictionaryEntry("Lyndon Baines Johnson");
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry1, dictionaryEntry2));
HashMap<String, Object> annotationOptions = new HashMap<>();
annotationOptions.put("merge-annotations?", false);
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.",
annotationOptions);
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset() + " with meta: " + s.meta()));
//=> Annotated: 'Baines' at offset: 7:13 with meta: {}
//=> Annotated: 'Lyndon Baines Johnson' at offset: 0:21 with meta: {}
annotationOptions.put("merge-annotations?", true);
annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.",
annotationOptions);
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset() + " with meta: " + s.meta()));
//=> Annotated: 'Lyndon Baines Johnson' at offset: 0:21 with meta: {}
Options for dictionary entries are independent. This means that every dictionary entry can have a separate set of options enabled. E.g. two dictionary entries can use stemmers for different languages.