Skip to content

Java Interface for Phrase Highlighting

Dainius Jocas edited this page Oct 3, 2019 · 8 revisions

Beagle phrase highlighting exposes options to control:

  • case sensitivity,
  • ASCII folding,
  • stemming support for various languages,
  • phrase slop,
  • defining synonymous phrases,
  • assigning metadata,
  • combining all the options.

Examples will be given using Beagle library for processing text snippets from texts about one of the most famous beagle owners Lyndon B. Johnson.

Prerequisites

Beagle is deployed on Maven Central. Just add an entry in you favourite dependency manager configuration, e.g. pom.xml

<dependency>
  <groupId>lt.tokenmill</groupId>
  <artifactId>beagle</artifactId>
  <version>0.3.1</version>
</dependency>

Exact Phrase Highlighting Example

Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "Lyndon Baines Johnson"
import lt.tokenmill.beagle.phrases.Annotation;
import lt.tokenmill.beagle.phrases.Annotator;
import lt.tokenmill.beagle.phrases.DictionaryEntry;

import java.util.Arrays;
import java.util.Collection;

public class Main {
    public static void main(String[] args) {
        DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Baines Johnson");
        Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
        Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
        annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
    }
}

// => Annotated: 'Lyndon Baines Johnson' at offset: 0:21

Other examples will not include class definitions and imports for conciseness.

Case Sensitivity

Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "LYNDON BAINES JOHNSON"
DictionaryEntry dictionaryEntry = new DictionaryEntry("LYNDON BBAINES JOHNSON");

dictionaryEntry.setCaseSensitive(false);

Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));

// => Annotated: 'Lyndon Baines Johnson' at offset: 0:21

ASCII Folding

Text: "Lyndon Baines Johnson was naïve kid from Brooklyn."
Phrase: "Lyndon Baines Johnson was naive"
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Baines Johnson was a naive");

dictionaryEntry.setAsciiFold(true);
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson was a naïve kid from Stonewall, Texas.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));ach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));

// => Annotated: 'Lyndon Baines Johnson was a naïve' at offset: 0:33

Stemming

Text: "Johnson's presidency marked the peak of modern liberalism after the New Deal era."
Phrase: "Johnson presidency"
DictionaryEntry dictionaryEntry = new DictionaryEntry("Johnson presidency");

dictionaryEntry.setStem(true);
dictionaryEntry.setStemmer("english");

Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Johnson's presidency marked the peak of modern liberalism after the New Deal era.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));

// => Annotated: 'Johnson's presidency' at offset: 0:20

Phrase Slop

Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "Lyndon Johnson"
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Johnson");

dictionaryEntry.setSlop(1);

Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));

Synonymous Phrases

Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "Lyndon Johnson" with a synonym "JBL"
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Baines Johnson");dictionaryEntry.setId("lyndon-baines-johnson");

dictionaryEntry.setSynonyms(Arrays.asList("LBJ"));

Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: dictioanryEntryId=" + s.dictionaryEntryId() + ": \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
// => Annotated: dictioanryEntryId=lyndon-baines-johnson: 'Lyndon Baines Johnson' at offset: 0:21
// => Annotated: dictioanryEntryId=lyndon-baines-johnson: 'LBJ' at offset: 99:102

Metadata

Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "Lyndon Johnson" with a metadata map {"email": "demo@example.com"} 
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Baines Johnson");

HashMap<String, String> meta = new HashMap<>();
meta.put("email", "demo@example.com");
dictionaryEntry.setMeta(meta);

Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset() + " with meta: " + s.meta()));

Annotation Merging

Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrases: "Baines" and "Lyndon Baines Johnson"
DictionaryEntry dictionaryEntry1 = new DictionaryEntry("Baines");
DictionaryEntry dictionaryEntry2 = new DictionaryEntry("Lyndon Baines Johnson");
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry1, dictionaryEntry2));

HashMap<String, Object> annotationOptions = new HashMap<>();
annotationOptions.put("merge-annotations?", false);
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.",
        annotationOptions);
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset() + " with meta: " + s.meta()));

//=> Annotated: 'Baines' at offset: 7:13 with meta: {}
//=> Annotated: 'Lyndon Baines Johnson' at offset: 0:21 with meta: {}

annotationOptions.put("merge-annotations?", true);
        
annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.",
        annotationOptions);
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset() + " with meta: " + s.meta()));
//=> Annotated: 'Lyndon Baines Johnson' at offset: 0:21 with meta: {}

Combinations

Options for dictionary entries are independent. This means that every dictionary entry can have a separate set of options enabled. E.g. two dictionary entries can use stemmers for different languages.

Clone this wiki locally