Incorporating Content Structure into Text Analysis Applications

Christina Sauper	Aria Haghighi	Regina Barzilay
csauper@csail.mit.edu	aria42@gmail.com	regina@csail.mit.edu

Abstract

In this paper, we investigate how modeling content structure can benefit text analysis applications such as extractive summarization and sentiment analysis. This follows the linguistic intuition that rich contextual information should be useful in these tasks. We present a framework which combines a supervised text analysis application with the induction of latent content structure. Both of these elements are learned jointly using the EM algorithm. The induced content structure is learned from a large unannotated corpus and biased by the underlying text analysis task. We demonstrate that exploiting content structure yields significant improvements over approaches that rely only on local context.

Full Text: http://groups.csail.mit.edu/rbg/code/content_structure/sauper-emnlp-10.pdf

Code

This code is available for research use only.

These instructions mainly pertain to the Amazon and Yelp data sets for the multi-aspect phrase extraction task.

Running

Source code is included, but to just get started running the system quickly, I recommend installing maven2, then compiling with mvn compile in the base directory. After that, run the system as follows (substituting your desired config file and memory usage):

java -ea -server -mx5G -Djava.ext.dirs=lib -cp target/classes phrase.jointtopic.Main amazon-conf.yaml

Config files

amazon-conf.yaml yelp-conf.yaml

These are the config files which specify parameters for the model. See GlobalOptions.java for more potential options.

conf/amazon.list conf/yelp.list

Lists of token files (see below for format). Those marked as TRAIN_LABELED will be used as labeled input documents; those marked as TEST_LABELED will be used at test time.

Data

We performed experiments on three separate corpora, a set of Amazon HDTV reviews (59 labeled, 12.8k unlabeled), a set of Yelp restaurant reviews (96 labeled, 31k unlabeled), and a set of IGN DVD reviews (665 labeled).

Formats

*.tok

Tokenized file, one word per line. The columns of this file are as follows: word sentence # word #(sent) start-char end-char start-char(sent) end-char(sent)

*.ann

Annotations corresponding to the tokenized file; one word's label per line. Some annotation files are tagged with begin / inside / end tokens; ability to automatically strip these is controlled by an option in the config file.

Demo

A demo system is online at http://condensr.com. Backend code is online at https://github.com/csauper/condensr.

The demo system puts together results from multi-aspect phrase extraction on Yelp reviews with Google maps and other restaurant metadata in order to display a concise summary of restaurants based on a search in a particular area.

Each restaurant summary presents the following:

Basic restaurant information (name, address, location)
Automatically extracted and categorized highlights from all Yelp reviews for the restaurant in several categories (food, ambiance, service, value, overall)
Automatically determined sentiment classification (positive, negative, neutral)
Supporting highlights to expand on the main points

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
conf		conf
lib		lib
src/main		src/main
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Incorporating Content Structure into Text Analysis Applications

Abstract

Code

Running

Config files

Data

Formats

Demo

About

Uh oh!

Releases

Packages

csauper/content-structure

Folders and files

Latest commit

History

Repository files navigation

Incorporating Content Structure into Text Analysis Applications

Abstract

Code

Running

Config files

Data

Formats

Demo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages