Skip to content

Text Processing

PanagiotisP edited this page Aug 25, 2019 · 2 revisions

The most important goal of this project is the extraction of linguistic features from texts.

These features are many indices, separated in three categories, as shown below:

Readability Lexical Diversity Miscellaneous
ARI TTR Entropy
ARI NRI C Normalized Entropy
ARI simplified R
BormuthMC CTTR
BormuthGP U
Coleman S
ColemanC2 K
Coleman.Liau.ECP D
Coleman.Liau.grade Vm
Dale.Chall Maas
Dale.Chall.old MATTR
Dale.Chall.PSK MSTTR
Danielson.Bryan lgV0
Danielson.Bryan2 lgeV0
Dickes.Steiwer
DRP
ELF
Farr.Jenkins.Paterson
Flesch
Flesch PSK
Flesch.Kincaid
FOG
FOG PSK
FOG NRI
FORCAST
FORCAST Reading Grade Level
Fucks
Linsear.Write
LIW
nWS
nWS 2
nWS 3
nWS 4
RIX
Scrabble
SMOG
SMOG C
SMOG simplified
Spache
Spache.old
Strain
Traenkle.Bailer
Traenkle.Bailer 2
Wheeler.Smith
Mean Sentence Length
Mean Word Syllables

The processing of these indices is carried out by various functions, originating from different packages. During the early stages of development, it was decided to use R instead of other programming languages that have NLP libraries, like Python, because it is a language that is widely used by people who work on computational linguistics (who comprise the main target group of this application) and thus it covers their needs completely through its packages.

Processing scripts

For the extraction of each type of indices (readability, lexical diversity, miscellaneous), there is a processing script (readability_indices.R, lexdiv_indices.R, misc_indices.R, located in scripts folder).

These scripts require the path of the folder in which R's libraries are stored, "filePaths" and "index" as command arguments to work correctly. Each script outputs its results in results_[index_category].json file, located in temp folder, in JSON format. An example, which finds the tokens and vocabulary (types) of a txt file, using misc_indices.R script is given below:

Rscript "C:\\Users\\panos\\Documents\\Projects\\gsoc2019-text-extraction\\src\\Built-in\\misc\\misc_indices.R"  "C:\\Users\\panos\\Documents\\R\\win-library\\3.6" -filePaths="C:\\Users\\panos\\Documents\\Projects\\gsoc2019-text-extraction\\data\\book1.txt" -index=tokens,vocabulary

The above scripts contain, apart from R's built-in functions, some functions from two NLP packages, koRpus and quanteda. These libraries were chosen as they totally cover the tool's needs (the first one is used for POS tagging and the second for lexical diversity and readability indices), support many languages and are well documented and easy to use.

Clone this wiki locally