Text Processing

The most important goal of this project is the extraction of linguistic features from texts.

These features are many indices, separated in three categories, as shown below:

Readability	Lexical Diversity	Miscellaneous
ARI	TTR	Entropy
ARI NRI	C	Normalized Entropy
ARI simplified	R
BormuthMC	CTTR
BormuthGP	U
Coleman	S
ColemanC2	K
Coleman.Liau.ECP	D
Coleman.Liau.grade	Vm
Dale.Chall	Maas
Dale.Chall.old	MATTR
Dale.Chall.PSK	MSTTR
Danielson.Bryan	lgV0
Danielson.Bryan2	lgeV0
Dickes.Steiwer
DRP
ELF
Farr.Jenkins.Paterson
Flesch
Flesch PSK
Flesch.Kincaid
FOG
FOG PSK
FOG NRI
FORCAST
FORCAST Reading Grade Level
Fucks
Linsear.Write
LIW
nWS
nWS 2
nWS 3
nWS 4
RIX
Scrabble
SMOG
SMOG C
SMOG simplified
Spache
Spache.old
Strain
Traenkle.Bailer
Traenkle.Bailer 2
Wheeler.Smith
Mean Sentence Length
Mean Word Syllables

The processing of these indices is carried out by various functions, originating from different packages. During the early stages of development, it was decided to use R instead of other programming languages that have NLP libraries, like Python, because it is a language that is widely used by people who work on computational linguistics (who comprise the main target group of this application) and thus it covers their needs completely through its packages.

Processing scripts

For the extraction of each type of indices (readability, lexical diversity, miscellaneous), there is a processing script (readability_indices.R, lexdiv_indices.R, misc_indices.R, located in scripts folder).

These scripts require the path of the folder in which R's libraries are stored, "filePaths" and "index" as command arguments to work correctly. Each script outputs its results in results_[index_category].json file, located in temp folder, in JSON format. An example, which finds the tokens and vocabulary (types) of a txt file, using misc_indices.R script is given below:

Rscript "C:\\Users\\panos\\Documents\\Projects\\gsoc2019-text-extraction\\src\\Built-in\\misc\\misc_indices.R"  "C:\\Users\\panos\\Documents\\R\\win-library\\3.6" -filePaths="C:\\Users\\panos\\Documents\\Projects\\gsoc2019-text-extraction\\data\\book1.txt" -index=tokens,vocabulary

The above scripts contain, apart from R's built-in functions, some functions from two NLP packages, koRpus and quanteda. These libraries were chosen as they totally cover the tool's needs (the first one is used for POS tagging and the second for lexical diversity and readability indices), support many languages and are well documented and easy to use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Processing

Processing scripts

Clone this wiki locally