-
Notifications
You must be signed in to change notification settings - Fork 3
Text Processing
The most important goal of this project is the extraction of linguistic features from texts.
These features are many indices, separated in three categories, as shown below:
Readability | Lexical Diversity | Miscellaneous |
---|---|---|
ARI | TTR | Entropy |
ARI NRI | C | Normalized Entropy |
ARI simplified | R | |
BormuthMC | CTTR | |
BormuthGP | U | |
Coleman | S | |
ColemanC2 | K | |
Coleman.Liau.ECP | D | |
Coleman.Liau.grade | Vm | |
Dale.Chall | Maas | |
Dale.Chall.old | MATTR | |
Dale.Chall.PSK | MSTTR | |
Danielson.Bryan | lgV0 | |
Danielson.Bryan2 | lgeV0 | |
Dickes.Steiwer | ||
DRP | ||
ELF | ||
Farr.Jenkins.Paterson | ||
Flesch | ||
Flesch PSK | ||
Flesch.Kincaid | ||
FOG | ||
FOG PSK | ||
FOG NRI | ||
FORCAST | ||
FORCAST Reading Grade Level | ||
Fucks | ||
Linsear.Write | ||
LIW | ||
nWS | ||
nWS 2 | ||
nWS 3 | ||
nWS 4 | ||
RIX | ||
Scrabble | ||
SMOG | ||
SMOG C | ||
SMOG simplified | ||
Spache | ||
Spache.old | ||
Strain | ||
Traenkle.Bailer | ||
Traenkle.Bailer 2 | ||
Wheeler.Smith | ||
Mean Sentence Length | ||
Mean Word Syllables |
The processing of these indices is carried out by various functions, originating from different packages. During the early stages of development, it was decided to use R instead of other programming languages that have NLP libraries, like Python, because it is a language that is widely used by people who work on computational linguistics (who comprise the main target group of this application) and thus it covers their needs completely through its packages.
For the extraction of each type of indices (readability, lexical diversity, miscellaneous), there is a processing script (readability_indices.R
, lexdiv_indices.R
, misc_indices.R
, located in scripts folder).
These scripts require the path of the folder in which R's libraries are stored, "filePaths" and "index" as command arguments to work correctly. Each script outputs its results in results_[index_category].json file, located in temp folder, in JSON format. An example, which finds the tokens and vocabulary (types) of a txt file, using misc_indices.R
script is given below:
Rscript "C:\\Users\\panos\\Documents\\Projects\\gsoc2019-text-extraction\\src\\Built-in\\misc\\misc_indices.R" "C:\\Users\\panos\\Documents\\R\\win-library\\3.6" -filePaths="C:\\Users\\panos\\Documents\\Projects\\gsoc2019-text-extraction\\data\\book1.txt" -index=tokens,vocabulary
The above scripts contain, apart from R's built-in functions, some functions from two NLP packages, koRpus and quanteda. These libraries were chosen as they totally cover the tool's needs (the first one is used for POS tagging and the second for lexical diversity and readability indices), support many languages and are well documented and easy to use.