Skip to content
PanagiotisP edited this page Aug 25, 2019 · 2 revisions

Motivation

The text data that is available online is enormous and is increasing, accelerated, every day. From simple tweets, status updates and movie reviews to news, scientific papers and government legislation, the text is a significant part of unstructured data. In order to make use of that data, we need to analyze it and since human computational speed is no comparable to the speed that this data is generated, we need fast, automated computational techniques. Quantitative text analysis can facilitate the automated method of processing large amounts of text, in order to perform various tasks, such as information retrieval, sentiment classification, stylometric analysis, etc. Such analysis is currently performed by powerful programming tools and libraries, open-source, or not, written in many languages, such as Python and R, and backed by organized communities and individuals.

Unfortunately, each tool covers only a subset of the possible linguistic features, since they are usually developed for a specific task. For example, Python’s spaCy has way fewer readability indices than R’s udpipe, but it implements named entity recognition (NER), which is absent from udpipe (a further comparison between these two packages can be found here). So, in order to obtain a unified result, with every desired feature, it is vital for different tools to be integrated under a single platform. Also, in order to operate those tools, high technical skills and the knowledge of numerous programming environments are required. The above points make the text analysis a strenuous task for a big community of scientists, coming from sociopolitical and humanities scientific fields, who do not necessarily have strong programming skills Therefore, my incentive is to develop a friendly Graphical User (GUI) Interface tool, that combines many existing text analysis packages in order to extract linguistic features and quantitative text profiles from multilingual text. The extracted results would be formatted in a way that allows them to be used for further analysis and processing. The tool that was developed facilitates text analysis and makes it available to everyone. Additionally, it is built in a way that ensures modularity and scalability to make future development easier. Such a tool with this structure does not already exist (the existing GUIs are restricted to a single package and work as a demonstration of its capabilities) and I strongly believe that this work will boost research in all areas that require the processing of large amounts of text and attract the open source community to built on it.

Project Goal

The ultimate goal of the project is the development of an integrated tool for quantitative text analysis. A tool that brings together a series of existing algorithms and computational models, but in a user-friendly and seamless way.

Clone this wiki locally