Camel CoNLL

Introduction

CAMel CoNLL is a suite of tools that helps improve CoNLL-X file quality for annotators. It is designed specifically for Arabic, but some of the tools are language independent.

Current tools:

CATiB enrichment: converts CATiB part-of-speech tags and dependency relation labels to traditional Arabic tags and labels.
Comma fixer: fixes incorrectly attached commas while preserving attachment and projectivity rules of CoNLL files.
CoNLL evaluation: compares parsed CoNLL file(s) with gold CoNLL files(s).
CoNLL statistics: gets general statistics of one or more CoNLL files.
Well-formedness checker: checks if one or more CoNLL files adhere to a set of rules.

Installation

Clone this repo
Set up a virtual environment. The tools have been tested using Python 3.11.13.
Install the required packages:

pip install -r requirements.txt

Tools

CATiB enrichment

Convert the part-of-speech tags and dependency relation labels from CATiB to traditional Arabic tags and labels.

This conversion is done using a set of rules found in the latest map file, catib_enrichment/patterns_[version_number].

To run CATiB enrichment:

python catib_enrichment.py -i path/to/file(s) -o path/to/output

The -m parameter is optional, as the latest stable map file will be used by default.

Comma fixer

After running files through a dependency parser, some trees may contain commas that have incorrect attachments. The comma fix script is used on a CoNLL file or directory of CoNLL files in order to make these fixes by attaching the comma to a token behind it that is not the root, and does not cause non-projectivity.

To run the comma fixer:

python comma_fix.py -i [path/to/file/or/dir] -o [output/path/]

Note that if the input and output directories are the same, the fixed CoNLL files will will have 'comma_fixed' attached to the end.

CoNLL evaluation

See the CoNLL evaluation README for details of the tool and how to run it.

CoNLL statistics

You can us the CoNLL statistics script to generate statistics for one or more CoNLL files using the following link:

python conll_stats.py -i [path/to/file/or/dir] -o [output/path/] [-flags]

There are five flags that can be added at the end, that give statistics for: * w: words, statistics on the word level * s: sentences, statistics on the sentence level * p: pos_tags, the counts of the different part-of-speech tags * d: deprel_labels, the counts of the different dependency relation labels * l: leading, how many parent-child relations are led by the parent vs. by the child

To generate statistics for the words, sentences, and determining the count of leading relationships, you would use -wsl:

python conll_stats.py -i [path/to/file/or/dir] -o [output/path/] -wsl

Well-formedness checker

You can pass a CoNLL file or directory of CoNLL files:

python wellformedness_checker.py -i [path/to/file/or/dir] -o [output/path/]

The checker uses the r13 database by default, but you can pass calima-msa-s31. See the Databases section for details.

Using another morphology database

Curently, the Well-formedness checker uses CAMeLTools' default morphology database, the morphology-db-msa-r13.

You can use the calima-msa-s31 database by first installing it. follow these steps (note that you need an account with the LDC):

Install camel_tools v1.5.2 or later (you can check this using camel_data -v)
Download the camel data for the BERT unfactored (MSA) model, as well as the morphology database:

camel_data -i morphology-db-msa-s31

camel_data -i disambig-bert-unfactored-msa

Download the LDC2010L01 from the ldc downloads:
- go to https://catalog.ldc.upenn.edu/organization/downloads
- search for LDC2010L01.tgz and download it
DO NOT EXTRACT LDC2010L01.tgz! We'll use the following command from camel tools to install the db:

camel_data -p morphology-db-msa-s31 /path/to/LDC2010L01.tgz

When running the Well-formedness checker script, use -b and pass calima-msa-s31.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
catib_enrichment		catib_enrichment
conll_evaluation		conll_evaluation
conll_stats		conll_stats
data/patterns		data/patterns
external_libraries/ced_word_alignment		external_libraries/ced_word_alignment
utils		utils
wellformedness		wellformedness
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
catib_enrichment.py		catib_enrichment.py
comma_fix.py		comma_fix.py
conll_evaluation.py		conll_evaluation.py
conll_stats.py		conll_stats.py
requirements.txt		requirements.txt
wellformedness_checker.py		wellformedness_checker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Camel CoNLL

Introduction

Installation

Tools

CATiB enrichment

Comma fixer

CoNLL evaluation

CoNLL statistics

Well-formedness checker

Using another morphology database

About

Uh oh!

Releases

Packages

Languages

License

CAMeL-Lab/camel_conll

Folders and files

Latest commit

History

Repository files navigation

Camel CoNLL

Introduction

Installation

Tools

CATiB enrichment

Comma fixer

CoNLL evaluation

CoNLL statistics

Well-formedness checker

Using another morphology database

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages