CAMel CoNLL is a suite of tools that helps improve CoNLL-X file quality for annotators. It is designed specifically for Arabic, but some of the tools are language independent.
Current tools:
- CATiB enrichment: converts CATiB part-of-speech tags and dependency relation labels to traditional Arabic tags and labels.
- Comma fixer: fixes incorrectly attached commas while preserving attachment and projectivity rules of CoNLL files.
- CoNLL evaluation: compares parsed CoNLL file(s) with gold CoNLL files(s).
- CoNLL statistics: gets general statistics of one or more CoNLL files.
- Well-formedness checker: checks if one or more CoNLL files adhere to a set of rules.
- Clone this repo
- Set up a virtual environment. The tools have been tested using Python 3.11.13.
- Install the required packages:
pip install -r requirements.txt
Convert the part-of-speech tags and dependency relation labels from CATiB to traditional Arabic tags and labels.
This conversion is done using a set of rules found in the latest map file, catib_enrichment/patterns_[version_number].
To run CATiB enrichment:
python catib_enrichment.py -i path/to/file(s) -o path/to/output
The -m parameter is optional, as the latest stable map file will be used by default.
After running files through a dependency parser, some trees may contain commas that have incorrect attachments. The comma fix script is used on a CoNLL file or directory of CoNLL files in order to make these fixes by attaching the comma to a token behind it that is not the root, and does not cause non-projectivity.
To run the comma fixer:
python comma_fix.py -i [path/to/file/or/dir] -o [output/path/]
Note that if the input and output directories are the same, the fixed CoNLL files will will have 'comma_fixed' attached to the end.
See the CoNLL evaluation README for details of the tool and how to run it.
You can us the CoNLL statistics script to generate statistics for one or more CoNLL files using the following link:
python conll_stats.py -i [path/to/file/or/dir] -o [output/path/] [-flags]
There are five flags that can be added at the end, that give statistics for: * w: words, statistics on the word level * s: sentences, statistics on the sentence level * p: pos_tags, the counts of the different part-of-speech tags * d: deprel_labels, the counts of the different dependency relation labels * l: leading, how many parent-child relations are led by the parent vs. by the child
To generate statistics for the words, sentences, and determining the count of leading relationships, you would use -wsl:
python conll_stats.py -i [path/to/file/or/dir] -o [output/path/] -wsl
You can pass a CoNLL file or directory of CoNLL files:
python wellformedness_checker.py -i [path/to/file/or/dir] -o [output/path/]
The checker uses the r13 database by default, but you can pass calima-msa-s31. See the Databases section for details.
Curently, the Well-formedness checker uses CAMeLTools' default morphology database, the morphology-db-msa-r13.
You can use the calima-msa-s31 database by first installing it. follow these steps (note that you need an account with the LDC):
- Install camel_tools v1.5.2 or later (you can check this using camel_data -v)
- Download the camel data for the BERT unfactored (MSA) model, as well as the morphology database:
camel_data -i morphology-db-msa-s31
camel_data -i disambig-bert-unfactored-msa
- Download the LDC2010L01 from the ldc downloads:
- go to https://catalog.ldc.upenn.edu/organization/downloads
- search for LDC2010L01.tgz and download it
- DO NOT EXTRACT LDC2010L01.tgz! We'll use the following command from camel tools to install the db:
camel_data -p morphology-db-msa-s31 /path/to/LDC2010L01.tgz
- When running the Well-formedness checker script, use -b and pass calima-msa-s31.