Skip to content

Commit 4941749

Browse files
committed
README
1 parent f3313b9 commit 4941749

File tree

1 file changed

+88
-1
lines changed

1 file changed

+88
-1
lines changed

README.md

Lines changed: 88 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,88 @@
1-
# added_annotations
1+
# Added annotations: EMICSS (**E**MDB **In**tegration with **C**omplexes, **S**tructures and **S**equences)
2+
3+
This repository provides tools and scripts for extracting and adding annotations to EMDB entries, which are used to enhance the metadata associated with EM datasets.
4+
5+
### Table of Contents
6+
7+
* Installation
8+
* Configuration
9+
* Usage
10+
* Contributing
11+
* License
12+
13+
### Installation
14+
15+
To install the necessary dependencies, run:
16+
pip install -r requirements.txt
17+
18+
### Configuration
19+
20+
The repository uses a config.ini file for configuration, which is not included in the repository. This file should be created in the root directory of the project with the following structure:
21+
22+
[file_paths]
23+
uniprot_tab: <path_to_file>/uniprot.tsv
24+
CP_ftp: <path_to_file>/complextab
25+
components_cif: <path_to_file>/components.cif
26+
chem_comp_list: <path_to_file>/chem_comp_list.xml
27+
pmc_ftp_gz: <path_to_file>/PMID_PMCID_DOI.csv.gz
28+
pmc_ftp: <path_to_file>/PMID_PMCID_DOI.csv
29+
emdb_pubmed: <path_to_file>/emdb_pubmed.log
30+
emdb_orcid: <path_to_file>/emdb_orcid.log
31+
assembly_ftp: <path_to_file>/assembly/
32+
BLAST_DB: <path_to_file>/ncbi-blast-2.13.0+/database/uniprot_sprot
33+
BLASTP_BIN: blastp
34+
sifts_GO: <path_to_file>/pdb_chain_go.csv
35+
GO_obo: <path_to_file>/go.obo
36+
GO_interpro: /nfs/ftp/pub/databases/GO/goa/external2go/interpro2go
37+
sifts: <path_to_file>/split_xml/
38+
alphafold_ftp: <path_to_file>/accession_ids.txt
39+
rfam_ftp: <path_to_file>/rfam_files_combined.txt
40+
41+
[api]
42+
pmc: https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
43+
44+
#### File Sources and Download Links
45+
| File | Descritption | Download Link |
46+
|-------------|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|
47+
| uniprot.tsv | UniProt annpotations | https://rest.uniprot.org/uniprotkb/stream?fields=accession,xref_pdb,protein_name&query=((database:pdb))&format=tsv&compressed=false |
48+
| complextab | Complex Portal data | https://ftp.ebi.ac.uk/pub/databases/complexportal/complexes.tab.gz |
49+
| components.cif | Chemical components data | https://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/ccd/components.cif |
50+
| chem_comp_list.xml | Chemical component list | https://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/ccd/chem_comp_list.xml |
51+
| PMID_PMCID_DOI.csv.gz | Europe PMC dataset (compressed) | https://europepmc.org/pub/databases/pmc/DOI/PMID_PMCID_DOI.csv.gz |
52+
| PMID_PMCID_DOI.csv | Unzipped version of the Europe PMC dataset | https://ftp.ebi.ac.uk/pub/databases/pmc/DOI/PMID_PMCID_DOI.csv |
53+
| emdb_pubmed | Mapping file created after running PublicationMapping.py | emdb_pubmed.log |
54+
| emdb_orcid | Mapping file created after running PublicationMapping.py | emdb_orcid.log |
55+
| assembly_ftp | PDB assemblies | https://ftp.ebi.ac.uk/pub/databases/msd/assemblies/split/ |
56+
| BLAST_DB | UniProt BLAST database | https://ftp.uniprot.org/pub/databases/uniprot/uniprot_sprot/uniprot_sprot.fasta.gz |
57+
| sifts_GO | PDB chain Gene Ontology mapping | https://ftp.ebi.ac.uk/pub/databases/msd/sifts/pdb_chain_go.csv |
58+
| GO_obo | Gene Ontology definitions | https://current.geneontology.org/ontology/go.obo |
59+
| GO_interpro | InterPro to GO mapping | https://ftp.ebi.ac.uk/pub/databases/GO/goa/external2go/interpro2go |
60+
| sifts | SIFTS data | https://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/ |
61+
| alphafold_ftp | AlphaFold DB accession IDs | https://ftp.ebi.ac.uk/pub/databases/alphafold/accession_ids.csv |
62+
| rfam_ftp | RFAM files | https://www.ebi.ac.uk/pdbe/search/pdb/select?q=emdb_id:*%20AND%20rfam:%5B*%20TO%20*%5D&wt=csv&fl=emdb_id,pdb_id,rfam,rfam_id,entity_id&rows=9999999 |
63+
64+
Download EMDB metadata files from https://ftp.ebi.ac.uk/pub/databases/emdb/structures/EMD-xxxx/header/emd-xxxx-v30.xml (replace "xxxxx" with correct EMDB accession number)
65+
Download EMPAIR metadata files from https://ftp.ebi.ac.uk/pub/databases/emtest/empiar/headers/xxxxx.xml (replace "xxxxx" with correct EMPAIR accession number)
66+
Replace <path_to_file> with the base directory where you store the files locally. Ensure all required files are downloaded and referenced correctly in the config.ini. Make sure your internet connection is active to query api endpoint during execution.
67+
68+
### Usage
69+
70+
To use the tools and scripts in this repository, follow these steps:
71+
Clone the repository:
72+
git clone https://github.com/emdb-empiar/added_annotations.git
73+
cd added_annotations
74+
75+
Ensure the config.ini file is properly configured as described above.
76+
77+
#### Executing the scripts:
78+
79+
Execute the scripts independently in the following recommended order:
80+
* fetch_empiar.py: python fetch_empiar.py -w <output_dir_to_store_annotated_empiar_files> -f <path_to_empiar_metadata_files>
81+
* fetch_pubmed.py: python fetch_pubmed.py -w <output_dir_to_store_annotated_pubmed_files> -f <path_to_emdb_metadata_files>
82+
* added_annotations.py: python added_annotations.py -w <output_dir_to_store_added_annotations> -f <path_to_emdb_metadata_files> --all -t<number_of_threads>
83+
* fetch_afdb.py: python fetch_afdb.py -w <output_dir_to_store_annotated_alphafdb_files>
84+
* write_xml.py: python write_xml.py <output_dir_to_store_EMICSS_xml_files>
85+
86+
### **Landing Page**
87+
88+
For more information about EMICSS, visit the official EMICSS landing page (https://www.ebi.ac.uk/emdb/emicss). This page provides detailed information about the EMDB/EMICSS project.

0 commit comments

Comments
 (0)