|
1 |
| -# added_annotations |
| 1 | +# Added annotations: EMICSS (**E**MDB **In**tegration with **C**omplexes, **S**tructures and **S**equences) |
| 2 | + |
| 3 | +This repository provides tools and scripts for extracting and adding annotations to EMDB entries, which are used to enhance the metadata associated with EM datasets. |
| 4 | + |
| 5 | +### Table of Contents |
| 6 | + |
| 7 | +* Installation |
| 8 | +* Configuration |
| 9 | +* Usage |
| 10 | +* Contributing |
| 11 | +* License |
| 12 | + |
| 13 | +### Installation |
| 14 | + |
| 15 | +To install the necessary dependencies, run: |
| 16 | +pip install -r requirements.txt |
| 17 | + |
| 18 | +### Configuration |
| 19 | + |
| 20 | +The repository uses a config.ini file for configuration, which is not included in the repository. This file should be created in the root directory of the project with the following structure: |
| 21 | + |
| 22 | +[file_paths] |
| 23 | +uniprot_tab: <path_to_file>/uniprot.tsv |
| 24 | +CP_ftp: <path_to_file>/complextab |
| 25 | +components_cif: <path_to_file>/components.cif |
| 26 | +chem_comp_list: <path_to_file>/chem_comp_list.xml |
| 27 | +pmc_ftp_gz: <path_to_file>/PMID_PMCID_DOI.csv.gz |
| 28 | +pmc_ftp: <path_to_file>/PMID_PMCID_DOI.csv |
| 29 | +emdb_pubmed: <path_to_file>/emdb_pubmed.log |
| 30 | +emdb_orcid: <path_to_file>/emdb_orcid.log |
| 31 | +assembly_ftp: <path_to_file>/assembly/ |
| 32 | +BLAST_DB: <path_to_file>/ncbi-blast-2.13.0+/database/uniprot_sprot |
| 33 | +BLASTP_BIN: blastp |
| 34 | +sifts_GO: <path_to_file>/pdb_chain_go.csv |
| 35 | +GO_obo: <path_to_file>/go.obo |
| 36 | +GO_interpro: /nfs/ftp/pub/databases/GO/goa/external2go/interpro2go |
| 37 | +sifts: <path_to_file>/split_xml/ |
| 38 | +alphafold_ftp: <path_to_file>/accession_ids.txt |
| 39 | +rfam_ftp: <path_to_file>/rfam_files_combined.txt |
| 40 | + |
| 41 | +[api] |
| 42 | +pmc: https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST |
| 43 | + |
| 44 | +#### File Sources and Download Links |
| 45 | +| File | Descritption | Download Link | |
| 46 | +|-------------|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------| |
| 47 | +| uniprot.tsv | UniProt annpotations | https://rest.uniprot.org/uniprotkb/stream?fields=accession,xref_pdb,protein_name&query=((database:pdb))&format=tsv&compressed=false | |
| 48 | +| complextab | Complex Portal data | https://ftp.ebi.ac.uk/pub/databases/complexportal/complexes.tab.gz | |
| 49 | +| components.cif | Chemical components data | https://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/ccd/components.cif | |
| 50 | +| chem_comp_list.xml | Chemical component list | https://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/ccd/chem_comp_list.xml | |
| 51 | +| PMID_PMCID_DOI.csv.gz | Europe PMC dataset (compressed) | https://europepmc.org/pub/databases/pmc/DOI/PMID_PMCID_DOI.csv.gz | |
| 52 | +| PMID_PMCID_DOI.csv | Unzipped version of the Europe PMC dataset | https://ftp.ebi.ac.uk/pub/databases/pmc/DOI/PMID_PMCID_DOI.csv | |
| 53 | +| emdb_pubmed | Mapping file created after running PublicationMapping.py | emdb_pubmed.log | |
| 54 | +| emdb_orcid | Mapping file created after running PublicationMapping.py | emdb_orcid.log | |
| 55 | +| assembly_ftp | PDB assemblies | https://ftp.ebi.ac.uk/pub/databases/msd/assemblies/split/ | |
| 56 | +| BLAST_DB | UniProt BLAST database | https://ftp.uniprot.org/pub/databases/uniprot/uniprot_sprot/uniprot_sprot.fasta.gz | |
| 57 | +| sifts_GO | PDB chain Gene Ontology mapping | https://ftp.ebi.ac.uk/pub/databases/msd/sifts/pdb_chain_go.csv | |
| 58 | +| GO_obo | Gene Ontology definitions | https://current.geneontology.org/ontology/go.obo | |
| 59 | +| GO_interpro | InterPro to GO mapping | https://ftp.ebi.ac.uk/pub/databases/GO/goa/external2go/interpro2go | |
| 60 | +| sifts | SIFTS data | https://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/ | |
| 61 | +| alphafold_ftp | AlphaFold DB accession IDs | https://ftp.ebi.ac.uk/pub/databases/alphafold/accession_ids.csv | |
| 62 | +| rfam_ftp | RFAM files | https://www.ebi.ac.uk/pdbe/search/pdb/select?q=emdb_id:*%20AND%20rfam:%5B*%20TO%20*%5D&wt=csv&fl=emdb_id,pdb_id,rfam,rfam_id,entity_id&rows=9999999 | |
| 63 | + |
| 64 | +Download EMDB metadata files from https://ftp.ebi.ac.uk/pub/databases/emdb/structures/EMD-xxxx/header/emd-xxxx-v30.xml (replace "xxxxx" with correct EMDB accession number) |
| 65 | +Download EMPAIR metadata files from https://ftp.ebi.ac.uk/pub/databases/emtest/empiar/headers/xxxxx.xml (replace "xxxxx" with correct EMPAIR accession number) |
| 66 | +Replace <path_to_file> with the base directory where you store the files locally. Ensure all required files are downloaded and referenced correctly in the config.ini. Make sure your internet connection is active to query api endpoint during execution. |
| 67 | + |
| 68 | +### Usage |
| 69 | + |
| 70 | +To use the tools and scripts in this repository, follow these steps: |
| 71 | +Clone the repository: |
| 72 | +git clone https://github.com/emdb-empiar/added_annotations.git |
| 73 | +cd added_annotations |
| 74 | + |
| 75 | +Ensure the config.ini file is properly configured as described above. |
| 76 | + |
| 77 | +#### Executing the scripts: |
| 78 | + |
| 79 | +Execute the scripts independently in the following recommended order: |
| 80 | +* fetch_empiar.py: python fetch_empiar.py -w <output_dir_to_store_annotated_empiar_files> -f <path_to_empiar_metadata_files> |
| 81 | +* fetch_pubmed.py: python fetch_pubmed.py -w <output_dir_to_store_annotated_pubmed_files> -f <path_to_emdb_metadata_files> |
| 82 | +* added_annotations.py: python added_annotations.py -w <output_dir_to_store_added_annotations> -f <path_to_emdb_metadata_files> --all -t<number_of_threads> |
| 83 | +* fetch_afdb.py: python fetch_afdb.py -w <output_dir_to_store_annotated_alphafdb_files> |
| 84 | +* write_xml.py: python write_xml.py <output_dir_to_store_EMICSS_xml_files> |
| 85 | + |
| 86 | +### **Landing Page** |
| 87 | + |
| 88 | +For more information about EMICSS, visit the official EMICSS landing page (https://www.ebi.ac.uk/emdb/emicss). This page provides detailed information about the EMDB/EMICSS project. |
0 commit comments