|
1 |
| -# added_annotations |
| 1 | +# Added annotations: EMICSS (**E**MDB **In**tegration with **C**omplexes, **S**tructures and **S**equences) |
| 2 | + |
| 3 | +This repository provides tools and scripts for extracting and adding annotations to EMDB entries, which are used to enhance the metadata associated with EM datasets. |
| 4 | + |
| 5 | +### Table of Contents |
| 6 | + |
| 7 | +* Installation |
| 8 | +* Configuration |
| 9 | +* Usage |
| 10 | +* Contributing |
| 11 | +* License |
| 12 | + |
| 13 | +### Installation |
| 14 | + |
| 15 | +To install the necessary dependencies, run: |
| 16 | +pip install -r requirements.txt |
| 17 | + |
| 18 | +### Configuration |
| 19 | + |
| 20 | +The repository uses a config.ini file for configuration, which is not included in the repository. This file should be created in the root directory of the project with the following structure: |
| 21 | + |
| 22 | +``` |
| 23 | +[file_paths] |
| 24 | +uniprot_tab: <path_to_file>/uniprot.tsv |
| 25 | +CP_ftp: <path_to_file>/complextab |
| 26 | +components_cif: <path_to_file>/components.cif |
| 27 | +chem_comp_list: <path_to_file>/chem_comp_list.xml |
| 28 | +pmc_ftp_gz: <path_to_file>/PMID_PMCID_DOI.csv.gz |
| 29 | +pmc_ftp: <path_to_file>/PMID_PMCID_DOI.csv |
| 30 | +emdb_pubmed: <path_to_file>/emdb_pubmed.log |
| 31 | +emdb_orcid: <path_to_file>/emdb_orcid.log |
| 32 | +assembly_ftp: <path_to_file>/assembly/ |
| 33 | +BLAST_DB: <path_to_file>/ncbi-blast-2.13.0+/database/uniprot_sprot |
| 34 | +BLASTP_BIN: blastp |
| 35 | +sifts_GO: <path_to_file>/pdb_chain_go.csv |
| 36 | +GO_obo: <path_to_file>/go.obo |
| 37 | +GO_interpro: /nfs/ftp/pub/databases/GO/goa/external2go/interpro2go |
| 38 | +sifts: <path_to_file>/split_xml/ |
| 39 | +alphafold_ftp: <path_to_file>/accession_ids.txt |
| 40 | +rfam_ftp: <path_to_file>/rfam_files_combined.txt |
| 41 | +
|
| 42 | +[api] |
| 43 | +pmc: https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST |
| 44 | +``` |
| 45 | + |
| 46 | +#### File Sources and Download Links |
| 47 | +| File | Descritption | Download Link | |
| 48 | +|-------------|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------| |
| 49 | +| uniprot.tsv | UniProt annpotations | https://rest.uniprot.org/uniprotkb/stream?fields=accession,xref_pdb,protein_name&query=((database:pdb))&format=tsv&compressed=false | |
| 50 | +| complextab | Complex Portal data | https://ftp.ebi.ac.uk/pub/databases/complexportal/complexes.tab.gz | |
| 51 | +| components.cif | Chemical components data | https://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/ccd/components.cif | |
| 52 | +| chem_comp_list.xml | Chemical component list | https://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/ccd/chem_comp_list.xml | |
| 53 | +| PMID_PMCID_DOI.csv.gz | Europe PMC dataset (compressed) | https://europepmc.org/pub/databases/pmc/DOI/PMID_PMCID_DOI.csv.gz | |
| 54 | +| PMID_PMCID_DOI.csv | Unzipped version of the Europe PMC dataset | https://ftp.ebi.ac.uk/pub/databases/pmc/DOI/PMID_PMCID_DOI.csv | |
| 55 | +| emdb_pubmed | Mapping file created after running PublicationMapping.py | emdb_pubmed.log | |
| 56 | +| emdb_orcid | Mapping file created after running PublicationMapping.py | emdb_orcid.log | |
| 57 | +| assembly_ftp | PDB assemblies | https://ftp.ebi.ac.uk/pub/databases/msd/assemblies/split/ | |
| 58 | +| BLAST_DB | UniProt BLAST database | https://ftp.uniprot.org/pub/databases/uniprot/uniprot_sprot/uniprot_sprot.fasta.gz | |
| 59 | +| sifts_GO | PDB chain Gene Ontology mapping | https://ftp.ebi.ac.uk/pub/databases/msd/sifts/pdb_chain_go.csv | |
| 60 | +| GO_obo | Gene Ontology definitions | https://current.geneontology.org/ontology/go.obo | |
| 61 | +| GO_interpro | InterPro to GO mapping | https://ftp.ebi.ac.uk/pub/databases/GO/goa/external2go/interpro2go | |
| 62 | +| sifts | SIFTS data | https://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/ | |
| 63 | +| alphafold_ftp | AlphaFold DB accession IDs | https://ftp.ebi.ac.uk/pub/databases/alphafold/accession_ids.csv | |
| 64 | +| rfam_ftp | RFAM files | https://www.ebi.ac.uk/pdbe/search/pdb/select?q=emdb_id:*%20AND%20rfam:%5B*%20TO%20*%5D&wt=csv&fl=emdb_id,pdb_id,rfam,rfam_id,entity_id&rows=9999999 | |
| 65 | +| emd-xxxx-v30.xml | EMDB metadata | https://ftp.ebi.ac.uk/pub/databases/emdb/ | |
| 66 | +| xxxxx.xml | EMPIAR metadata | https://ftp.ebi.ac.uk/pub/databases/emtest/empiar | |
| 67 | + |
| 68 | +### Usage |
| 69 | + |
| 70 | +To use the tools and scripts in this repository, you just need to clone it and ensure the config.ini file is properly configured as described above. |
| 71 | + |
| 72 | +#### Executing the scripts: |
| 73 | + |
| 74 | +Execute the scripts independently in the following recommended order: |
| 75 | +##### EMPIAR mapping |
| 76 | +``` |
| 77 | +fetch_empiar.py: python fetch_empiar.py -w <output_dir_to_store_annotated_empiar_files> -f <path_to_empiar_metadata_files> |
| 78 | +``` |
| 79 | +##### Publication mapping |
| 80 | +``` |
| 81 | +fetch_pubmed.py: python fetch_pubmed.py -w <output_dir_to_store_annotated_pubmed_files> -f <path_to_emdb_metadata_files> |
| 82 | +``` |
| 83 | +##### Protein, complexes and ligands mapping |
| 84 | +``` |
| 85 | +added_annotations.py: python added_annotations.py -w <output_dir_to_store_added_annotations> -f <path_to_emdb_metadata_files> --all -t <number_of_threads> |
| 86 | +``` |
| 87 | +##### AlphaFold DB mapping |
| 88 | +``` |
| 89 | +fetch_afdb.py: python fetch_afdb.py -w <output_dir_to_store_annotated_alphafdb_files> |
| 90 | +``` |
| 91 | +##### Write files |
| 92 | +``` |
| 93 | +write_xml.py: python write_xml.py <output_dir_to_store_EMICSS_xml_files> |
| 94 | +``` |
| 95 | + |
| 96 | +### Further information |
| 97 | + |
| 98 | +For more information about EMICSS, visit the official EMICSS website (https://www.ebi.ac.uk/emdb/emicss). This page provides detailed information about the EMDB/EMICSS project. |
0 commit comments