Skip to content

shahram-boshra/hh130_database_process

Repository files navigation

HH130 (Half-Heusler130) Database Data Processing Pipeline

Python Version License Code Style Imports: isort Linter: pylint Formatter: black

This repository contains a Python-based data processing pipeline designed for the Half-Heusler Database (HH130). The pipeline automates the process of downloading the dataset, extracting relevant configuration files, processing these files to extract structured data, and generating graph representations suitable for machine learning tasks, particularly within the domain of materials science. Additionally, it includes a Sphinx extension for properly citing the HH130 database in documentation.

Overview

The pipeline consists of the following key stages:

  1. Download and Extraction: Downloads the HH130 dataset as a ZIP file from a specified URL and extracts its contents.
  2. Configuration File Processing: Recursively searches for and processes .cfg files within the extracted data, extracting information such as material composition, atomic coordinates, energy, and stress.
  3. Graph Generation: Constructs graph representations of the material configurations based on atomic distances and properties, saving them as PyTorch Geometric Data objects.
  4. Sphinx Citation Handling: Provides a custom Sphinx extension to ensure consistent and accurate citation of the HH130 database in project documentation.

Repository Structure

├── config.yaml # Configuration file for the data processing pipeline. ├── config.py # Module for loading configuration settings from the YAML file. ├── file_finder.py # Module for finding files (e.g., .cfg files) and counting directories. ├── file_processor.py # Module for processing .cfg files to extract structured data. ├── graph_constructor.py # Module for constructing graph representations from processed data. ├── hh130_dbdl.py # Module for downloading and extracting the HH130 dataset. ├── hh130_sphinx.py # Custom Sphinx extension for handling HH130 citations. ├── main.py # Main script to orchestrate the entire data processing pipeline. └── README.md # This README file.

Configuration

The pipeline's behavior is controlled by the config.yaml file. This file specifies parameters such as the download URL, directory names, file extensions, and settings for graph construction and Sphinx citation.

database_url: "[http://www.mathub3d.net/static/database/HH130.zip](http://www.mathub3d.net/static/database/HH130.zip)"
download_dir_name: "Chem_Data"
hh130_dir_name: "HH130"
cfg_root_subdir: "MLIP models and Datasets"
raw_data_subdir: "raw"
root_data_dir: "C:\\"
cfg_extension: ".cfg"
hh130_download_config:
  zip_filename: "HH130.zip"
  raw_dirname: "raw"
  processed_dirname: "processed"
  cfg_files_dirname: "MLIP models and Datasets"
  cfg_file_extension: ".cfg"
sphinx_config:
  hh130_citation_label: "HH130"
  hh130_citation_ref_text: "[1]"
  hh130_citation_id: "hh130"
  hh130_citation_string: "Y. Yang, Y. Lin, S. Dai, Y. Zhu, J. Xi, L. Xi, X. Gu, D. J. Singh, W. Zhang and J. Yang, HH130: a standardized database of machine learning interatomic potentials, datasets, and its applications in the thermal transport of half-Heusler thermoelectrics, Digital Discovery, 2024, 3, 2201-2210, DOI: 10.1039/D4DD00240G."
graph_construction:
  distance_cutoff: 5.0
  graph_file_extension: ".pt"
  config_counter_format: "{}_config_{}{}"
Note: Ensure that the root_data_dir in your config.yaml is set to a location where the script has write permissions.

Setup and Usage
Prerequisites
Python 3.8 or higher: Ensure you have a compatible Python environment installed.

Required Python Packages: Install the necessary libraries using pip:

Bash

pip install requests pyyaml pathlib tqdm torch torch_geometric numpy sphinx docutils
Running the Pipeline
Clone the repository:

Bash

git clone <repository_url>
cd <repository_directory>
(Optional) Modify the configuration: Review and adjust the config.yaml file according to your needs, such as the download location and root data directory.

Run the main script:

Bash

python main.py
You can also specify a different configuration file using the --config argument:

Bash

python main.py --config custom_config.yaml
The script will then:

Download the HH130 dataset if it's not already present or fully extracted.
Process the .cfg files found in the specified directory.
Generate PyTorch Geometric graph files in the designated raw data subdirectory.
Print a citation reminder for the HH130 database.
Using the Sphinx Extension
To incorporate the HH130 citation into your Sphinx documentation:

Ensure Sphinx is installed:

Bash

pip install sphinx
Add the extension to your Sphinx conf.py file:

Python

extensions = [
    # ... other extensions
    'hh130_sphinx',
]
Configure the citation details in your conf.py file:

Python

hh130_citation_label = 'HH130'
hh130_citation_ref_text = '[1]'
hh130_citation_id = 'hh130'
hh130_citation_string = "Y. Yang, Y. Lin, S. Dai, Y. Zhu, J. Xi, L. Xi, X. Gu, D. J. Singh, W. Zhang and J. Yang, HH130: a standardized database of machine learning interatomic potentials, datasets, and its applications in the thermal transport of half-Heusler thermoelectrics, Digital Discovery, 2024, 3, 2201-2210, DOI: 10.1039/D4DD00240G."
Use the citation in your reStructuredText files:

Code snippet

... some text ... [hh130]_ ... more text ...

.. bibliography::
Sphinx will then automatically generate a citation and a bibliography entry for HH130.

Modules Description
config.py: Provides the functionality to load configuration settings from the config.yaml file.
file_finder.py: Contains utility functions for finding files with a specific extension within a directory and recursively counting directories.
file_processor.py: Implements the logic to parse .cfg files, extract relevant data blocks, and organize them. It also includes functions for extracting numeric and integer data, and determining the order of elements by their atomic mass.
graph_constructor.py: Takes the processed data and constructs graph representations using PyTorch Geometric. It defines how nodes (atoms) and edges (interactions based on distance) are created, along with global graph attributes.
hh130_dbdl.py: Handles the downloading of the HH130 dataset ZIP file from the specified URL and its extraction to a designated directory. It includes checks for existing files and error handling.
hh130_sphinx.py: A custom Sphinx extension that defines a specific citation node for the HH130 database, ensuring consistent formatting and content based on the configuration.
main.py: The main entry point of the pipeline. It orchestrates the execution of the download, processing, and graph generation steps using the configuration loaded from config.yaml.
Contributing
Contributions to this project are welcome. Please feel free to submit pull requests or open issues for any bugs, improvements, or new features.

License
This project is licensed under the MIT License.

Citation
If you use the HH130 database or this processing pipeline in your research, please cite the original publication as specified in the sphinx_config section of your config.yaml or directly:

Y. Yang, Y. Lin, S. Dai, Y. Zhu, J. Xi, L. Xi, X. Gu, D. J. Singh, W. Zhang and J. Yang, HH130: a standardized database of machine learning interatomic potentials, datasets, and its applications in the thermal transport of half-Heusler thermoelectrics, Digital Discovery, 2024, 3, 2201-2210, DOI: 10.1039/D4DD00240G.

## Acknowledgements

We would like to express our sincere gratitude to the researchers and developers behind the Half Heusler Database (HH130) for their efforts in creating and maintaining this valuable resource for the materials science community. Their work in standardizing interatomic potentials and datasets significantly facilitates research and development in areas like thermal transport and machine learning applications in materials.

Specifically, we acknowledge the authors of the original HH130 publication for providing the scientific basis and the data that this pipeline aims to process and make more accessible for computational studies. Their dedication to open science and data sharing is greatly appreciated.

We also thank the developers of the open-source Python libraries used in this pipeline, including `requests`, `pyyaml`, `pathlib`, `tqdm`, `torch`, `torch_geometric`, `numpy`, `sphinx`, and `docutils`, for providing the tools that enable this work.

## Support

For any questions or issues regarding this pipeline, please open an issue on the GitHub repository.

---

**End of README.md**