This repository contains a Python-based data processing pipeline designed for the Half-Heusler Database (HH130). The pipeline automates the process of downloading the dataset, extracting relevant configuration files, processing these files to extract structured data, and generating graph representations suitable for machine learning tasks, particularly within the domain of materials science. Additionally, it includes a Sphinx extension for properly citing the HH130 database in documentation.
The pipeline consists of the following key stages:
- Download and Extraction: Downloads the HH130 dataset as a ZIP file from a specified URL and extracts its contents.
- Configuration File Processing: Recursively searches for and processes
.cfg
files within the extracted data, extracting information such as material composition, atomic coordinates, energy, and stress. - Graph Generation: Constructs graph representations of the material configurations based on atomic distances and properties, saving them as PyTorch Geometric
Data
objects. - Sphinx Citation Handling: Provides a custom Sphinx extension to ensure consistent and accurate citation of the HH130 database in project documentation.
├── config.yaml # Configuration file for the data processing pipeline. ├── config.py # Module for loading configuration settings from the YAML file. ├── file_finder.py # Module for finding files (e.g., .cfg files) and counting directories. ├── file_processor.py # Module for processing .cfg files to extract structured data. ├── graph_constructor.py # Module for constructing graph representations from processed data. ├── hh130_dbdl.py # Module for downloading and extracting the HH130 dataset. ├── hh130_sphinx.py # Custom Sphinx extension for handling HH130 citations. ├── main.py # Main script to orchestrate the entire data processing pipeline. └── README.md # This README file.
The pipeline's behavior is controlled by the config.yaml
file. This file specifies parameters such as the download URL, directory names, file extensions, and settings for graph construction and Sphinx citation.
database_url: "[http://www.mathub3d.net/static/database/HH130.zip](http://www.mathub3d.net/static/database/HH130.zip)"
download_dir_name: "Chem_Data"
hh130_dir_name: "HH130"
cfg_root_subdir: "MLIP models and Datasets"
raw_data_subdir: "raw"
root_data_dir: "C:\\"
cfg_extension: ".cfg"
hh130_download_config:
zip_filename: "HH130.zip"
raw_dirname: "raw"
processed_dirname: "processed"
cfg_files_dirname: "MLIP models and Datasets"
cfg_file_extension: ".cfg"
sphinx_config:
hh130_citation_label: "HH130"
hh130_citation_ref_text: "[1]"
hh130_citation_id: "hh130"
hh130_citation_string: "Y. Yang, Y. Lin, S. Dai, Y. Zhu, J. Xi, L. Xi, X. Gu, D. J. Singh, W. Zhang and J. Yang, HH130: a standardized database of machine learning interatomic potentials, datasets, and its applications in the thermal transport of half-Heusler thermoelectrics, Digital Discovery, 2024, 3, 2201-2210, DOI: 10.1039/D4DD00240G."
graph_construction:
distance_cutoff: 5.0
graph_file_extension: ".pt"
config_counter_format: "{}_config_{}{}"
Note: Ensure that the root_data_dir in your config.yaml is set to a location where the script has write permissions.
Setup and Usage
Prerequisites
Python 3.8 or higher: Ensure you have a compatible Python environment installed.
Required Python Packages: Install the necessary libraries using pip:
Bash
pip install requests pyyaml pathlib tqdm torch torch_geometric numpy sphinx docutils
Running the Pipeline
Clone the repository:
Bash
git clone <repository_url>
cd <repository_directory>
(Optional) Modify the configuration: Review and adjust the config.yaml file according to your needs, such as the download location and root data directory.
Run the main script:
Bash
python main.py
You can also specify a different configuration file using the --config argument:
Bash
python main.py --config custom_config.yaml
The script will then:
Download the HH130 dataset if it's not already present or fully extracted.
Process the .cfg files found in the specified directory.
Generate PyTorch Geometric graph files in the designated raw data subdirectory.
Print a citation reminder for the HH130 database.
Using the Sphinx Extension
To incorporate the HH130 citation into your Sphinx documentation:
Ensure Sphinx is installed:
Bash
pip install sphinx
Add the extension to your Sphinx conf.py file:
Python
extensions = [
# ... other extensions
'hh130_sphinx',
]
Configure the citation details in your conf.py file:
Python
hh130_citation_label = 'HH130'
hh130_citation_ref_text = '[1]'
hh130_citation_id = 'hh130'
hh130_citation_string = "Y. Yang, Y. Lin, S. Dai, Y. Zhu, J. Xi, L. Xi, X. Gu, D. J. Singh, W. Zhang and J. Yang, HH130: a standardized database of machine learning interatomic potentials, datasets, and its applications in the thermal transport of half-Heusler thermoelectrics, Digital Discovery, 2024, 3, 2201-2210, DOI: 10.1039/D4DD00240G."
Use the citation in your reStructuredText files:
Code snippet
... some text ... [hh130]_ ... more text ...
.. bibliography::
Sphinx will then automatically generate a citation and a bibliography entry for HH130.
Modules Description
config.py: Provides the functionality to load configuration settings from the config.yaml file.
file_finder.py: Contains utility functions for finding files with a specific extension within a directory and recursively counting directories.
file_processor.py: Implements the logic to parse .cfg files, extract relevant data blocks, and organize them. It also includes functions for extracting numeric and integer data, and determining the order of elements by their atomic mass.
graph_constructor.py: Takes the processed data and constructs graph representations using PyTorch Geometric. It defines how nodes (atoms) and edges (interactions based on distance) are created, along with global graph attributes.
hh130_dbdl.py: Handles the downloading of the HH130 dataset ZIP file from the specified URL and its extraction to a designated directory. It includes checks for existing files and error handling.
hh130_sphinx.py: A custom Sphinx extension that defines a specific citation node for the HH130 database, ensuring consistent formatting and content based on the configuration.
main.py: The main entry point of the pipeline. It orchestrates the execution of the download, processing, and graph generation steps using the configuration loaded from config.yaml.
Contributing
Contributions to this project are welcome. Please feel free to submit pull requests or open issues for any bugs, improvements, or new features.
License
This project is licensed under the MIT License.
Citation
If you use the HH130 database or this processing pipeline in your research, please cite the original publication as specified in the sphinx_config section of your config.yaml or directly:
Y. Yang, Y. Lin, S. Dai, Y. Zhu, J. Xi, L. Xi, X. Gu, D. J. Singh, W. Zhang and J. Yang, HH130: a standardized database of machine learning interatomic potentials, datasets, and its applications in the thermal transport of half-Heusler thermoelectrics, Digital Discovery, 2024, 3, 2201-2210, DOI: 10.1039/D4DD00240G.
## Acknowledgements
We would like to express our sincere gratitude to the researchers and developers behind the Half Heusler Database (HH130) for their efforts in creating and maintaining this valuable resource for the materials science community. Their work in standardizing interatomic potentials and datasets significantly facilitates research and development in areas like thermal transport and machine learning applications in materials.
Specifically, we acknowledge the authors of the original HH130 publication for providing the scientific basis and the data that this pipeline aims to process and make more accessible for computational studies. Their dedication to open science and data sharing is greatly appreciated.
We also thank the developers of the open-source Python libraries used in this pipeline, including `requests`, `pyyaml`, `pathlib`, `tqdm`, `torch`, `torch_geometric`, `numpy`, `sphinx`, and `docutils`, for providing the tools that enable this work.
## Support
For any questions or issues regarding this pipeline, please open an issue on the GitHub repository.
---
**End of README.md**