Skip to content

abin1021/L2M3

 
 

Repository files navigation

⛏L2M3: Large Language Models Material Miner

Summary

L2M3 is designed to efficiently gather experimental Metal-Organic Framework (MOF) data from scientific literature, addressing the challenge of accessing hard-to-find data and enhancing the quality of information available for machine learning in materials science. By utilizing a chain of advanced Large Language Models (LLMs), we developed a systematic method for extracting and organizing MOF data into a structured, usable format. Our approach has successfully compiled data from over 40,000 research articles, creating a comprehensive, ready-to-use dataset. This dataset includes MOF synthesis conditions and properties, extracted from both tables and text, which have been thoroughly analyzed.

Process

L2M3 employs three specialized agents:

  • Categorization Agent: Classifies tables and text based on whether they describe properties, synthesis conditions, or contain irrelevant information.
  • Inclusion Agent: Identifies specific pieces of information present in the categorized data.
  • Extraction Agent: Extracts the relevant information in a structured JSON format.

Installation

NOTE: This package is primarily tested on Linux systems. We strongly recommend using Linux for installation.

Requirements: Python >= 3.9

$ git clone https://github.com/Yeonghun1675/L2M3.git
$ cd L2M3
$ pip install -e .

How to use

You can run L2M3 using the LLMMiner and JournalReader classes.

from llm_miner import LLMMiner
from llm_miner import JournalReader

# Load agent and parse XML/HTML file
agent = LLMMiner.from_config(config)
jr = JournalReader.from_file(file_path, publisher)

# Run the agent on the parsed file
agent.run(jr)

1. JournalReader (Parsing XML/HTML)

JournalReader is a Python class that extracts clean text and metadata from XML or HTML files.

from llm_miner import JournalReader

file_path = 'path-to-your-xml/html-file'
publisher = 'your-publisher'  # Available options: ['acs', 'rsc', 'elsevier', 'springer']

jr = JournalReader.from_file(file_path, publisher=publisher)

Attributes of JournalReader:

  • doi: The DOI of the paper
  • title: The title of the paper
  • url: The URL of the paper
  • get_tables: A list of tables in the paper
  • get_texts: A list of paragraphs in the paper
  • get_figures: A list of figure captions in the paper

You can also save and load JournalReader instances as JSON files:

# Save JournalReader to a JSON file
jr.to_json('output_file_path.json')

# Load JournalReader from a JSON file
jr_load = JournalReader.from_json('input_file_path.json')

2. LLMMiner (Agent)

LLMMiner is a module that extracts synthesis conditions and characteristic properties from text and tables.

from llm_miner import LLMMiner

api_key = 'openai-api-key'
agent = LLMMiner.create(openai_api_key=api_key)

By default, the LLM module uses gpt-4 for text extraction and gpt-3.5-turbo-16k for table extraction.

You can customize the LLM model using a configuration file. If you want to use a fine-tuned model, an example configuration file is available in the L2M3/config directory.

agent = LLMMiner.from_yaml('yaml-file-path', openai_api_key=api_key)

3. Run agent

You can run the agent, and the output of the text-mining process will automatically be saved in the JournalReader object.

agent.run(jr)

You can check the results in the JournalReader object. These results show the synthesized or property data, consolidated by material.

result = jr.result

# View all results
result.print()

If jr.result exists, it is automatically saved during the process of saving the JournalReader. If you want to save just the result separately, you can save the cleaned results using the to_dict or to_json functions:

# convert dictionary type
output = result.to_dict()

# Save as a JSON file
result.to_json(json_file)

If you want to review the text-mining results by paragraph or table, you can use the following functions. Each function returns a list of Paragraph objects.

# All text-mined results
all_paragraph = jr.cln_element

# View results by category
synthesis_condition = jr.get_synthesis_conditions()
properties = jr.get_properties()
tables = jr.get_tables()

You can inspect the text-mining results in each Paragraph object:

# Check the results of each paragraph
paragraph = synthesis_condition[idx]  # Can also use properties, or table
paragraph.print()  # View full content of the paragraph

The Paragraph object offers several useful attributes and methods:

  • idx: Index of the paragraph
  • type: Type of the paragraph (text or table)
  • classification: Classification result (synthesis condition or properties)
  • clean_text: Cleaned version of the paragraph (no HTML/XML tags)
  • include_properties: Inclusion result
  • data: Extracted data in JSON format
  • (method) get_intermediate_step: Displays results of intermediate steps
  • (method) to_dict: Converts the Paragraph object to a dictionary

4. (optional) Token Checker

L2M3 provides a token checker to estimate the number of tokens used and the price for your text-mining task.

from llm_miner.pricing import TokenChecker

tc = TokenChecker()
...
agent.run(
    paragraph=output,
    token_checker=tc
)

# View token summary
tc.print()
# Display total price (in $)
print (tc.price)

Fine-tuning

L2M3 allows you to fine-tune the LLM model to reduce token usage and cost. In the L2M3/finetune directory, there are jsonl files that serve as datasets for fine-tuning various models. The available fine-tuned datasets include:

  • text_categorize
  • property_inclusion
  • synthesis_inclusion
  • table_categorize
  • table_crystal_inclusion
  • table_property_inclusion
  • table_xml2md

You can fine-tune models on the OpenAI Finetune page (Recommanded).

Alternatively, you can fine-tune the model using the provided script:

$ python finetune/finetune.py --model model_name --file jsonl_file --api-key your_api_key

Addtional Information

Citiation

If you want to cite L2M3, please refer to the following paper:

Harnessing Large Language Model to collect and analyze Metal-organic framework property dataset, J. Am. Chem. Soc. 2025, 147, 5, 3943-3958

Contributing 🙌

Contributions are welcome! If you have any suggestions or find any issues, please open an issue or a pull request.

License 📄

This project is licensed under the MIT License. See the LICENSE file for more information.

About

Large Language Models Material Miner

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 59.8%
  • Python 40.2%