L2M3 is designed to efficiently gather experimental Metal-Organic Framework (MOF) data from scientific literature, addressing the challenge of accessing hard-to-find data and enhancing the quality of information available for machine learning in materials science. By utilizing a chain of advanced Large Language Models (LLMs), we developed a systematic method for extracting and organizing MOF data into a structured, usable format. Our approach has successfully compiled data from over 40,000 research articles, creating a comprehensive, ready-to-use dataset. This dataset includes MOF synthesis conditions and properties, extracted from both tables and text, which have been thoroughly analyzed.
L2M3 employs three specialized agents:
- Categorization Agent: Classifies tables and text based on whether they describe properties, synthesis conditions, or contain irrelevant information.
- Inclusion Agent: Identifies specific pieces of information present in the categorized data.
- Extraction Agent: Extracts the relevant information in a structured JSON format.
NOTE: This package is primarily tested on Linux systems. We strongly recommend using Linux for installation.
Requirements: Python >= 3.9
$ git clone https://github.com/Yeonghun1675/L2M3.git
$ cd L2M3
$ pip install -e .
You can run L2M3 using the LLMMiner
and JournalReader
classes.
from llm_miner import LLMMiner
from llm_miner import JournalReader
# Load agent and parse XML/HTML file
agent = LLMMiner.from_config(config)
jr = JournalReader.from_file(file_path, publisher)
# Run the agent on the parsed file
agent.run(jr)
JournalReader
is a Python class that extracts clean text and metadata from XML or HTML files.
from llm_miner import JournalReader
file_path = 'path-to-your-xml/html-file'
publisher = 'your-publisher' # Available options: ['acs', 'rsc', 'elsevier', 'springer']
jr = JournalReader.from_file(file_path, publisher=publisher)
Attributes of JournalReader
:
doi
: The DOI of the papertitle
: The title of the paperurl
: The URL of the paperget_tables
: A list of tables in the paperget_texts
: A list of paragraphs in the paperget_figures
: A list of figure captions in the paper
You can also save and load JournalReader
instances as JSON files:
# Save JournalReader to a JSON file
jr.to_json('output_file_path.json')
# Load JournalReader from a JSON file
jr_load = JournalReader.from_json('input_file_path.json')
LLMMiner
is a module that extracts synthesis conditions and characteristic properties from text and tables.
from llm_miner import LLMMiner
api_key = 'openai-api-key'
agent = LLMMiner.create(openai_api_key=api_key)
By default, the LLM module uses gpt-4
for text extraction and gpt-3.5-turbo-16k
for table extraction.
You can customize the LLM model using a configuration file. If you want to use a fine-tuned model, an example configuration file is available in the L2M3/config directory.
agent = LLMMiner.from_yaml('yaml-file-path', openai_api_key=api_key)
You can run the agent, and the output of the text-mining process will automatically be saved in the JournalReader
object.
agent.run(jr)
You can check the results in the JournalReader
object. These results show the synthesized or property data, consolidated by material.
result = jr.result
# View all results
result.print()
If jr.result
exists, it is automatically saved during the process of saving the JournalReader
. If you want to save just the result separately, you can save the cleaned results using the to_dict
or to_json
functions:
# convert dictionary type
output = result.to_dict()
# Save as a JSON file
result.to_json(json_file)
If you want to review the text-mining results by paragraph or table, you can use the following functions. Each function returns a list of Paragraph
objects.
# All text-mined results
all_paragraph = jr.cln_element
# View results by category
synthesis_condition = jr.get_synthesis_conditions()
properties = jr.get_properties()
tables = jr.get_tables()
You can inspect the text-mining results in each Paragraph
object:
# Check the results of each paragraph
paragraph = synthesis_condition[idx] # Can also use properties, or table
paragraph.print() # View full content of the paragraph
The Paragraph
object offers several useful attributes and methods:
idx
: Index of the paragraphtype
: Type of the paragraph (text or table)classification
: Classification result (synthesis condition or properties)clean_text
: Cleaned version of the paragraph (no HTML/XML tags)include_properties
: Inclusion resultdata
: Extracted data in JSON format(method) get_intermediate_step
: Displays results of intermediate steps(method) to_dict
: Converts the Paragraph object to a dictionary
L2M3 provides a token checker to estimate the number of tokens used and the price for your text-mining task.
from llm_miner.pricing import TokenChecker
tc = TokenChecker()
...
agent.run(
paragraph=output,
token_checker=tc
)
# View token summary
tc.print()
# Display total price (in $)
print (tc.price)
L2M3 allows you to fine-tune the LLM model to reduce token usage and cost. In the L2M3/finetune directory, there are jsonl
files that serve as datasets for fine-tuning various models. The available fine-tuned datasets include:
- text_categorize
- property_inclusion
- synthesis_inclusion
- table_categorize
- table_crystal_inclusion
- table_property_inclusion
- table_xml2md
You can fine-tune models on the OpenAI Finetune page (Recommanded).
Alternatively, you can fine-tune the model using the provided script:
$ python finetune/finetune.py --model model_name --file jsonl_file --api-key your_api_key
- L2M3 Database
- Paper Crawling
- Synthesis Condition Recommender
- Detail of Machine Learing Models
- Utilizing L2M3 Across Various Domains
If you want to cite L2M3, please refer to the following paper:
Contributions are welcome! If you have any suggestions or find any issues, please open an issue or a pull request.
This project is licensed under the MIT License. See the LICENSE
file for more information.