Note: this project is new and still in a setup phase.
Serf aims to provide a comprehensive framework for semantic entity resolution, enabling the identification and disambiguation of entities in the same dataaset or across different datasets. It is based on the blog post The Rise of Semantic Entity Resolution which was featured on Towards Data Science.
Serf runs multiple rounds of entity resolution until the dataset converges to a stable state.
Phase 1 - Semantic Blocking
- Semantic Clustering - Clusters records using sentence embeddings to group them into efficient blocks for pairwise comparison at quadratic complexity.
Phase 2 - Schema Alignment, Matching and Merging with Large Language Models
- Schema Alignment - Align schemas of common entities with different formats
- Entity Matching - Match within entire blocks of records at once
- Entity Merging - Merge matched entities in entire blocks of records, guided by entity signature descriptions.
- Match Evaluation - Evaluate the quality of matches using rigorous metrics
All three operations occur in a single prompt guided by metadata from DSPy signatures, in BAML format with Google Gemini models.
Phase 3 - Edge Resolution - Deduplicate edge duplicates produced by merging nodes.
- Edge Blocking - A simple GROUP BY on
src
,dst
and edgetype
. - Edge Merging - Edges are merged by an LLM guided by edge signature descriptions.
- Python 3.12
- Poetry for dependency management - see POETRY.md for installation instructions
- Java 11/17 (for Apache Spark)
- Apache Spark 3.5.5+
- 4GB+ RAM recommended (for Spark processing)
- Clone the repository:
git clone https://github.com/Graphlet-AI/serf.git
cd serf
- Create a conda / virtual environment:
In conda
:
conda create -n serf python=3.12
conda activate serf
With venv
:
python -m venv venv
source venv/bin/activate
- Install dependencies:
poetry install
- Install pre-commit checks:
pre-commit install
The SERF CLI provides commands for running the entity resolution pipeline:
$ serf --help
Usage: serf [OPTIONS] COMMAND [ARGS]...
SERF: Semantic Entity Resolution Framework CLI.
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
block Perform semantic blocking on input data.
edges Resolve edges after node merging.
match Align schemas, match entities, and merge within blocks.
The easiest way to get started with Serf is using Docker and docker compose
. This ensures a consistent development environment.