Skip to content

Graphlet-AI/serf

Repository files navigation

Semantic Entity Resolution Framework (Serf)

Note: this project is new and still in a setup phase.

Serf aims to provide a comprehensive framework for semantic entity resolution, enabling the identification and disambiguation of entities in the same dataaset or across different datasets. It is based on the blog post The Rise of Semantic Entity Resolution which was featured on Towards Data Science.

Features

Serf runs multiple rounds of entity resolution until the dataset converges to a stable state.

Phase 1 - Semantic Blocking

  • Semantic Clustering - Clusters records using sentence embeddings to group them into efficient blocks for pairwise comparison at quadratic complexity.

Phase 2 - Schema Alignment, Matching and Merging with Large Language Models

  • Schema Alignment - Align schemas of common entities with different formats
  • Entity Matching - Match within entire blocks of records at once
  • Entity Merging - Merge matched entities in entire blocks of records, guided by entity signature descriptions.
  • Match Evaluation - Evaluate the quality of matches using rigorous metrics

All three operations occur in a single prompt guided by metadata from DSPy signatures, in BAML format with Google Gemini models.

Phase 3 - Edge Resolution - Deduplicate edge duplicates produced by merging nodes.

  • Edge Blocking - A simple GROUP BY on src, dst and edge type.
  • Edge Merging - Edges are merged by an LLM guided by edge signature descriptions.

System Requirements

  • Python 3.12
  • Poetry for dependency management - see POETRY.md for installation instructions
  • Java 11/17 (for Apache Spark)
  • Apache Spark 3.5.5+
  • 4GB+ RAM recommended (for Spark processing)

Quick Start

  1. Clone the repository:
git clone https://github.com/Graphlet-AI/serf.git
cd serf
  1. Create a conda / virtual environment:

In conda:

conda create -n serf python=3.12
conda activate serf

With venv:

python -m venv venv
source venv/bin/activate
  1. Install dependencies:
poetry install
  1. Install pre-commit checks:
pre-commit install

CLI

The SERF CLI provides commands for running the entity resolution pipeline:

$ serf --help
Usage: serf [OPTIONS] COMMAND [ARGS]...

  SERF: Semantic Entity Resolution Framework CLI.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  block  Perform semantic blocking on input data.
  edges  Resolve edges after node merging.
  match  Align schemas, match entities, and merge within blocks.

Docker Setup

The easiest way to get started with Serf is using Docker and docker compose. This ensures a consistent development environment.

About

Semantic Entity Resolution Framework (Serf)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages