This repository contains the official codebase for our paper:
"Do Llamas understand the periodic table?"
We investigate how large language models (LLMs) encode structured scientific knowledge using chemical elements as a case study. Our key findings include:
- Discovery of a 3D spiral structure in LLM activations, aligned with the periodic table.
- Intermediate layers encode continuous, overlapping attributes suitable for indirect recall.
- Deeper layers sharpen categorical boundaries and integrate linguistic context.
- LLMs organize facts as geometry-aware manifolds, not just isolated tokens.
Each folder corresponds to a section or concept in the paper:
Pre/
— Preprocessing scripts: prompt creation, activation extraction.Geometry/
— Code for geometric analyses, such as spiral detection.Direct_recall/
— Linear probing for direct factual recall.Indirect_recall/
— Experiments on retrieving unmentioned or related facts.Appendix/
— Extra analysis, visualizations, and ablation results.Results/
— Saved figures, metrics, and outputs.periodic_table_dataset.csv
— Structured dataset of 50 elements and attributes.
-
Clone the repository and enter the project directory.
-
Set your HuggingFace API token in
config.json
:{ "HF_TOKEN": "your_huggingface_token" }
-
Install dependencies:
conda create --name myenv python=3.10 conda activate myenv pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126 pip install -r requirements.txt
-
Datasets
This project uses activation_datasets located at
./activation_datasets/
(project root).You can obtain the datasets in two ways:
Option A: Extract Residual Stream Yourself
- Edit the configuration file:
config_extract_activation.yaml
- Run the extraction script:
python Pre/extract_activations.py
Option B: Download from Hugging Face
huggingface-cli download leige1114/activation_datasets \ --repo-type dataset \ --local-dir activation_datasets \ --local-dir-use-symlinks False
- Edit the configuration file:
- bitsandbytes 4-bit quantization (
load_in_4bit
,nf4
) is only supported on Linux with NVIDIA GPUs.
It does not work on macOS (including Apple Silicon) or CPU-only setups.
- Disable quantization in configs:
config_extract_activation.yaml
: setextraction.quantization.load_in_4bit: false
(or remove the whole block).config_indirect.yaml
: setquantization.load_in_4bit: false
if used.
- Disable quantization in scripts:
Geometry/intervention.py
:'use_quantization': False
Appendix/entity_attention.py
:quantize=False
- For scripts without a toggle:
Remove BitsAndBytes-related code, or passquantization_config=None
.
On CPU, you can also usedevice_map="cpu"
and reduce batch size.
Note: requirements.txt
pins bitsandbytes
.
On macOS/CPU-only, installation may fail—remove the dependency and keep quantization disabled.