Summation-Based Transformers: Toward Linear Complexity Sequence Modeling

⚠️ Update: Attention-Driven Transformers

A more general architecture has been developed in follow-up work.

The attention-driven transformer (ADT) architecture removes the explicit summation mechanism while retaining projection layers and sparse attention. This simpler formulation achieves state-of-the-art results on time-series forecasting benchmarks with reduced computational overhead.

New repository: attention-driven-transformers For new projects, we recommend exploring ADT as a simpler starting point. This repository remains available for reference and reproducibility of the original summation-based experiments.

Summation-Based Transformers: Toward Linear Complexity Sequence Modeling

Overview

This repository implements summation-based aggregation, a simple alternative to self-attention that reduces per-layer complexity from O(n²·d) to O(n·d).
Instead of computing pairwise similarities, tokens are modulated by learned positional encodings, projected through bias-free layers with nonlinearities, and aggregated by direct summation.

On its own, summation is competitive in classification and multimodal settings. In autoregressive language modeling, a hybrid design - summation in most layers with a single attention layer at the output - matches or slightly exceeds full-attention performance while remaining nearly linear in cost.

Key Points

Near-linear scaling: O(n·d) vs. O(n²·d) for attention
Hybrid-friendly: most layers use summation, a final layer uses attention
Drop-in compatible: summation can replace attention inside transformer blocks without altering residuals, norms, or optimizers
Broad applicability: tested across classification, language modeling, and multimodal regression

Quick Start

git clone https://github.com/pfekin/summation-based-transformers
cd summation-based-transformers
pip install -r requirements.txt

# Run language modeling benchmark
python causal.py

# Run classification benchmark
python classifier.py

# Run multimodal regression benchmark
python multimodal.py

Google Colab

!pip install --upgrade datasets fsspec huggingface_hub
!pip install transformers -U
!pip install git+https://github.com/pfekin/summation-based-transformers

Experimental Highlights

Classification: With a context window of 512 tokens, summation performs on par with attention while running up to ~18× faster (CPU and GPU).
Language Modeling: Pure summation lags behind, but a hybrid design (summation + one attention layer) closes the gap and sometimes outperforms full attention.
Multimodal Regression: Summation provides a shared channel across text and metadata, yielding competitive results with fewer parameters and faster training.

Representation

Summation layers restructure embeddings differently from attention: instead of gradual refinement, they show sharper shifts and alternating contraction–expansion of representational dimensionality.

PCA trajectories of embeddings across layers. Summation restructures the manifold before the final attention layer stabilizes it.

Requirements

Python 3.8+
PyTorch 1.9+ (or TensorFlow with CUDA support)
transformers, scikit-learn, numpy, matplotlib, datasets, fsspec, huggingface_hub

Reference

For details, see: "Summation-Based Transformers: A Path Toward Linear Complexity Sequence Modeling," TechRxiv, 2025. 📄 Download Paper

@article{Summation_Based_Transformers_2025,
  title={Summation-Based Transformers: A Path Toward Linear Complexity Sequence Modeling},
  author={Pascal Ekin},
  journal={TechRxiv},  
  year={2025},
  doi={10.36227/techrxiv.175790522.25734653/v2},  
  url={https://doi.org/10.36227/techrxiv.175790522.25734653/v2},
}

Contact

Author: Pascal Ekin
Email: pfekin@gmail.com
Issues: Use the GitHub issue tracker for bugs/requests

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
media		media
LICENSE		LICENSE
README.md		README.md
causal.py		causal.py
causal_plot.py		causal_plot.py
classifier.py		classifier.py
load_datasets.py		load_datasets.py
multimodal.py		multimodal.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚠️ Update: Attention-Driven Transformers

Summation-Based Transformers: Toward Linear Complexity Sequence Modeling

Overview

Key Points

Quick Start

Google Colab

Experimental Highlights

Representation

Requirements

Reference

Contact

About

Uh oh!

Releases

Packages

Languages

License

pfekin/summation-based-transformers

Folders and files

Latest commit

History

Repository files navigation

⚠️ Update: Attention-Driven Transformers

Summation-Based Transformers: Toward Linear Complexity Sequence Modeling

Overview

Key Points

Quick Start

Google Colab

Experimental Highlights

Representation

Requirements

Reference

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages