A more general architecture has been developed in follow-up work.
The attention-driven transformer (ADT) architecture removes the explicit summation mechanism while retaining projection layers and sparse attention. This simpler formulation achieves state-of-the-art results on time-series forecasting benchmarks with reduced computational overhead.
New repository: attention-driven-transformers For new projects, we recommend exploring ADT as a simpler starting point. This repository remains available for reference and reproducibility of the original summation-based experiments.
This repository implements summation-based aggregation, a simple alternative to self-attention that reduces per-layer complexity from O(n²·d) to O(n·d).
Instead of computing pairwise similarities, tokens are modulated by learned positional encodings, projected through bias-free layers with nonlinearities, and aggregated by direct summation.
On its own, summation is competitive in classification and multimodal settings. In autoregressive language modeling, a hybrid design - summation in most layers with a single attention layer at the output - matches or slightly exceeds full-attention performance while remaining nearly linear in cost.
- Near-linear scaling: O(n·d) vs. O(n²·d) for attention
- Hybrid-friendly: most layers use summation, a final layer uses attention
- Drop-in compatible: summation can replace attention inside transformer blocks without altering residuals, norms, or optimizers
- Broad applicability: tested across classification, language modeling, and multimodal regression
git clone https://github.com/pfekin/summation-based-transformers
cd summation-based-transformers
pip install -r requirements.txt
# Run language modeling benchmark
python causal.py
# Run classification benchmark
python classifier.py
# Run multimodal regression benchmark
python multimodal.py!pip install --upgrade datasets fsspec huggingface_hub
!pip install transformers -U
!pip install git+https://github.com/pfekin/summation-based-transformers- Classification: With a context window of 512 tokens, summation performs on par with attention while running up to ~18× faster (CPU and GPU).
- Language Modeling: Pure summation lags behind, but a hybrid design (summation + one attention layer) closes the gap and sometimes outperforms full attention.
- Multimodal Regression: Summation provides a shared channel across text and metadata, yielding competitive results with fewer parameters and faster training.
Summation layers restructure embeddings differently from attention: instead of gradual refinement, they show sharper shifts and alternating contraction–expansion of representational dimensionality.
PCA trajectories of embeddings across layers. Summation restructures the manifold before the final attention layer stabilizes it.
- Python 3.8+
- PyTorch 1.9+ (or TensorFlow with CUDA support)
- transformers, scikit-learn, numpy, matplotlib, datasets, fsspec, huggingface_hub
For details, see: "Summation-Based Transformers: A Path Toward Linear Complexity Sequence Modeling," TechRxiv, 2025. 📄 Download Paper
@article{Summation_Based_Transformers_2025,
title={Summation-Based Transformers: A Path Toward Linear Complexity Sequence Modeling},
author={Pascal Ekin},
journal={TechRxiv},
year={2025},
doi={10.36227/techrxiv.175790522.25734653/v2},
url={https://doi.org/10.36227/techrxiv.175790522.25734653/v2},
}- Author: Pascal Ekin
- Email: pfekin@gmail.com
- Issues: Use the GitHub issue tracker for bugs/requests