A PHP implementation of OpenAI's BPE tokenizer tiktoken.
-
Updated
Jan 25, 2025 - PHP
A PHP implementation of OpenAI's BPE tokenizer tiktoken.
Byte-Pair Encoding tokenizer for training large language models on huge datasets
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
(1) Train large language models to help people with automatic essay scoring. (2) Extract essay features and train new tokenizer to build tree models for score prediction.
Teaching transformer-based architectures
🐍This is a fast, lightweight, and clean CPython extension for the Byte Pair Encoding (BPE) algorithm, which is commonly used in LLM tokenization and NLP tasks.
BPE tokenizer for LLMs in Pure Zig
[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust
This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec model to generate word embeddings from a text corpus. The implementation leverages Apache Hadoop for distributed processing and includes evaluation metrics for optimal dimensionality of embeddings.
C89, single header, nostdlib byte pair encoding algorythm
processing de LANguage NATural
Transformer Models for Humorous Text Generation. Fine-tuned on Russian jokes dataset with ALiBi, RoPE, GQA, and SwiGLU.Plus a custom Byte-level BPE tokenizer.
The basic BPE-tokenizer in NLP
A modular Byte Pair Encoding (BPE) tokenizer implemented in Python
✨ BPE-Tokenizer for university module Foundational Generative Models.
Naive implementation of byte pair encoding (BPE) tokenization algorithm in golang
TF-IDF Calculation
Interactive tool for exploring Byte Pair Encoding tokenization step-by-step.
Add a description, image, and links to the bpe-tokenizer topic page so that developers can more easily learn about it.
To associate your repository with the bpe-tokenizer topic, visit your repo's landing page and select "manage topics."