A Go implementation of Byte Pair Encoding (BPE), inspired by Andrej Karpathy's tutorial. Support the tiktoken
file format from OpenAI. You can fetch pretrained encodings directly from OpenAI's github 📦.
- 🔤 Tokenizes arbitrary byte sequences (not just text!)
- 🧩 Special token support with whitelisting
- 🧪 Regex-based input splitting
⚠️ Not a drop-in replacement for OpenAI’s tokenizer
├── bpeprocessor.go # Interface definition
├── go.mod # Module config
├── README.md # You're reading it!
├── regextiktokenproc.go # Regex-enhanced BPE processor
├── regextiktokenproc_test.go
├── tiktokenproc.go # Core BPE for OpenAI's .tiktoken format
├── tiktokenproc_test.go
└── testdata/
└── cl100k_base.tiktoken # Sample encoding data
- 🧪 Fuzz testing in Go is powerful — used to test
decode(encode(x)) == x
across edge cases - 🌐 UTF-8 is full of surprises — beware of multi-byte characters
- 🛠️ Byte slice manipulation in Go can be... tricky & annoying😅
- 🔍 Go’s regex capabilities are fundamentally different from Python’s 🐍 — beware of surprises!
The file testdata/cl100k_base.tiktoken is under the MIT License.© 2022 OpenAI, Shantanu Jain.
This project itself is also licensed under the MIT License — feel free to use, fork, or contribute!