🚀 Go-based Byte Pair Encoding (BPE)

A Go implementation of Byte Pair Encoding (BPE), inspired by Andrej Karpathy's tutorial. Support the tiktoken file format from OpenAI. You can fetch pretrained encodings directly from OpenAI's github 📦.

✨ Features

🔤 Tokenizes arbitrary byte sequences (not just text!)
🧩 Special token support with whitelisting
🧪 Regex-based input splitting
⚠️ Not a drop-in replacement for OpenAI’s tokenizer

🗂️ Project Structure

├── bpeprocessor.go          # Interface definition
├── go.mod                   # Module config
├── README.md                # You're reading it!
├── regextiktokenproc.go     # Regex-enhanced BPE processor
├── regextiktokenproc_test.go
├── tiktokenproc.go          # Core BPE for OpenAI's .tiktoken format
├── tiktokenproc_test.go
└── testdata/
    └── cl100k_base.tiktoken # Sample encoding data

💡 Key Takeaways

🧪 Fuzz testing in Go is powerful — used to test decode(encode(x)) == x across edge cases
🌐 UTF-8 is full of surprises — beware of multi-byte characters
🛠️ Byte slice manipulation in Go can be... tricky & annoying😅
🔍 Go’s regex capabilities are fundamentally different from Python’s 🐍 — beware of surprises!

📄 License

This project itself is also licensed under the MIT License — feel free to use, fork, or contribute!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Go-based Byte Pair Encoding (BPE)

✨ Features

🗂️ Project Structure

💡 Key Takeaways

📄 License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
testdata		testdata
.gitignore		.gitignore
README.md		README.md
bpeprocessor.go		bpeprocessor.go
go.mod		go.mod
regextiktokenproc.go		regextiktokenproc.go
regextiktokenproc_test.go		regextiktokenproc_test.go
tiktokenproc.go		tiktokenproc.go
tiktokenproc_test.go		tiktokenproc_test.go

bluuuk/bpe-go

Folders and files

Latest commit

History

Repository files navigation

🚀 Go-based Byte Pair Encoding (BPE)

✨ Features

🗂️ Project Structure

💡 Key Takeaways

📄 License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages