Skip to content

LahiaOmar/tokens_viewer

Repository files navigation

Tokenizer Token Viewer

This repository houses an implementation of the Byte Pair Encoding (BPE) algorithm with several levels of optimization for tokenization. BPE is a data compression technique used primarily in natural language processing tasks such as text tokenization, where it helps in breaking down a given text into smaller subword units. By iteratively merging the most frequent pairs of characters or bytes, BPE efficiently builds a vocabulary that captures both common and rare words, thereby enhancing the performance of various NLP tasks like machine translation, language modeling, and text generation.

  • Basic BPE Implementation. basic_bpe

❤️ This work is mostly inspired from the Youtube video talking about Tokenization by the talented Andrej Karpathy twitter

Deployed version

About

Strings Tokenization with Byte Pair Encoding (BPE).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published