An interactive web application for text tokenization, demonstrating both word-based and character-based tokenization approaches. Built with React and Vite.
Check It out here: Custom Tokenizer App
UI Interface character tokenization
UI Interface Word tokenization
-
Multiple Tokenization Methods
- Word-based tokenization
- Character-based tokenization
-
Interactive UI
- Real-time tokenization visualization
- Highlighted token display
- Easy token ID copying
- Interactive encode/decode functionality
-
Advanced Processing
- Proper handling of special tokens ([BOS], [EOS], [PAD], [UNK])
- Case-sensitive token handling
- Intelligent space and punctuation preservation
-
Deployement of Gihub Pages
- Automatically deploys to GitHub Pages on push to main branch
- Uses Vite for optimized production build
- Clone the repository:
git clone https://github.com/adityaSrivastava29/custom-tokenizer-app.git
cd custom-tokenizer-app
- Install dependencies:
npm install
- Start the development server:
npm run dev
- Open your browser and navigate to:
http://localhost:5173
- Enter any text in the input field
- Choose tokenization type (word/character)
- See real-time tokenization results
- Text is automatically encoded when you type
- Special tokens are added ([BOS] at start, [EOS] at end)
- Unknown tokens are marked as [UNK]
- Tokens are displayed with their IDs
- Input token IDs in the decode field
- Click "Decode" button to convert back to text
- View the reconstructed text output
- Use the "Copy" button to copy encoded tokens
- Tokens are copied in a comma-separated format
Input Text:
Namaste JI, Custom Tokenizer App me aapka Swagat hai
Encoded Output (Word-based):
2, 4, 5, 6, 7, 8, 9, 1, 10, 11, 3
Where:
- 2: [BOS] token
- 4-11: Vocabulary tokens
- 1: [UNK] token for unknown words
- 3: [EOS] token
[BOS]
: Beginning of sequence token (ID: 2)[EOS]
: End of sequence token (ID: 3)[UNK]
: Unknown token (ID: 1)[PAD]
: Padding token (ID: 0)
- Word mode: Splits on spaces and punctuation
- Character mode: Processes each character individually
- Preserves original text formatting
- Handles case sensitivity
MIT License
Feel free to:
- Open issues
- Submit pull requests
- Suggest improvements
- Report bugs
Your contributions are welcome!