Custom Tokenizer App

An interactive web application for text tokenization, demonstrating both word-based and character-based tokenization approaches. Built with React and Vite.

Check It out here: Custom Tokenizer App

Screenshots:

UI Interface character tokenization

UI Interface Word tokenization

🚀 Features

Multiple Tokenization Methods
- Word-based tokenization
- Character-based tokenization
Interactive UI
- Real-time tokenization visualization
- Highlighted token display
- Easy token ID copying
- Interactive encode/decode functionality
Advanced Processing
- Proper handling of special tokens ([BOS], [EOS], [PAD], [UNK])
- Case-sensitive token handling
- Intelligent space and punctuation preservation
Deployement of Gihub Pages
- Automatically deploys to GitHub Pages on push to main branch
- Uses Vite for optimized production build

🛠️ Installation

Clone the repository:

git clone https://github.com/adityaSrivastava29/custom-tokenizer-app.git
cd custom-tokenizer-app

Install dependencies:

npm install

Start the development server:

npm run dev

Open your browser and navigate to:

http://localhost:5173

💡 Usage Guide

Input Text

Enter any text in the input field
Choose tokenization type (word/character)
See real-time tokenization results

Encoding

Text is automatically encoded when you type
Special tokens are added ([BOS] at start, [EOS] at end)
Unknown tokens are marked as [UNK]
Tokens are displayed with their IDs

Decoding

Input token IDs in the decode field
Click "Decode" button to convert back to text
View the reconstructed text output

Copy Functionality

Use the "Copy" button to copy encoded tokens
Tokens are copied in a comma-separated format

🔍 Example

Input Text:

Namaste JI, Custom Tokenizer App me aapka Swagat hai

Encoded Output (Word-based):

2, 4, 5, 6, 7, 8, 9, 1, 10, 11, 3

Where:

2: [BOS] token
4-11: Vocabulary tokens
1: [UNK] token for unknown words
3: [EOS] token

🛠️ Technical Details

Token Types

[BOS]: Beginning of sequence token (ID: 2)
[EOS]: End of sequence token (ID: 3)
[UNK]: Unknown token (ID: 1)
[PAD]: Padding token (ID: 0)

Tokenization Logic

Word mode: Splits on spaces and punctuation
Character mode: Processes each character individually
Preserves original text formatting
Handles case sensitivity

📝 License

MIT License

🤝 Contributing

Feel free to:

Open issues
Submit pull requests
Suggest improvements
Report bugs

Your contributions are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
public		public
src		src
.gitignore		.gitignore
README.md		README.md
eslint.config.js		eslint.config.js
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
vite.config.js		vite.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Custom Tokenizer App

Screenshots:

🚀 Features

🛠️ Installation

💡 Usage Guide

Input Text

Encoding

Decoding

Copy Functionality

🔍 Example

🛠️ Technical Details

Token Types

Tokenization Logic

📝 License

🤝 Contributing

About

Uh oh!

Releases

Packages

Languages

adityaSrivastava29/custom-tokenizer-app

Folders and files

Latest commit

History

Repository files navigation

Custom Tokenizer App

Screenshots:

🚀 Features

🛠️ Installation

💡 Usage Guide

Input Text

Encoding

Decoding

Copy Functionality

🔍 Example

🛠️ Technical Details

Token Types

Tokenization Logic

📝 License

🤝 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages