Skip to content

An interactive web application for text tokenization, demonstrating both word-based and character-based tokenization approaches. Built with React and Vite.

Notifications You must be signed in to change notification settings

adityaSrivastava29/custom-tokenizer-app

Repository files navigation

Custom Tokenizer App

An interactive web application for text tokenization, demonstrating both word-based and character-based tokenization approaches. Built with React and Vite.

Check It out here: Custom Tokenizer App

Screenshots:

UI Interface character tokenization UI Interface character

UI Interface Word tokenization UI Interface words

🚀 Features

  • Multiple Tokenization Methods

    • Word-based tokenization
    • Character-based tokenization
  • Interactive UI

    • Real-time tokenization visualization
    • Highlighted token display
    • Easy token ID copying
    • Interactive encode/decode functionality
  • Advanced Processing

    • Proper handling of special tokens ([BOS], [EOS], [PAD], [UNK])
    • Case-sensitive token handling
    • Intelligent space and punctuation preservation
  • Deployement of Gihub Pages

    • Automatically deploys to GitHub Pages on push to main branch
    • Uses Vite for optimized production build

🛠️ Installation

  1. Clone the repository:
git clone https://github.com/adityaSrivastava29/custom-tokenizer-app.git
cd custom-tokenizer-app
  1. Install dependencies:
npm install
  1. Start the development server:
npm run dev
  1. Open your browser and navigate to:
http://localhost:5173

💡 Usage Guide

Input Text

  1. Enter any text in the input field
  2. Choose tokenization type (word/character)
  3. See real-time tokenization results

Encoding

  • Text is automatically encoded when you type
  • Special tokens are added ([BOS] at start, [EOS] at end)
  • Unknown tokens are marked as [UNK]
  • Tokens are displayed with their IDs

Decoding

  1. Input token IDs in the decode field
  2. Click "Decode" button to convert back to text
  3. View the reconstructed text output

Copy Functionality

  • Use the "Copy" button to copy encoded tokens
  • Tokens are copied in a comma-separated format

🔍 Example

Input Text:

Namaste JI, Custom Tokenizer App me aapka Swagat hai

Encoded Output (Word-based):

2, 4, 5, 6, 7, 8, 9, 1, 10, 11, 3

Where:

  • 2: [BOS] token
  • 4-11: Vocabulary tokens
  • 1: [UNK] token for unknown words
  • 3: [EOS] token

🛠️ Technical Details

Token Types

  • [BOS]: Beginning of sequence token (ID: 2)
  • [EOS]: End of sequence token (ID: 3)
  • [UNK]: Unknown token (ID: 1)
  • [PAD]: Padding token (ID: 0)

Tokenization Logic

  • Word mode: Splits on spaces and punctuation
  • Character mode: Processes each character individually
  • Preserves original text formatting
  • Handles case sensitivity

📝 License

MIT License

🤝 Contributing

Feel free to:

  • Open issues
  • Submit pull requests
  • Suggest improvements
  • Report bugs

Your contributions are welcome!

About

An interactive web application for text tokenization, demonstrating both word-based and character-based tokenization approaches. Built with React and Vite.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published