This repository provides examples and usage of LangChain text splitters, a fundamental tool for preparing large documents into smaller, manageable chunks that can be effectively processed by language models.
Splitting text is critical when working with LLMs (Large Language Models), since they have context length limits. Proper splitting ensures that the context remains meaningful and coherent, improving downstream tasks like question answering, summarization, embedding generation, and retrieval-augmented generation (RAG).
- Overview: A simple splitter that breaks text into chunks based on a fixed character size.
- Usage: Useful for quick prototyping or when structure does not matter.
- Pros:
- Very fast and straightforward
- Works with arbitrary text without assumptions
- Cons:
- May split mid-sentence or mid-word
- Does not preserve semantic meaning
-
Overview: A more advanced splitter that tries to split text by preferred separators, falling back recursively if no separator is found.
-
Why it matters: This is the most widely used text splitter in LangChain, as it balances chunk size with semantic integrity.
-
Capabilities:
- Text structure splitting (paragraphs, sentences, words)
- Programming language splitting (by functions, classes, code blocks)
- Markdown splitting (headings, bullet points, sections)
- Semantic-aware splitting (tries to keep related meaning together)
-
Pros:
- Preserves logical boundaries in text
- Adaptable for structured documents, code, or markdown
- Provides more meaningful chunks for embeddings and retrieval
-
Cons:
- Slightly more computationally intensive than a simple character splitter