TLDR
Fixed-Size Chunking is the most widely adopted "Level 1" strategy for preparing text for RAG. It involves dividing a document into chunks of a set number of tokens (e.g., 512) or characters (e.g., 2000), usually with a "sliding window" overlap to prevent cutting sentences in half. Its primary advantage is predictability and speed: unlike semantic chunking, it requires no model inference to determine split points. However, its rigidity often leads to "Context Fragmentation," where related concepts are arbitrarily severed. It is the default baseline for most vector databases and prototyping.
Conceptual Overview
At its core, Fixed-Size Chunking treats a document as a linear stream of data rather than a semantic structure. It ignores paragraph breaks, headers, or logical shifts, focusing solely on the "budget" of the context window.
The Mechanics of the Split
The process is governed by two key parameters:
- Chunk Size (
chunk_size): The maximum length of a text block. This is typically determined by the embedding model's limit (e.g., OpenAI'stext-embedding-3works well with 256-512 token chunks). - Overlap (
chunk_overlap): The number of tokens shared between adjacent chunks.
Why Overlap Matters: Imagine a sentence: "The secret code to the safe is 1234." If the chunk cut happens right after "is", Chunk A contains "The secret code to the safe is" and Chunk B contains "1234." Both chunks are semantically useless on their own. By adding an overlap of 50 tokens, the sentence appears fully in both chunks (or at least one of them), preserving the critical information linkage.
The "Recursive" Evolution (Level 2)
Pure fixed-size splitting (hard cuts at character N) is rarely used because it slices words in half. The industry standard is Recursive Character Splitting. This method attempts to split at the largest semantic separators first (double newlines \n\n), then single newlines \n, then spaces, and finally legitimate character cuts. This tries to keep paragraphs together while adhering to the fixed size limit.

Practical Implementations
In the Python ecosystem, LangChain and LlamaIndex provide the standard implementations for this strategy.
LangChain: RecursiveCharacterTextSplitter
This is the most common implementation for text-heavy documents.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = "Load your long document text here..."
# Initialize the splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # Target size (characters or tokens)
chunk_overlap=50, # 10-15% overlap is standard practice
length_function=len, # Can use 'len' or a tokenizer's count
separators=["\n\n", "\n", " ", ""] # Priority list for splitting
)
docs = splitter.create_documents([text])
Optimizing for Tokens vs. Characters
While it is easier to count characters, LLMs operate on tokens. A generic rule of thumb is 1 token $\approx$ 4 characters. However, solely relying on characters can lead to chunks that exceed the embedding model's token limit.
Best Practice: Use a tokenizer-aware length function (e.g., tiktoken) to ensure your 512-token chunk is actually 512 tokens.
import tiktoken
def tiktoken_len(text):
tokenizer = tiktoken.get_encoding("cl100k_base")
tokens = tokenizer.encode(text)
return len(tokens)
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=tiktoken_len
)
Advanced Techniques
While "fixed-size" implies rigidity, there are sophisticated ways to apply it.
Small-to-Big Retrieval (Parent Document Retrieval)
This technique uses fixed-size chunking to handle the index, but delivers variable content to the LLM.
- Child Chunks: Split the document into small, fixed 128-token chunks. Embed these. Their specific vector representation is highly valid for dense retrieval.
- Parent Retrieval: When a child chunk is retrieved, do not send just that small snippet to the LLM. Instead, fetch the "Parent Chunk" (e.g., the 1024-token window surrounding it) or the full document. This leverages the precision of small fixed chunks for search and the context of large windows for generation.
Sliding Window Integration with Reranking
In high-precision systems, engineers often generate heavily overlapping chunks (e.g., 512 tokens with 256 overlap). This results in 2x the number of vectors (linear cost increase) but ensures that every sentence appears "in the middle" of some chunk at least once, mitigating the "Lost in the Middle" effect during the retrieval phase.
Research and Future Directions
The "Level 2" Recursive Splitter is currently the plateau of non-model-based algorithms. Future research focuses on minimal-compute heuristics to improve split boundaries without the cost of full semantic models.
Static vs. Dynamic Boundaries
Research is exploring "NLP-light" splitting, where simple heuristic models (like NLTK sentence tokenizers) determine boundaries, but the chunking logic dynamically resizes the window to avoid "stranded sentences." This aims to approach the quality of Semantic Chunking without the $O(N)$ embedding cost of calculating semantic distance for every sentence.
Frequently Asked Questions
Q: What is the optimal chunk size?
There is no single number, but 256 to 512 tokens is the sweet spot for most dense embedding models (like OpenAI text-embedding-3 or bge-m3). Smaller chunks (128) are better for granular fact retrieval ("What is the interest rate?"), while larger chunks (1024) are better for thematic queries ("Summarize the termination policy").
Q: How much overlap should I use?
A standard rule of thumb is 10-15% of the chunk size. For a 512-token chunk, an overlap of 50-75 tokens is sufficient to capture the transition between sentences.
Q: Is Fixed-Size Chunking obsolete?
No. It remains the industry workhorse because it is fast, cheap, and predictable. For 80% of RAG use cases, properly tuned Recursive Character Splitting is "good enough" and significantly less complex than Semantic or Agentic chunking.
References
- LangChain Documentation: Text Splitters
- LlamaIndex: Node Parsers and Chunking
- Pinecone: Chunking Strategies for LLM Applications
- Liu et al. (2023) Lost in the Middle: How Language Models Use Long Contexts