What Is a Tokenizer?
Before a large language model can read a single word you type, that text must be broken into tokens — small chunks that map to numbers the model actually understands. Think of it as the translator sitting between human language and the neural network. The tokenizer splits your text into subwords, characters, or byte sequences using an algorithm called Byte-Pair Encoding (BPE), then looks each piece up in a fixed vocabulary to produce a sequence of integer IDs.
This step is not optional. Every prompt, every document, every line of code — it all passes through the tokenizer before the model sees it and again when the model's output is decoded back into text. A slow tokenizer becomes a bottleneck in data preprocessing, evaluation pipelines, and any workload that touches raw text at scale.
Why Another Tokenizer?
OpenAI's tiktoken is the current gold standard for tokenizer performance.
Written in Rust with Python bindings, it is the fastest tokenizer available today —
not just among BPE implementations, but across tokenizer libraries in general. As the
SentencePiece benchmark later in this post shows, even a Rust-native SentencePiece
implementation runs dramatically slower than tiktoken. When people benchmark tokenizers,
tiktoken is the one to beat. Being faster than tiktoken means being faster than
everything else.
The vocabulary in play here matters too. o200k_base is one of the largest
and most comprehensive BPE vocabularies in production use — 200,000 tokens, designed to
cover a wide range of languages, code, and special characters. It powers GPT-4o and later
OpenAI models. A larger vocabulary means more merge rules to evaluate during encoding,
which makes fast tokenization harder, not easier. Achieving a 3× speedup on a vocabulary
this size is a different challenge than doing it on a smaller, simpler one.
o200ktok is a standalone CLI tokenizer built for heavy workloads —
data preprocessing, corpus analytics, batch evaluation.
It implements the same BPE merge rules over the same o200k_base vocabulary,
producing bit-identical output — but it does so significantly faster
than the tool that currently holds the performance crown. On a single thread it's 3.6× faster;
with the --parallel flag, which splits work across all available CPU cores,
it reaches 14.3× faster than tiktoken on the same hardware.
Benchmark
The test corpus is WikiText-103 training set, a standard NLP benchmark dataset. Both tools were run on the same machine, tokenizing the full file and writing results to disk. Two modes were measured: IDs-only (output token IDs, one per line) and Tokens (output each ID with its decoded text value).
Single-Thread: IDs-Only Mode
Single-Thread: Tokens + Text Mode
Parallel Mode: IDs-Only (multi-CPU)
Parallel Mode: Tokens + Text (multi-CPU)
Here are the raw timing results — you can reproduce these yourself. No cherry-picking,
no warm caches, just time on the command line:
And with the --parallel flag, o200ktok splits the work across
all available CPU cores — notice how user time exceeds real time,
confirming true multi-core utilization:
Correctness First
Speed means nothing if the output is wrong. As the benchmark confirms, o200ktok
produces byte-for-byte identical results to tiktoken on the full WikiText-103
training set — same token count, same token IDs, same decoded text. This holds in both
single-thread and parallel mode. This isn't approximate compatibility; it's exact.
Let's look at the output side-by-side. First the decoded tokens:
Then the raw IDs:
And the parallel mode? Same result — splitting work across cores doesn't affect correctness:
And the final proof — wc confirms every line, word, and byte matches exactly:
Bonus: SentencePiece Benchmark
BPE isn't the only tokenization algorithm in production. Google's SentencePiece is one of the most widely adopted tokenizer frameworks in the LLM ecosystem. It powers models from virtually every major AI lab: Google (Gemini, Gemma, PaLM, T5), Meta (LLaMA 1 & 2), xAI (Grok-1), Mistral, and the multilingual BLOOM model — among many others. If you work with LLMs, there's a good chance you're running SentencePiece tokenization somewhere in your stack.
To test whether the same performance principles apply, I built
sentence-piece-tok — a SentencePiece-compatible tokenizer using the
Gemma 4 vocabulary (262,144 tokens, one of the largest SentencePiece vocabularies
in production).
For comparison, I benchmarked against sptok, a Rust-based SentencePiece
implementation. The result was surprising — but also confirms tiktoken's position as the
current performance champion: even a Rust implementation of SentencePiece
turned out to be dramatically slower than tiktoken, spending over 12 minutes on a job that
sentence-piece-tok finishes in 25 seconds with parallel mode.
IDs-Only: Single-Thread
IDs-Only: Parallel (multi-CPU)
Here are the raw results. Note the extreme system time for sptok — over
9 minutes of the 12-minute runtime is spent in kernel overhead:
And as always, the output is identical across all three runs:
Usage
o200ktok is a single-binary CLI tool. No Python environment, no pip install,
no dependency resolution — just download and run.
When to Use It
If you're building tooling around models that use the o200k_base vocabulary —
data pipelines, evaluation harnesses, token-counting utilities, corpus analysis scripts —
o200ktok can drop in as a faster alternative with zero risk of
output divergence. Since tiktoken currently delivers the best tokenizer performance
of any available library, being 3–14× faster than tiktoken means o200ktok
is faster than whatever tokenizer you're currently using — full stop. Add the --parallel flag on
multi-core machines and the gap widens even further.
For SentencePiece-based models like Gemma 4, sentence-piece-tok delivers
the same dramatic speedups — over 28× faster than a Rust SentencePiece implementation
in parallel mode. The batch mode in both tools is particularly useful for preprocessing
large datasets where the vocabulary-loading overhead of launching a separate process
per file adds up quickly.