Tokenization (AI)

Official Definition

The process of breaking down text into smaller units called tokens (words, subwords, or characters) that serve as the input for natural language processing and large language models.

Source: AIEOG AI Lexicon (Feb 2026), adapted from NIST AI 100-1 and industry usage

What tokenization means in plain language

Tokenization in AI is the process of splitting text into smaller pieces (tokens) that a model can process. Before an LLM can read your prompt, the text must be converted into tokens — the fundamental units the model works with.

Tokens are not always whole words. Common words might be a single token, while uncommon or long words might be split into multiple subword tokens. Punctuation, spaces, and special characters are also tokenized. For example, “compliance” might be one token, while “noncompliance” might be split into “non” and “compliance.”

Tokenization matters for governance because it affects how models process information. The tokenization method influences how well a model handles specialized vocabulary (like financial and legal terminology), different languages, numerical data, and code. LLMs also have token limits (context windows) that determine how much text they can process at once.

Important note: AI tokenization is distinct from data tokenization used in payment security (replacing sensitive card data with non-sensitive tokens). The two concepts share a name but serve entirely different purposes.

Why it matters in financial services

Tokenization has practical implications for AI governance in financial services:

Context limitations. Token limits affect how much information an LLM can process. Long regulatory documents, complex contracts, or extensive transaction histories may exceed token limits, requiring chunking strategies.
Specialized vocabulary. Financial and regulatory terminology may be tokenized in ways that reduce model comprehension. Testing domain-specific performance is important.
Cost and performance. AI service pricing is often based on tokens processed. Understanding tokenization helps institutions estimate and manage costs.
Multilingual considerations. Tokenization efficiency varies across languages, potentially affecting model performance for institutions with international operations.

Key considerations for compliance teams

Understand token limits. Know the context window of the LLMs your institution uses and how it affects processing capacity.
Test domain vocabulary. Verify that the tokenization approach handles financial and regulatory terminology appropriately.
Plan chunking strategies. For long documents that exceed token limits, establish consistent approaches for splitting and processing content.
Monitor costs. Track token usage and costs for LLM-based applications.
Distinguish from payment tokenization. Ensure internal documentation clearly differentiates AI tokenization from data tokenization used in payment security.
Include in system documentation. Document the tokenization approach and any limitations for each NLP or LLM deployment.

Stay current on AI risk in financial services

Get practical guidance on AI governance, model risk, and regulatory developments delivered to your inbox. Stay up to date on the latest in financial compliance from our experts.