Training data

Official Definition

The dataset used to train an AI model, from which the model learns patterns, relationships, and representations that inform its predictions and outputs.

Source: AIEOG AI Lexicon (Feb 2026), adapted from NIST AI 100-1

What training data means in plain language

Training data is the information an AI model learns from. It is the raw material of machine learning — the examples, patterns, and relationships that shape how the model understands the world and makes predictions.

The quality, representativeness, and integrity of training data directly determine the quality of the model. A fraud detection model trained on a dataset that does not include recent fraud patterns will miss those patterns. A credit model trained on biased historical decisions will learn those biases. A language model trained on inaccurate text will produce inaccurate outputs.

Training data governance encompasses the entire data pipeline: collection (how was the data gathered?), labeling (how were correct answers determined?), preprocessing (how was the data cleaned and prepared?), selection (what was included and excluded?), and documentation (what is known about the data’s provenance and characteristics?).

Why it matters in financial services

Training data is the foundation of AI model risk. Many of the most significant AI governance failures trace back to training data issues:

  • Bias. Training data that reflects historical discrimination will produce models that perpetuate that discrimination. This is a critical fair lending concern.
  • Data quality. Incomplete, inaccurate, or outdated training data produces unreliable models.
  • Privacy. Training data may contain personally identifiable information (PII) or other sensitive data that requires protection.
  • Regulatory data requirements. Examiners expect institutions to demonstrate understanding of the data used to train models, including its source, quality, and limitations.
  • Third-party data. For foundation models and vendor AI, institutions may have limited visibility into training data, creating a governance gap.

Key considerations for compliance teams

  1. Document data provenance. For every AI model, document where the training data came from, how it was collected, and what preprocessing was applied.
  2. Assess for bias. Evaluate training data for patterns that could produce discriminatory outcomes, particularly in lending and customer-facing applications.
  3. Ensure data quality. Implement data quality controls including completeness checks, accuracy validation, and consistency verification.
  4. Protect sensitive data. Apply appropriate privacy controls to training data, including de-identification, access restrictions, and retention policies.
  5. Assess vendor training data. For third-party models, request documentation on training data sources, composition, and known limitations.
  6. Maintain data lineage. Track the full chain from data source to training dataset to model, enabling audit and investigation.

Stay current on AI risk in financial services

Get practical guidance on AI governance, model risk, and regulatory developments delivered to your inbox. Stay up to date on the latest in financial compliance from our experts.

Google reCaptcha: Invalid site key.