Semi-supervised learning

Official Definition

A machine learning approach that uses a small amount of labeled data combined with a large amount of unlabeled data for training, bridging the gap between supervised and unsupervised learning.

Source: AIEOG AI Lexicon (Feb 2026), adapted from NIST AI 100-1

What semi-supervised learning means in plain language

Semi-supervised learning is a machine learning approach that uses a combination of labeled data (data with known correct answers) and unlabeled data (data without labels). It is a pragmatic middle ground between supervised learning (which requires all data to be labeled) and unsupervised learning (which uses no labels at all).

In many real-world situations, labeled data is expensive and time-consuming to produce. Labeling transaction data as fraudulent or legitimate requires expert review. Labeling customer complaints by category requires trained analysts. Semi-supervised learning addresses this by using a small amount of carefully labeled data to guide learning from a much larger pool of unlabeled data.

The approach is common in financial services where large volumes of data exist but labeling is resource-intensive.

Why it matters in financial services

Semi-supervised learning is particularly relevant in financial services because many applications involve large datasets where only a fraction can be realistically labeled by human experts. Fraud detection, AML transaction monitoring, and document classification all face this challenge.

Governance considerations include label quality (the small labeled dataset must be high quality since it guides the entire learning process), representation (the labeled subset must be representative of the full dataset), and validation (testing must confirm that the model’s performance on unlabeled data matches its performance on labeled data).

Key considerations for compliance teams

Ensure label quality. The labeled data subset is critical. Implement quality controls and expert review for the labeling process.
Validate on labeled data. Use held-out labeled data to validate that the model performs well on both labeled and unlabeled examples.
Assess representation. Confirm that the labeled subset is representative of the full population and does not introduce selection bias.
Document the approach. Record the labeling methodology, the ratio of labeled to unlabeled data, and the rationale for using semi-supervised learning.
Monitor for label propagation errors. Semi-supervised learning can propagate errors from the labeled subset to the unlabeled data. Monitor for systematic mistakes.
Include in model risk management. Semi-supervised learning models should be subject to the same governance as any other ML model.

Stay current on AI risk in financial services

Get practical guidance on AI governance, model risk, and regulatory developments delivered to your inbox. Stay up to date on the latest in financial compliance from our experts.