Multi-modal model
Official Definition
A type of AI model that processes and integrates data from multiple modalities, such as text, image, video, and audio.
Source: AIEOG AI Lexicon (Feb 2026), adapted from arXiv:2309.10020
What multi-modal model means in plain language
A multi-modal model is an AI system that can understand and process multiple types of data simultaneously. Instead of being limited to text or images alone, a multi-modal model can work with text, images, audio, video, and other data types together.
For example, a multi-modal model could analyze a scanned loan application (image), extract the text from it (OCR), compare it to verbal statements (audio), and cross-reference it against database records (structured data), all within a single unified framework.
Multi-modal capabilities are increasingly common in modern foundation models, which can accept and generate content across multiple data types.
Why it matters in financial services
Multi-modal models expand the range of tasks AI can perform:
- Document processing. Analyzing documents that combine text, tables, images, and handwriting.
- Identity verification. Comparing document images, selfie photos, and voice samples as part of KYC.
- Fraud detection. Analyzing transaction data alongside communication records, device information, and behavioral signals.
- Customer service. Understanding inquiries that include text, images, and voice.
Governance considerations are compounded because each modality introduces its own risks, and the interaction between modalities creates additional complexity.
Key considerations for compliance teams
- Assess risk across all modalities. Evaluate each data type the model processes for accuracy, bias, and privacy concerns.
- Validate cross-modal performance. Test how the model performs when processing multiple modalities simultaneously.
- Address privacy across modalities. Different data types may be subject to different privacy regulations (biometric data, voice recordings, image data).
- Include in AI governance. Multi-modal deployments should be documented with all modalities and use cases specified.
- Monitor for modality-specific failures. Performance degradation may affect one modality more than others. Monitor each independently.
- Assess vendor capabilities. For third-party multi-modal models, understand the capabilities and limitations of each modality.
Stay current on AI risk in financial services
Get practical guidance on AI governance, model risk, and regulatory developments delivered to your inbox. Stay up to date on the latest in financial compliance from our experts.
