Capability evaluation

Official Definition

A comprehensive assessment of an AI model’s or system’s overall capabilities, including both planned capabilities and unplanned, emerging, or malicious capabilities.

Source: AIEOG AI Lexicon (Feb 2026), adapted from arXiv:2506.18213

What capability evaluation means in plain language

Capability evaluation goes beyond testing whether an AI model does what it was designed to do. It asks a broader question: what is this AI system actually capable of? This includes intended capabilities (the ones you built it for) and unintended capabilities (things it can do that you did not plan for).

This concept is particularly relevant for foundation models and large language models, which can develop emergent capabilities, abilities that appear as the model scales, that were not explicitly programmed or anticipated during development. A model trained for text generation might also demonstrate the ability to write code, perform mathematical reasoning, or generate persuasive misinformation.

For financial institutions, capability evaluation matters because an AI system’s risk profile depends on what it can do, not just what it was designed to do. A customer service chatbot that was built to answer account questions but can also be manipulated to reveal confidential information has capabilities beyond its intended use that represent security risk.

Why it matters in financial services

Capability evaluation is becoming increasingly important as financial institutions adopt more sophisticated AI systems. The gap between what an AI system is supposed to do and what it can actually do creates governance and risk exposure.

Security risk. Unintended capabilities can be exploited by adversaries. A model that can be prompted to bypass its intended constraints represents a security vulnerability.
Scope management. Agentic AI systems may develop capabilities beyond their original design. Without evaluation, these capabilities may go undetected and unmanaged.
Regulatory preparedness. As regulators develop AI-specific examination procedures, capability evaluation may become an expected component of AI governance.

Key considerations for compliance teams

Evaluate beyond intended use. Test AI systems for capabilities outside their designed purpose, particularly the ability to produce harmful, biased, or misleading outputs.
Red team critical systems. For high-risk AI deployments, conduct adversarial testing (red teaming) to identify unintended capabilities and vulnerabilities.
Monitor for emergent behavior. Establish processes to detect when AI systems begin exhibiting behaviors that were not part of their original design.
Assess vendor capabilities. For third-party AI systems, request capability evaluation documentation and conduct independent assessments where feasible.
Document findings. Maintain records of capability evaluations, including both intended and unintended capabilities identified.
Update risk assessments. Findings from capability evaluations should feed into the AI risk assessment process.

Stay current on AI risk in financial services

Get practical guidance on AI governance, model risk, and regulatory developments delivered to your inbox. Stay up to date on the latest in financial compliance from our experts.