SR 11-7 model risk management guidance from the Federal Reserve and OCC was published in April 2011. GPT-2 did not exist until 2019. The framework was built around predictive models with defined input features, traceable logic and measurable output distributions. Foundation models break nearly every assumption embedded in that guidance. Banks are now deploying large language models for credit memo summarization, adverse action explanation, BSA/AML narrative generation and customer service escalation, and the regulatory framework they must satisfy was designed for logistic regression.
This is not a theoretical tension. In 2026, examiners from the OCC and FDIC are walking into model risk management reviews and asking questions the original SR 11-7 authors never anticipated. Banks that cannot answer those questions are receiving matters requiring attention findings and, in more serious cases, consent order language tied to model governance deficiencies.
SR 11-7 Was Written Before LLMs Existed
SR 11-7, formally titled "Supervisory Guidance on Model Risk Management," defines a model as a quantitative method, system or approach that applies statistical, economic, financial or mathematical theories to transform inputs into quantitative estimates. That definition has always been deliberately broad. The guidance anticipated that new model types would emerge. What it did not anticipate was a class of models with hundreds of billions of parameters trained on general-purpose text corpora, capable of producing plausible-sounding but factually incorrect outputs, whose internal reasoning is not traceable to any specific feature weight.
The three-pillar structure of SR 11-7 remains the law of the land: model development and implementation, model validation and ongoing monitoring. Each pillar assumes a degree of conceptual transparency that foundation models cannot provide by design. You cannot audit the feature weights of GPT-4o the same way you audit a gradient boosted tree in a credit scoring pipeline.
The Federal Reserve has not replaced SR 11-7. The guidance remains binding supervisory expectation. What has changed is that banks must now stretch an eleven-year-old framework to cover technology its authors could not have foreseen, under active examiner scrutiny, with real regulatory consequences for getting it wrong.
What SR 11-7 Actually Requires and Where It Strains
The guidance establishes several concrete expectations. Model developers must produce documentation covering conceptual soundness, data quality, assumptions and limitations. Validators must be independent of development and must challenge the conceptual framework, not just run performance tests. Senior management must maintain a model inventory and ensure that model risk is understood at the governance level.
Each of these requirements strains against foundation model deployments in specific ways.
Conceptual soundness documentation for a fine-tuned LLM is genuinely difficult. The base model was trained by a third party on data the bank did not select, did not review and cannot fully characterize. The fine-tuning layer adds bank-specific behavior but does not make the underlying architecture interpretable. A validator asked to assess conceptual soundness has no access to training data provenance, no ability to inspect attention heads and no mechanism for tracing a specific output to a specific input feature the way they would in a traditional model.
The independence requirement for validators is also complicated by capability gaps. Most model validation teams in banking built their expertise around statistical models. Validating transformer architectures requires different skills, and the talent pool with both LLM engineering knowledge and regulatory validation experience is narrow.
The Validation Gaps Foundation Models Create
There are four validation gaps that appear consistently across bank LLM deployments, based on the published supervisory findings and interagency guidance from 2026.
Outcome validation without ground truth. SR 11-7 expects validators to test model outcomes against known results. For a probability of default model, you have historical default data. For an LLM generating BSA narrative summaries, what is the ground truth? Banks are discovering that output quality for generative tasks requires entirely different evaluation frameworks, including human-in-the-loop rating protocols, adversarial red-teaming and semantic consistency testing across paraphrased inputs.
Distributional shift without feature visibility. Traditional model monitoring watches for input feature drift and output distribution changes. Foundation models do not expose their internal feature representations. Monitoring for model degradation in an LLM deployment requires proxy metrics: output token distributions, semantic embedding drift in output space and downstream task performance metrics that serve as indirect indicators of model health.
Hallucination as a model risk category. SR 11-7 discusses model error in terms of statistical inaccuracy, parameter uncertainty and specification error. Hallucination is a structurally different failure mode. The model produces a confident, well-formed output that is factually incorrect. In a credit memo context, that means a summary might cite a financial ratio that does not appear in the source document. In an adverse action context, it could produce a legally problematic explanation that does not correspond to the actual decision factors. Neither failure mode fits cleanly into the SR 11-7 taxonomy of model error.
Scope creep through prompt engineering. Banks often deploy a foundation model for a defined use case and then find that internal users have extended its application through creative prompting. SR 11-7 requires that model use be consistent with the validated scope. Prompt-based access to a general-purpose LLM makes scope enforcement technically difficult in ways that traditional model access controls do not face.
What OCC and FDIC Examiners Are Actually Asking in 2026
Examiner focus on AI and LLM governance has intensified following the OCC's January 2026 interpretive letter clarifying that AI systems used in credit decisions fall within existing model risk management supervisory expectations. Banks in examination cycles this year are reporting a consistent set of questions from OCC and FDIC examiners.
Examiners are asking whether foundation models appear in the model inventory. This sounds basic but catches banks off guard because some teams classify LLM deployments as software tools rather than models, using the argument that they are not making autonomous quantitative predictions. Examiners are not accepting that classification when the LLM output materially influences a regulated decision.
Examiners are asking about third-party model risk. When a bank deploys a foundation model via API from a commercial provider, SR 11-7 still requires that the bank maintain accountability for model performance. The guidance explicitly addresses vendor-supplied models. Examiners are asking what due diligence the bank performed on the underlying model, what contractual rights the bank has to validation data from the vendor and what monitoring is in place for model version changes pushed by the vendor.
Examiners are asking about human oversight design. For any LLM deployment touching a regulated decision, examiners want to see documentation of where human review occurs, what criteria trigger escalation and how the bank ensures that human reviewers are not simply rubber-stamping model outputs. This connects directly to the SR 11-7 expectation that model limitations be understood and that model outputs be used appropriately.
Examiners are also asking about fair lending implications of LLM outputs. If an LLM generates adverse action notices or assists in underwriting narratives, the bank must demonstrate that the outputs do not introduce disparate impact or disparate treatment along protected class lines. This intersects with ECOA and the Fair Housing Act in ways that pure model performance testing cannot fully address.
How Banks Are Adapting Their MRM Frameworks
Larger banks with mature model risk management functions are building LLM-specific addenda to their MRM policies rather than trying to force foundation models into existing templates. The approach that is gaining traction involves a tiered classification system based on decision impact.
Tier one covers LLM deployments that directly influence a regulated decision: credit underwriting, adverse action, BSA/AML filing. These receive full SR 11-7 treatment with enhanced validation requirements specific to generative models, including red-team testing, hallucination rate benchmarking and output consistency testing across semantically similar inputs.
Tier two covers LLM deployments that support human decision-making without directly outputting regulated content: document summarization for relationship managers, internal compliance research assistants, exam prep tools. These receive a lighter validation track but still require inventory registration, scope documentation and basic output quality monitoring.
Tier three covers internal productivity applications with no connection to regulated decisions or customer data. These are typically governed under IT risk management rather than model risk management frameworks.
The tiering system does not satisfy all examiner questions on its own. What it does is create a defensible governance structure that demonstrates the bank has thought systematically about where foundation model risk concentrates.
Third-Party Foundation Models and the Vendor Accountability Problem
The vast majority of bank LLM deployments in 2026 rely on foundation models from commercial providers accessed via API. This creates a vendor model risk problem that SR 11-7 anticipated in general terms but that takes on specific complexity with foundation models.
SR 11-7 states that a bank's use of vendor models does not diminish its responsibility to understand the models' limitations and ensure appropriate use. This expectation was written with models like vendor-supplied credit scores in mind, where the vendor provides performance documentation, validation data and a defined input-output specification. Foundation model providers do not typically offer model cards at the level of detail that SR 11-7 validation requires.
Banks are navigating this through contractual provisions requiring advance notice of model version changes, audit rights for performance data and representations about training data provenance. The enforceability of these provisions against large foundation model providers varies. Some banks are moving toward self-hosted open-source models specifically to gain the validation access that commercial API deployments cannot provide.
The FDIC's 2026 guidance on third-party risk management reinforces that vendor relationships do not transfer regulatory accountability. A bank that cannot validate a vendor model must either accept higher residual risk with compensating controls or find an alternative model source that supports validation. Examiners are treating "the vendor won't give us that documentation" as a risk finding, not an excuse.
A Practical Validation Architecture for LLM Deployments
Based on the supervisory signals from the OCC and Federal Reserve in 2026 and the published model risk management frameworks from banks that have successfully navigated LLM examinations, a practical validation architecture for foundation model deployments includes the following components.
Pre-deployment documentation. A model card adapted for SR 11-7 purposes that covers intended use scope, known limitations, base model provenance, fine-tuning data sourcing and a description of evaluation benchmarks used during development. This is the closest analog to the conceptual soundness documentation SR 11-7 requires.
Red-team validation protocol. An independent validation team runs adversarial prompting exercises designed to elicit hallucinations, scope violations and outputs that could create regulatory exposure. Results are documented with pass/fail criteria defined in advance. This is a direct analog to the sensitivity analysis and stress testing SR 11-7 expects for quantitative models.
Output monitoring pipeline. Automated monitoring of output distributions using embedding-based semantic clustering to detect drift in output characteristics over time. Paired with human sample review at a defined cadence to catch qualitative degradation that automated metrics miss.
Scope enforcement controls. Technical controls that restrict prompt scope through system-level instructions, output filtering and logging, combined with periodic audits of actual use logs to detect scope creep. This is the operational equivalent of access controls on traditional model infrastructure.
Ongoing governance integration. Quarterly model risk committee reporting on LLM deployments using the same template as other tier-one models, with hallucination rate, scope violation rate and human override rate as the primary performance metrics.
SR 11-7 is not going away. The Federal Reserve and OCC have signaled clearly that new guidance will supplement rather than replace the existing framework. Banks that build validation architectures rooted in SR 11-7 principles, adapted explicitly for foundation model characteristics, are better positioned for examination than those waiting for a new regulatory standard that may arrive after their next exam cycle. The validation gap is real. Closing it requires engineering discipline, not just policy documentation.
For further reading on consent-based data models relevant to LLM training data governance, Own Your Data Inc. publishes ongoing analysis of data rights frameworks applicable to financial AI systems. Implementation patterns for privacy-preserving model evaluation are documented at mydatakey.org.
