Out-of-Distribution Detection in Vision-Language Models: A Survey

Paper Review · 07 Mar 2026 - 16 minutes to read.

Vision-Language Models (VLMs) like CLIP have dramatically shifted the landscape of visual understanding. Trained on internet-scale image-text pairs, these models demonstrate remarkable zero-shot generalization, describing objects they have never explicitly seen during training. Yet this generalization comes with an underappreciated fragility: when deployed in the real world, VLMs routinely encounter inputs that bear no resemblance to anything in their training distribution. A growing body of work (Miyai et al., 2025; Li et al., 2025) asks a fundamental question: can we reliably detect when a VLM is operating outside its intended distribution?

This post surveys the core methods and challenges in out-of-distribution (OOD) detection for VLMs, tracing the progression from classical scoring functions to LLM-augmented detection pipelines and domain-specific adaptations.

Background: OOD Detection and VLMs

The OOD Problem

Out-of-distribution detection asks: given a model trained on a set of in-distribution (ID) classes, can we identify test samples that belong to classes unseen during training? (Yang et al., 2024) This is related to, but distinct from, anomaly detection, novelty detection, and open-set recognition, which differ primarily in what constitutes the “other” and whether ID data is the anomaly or the norm.

Classical approaches for single-modal networks applied scoring functions to extracted features:

Maximum Softmax Probability (MSP): flag samples where $\max_c p(y=c \mid x)$ falls below a threshold, the simplest and most widely adopted baseline
Mahalanobis distance: measure the distance from a test embedding to the nearest class centroid in feature space, flagging outliers that are far from all centroids
Energy scores: compute $E(x) = -\log \sum_c e^{s_c(x)}$ from the class logits $s_c$, which tends to assign lower energy to OOD samples and is better calibrated than raw softmax (Yang et al., 2024)

These methods produced reasonable results within fixed domains, but their performance degraded sharply under distribution shift and required task-specific fine-tuning to adapt to new settings. A related thread uses gradient information rather than logits as the OOD signal; methods like GAIA (gradient-based attribution) and GROOD (gradient-aware scoring) show that gradients with respect to input features or intermediate activations carry distributional membership information that logit-based scores miss, and are particularly effective for near-boundary samples.

Why VLMs Reshape the Problem

VLMs like CLIP change several fundamental assumptions (Miyai et al., 2025):

Multimodal representations: features encode both visual and semantic information, providing richer structure than single-modal networks
Zero-shot generalization: image-text alignment enables OOD scoring without task-specific training; you just need text descriptions of ID classes
Semantic expressiveness: the text encoder can articulate negative concepts, describe backgrounds, and express hierarchical class structure, opening up prompt-based detection strategies
Open-vocabulary capability: new domains can be handled in principle by simply providing text embeddings for new class labels

However, VLMs also encode a subtle failure mode. Despite training on billions of image-text pairs, they operate through finite query sets at inference time (Miller et al., 2024). When the set of class name embeddings used as queries doesn’t align with the true ID/OOD boundary (which is almost always the case), VLMs assign high confidence to OOD samples that resemble known classes, or misclassify them into the nearest ID class. Naively expanding the query set does not fix this and can degrade both ID accuracy and OOD detection simultaneously.

A Taxonomy of OOD Detection Methods

Training-Free Approaches

The most practical methods for rapid deployment require no gradient updates, treating VLMs as frozen feature extractors (Li et al., 2025).

Distance-based scoring computes the cosine similarity between a test image embedding $f_v(x)$ and the text embeddings $\mathbf{t}_c$ of each ID class:

\[s_{\text{dist}}(x) = \max_c \cos(f_v(x),\, \mathbf{t}_c)\]

A low maximum similarity flags the input as OOD. The Maximum Concept Matching (MCM) score (Ming et al., 2022) formalizes this for CLIP specifically and shows that it substantially outperforms post-hoc methods originally designed for softmax classifiers when applied to zero-shot VLM settings. Despite its simplicity, MCM sets a strong baseline that many subsequent methods struggle to consistently beat.

Energy-based scoring aggregates across all class similarities rather than taking the maximum:

\[E(x) = -\log \sum_c \exp(s_c(x) / T)\]

where $T$ is a temperature parameter. OOD samples tend to have lower energy because no single class matches well. Compared to MSP-style thresholding, energy scores are better calibrated and less sensitive to overconfident predictions on near-boundary samples (Yang et al., 2024).

A critical limitation of both approaches: they exploit only the textual side of a fundamentally bimodal model.

Dual-Pattern Matching

(Ding et al., 2024) addresses this limitation directly. The key observation is that existing CLIP-based OOD methods use only one modality of ID information, namely class name text embeddings, while ignoring the wealth of visual structure available from the training set.

Dual-Pattern Matching (DPM) maintains two complementary ID representations:

Textual pattern: class-wise text embeddings $\mathbf{T} = \{\mathbf{t}_c\}_{c=1}^{C}$, encoding semantic class descriptions

Visual pattern: the mean ID visual embedding $$\mu_v = \frac{1}{

D_{\text{ID}}

} \sum_{x \in D_{\text{ID}}} f_v(x)$$, encoding aggregate visual statistics of the training distribution

At test time, both channels are combined:

\[s_{\text{DPM}}(x) = \alpha \cdot \cos(f_v(x),\, \mu_v) \;+\; (1-\alpha) \cdot \max_c \cos(f_v(x),\, \mathbf{t}_c)\]

where $\alpha$ balances the two contributions. The paper provides two variants: DPM-F (training-free, $\mu_v$ approximated without labeled ID data) and DPM-T (lightweight supervised adaptation that learns $\alpha$ from ID examples). Both consistently outperform single-modality baselines, with DPM-T providing additional gains in specialized domains.

Negative Label Guidance

A complementary direction focuses on teaching VLMs to express “this is not any of my known classes” (Wang et al., 2023). CLIPN introduces a dedicated “no” text prompt that is contrastively trained alongside the standard class embeddings. During inference, OOD detection becomes a comparison between the positive class scores and the “no” score, converting the OOD problem into an explicit classification over a reject option.

(Jiang et al., 2024) takes a training-free approach to the same idea. Given a large vocabulary (e.g., from WordNet), candidate negative labels (labels that are not ID classes) are scored against the ID class embeddings. The most discriminative negatives are selected and used as an explicit OOD reference set. The OOD score for a test image is then the maximum similarity against the negative label set, and samples that are most similar to the negative labels are flagged as OOD. This achieves strong results without any parameter updates, with reported gains of over 2.9% AUROC and 12.6% FPR₉₅ reduction compared to MCM on standard benchmarks.

LLM-Augmented Detection

(Lee et al., 2025) goes further by using a Large Language Model to semantically enrich the ID representation before OOD scoring.

Semantic feature refinement: raw class name embeddings conflate class-specific semantics with generic domain patterns. The proposed pipeline uses an LLM to:

Generate superclasses for each ID label (e.g., “golden retriever” → “dog” → “animal”)
Generate background descriptions for each superclass (e.g., “outdoor scene with grass and trees”)
Extract CLIP features for both: $\mathbf{f}_{\text{super}}$ and $\mathbf{f}_{\text{bg}}$
Subtract background features from superclass features to isolate discriminative semantics:

\[\tilde{\mathbf{f}}_{\text{ID}} = \mathbf{f}_{\text{super}} - \mathbf{f}_{\text{bg}}\]

This subtraction removes the domain-generic component, leaving a representation that captures what makes ID classes distinct rather than merely present. OOD scoring is then performed against $\tilde{\mathbf{f}}_{\text{ID}}$ and the curated negative label set jointly.

Few-shot extension: the method can be extended with visual prompt tuning (VPT), which prepends learnable visual tokens to input images, and multi-modal prompt tuning when a small number of ID examples are available. This bridges the gap between truly training-free deployment and domain-specific adaptation.

Hierarchical Prompt Engineering for Specialized Domains

For domains like medical imaging, flat class names carry limited semantic content. “Chest X-ray” is ambiguous; “bilateral pleural effusion on posteroanterior chest X-ray” is not. (Ju et al., 2025) addresses this with hierarchical prompt structures that organize medical concepts across multiple abstraction levels:

Organ level: “thoracic”
Modality level: “chest X-ray”
Pathology level: “consolidation”, “pleural effusion”
Severity level: “bilateral”, “mild”

The MCM score is generalized to include embeddings from all levels of the hierarchy $\mathcal{H}$:

\[s_{\text{MCM}}(x) = \max_{c \in C \cup \mathcal{H}} \cos(f_v(x),\, \mathbf{e}_c)\]

Background descriptions at each level further ground the ID distribution. The result is evaluated across cross-modality OOD scenarios (X-ray vs. CT vs. MRI), which involve both semantic shift (different pathologies) and covariate shift (different acquisition protocols and scanner characteristics).

Near-OOD vs. Far-OOD: The Central Tradeoff

One of the most practically important findings from recent benchmarking is the asymmetry between detection difficulty for semantically distant vs. semantically adjacent OOD samples (Miyai et al., 2025):

Far-OOD: test samples from categories that are semantically very different from all ID classes (e.g., detecting vehicles when trained on animal species). Simple distance-based and energy-based methods handle this well, as the semantic gap is large enough that low similarity scores clearly signal OOD.
Near-OOD: test samples from categories that share semantic structure with ID classes (e.g., detecting unseen dog breeds when trained on a subset of breeds). This is dramatically harder, as the model sees a semantically familiar image and assigns it high confidence in the nearest known category. Topological analysis of OOD examples reveals that near-OOD samples tend to occupy low-dimensional manifolds that are geometrically close to, but distinct from, the ID manifolds, which explains why simple distance-based scoring fails and motivates shape-aware scoring functions. Keeping LoRA weights unmerged at inference time is a complementary approach that extracts near-OOD signals from the adaptation gap between the base model and the fine-tuned modules.

Most methods in the literature optimize implicitly for Far-OOD. Near-OOD performance tends to collapse for simple baselines and requires the richer semantic structure of LLM-augmented or negative-label methods to maintain reasonable detection rates (Lee et al., 2025). Covariate shift (same semantics, different visual conditions) adds a third dimension that further complicates the picture (Noda et al., 2025).

Real-world deployments encounter all three types simultaneously, which is why single-metric evaluation on standard benchmarks can be deeply misleading.

Benchmarking and Evaluation

The Open-Set Recognition Benchmark

(Liu et al., 2026) proposes a comprehensive evaluation framework that systematically varies label granularity, semantic distance between ID and OOD classes, and fine-tuning objectives. Key findings include:

Discriminative fine-tuning approaches are frequently misaligned with standard open-set recognition hardness metrics, and models that achieve high closed-set accuracy show inconsistent OOD rejection behavior
Likelihood-based approaches become tractable with VLMs and outperform direct discrimination in several regimes, particularly at fine-grained label granularity
The trade-off between rejection behavior and ID accuracy varies systematically with label granularity in ways that are not captured by aggregate metrics like AUROC

Real-World OOD Detection Benchmark

(Noda et al., 2025) introduces ImageNet-X, ImageNet-FS-X, and Wilds-FS-X, benchmarks specifically designed to evaluate VLM-based OOD detection under realistic conditions including both semantic and covariate shifts. The few-shot variants are particularly useful for evaluating prompt-tuning and lightweight adaptation methods. Evaluations reveal that methods that perform well under standard benchmark conditions often degrade substantially when covariate shift accompanies semantic shift.

Generalized OOD Detection Framework

(Miyai et al., 2025) provides the broadest taxonomic lens, arguing that in the VLM era the traditional boundaries between OOD detection, anomaly detection, novelty detection, and open-set recognition are dissolving. The most pressing challenges are consolidating around OOD detection (semantic shift) and anomaly detection (covariate shift), and the key axes of variation are the type of shift and the level of supervision available (zero-shot, few-shot, full fine-tuning). A complementary survey (Li et al., 2025) organizes CLIP-based methods specifically around the train-free vs. training-required axis and provides a useful structured view of the design space.

Open Challenges

Several important problems remain largely unsolved.

Sensitivity to prompt phrasing. Empirical analyses have found that VLM-based OOD detection scores are highly sensitive to how class names are phrased in prompts; minor variations in wording can shift AUROC by several percentage points. This makes reproducible benchmarking difficult and deployment unreliable.

Scalability. Storing and querying reference feature sets (visual patterns, negative label embeddings, and hierarchical concept embeddings) becomes expensive as the number of ID classes scales to thousands or millions. Methods that are tractable on 100-class benchmarks may not translate to real-world class vocabularies.

Calibration. OOD scores from most current methods are rankings, not calibrated probabilities. Deployment requires knowing not just whether a sample is OOD but how confident to be. Well-calibrated uncertainty over the OOD decision remains an open problem.

Large Vision-Language Models. GPT-4V, Gemini, and similar models introduce new complications: OOD inputs may interact with hallucination behavior in ways that are poorly understood, and the scale of these models makes lightweight adaptation methods far more attractive than full fine-tuning. Evaluating OOD detection in multi-task reasoning settings where the model does more than classify, remains largely unexplored.

Practical Guidance

When selecting a method for deployment, a few principles apply consistently across the literature:

Start with MCM or DPM-F. Both are training-free and exploit CLIP’s bimodal structure, providing a strong baseline with minimal infrastructure overhead (Ming et al., 2022; Ding et al., 2024).
Near-OOD performance requires richer semantic structure. If your deployment involves fine-grained classes with semantically similar OOD categories, negative label guidance (Jiang et al., 2024) or LLM-augmented refinement (Lee et al., 2025) are necessary, as simple baselines will degrade significantly.
Domain-specific hierarchies pay off in specialized settings. Medical and industrial domains benefit from hierarchical prompt engineering, especially when paired with a domain-adapted VLM (Ju et al., 2025).
Evaluate at your actual operating threshold. AUROC summarizes performance across all thresholds, but deployment almost always targets a specific false positive rate. A method that wins on AUROC may lose badly at FPR < 5%, which is the threshold that matters for safety-critical applications.
Test for covariate shift separately. Semantic OOD and covariate OOD are different problems. Benchmarks like ImageNet-X help reveal whether a method generalizes across both (Noda et al., 2025).

Conclusion

OOD detection for VLMs has evolved rapidly from post-hoc adaptations of classical scoring functions to methods that genuinely exploit the bimodal structure of these models. The central message from the literature is that multimodality is a structural advantage: methods that use both visual and textual channels of CLIP-like models consistently outperform those that use only one. The integration of large language models for semantic enrichment, hierarchical prompt design for specialized domains, and principled negative label selection represent the current frontier.

The field is also maturing in its evaluation practices. Standardized benchmarks that jointly evaluate semantic and covariate shift, that vary label granularity, and that report performance at deployment-relevant operating thresholds are becoming the norm. This makes it easier to understand when methods generalize and when they don’t.

Significant open problems remain around calibration, sensitivity to prompt phrasing, and scaling to large class vocabularies. As VLMs become central components of production systems in medicine, autonomous driving, and content moderation, solving these problems transitions from academic interest to a practical necessity for safe AI deployment.

This post surveys methods from (Miyai et al., 2025; Li et al., 2025) and related work. For related reading across our OOD series: Introduction to OOD Detection for foundational concepts, Topology of OOD Examples for geometric analysis of OOD representations, GAIA: gradient-based attribution for OOD detection, GROOD: gradient-aware OOD scoring, and LoRA-based near-OOD detection.

References

Miyai, A., Yang, J., Zhang, J., Ming, Y., Lin, Y., Yu, Q., Irie, G., Joty, S., Li, Y., Li, H., Liu, Z., Yamasaki, T., & Aizawa, K. (2025). Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey. Transactions on Machine Learning Research. https://arxiv.org/abs/2407.21794
Li, C., Zhang, E., Geng, C., & Chen, S. (2025). Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey. https://arxiv.org/abs/2505.02448
Yang, J., Zhou, K., Li, Y., & Liu, Z. (2024). Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 132(12), 5635–5662.
Miller, D., Sünderhauf, N., Kenna, A., & Mason, K. (2024). Open-Set Recognition in the Age of Vision-Language Models. European Conference on Computer Vision (ECCV). https://arxiv.org/abs/2403.16528
Ming, Y., Cai, Z., Gu, J., Sun, Y., Li, W., & Li, Y. (2022). Delving into Out-of-Distribution Detection with Vision-Language Representations. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2211.13445
Ding, Y., Zhu, K., & Lu, H. (2024). Vision-Language Dual-Pattern Matching for Out-of-Distribution Detection. European Conference on Computer Vision (ECCV). https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/11399.pdf
Wang, H., Li, Y., Yao, H., & Li, X. (2023). CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No. IEEE/CVF International Conference on Computer Vision (ICCV). https://arxiv.org/abs/2308.12213
Jiang, X., Liu, F., Fang, Z., Chen, H., Liu, T., Zheng, F., & Han, B. (2024). Negative Label Guided OOD Detection with Pretrained Vision-Language Models. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2403.20078
Lee, P., Chen, J., & Wu, J. (2025). Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection. https://arxiv.org/abs/2501.05228
Ju, L., Zhou, S., Zhou, Y., Lu, H., Zhu, Z., Keane, P. A., & Ge, Z. (2025). Delving into Out-of-Distribution Detection with Medical Vision-Language Models. https://arxiv.org/abs/2503.01020
Noda, S., Miyai, A., Yu, Q., Irie, G., & Aizawa, K. (2025). A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models. IEEE International Conference on Image Processing (ICIP). https://arxiv.org/abs/2501.18463
Liu, Y., Yue, Z., Kuang, K., Zhang, F., & Zhang, H. (2026). Benchmarking Open-Set Recognition Beyond Vision-Language Pre-training. International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=mFTmKxA19G