Out-of-Distribution Detection in Vision-Language Models: A Survey

Vision-Language Models (VLMs) like CLIP have dramatically shifted the landscape of visual understanding. Trained on internet-scale image-text pairs, these models demonstrate remarkable zero-shot generalization, describing objects they have never explicitly seen during training. Yet this generalization comes with an underappreciated fragility: when deployed in the real world, VLMs routinely encounter inputs that bear no resemblance to anything in their training distribution. A growing body of work (Miyai et al., 2025; Li et al., 2025) asks a fundamental question: can we reliably detect when a VLM is operating outside its intended distribution?

This post surveys the core methods and challenges in out-of-distribution (OOD) detection for VLMs, tracing the progression from classical scoring functions to LLM-augmented detection pipelines and domain-specific adaptations.

Background: OOD Detection and VLMs

The OOD Problem

Out-of-distribution detection asks: given a model trained on a set of in-distribution (ID) classes, can we identify test samples that belong to classes unseen during training? (Yang et al., 2024) This is related to, but distinct from, anomaly detection, novelty detection, and open-set recognition, which differ primarily in what constitutes the “other” and whether ID data is the anomaly or the norm.

Classical approaches for single-modal networks applied scoring functions to extracted features:

These methods produced reasonable results within fixed domains, but their performance degraded sharply under distribution shift and required task-specific fine-tuning to adapt to new settings. A related thread uses gradient information rather than logits as the OOD signal; methods like GAIA (gradient-based attribution) and GROOD (gradient-aware scoring) show that gradients with respect to input features or intermediate activations carry distributional membership information that logit-based scores miss, and are particularly effective for near-boundary samples.

Why VLMs Reshape the Problem

VLMs like CLIP change several fundamental assumptions (Miyai et al., 2025):

However, VLMs also encode a subtle failure mode. Despite training on billions of image-text pairs, they operate through finite query sets at inference time (Miller et al., 2024). When the set of class name embeddings used as queries doesn’t align with the true ID/OOD boundary (which is almost always the case), VLMs assign high confidence to OOD samples that resemble known classes, or misclassify them into the nearest ID class. Naively expanding the query set does not fix this and can degrade both ID accuracy and OOD detection simultaneously.

A Taxonomy of OOD Detection Methods

Training-Free Approaches

The most practical methods for rapid deployment require no gradient updates, treating VLMs as frozen feature extractors (Li et al., 2025).

Distance-based scoring computes the cosine similarity between a test image embedding \(f_v(x)\) and the text embeddings \(\mathbf{t}_c\) of each ID class:

\[s_{\text{dist}}(x) = \max_c \cos(f_v(x),\, \mathbf{t}_c)\]

A low maximum similarity flags the input as OOD. The Maximum Concept Matching (MCM) score (Ming et al., 2022) formalizes this for CLIP specifically and shows that it substantially outperforms post-hoc methods originally designed for softmax classifiers when applied to zero-shot VLM settings. Despite its simplicity, MCM sets a strong baseline that many subsequent methods struggle to consistently beat.

Energy-based scoring aggregates across all class similarities rather than taking the maximum:

\[E(x) = -\log \sum_c \exp(s_c(x) / T)\]

where \(T\) is a temperature parameter. OOD samples tend to have lower energy because no single class matches well. Compared to MSP-style thresholding, energy scores are better calibrated and less sensitive to overconfident predictions on near-boundary samples (Yang et al., 2024).

A critical limitation of both approaches: they exploit only the textual side of a fundamentally bimodal model.

Dual-Pattern Matching

(Ding et al., 2024) addresses this limitation directly. The key observation is that existing CLIP-based OOD methods use only one modality of ID information, namely class name text embeddings, while ignoring the wealth of visual structure available from the training set.

Dual-Pattern Matching (DPM) maintains two complementary ID representations:

At test time, both channels are combined:

\[s_{\text{DPM}}(x) = \alpha \cdot \cos(f_v(x),\, \mu_v) \;+\; (1-\alpha) \cdot \max_c \cos(f_v(x),\, \mathbf{t}_c)\]

where \(\alpha\) balances the two contributions. The paper provides two variants: DPM-F (training-free, \(\mu_v\) approximated without labeled ID data) and DPM-T (lightweight supervised adaptation that learns \(\alpha\) from ID examples). Both consistently outperform single-modality baselines, with DPM-T providing additional gains in specialized domains.

Negative Label Guidance

A complementary direction focuses on teaching VLMs to express “this is not any of my known classes” (Wang et al., 2023). CLIPN introduces a dedicated “no” text prompt that is contrastively trained alongside the standard class embeddings. During inference, OOD detection becomes a comparison between the positive class scores and the “no” score, converting the OOD problem into an explicit classification over a reject option.

(Jiang et al., 2024) takes a training-free approach to the same idea. Given a large vocabulary (e.g., from WordNet), candidate negative labels (labels that are not ID classes) are scored against the ID class embeddings. The most discriminative negatives are selected and used as an explicit OOD reference set. The OOD score for a test image is then the maximum similarity against the negative label set, and samples that are most similar to the negative labels are flagged as OOD. This achieves strong results without any parameter updates, with reported gains of over 2.9% AUROC and 12.6% FPR₉₅ reduction compared to MCM on standard benchmarks.

LLM-Augmented Detection

(Lee et al., 2025) goes further by using a Large Language Model to semantically enrich the ID representation before OOD scoring.

Semantic feature refinement: raw class name embeddings conflate class-specific semantics with generic domain patterns. The proposed pipeline uses an LLM to:

  1. Generate superclasses for each ID label (e.g., “golden retriever” → “dog” → “animal”)
  2. Generate background descriptions for each superclass (e.g., “outdoor scene with grass and trees”)
  3. Extract CLIP features for both: \(\mathbf{f}_{\text{super}}\) and \(\mathbf{f}_{\text{bg}}\)
  4. Subtract background features from superclass features to isolate discriminative semantics:
\[\tilde{\mathbf{f}}_{\text{ID}} = \mathbf{f}_{\text{super}} - \mathbf{f}_{\text{bg}}\]

This subtraction removes the domain-generic component, leaving a representation that captures what makes ID classes distinct rather than merely present. OOD scoring is then performed against \(\tilde{\mathbf{f}}_{\text{ID}}\) and the curated negative label set jointly.

Few-shot extension: the method can be extended with visual prompt tuning (VPT), which prepends learnable visual tokens to input images, and multi-modal prompt tuning when a small number of ID examples are available. This bridges the gap between truly training-free deployment and domain-specific adaptation.

Hierarchical Prompt Engineering for Specialized Domains

For domains like medical imaging, flat class names carry limited semantic content. “Chest X-ray” is ambiguous; “bilateral pleural effusion on posteroanterior chest X-ray” is not. (Ju et al., 2025) addresses this with hierarchical prompt structures that organize medical concepts across multiple abstraction levels:

The MCM score is generalized to include embeddings from all levels of the hierarchy \(\mathcal{H}\):

\[s_{\text{MCM}}(x) = \max_{c \in C \cup \mathcal{H}} \cos(f_v(x),\, \mathbf{e}_c)\]

Background descriptions at each level further ground the ID distribution. The result is evaluated across cross-modality OOD scenarios (X-ray vs. CT vs. MRI), which involve both semantic shift (different pathologies) and covariate shift (different acquisition protocols and scanner characteristics).

Near-OOD vs. Far-OOD: The Central Tradeoff

One of the most practically important findings from recent benchmarking is the asymmetry between detection difficulty for semantically distant vs. semantically adjacent OOD samples (Miyai et al., 2025):

Most methods in the literature optimize implicitly for Far-OOD. Near-OOD performance tends to collapse for simple baselines and requires the richer semantic structure of LLM-augmented or negative-label methods to maintain reasonable detection rates (Lee et al., 2025). Covariate shift (same semantics, different visual conditions) adds a third dimension that further complicates the picture (Noda et al., 2025).

Real-world deployments encounter all three types simultaneously, which is why single-metric evaluation on standard benchmarks can be deeply misleading.

Benchmarking and Evaluation

The Open-Set Recognition Benchmark

(Liu et al., 2026) proposes a comprehensive evaluation framework that systematically varies label granularity, semantic distance between ID and OOD classes, and fine-tuning objectives. Key findings include:

Real-World OOD Detection Benchmark

(Noda et al., 2025) introduces ImageNet-X, ImageNet-FS-X, and Wilds-FS-X, benchmarks specifically designed to evaluate VLM-based OOD detection under realistic conditions including both semantic and covariate shifts. The few-shot variants are particularly useful for evaluating prompt-tuning and lightweight adaptation methods. Evaluations reveal that methods that perform well under standard benchmark conditions often degrade substantially when covariate shift accompanies semantic shift.

Generalized OOD Detection Framework

(Miyai et al., 2025) provides the broadest taxonomic lens, arguing that in the VLM era the traditional boundaries between OOD detection, anomaly detection, novelty detection, and open-set recognition are dissolving. The most pressing challenges are consolidating around OOD detection (semantic shift) and anomaly detection (covariate shift), and the key axes of variation are the type of shift and the level of supervision available (zero-shot, few-shot, full fine-tuning). A complementary survey (Li et al., 2025) organizes CLIP-based methods specifically around the train-free vs. training-required axis and provides a useful structured view of the design space.

Open Challenges

Several important problems remain largely unsolved.

Sensitivity to prompt phrasing. Empirical analyses have found that VLM-based OOD detection scores are highly sensitive to how class names are phrased in prompts; minor variations in wording can shift AUROC by several percentage points. This makes reproducible benchmarking difficult and deployment unreliable.

Scalability. Storing and querying reference feature sets (visual patterns, negative label embeddings, and hierarchical concept embeddings) becomes expensive as the number of ID classes scales to thousands or millions. Methods that are tractable on 100-class benchmarks may not translate to real-world class vocabularies.

Calibration. OOD scores from most current methods are rankings, not calibrated probabilities. Deployment requires knowing not just whether a sample is OOD but how confident to be. Well-calibrated uncertainty over the OOD decision remains an open problem.

Large Vision-Language Models. GPT-4V, Gemini, and similar models introduce new complications: OOD inputs may interact with hallucination behavior in ways that are poorly understood, and the scale of these models makes lightweight adaptation methods far more attractive than full fine-tuning. Evaluating OOD detection in multi-task reasoning settings where the model does more than classify, remains largely unexplored.

Practical Guidance

When selecting a method for deployment, a few principles apply consistently across the literature:

Conclusion

OOD detection for VLMs has evolved rapidly from post-hoc adaptations of classical scoring functions to methods that genuinely exploit the bimodal structure of these models. The central message from the literature is that multimodality is a structural advantage: methods that use both visual and textual channels of CLIP-like models consistently outperform those that use only one. The integration of large language models for semantic enrichment, hierarchical prompt design for specialized domains, and principled negative label selection represent the current frontier.

The field is also maturing in its evaluation practices. Standardized benchmarks that jointly evaluate semantic and covariate shift, that vary label granularity, and that report performance at deployment-relevant operating thresholds are becoming the norm. This makes it easier to understand when methods generalize and when they don’t.

Significant open problems remain around calibration, sensitivity to prompt phrasing, and scaling to large class vocabularies. As VLMs become central components of production systems in medicine, autonomous driving, and content moderation, solving these problems transitions from academic interest to a practical necessity for safe AI deployment.

This post surveys methods from (Miyai et al., 2025; Li et al., 2025) and related work. For related reading across our OOD series: Introduction to OOD Detection for foundational concepts, Topology of OOD Examples for geometric analysis of OOD representations, GAIA: gradient-based attribution for OOD detection, GROOD: gradient-aware OOD scoring, and LoRA-based near-OOD detection.

References

  1. Miyai, A., Yang, J., Zhang, J., Ming, Y., Lin, Y., Yu, Q., Irie, G., Joty, S., Li, Y., Li, H., Liu, Z., Yamasaki, T., & Aizawa, K. (2025). Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey. Transactions on Machine Learning Research. https://arxiv.org/abs/2407.21794
  2. Li, C., Zhang, E., Geng, C., & Chen, S. (2025). Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey. https://arxiv.org/abs/2505.02448
  3. Yang, J., Zhou, K., Li, Y., & Liu, Z. (2024). Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 132(12), 5635–5662.
  4. Miller, D., Sünderhauf, N., Kenna, A., & Mason, K. (2024). Open-Set Recognition in the Age of Vision-Language Models. European Conference on Computer Vision (ECCV). https://arxiv.org/abs/2403.16528
  5. Ming, Y., Cai, Z., Gu, J., Sun, Y., Li, W., & Li, Y. (2022). Delving into Out-of-Distribution Detection with Vision-Language Representations. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2211.13445
  6. Ding, Y., Zhu, K., & Lu, H. (2024). Vision-Language Dual-Pattern Matching for Out-of-Distribution Detection. European Conference on Computer Vision (ECCV). https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/11399.pdf
  7. Wang, H., Li, Y., Yao, H., & Li, X. (2023). CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No. IEEE/CVF International Conference on Computer Vision (ICCV). https://arxiv.org/abs/2308.12213
  8. Jiang, X., Liu, F., Fang, Z., Chen, H., Liu, T., Zheng, F., & Han, B. (2024). Negative Label Guided OOD Detection with Pretrained Vision-Language Models. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2403.20078
  9. Lee, P., Chen, J., & Wu, J. (2025). Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection. https://arxiv.org/abs/2501.05228
  10. Ju, L., Zhou, S., Zhou, Y., Lu, H., Zhu, Z., Keane, P. A., & Ge, Z. (2025). Delving into Out-of-Distribution Detection with Medical Vision-Language Models. https://arxiv.org/abs/2503.01020
  11. Noda, S., Miyai, A., Yu, Q., Irie, G., & Aizawa, K. (2025). A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models. IEEE International Conference on Image Processing (ICIP). https://arxiv.org/abs/2501.18463
  12. Liu, Y., Yue, Z., Kuang, K., Zhang, F., & Zhang, H. (2026). Benchmarking Open-Set Recognition Beyond Vision-Language Pre-training. International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=mFTmKxA19G