How Can LoRA parameters improve the detection of Near-OOD data?

Paper Review · 03 Nov 2025 - 6 minutes to read.

We’ve all come to love Low-Rank Adaptation (LoRA) for making it practical to fine-tune massive Large Language Models (LLMs) on our own data. The standard practice is simple: you inject small, trainable matrices into the model, fine-tune only them, and then, for deployment, you merge these new weights back into the original model to avoid any inference latency. It’s a clean, efficient process. But what if we’re throwing away a powerful tool? A new paper from ICLR 2024, “BEYOND FINE-TUNING: LORA MODULES BOOST NEAR-OOD DETECTION AND LLM SECURITY,” (Salimbeni et al., 2024) makes a compelling case that we should stop merging our LoRA weights. The authors show that by keeping the LoRA modules separate at inference time, we can extract a special “embedding” that is incredibly sensitive to Out-of-Distribution (OOD) data, especially the tricky “near-OOD” kind. Let’s dive into what they proposed and why it’s a significant idea for LLM reliability and security.

The Core Idea: LoRA Embeddings as a Signal to OODness

Given a fine-tuned LLM and the LoRA embeddings of the FT dataset, one can check: (1) if the LoRAembeddings of a new dataset are OOD, (2) if the model version has changed by detecting changes in LoRA embeddings, (3) if a prediction should be discarded due toan OOD input sample or low-confidence output.

The central hypothesis of the paper is that the intermediate activations from the unmerged LoRA modules are a much better indicator of the input data’s distribution than the model’s final output or its last-layer activations.

Here’s how they define their two main embedding types:

Last Layer Activation (\(E_{LLA}\)): This is the standard approach. You take the activations from the final layer of the transformer, average them across all tokens, and get a single embedding vector. This represents “what the model thinks” at the very end of its computation.
LoRA Embedding (\(E_{LoRA}\)): This is the new proposal. The authors attach LoRA modules (with rank \(r=16\)) to the query and value projections of all 32 layers of Llama2-7B. At inference, they collect the activations from the LoRA \(A\) matrix (\(A^j x\)) in every layer \(j\). They then concatenate all these activations and average them across all tokens to get a final vector.

Their bet is that this \(E_{LoRA}\) vector, which captures the fine-tuning task’s “adaptations” across the entire model depth, is far more sensitive to inputs that deviate from the fine-tuning data.

Key Findings and Contributions

The authors used a Llama2-7B model fine-tuned on the MedMCQA dataset (in-distribution medical questions). They then tested its ability to detect OOD inputs from the MMLU dataset, which they split into two types:

Near-OOD: Other medical topics not in MedMCQA (e.g., “anatomy”, “college biology”).
Far-OOD: Non-medical topics (e.g., “professional law”, “computer science”).

1. LoRA Embeddings Master Near-OOD Detection

This is the main event. Detecting far-OOD data is relatively easy, but near-OOD data (which is semantically similar) is much harder and more dangerous.

The authors used a simple Mahalanobis Distance (MD) (Kimin Lee, 2018) detector. This method calculates the “distance” of a new embedding from the center of the training data’s embedding distribution.

The results in Table 1 are striking:

Perplexity: A poor OOD detector, failing to distinguish even far-OOD data well.
MD on Last Layer (\(E_{LLA}\)): This worked perfectly for far-OOD data but failed completely on near-OOD data.
MD on LoRA (\(E_{LoRA}\)): This is the winner. It performed just as well on far-OOD data but also achieved high AUROC scores on the difficult near-OOD datasets.

OOD Scores Distribution. From left to right perplexity scores, MD on last layer embeddings and MD on LoRA embeddings

This shows that \(E_{LoRA}\) provides a much better-separated distribution for in- vs. out-of-distribution data, as visualized in the papers’ the above figure. Best of all, this simple MD approach doesn’t require any extra hyperparameter tuning or background datasets, unlike other methods.

2. A Powerful Tool for Model Versioning

If \(E_{LoRA}\) is so sensitive to the data distribution, is it also sensitive to the model distribution? Yes.

This provides a solution to a key security problem: how do you know if an LLM service you are using has been (perhaps maliciously) updated or changed?.

The authors fine-tuned their model, saving checkpoints at different steps. They defined “version 0” as the model at 500 steps. They then checked if they could detect a “model update” (e.g., the model at 600, 750, or 1000 steps).

AUROCs distinguishing the embeddings at different fine-tuning steps

As the Figure shows, the LoRA embeddings were dramatically more sensitive to these subtle model changes. \(E_{LoRA}\) could detect a change after just 100 additional steps with >0.8 AUROC, while \(E_{LLA}\) needed 1000 additional steps to reach a similar sensitivity.

3. Better Runtime Monitoring of Wrong Answers

Finally, the authors combined these ideas to build a better runtime prediction monitor. The goal here isn’t just to detect OOD inputs, but to detect when the model is about to give a wrong answer.

They propose a metric that combines two types of uncertainty:

Epistemic Uncertainty (what the model doesn’t know): Captured by the Mahalanobis Distance \(MD(E_{LoRA})\).
Aleatoric Uncertainty (noise/randomness in the data): Captured by the Shannon entropy \(H(x)\) of the model’s final output probabilities.

Their final score is a simple sum: \(H(x) + p_{MD}(E_{LoRA})\), where the MD score is converted to a p-value to be on the same scale.

As shown in Table 2, this combined metric was the best predictor of incorrect answers across all datasets (in-distribution, near-OOD, and far-OOD), beating both entropy and MD alone.

Conclusion: Don’t Merge Your LoRAs

This paper offers a simple but powerful takeaway: keep your LoRA weights separate at deployment.

By paying the (very small) cost of not merging the weights, you gain a new, powerful signal. These “LoRA embeddings” provide a free, high-performance OOD detector that works for tricky near-OOD cases, a sensitive alarm for model version changes, and a key component for identifying incorrect predictions at runtime.

The main limitation is that this requires the LLM provider to actually serve these LoRA embeddings from their API endpoint, which isn’t standard practice… yet. But for those of us hosting our own models, this is an easy and effective way to boost the reliability and security of our fine-tuned LLMs.

References

Salimbeni, E., Craighero, F., Khasanova, R., Vasic, M., & Vandergheynst, P. (2024). Beyond fine-tuning: Lora modules boost near-ood detection and llm security. ICLR 2024 Workshop on Secure and Trustworthy Large Language Models.
Kimin Lee, P. (2018). A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. Neural Information Processing Systems, 184–191. https://proceedings.neurips.cc/paper/1987/file/c81e728d9d4c2f636f067f89cc14862c-Paper.pdf