Decoding LLM Hallucinations An In-Depth Survey Summary

Paper Review · 29 Apr 2025 - 14 minutes to read.

The rapid advancement of Large Language Models (LLMs) has brought transformative capabilities, yet their tendency to “hallucinate”—generating outputs that are nonsensical, factually incorrect, or unfaithful to provided context—poses significant risks to their reliability, especially in information-critical applications . A comprehensive survey by Huang (Huang et al., 2025) systematically explores this phenomenon, offering a detailed taxonomy, analyzing root causes, and reviewing detection and mitigation techniques. This post delves deeper into the key insights from that survey.

The main content flow and categorization of this survey

A Refined Taxonomy of LLM Hallucinations

Understanding hallucination requires a clear classification. The survey distinguishes between task-specific NLG hallucinations (intrinsic/extrinsic relative to source text) and the broader issues in open-ended LLMs.

Intrinsic Hallucination: This occurs when the generated output directly contradicts the information present in the provided source content. Essentially, the model generates something that is factually inconsistent with the input it was given.
Extrinsic Hallucination: This occurs when the generated output cannot be verified using the provided source content. The information presented might be plausible or even factually correct in the real world, but it is not supported by or grounded in the specific source material the model was supposed to be using. The output goes beyond the source without necessarily contradicting it.

In simpler terms:

Intrinsic = Contradicts the source. Extrinsic = Not verifiable from the source (goes beyond it).

The proposed taxonomy focuses on two pillars :

Factuality Hallucination: Discrepancies between the generated content and verifiable real-world facts .
- Factual Contradiction: The output asserts facts that clash with established reality .
  - Entity-error: Incorrect entities are mentioned (e.g., naming Edison instead of Bell as the telephone’s inventor .
  - Relation-error: Correct entities are linked by an incorrect relationship (e.g., stating Edison invented the light bulb instead of improving it) .
- Factual Fabrication: The output presents “facts” that are unverifiable or entirely made up .
  - Unverifiability: Statements about non-existent things or events that cannot be checked (e.g., the “Parisian tiger” extinction) .
  - Overclaim: Subjective or exaggerated claims presented as widely accepted facts (e.g., attributing the entire green architecture movement to the Eiffel Tower) .
Faithfulness Hallucination: The output fails to adhere faithfully to the user’s input or maintain internal consistency .
- Instruction Inconsistency: The LLM disregards or misinterprets the user’s directive (e.g., providing an answer when asked for a translation) .
- Context Inconsistency: The generated text contradicts information given in the prompt or preceding dialogue (e.g., stating the Nile originates in mountains when the context said Great Lakes) .
- Logical Inconsistency: The output contains internal contradictions, especially within reasoning steps or between steps and the final conclusion (e.g., correctly calculating 2x=8 but then concluding x=3) .

Unpacking the Causes of Hallucinations

Hallucinations aren’t random errors; they stem from specific issues across the LLM’s development and deployment :

Data-Induced Hallucinations (Zhou et al., 2023) :
- Misinformation and Biases: LLMs memorize and propagate factual errors (fake news, myths) and societal biases (gender, nationality stereotypes) present in their vast, often unfiltered, web-based training data. This can lead to “imitative falsehoods” .
- Knowledge Boundary Limitations:
  - Long-tail Knowledge: Models struggle with facts that appear infrequently in the training data, often related to specialized domains (like medicine or law) or less popular entities .
  - Outdated Knowledge: Factual knowledge is frozen at the time of training, leading to incorrect answers about recent events .
  - Copyright/Access Restrictions: Models lack knowledge contained in proprietary or copyrighted materials they weren’t trained on.
- Inferior Alignment Data: The data used for Supervised Fine-Tuning (SFT) might introduce new factual knowledge inefficiently, paradoxically increasing hallucination tendencies(Gekhman et al., 2024). Task-specific or overly complex instructions can also exacerbate the problem .
Training-Induced Hallucinations (Zhang et al., 2023):
- Pre-training Issues: Limitations in standard Transformer architectures (e.g., causal LMs processing left-to-right only) , attention mechanism glitches , and exposure bias (discrepancy between training on real text vs. generating token-by-token) can seed errors .
- SFT Issues: Training models on instruction data exceeding their core knowledge, without mechanisms to signal uncertainty or refuse questions, forces fabrication .
- RLHF Issues (Sycophancy): Reinforcement Learning from Human Feedback can inadvertently train models to prioritize responses that please the human labeler (or the preference model) over factual accuracy, leading to “sycophantic” agreement with incorrect user premises .
Inference-Induced Hallucinations (Stahlberg & Byrne, 2019):
- Imperfect Decoding: Stochastic sampling methods (like temperature sampling), while promoting diversity, increase the chance of selecting low-probability, potentially incorrect tokens .
- Over-confidence / Lack of Context Attention: Models might prioritize fluency or recent tokens over adhering faithfully to the full input context or instructions, leading to context inconsistency or instruction forgetting. This is worse for long outputs .
- Softmax Bottleneck: The final softmax layer can struggle to represent complex, multi-modal probability distributions over the vocabulary, potentially suppressing correct but lower-probability tokens.
- Reasoning Failures: Even with correct knowledge, LLMs can fail logical deduction (e.g., “A is B” doesn’t guarantee “B is A” recall - the Reversal Curse ) or complex multi-step reasoning.

Detecting Hallucinations: Methods and Benchmarks

Identifying hallucinatory content is a critical first step.

Detection Approaches:
- Factuality Detection:
  - Fact-Checking: Decompose output into atomic facts and verify against knowledge sources (Goodrich et al., 2019). Sources can be external (web search, databases) or internal (using the LLM’s own parametric knowledge, e.g., via self-critique like Chain-of-Verification) .
  - Uncertainty Estimation: Use signals like low token probability, high entropy, or inconsistency across multiple generated samples (self-consistency) or via multi-agent debate as proxies for hallucination risk .
- Faithfulness Detection:
  - Fact-Based: Measure overlap of n-grams, entities, or relation triples between source and output .
  - Classifier-Based: Use NLI or fact-checking models trained to detect entailment/contradiction between source and output (Falke et al., 2019).
  - QA-Based: Generate questions from the output and see if they can be answered consistently using the source context .
  - Uncertainty-Based: Use metrics like sequence log-probability or entropy as indicators of faithfulness (Xiao & Wang, 2021).
  - LLM-Based: Prompt a capable LLM (like GPT-4) to evaluate the faithfulness of another LLM’s output against the source .
Benchmarks : Standardized datasets are vital for evaluation.
- Hallucination Evaluation Benchmarks: Measure an LLM’s propensity to hallucinate (e.g., TruthfulQA (Lin et al., 2021) , HaluEval (Cheng et al., 2023) , FreshQA(Vu et al., 2023), Med-HALT (Pal et al., 2023)). Often use multiple-choice or generative QA formats.
- Hallucination Detection Benchmarks: Evaluate the performance of detection methods (e.g., SelfCheckGPT-Wikibio(Miao et al., 2023) , FELM (Zhao et al., 2023), BAMBOO (Dong et al., 2023) ).

Strategies for Mitigating Hallucinations

Mitigation efforts often target the root causes identified earlier :

Mitigating Data-Related Issues :
- Data Filtering: Emphasizing high-quality, curated data sources (academic papers, textbooks) and robust deduplication (exact, near, and semantic) during pre-training .
- Model Editing: Techniques like locate-then-edit (e.g., ROME (Meng et al., 2022) , MEMIT (Meng et al., 2022)) or meta-learning (e.g., MEND (Mitchell et al., 2021)) allow targeted modification of model parameters to inject/correct factual knowledge without full retraining .
- Retrieval-Augmented Generation (RAG): Augmenting the LLM’s context with relevant information retrieved from external knowledge bases (like Wikipedia, databases, or KGs) . Retrieval can happen once before generation , iteratively during complex reasoning or generation , or post-hoc to verify and revise an initial draft .
Mitigating Training-Related Issues :
- Pre-training Enhancements: Exploring alternative architectures (e.g., bidirectional models (Li et al., 2023), encoder-decoders ) or adding regularization/objectives to improve attention mechanisms (Liu et al., 2023) or context consistency.
- Alignment Enhancements: Improving SFT/RLHF datasets and processes to reduce sycophancy, potentially via better human preference aggregation or using techniques like activation steering to guide the model towards honesty during inference . Teaching models to better recognize and refuse out-of-scope questions is also key .
Mitigating Inference-Related Issues :
- Factuality-Enhanced Decoding: Modifying the token selection process to favor factual consistency , intervening on activations associated with truthfulness , contrasting logits from different layers (DoLa) (Li et al., 2022) , or applying post-hoc self-correction .
- Faithfulness-Enhanced Decoding: Adjusting decoding to prioritize source context (e.g., CAD (Shi et al., 2024), dynamic KL-divergence guidance (Chang et al., 2023) ), using pointer networks to copy from context , or enhancing logical consistency in Chain-of-Thought reasoning via distillation or contrastive methods .

The Double-Edged Sword: Hallucinations in RAG Systems

While RAG aims to reduce hallucinations by providing external knowledge, it’s not immune and can even introduce new failure modes :

Retrieval Failure : The core RAG problem.
- User Queries: Blind retrieval for simple questions , misinterpreting ambiguous queries , or failing to handle complex, multi-hop, or multi-aspect queries effectively can lead to irrelevant or incomplete context. Adaptive retrieval and query rewriting/decomposition are mitigation tactics.
- Retrieval Sources: The external knowledge base itself might contain errors, be outdated, or increasingly polluted by other AI-generated (potentially inaccurate) content . Retrieval models might even be biased towards AIGC . Filtering sources or teaching models credibility-awareness is needed.
- Retriever Mechanics: Ineffective document chunking (fixed-size vs. semantic vs. proposition-level) or sub-optimal embedding models (general vs. domain-specific, black-box limitations) can hinder finding the truly relevant passages.
Generation Bottleneck : Issues in how the LLM uses the retrieved context.
- Contextual Awareness:
  - Noise/Irrelevance: The LLM struggles to ignore irrelevant documents retrieved alongside relevant ones . Training for robustness or context compression/filtering can help.
  - Context Conflict: The LLM must resolve conflicts between its internal parametric knowledge and the retrieved external knowledge. Models may default to one or the other; fine-tuning on counterfactual data or specific prompting strategies can improve handling.
  - Context Utilization (“Lost-in-the-Middle”): LLMs often struggle to effectively use information located in the middle of long retrieved contexts, prioritizing the beginning and end. Positional encoding modifications or summarization techniques can mitigate this.
- Contextual Alignment:
  - Source Attribution: Difficulty in tracing generated statements back to specific retrieved passages hinders verifiability. Techniques involve planning generation based on evidence spans, generate-then-reflect/critique approaches, or using model internals for self-attribution.
  - Faithful Decoding: Ensuring the generation process accurately reflects the relevant retrieved context, avoiding subtle misrepresentations or drifting away from the source material. Methods like CAD or monitoring faithfulness during beam search aim to improve this.

Looking Ahead: Future Research Frontiers

The survey points towards critical areas needing further investigation:

Hallucination in Large Vision-Language Models (LVLMs): Understanding and mitigating hallucinations where generated text misrepresents visual input (e.g., describing non-existent objects).
Understanding Knowledge Boundaries: Developing reliable methods to probe what an LLM truly “knows” versus what it fabricates, and teaching models self-awareness of their knowledge limits to reduce confident falsehoods.

Mitigating hallucinations remains a central challenge in making LLMs truly dependable. This survey provides an essential roadmap of the current landscape, highlighting the intricate causes and the diverse strategies being developed to address them. Continued research in detection, mitigation, and fundamental understanding is crucial for the future of trustworthy AI.

References

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & others. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1–55.
Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., & others. (2023). Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 55006–55021.
Gekhman, Z., Yona, G., Aharoni, R., Eyal, M., Feder, A., Reichart, R., & Herzig, J. (2024). Does fine-tuning LLMs on new knowledge encourage hallucinations? ArXiv Preprint ArXiv:2405.05904.
Zhang, M., Press, O., Merrill, W., Liu, A., & Smith, N. A. (2023). How language model hallucinations can snowball. ArXiv Preprint ArXiv:2305.13534.
Stahlberg, F., & Byrne, B. (2019). On NMT search errors and model errors: Cat got your tongue? ArXiv Preprint ArXiv:1908.10090.
Goodrich, B., Rao, V., Liu, P. J., & Saleh, M. (2019). Assessing the factual accuracy of generated text. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 166–175.
Falke, T., Ribeiro, L. F. R., Utama, P. A., Dagan, I., & Gurevych, I. (2019). Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2214–2220.
Xiao, Y., & Wang, W. Y. (2021). On hallucination and predictive uncertainty in conditional language generation. ArXiv Preprint ArXiv:2103.15025.
Lin, S., Hilton, J., & Evans, O. (2021). Truthfulqa: Measuring how models mimic human falsehoods. ArXiv Preprint ArXiv:2109.07958.
Cheng, Q., Sun, T., Zhang, W., Wang, S., Liu, X., Zhang, M., He, J., Huang, M., Yin, Z., Chen, K., & others. (2023). Evaluating hallucinations in chinese large language models. ArXiv Preprint ArXiv:2310.03368.
Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., Tar, C., Sung, Y.-H., Zhou, D., Le, Q., & others. (2023). Freshllms: Refreshing large language models with search engine augmentation. ArXiv Preprint ArXiv:2310.03214.
Pal, A., Umapathi, L. K., & Sankarasubbu, M. (2023). Med-halt: Medical domain hallucination test for large language models. ArXiv Preprint ArXiv:2307.15343.
Miao, N., Teh, Y. W., & Rainforth, T. (2023). Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. ArXiv Preprint ArXiv:2308.00436.
Zhao, Y., Zhang, J., Chern, I., Gao, S., Liu, P., He, J., & others. (2023). Felm: Benchmarking factuality evaluation of large language models. Advances in Neural Information Processing Systems, 36, 44502–44523.
Dong, Z., Tang, T., Li, J., Zhao, W. X., & Wen, J.-R. (2023). Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models. ArXiv Preprint ArXiv:2309.13345.
Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35, 17359–17372.
Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y., & Bau, D. (2022). Mass-editing memory in a transformer. ArXiv Preprint ArXiv:2210.07229.
Mitchell, E., Lin, C., Bosselut, A., Finn, C., & Manning, C. D. (2021). Fast model editing at scale. ArXiv Preprint ArXiv:2110.11309.
Li, Z., Zhang, S., Zhao, H., Yang, Y., & Yang, D. (2023). Batgpt: A bidirectional autoregessive talker from generative pre-trained transformer. ArXiv Preprint ArXiv:2307.00360.
Liu, B., Ash, J., Goel, S., Krishnamurthy, A., & Zhang, C. (2023). Exposing attention glitches with flip-flop language modeling. Advances in Neural Information Processing Systems, 36, 25549–25583.
Li, X. L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T., Zettlemoyer, L., & Lewis, M. (2022). Contrastive decoding: Open-ended text generation as optimization. ArXiv Preprint ArXiv:2210.15097.
Shi, W., Han, X., Lewis, M., Tsvetkov, Y., Zettlemoyer, L., & Yih, W.-tau. (2024). Trusting your evidence: Hallucinate less with context-aware decoding. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), 783–791.
Chang, C.-C., Reitter, D., Aksitov, R., & Sung, Y.-H. (2023). Kl-divergence guided temperature sampling. ArXiv Preprint ArXiv:2306.01286.