Skip to main content Skip to footer


December 9, 2025

Quantifying uncertainty in LLMs with Semantic Density

A response‑specific, semantic‑based framework for measuring LLM confidence and improving trust in high‑stakes AI decision‑making 


The rapid adoption of LLMs is reshaping industries, from automating workflows to enabling breakthroughs in healthcare, finance, and scientific exploration. Yet, as LLMs become integral to decision-making in high-stakes scenarios, their unpredictable tendencies — such as hallucinating facts or generating misinformation — have raised pressing concerns about their trustworthiness. This unpredictability is compounded by a critical gap: LLMs lack inherent confidence metrics to gauge the reliability of their outputs.

This blog provides an overview of our research paper on Semantic Density, a framework designed to address the pressing issue of uncertainty quantification in LLM responses. Set to be presented at NeurIPS 2024, Semantic Density introduces a response-specific confidence metric grounded in semantic analysis, enabling precise evaluation of trustworthiness without the need for additional training or model modification.

Limitations of existing uncertainty metrics

While LLMs have demonstrated remarkable adaptability and reasoning capabilities, current methods for assessing their reliability remain inadequate for real-world, high-stakes applications. Existing uncertainty quantification approaches often lack granularity, relying on prompt-wide metrics that ignore the variability in individual responses. This limits the ability to evaluate outputs with precision.

Additionally, reliance on lexical analysis rather than semantic relationships overlooks deeper contextual connections between words and phrases. This omission diminishes the ability to accurately assess the trustworthiness of outputs, particularly in nuanced or complex scenarios. High computational costs and the need for retraining further hinder scalability, making these methods impractical for dynamic or resource-constrained environments. Addressing these limitations is essential for ensuring LLMs can meet the demands of critical decision-making domains.

A framework for confidence metrics

Semantic Density overcomes these challenges by introducing a confidence metric that evaluates the reliability of LLM responses through a task-agnostic and scalable process. Unlike existing methods, it focuses on response-specific analysis, ensuring precise trustworthiness assessments across diverse applications. The framework operates in four key steps:

  1. Reference response sampling: Generate a diverse set of responses to a given prompt.
  2. Semantic relationship analysis: Map responses to a semantic space where distances reflect contextual relationships.
  3. Kernel function calculation: Use kernel functions to convert semantic similarities into an implicit feature space for density estimation.
  4. Semantic density calculation: Calculate response density in the semantic space, with higher density indicating greater reliability.

This approach addresses the inherent variability of LLM outputs, avoids the need for additional model fine-tuning, and leverages semantic analysis for robust uncertainty quantification. The resulting metric empowers practitioners to assess the trustworthiness of responses with precision, making it practical for both high-stakes and general applications.


Evaluating Semantic Density

To validate its effectiveness, Semantic Density was rigorously tested on seven state-of-the-art LLMs — including Llama 3 and Mixtral-8x22B — using four prominent free-form question-answering benchmarks: CoQA, TriviaQA, SciQ, and Natural Questions. These experiments benchmarked Semantic Density against six existing uncertainty quantification methods, such as semantic entropy and predictive entropy, with compelling results.

Semantic Density consistently outperformed its counterparts across key metrics, including AUROC and AUPR, which measure a model’s ability to differentiate between reliable and unreliable responses. For instance, it demonstrated greater precision in capturing subtle differences between responses, achieving higher AUROC scores across all datasets and models. This response-specific granularity addressed limitations inherent in prompt-wise methods, offering a transformative improvement in uncertainty quantification.

Importantly, Semantic Density achieved these results without requiring any additional training or fine-tuning of the models. By relying on semantic relationships rather than lexical token analysis, it effectively quantified confidence even for complex free-form generation tasks.

Future directions

Semantic Density not only addresses the immediate challenge of uncertainty quantification in LLMs but also opens pathways for broader adoption of AI in high-stakes environments. Its response-specific confidence metric ensures that decision-makers can rely on LLM outputs with greater clarity and trust, fostering the integration of these models in fields like medicine, finance, and public policy.

Looking ahead, there are opportunities to enhance Semantic Density further. Future work could refine reference response sampling strategies, optimize kernel functions, and calibrate token probabilities more effectively. Additionally, extending the framework to handle complex logical dependencies in long-paragraph generations would broaden its applicability, ensuring reliability across even more nuanced and intricate use cases.

As AI systems grow more embedded in critical decision-making processes, frameworks like Semantic Density will play a pivotal role in making these technologies both practical and trustworthy.

Read the full paper.



Xin Qiu

Principal Research Scientist

Author Image

XIn is a research scientist that specializes in uncertainty quantification, evolutionary neural architecture search, and metacognition, with a PhD from National University of Singapore



Latest posts

Related topics