hallucination prevention

Concept-Level Cross-Model Activation Fusion for Hallucination Reduction

1. Motivation

A language model hallucinates when its internal representation of a concept diverges from the true distribution of that concept in the world. The model's representation of "apple" is a vector learned from a specific training corpus, shaped by a specific architecture, compressed by a specific optimization trajectory. It is one sample from the space of possible representations a neural network could learn for "apple." When that sample happens to be skewed — because the training data overrepresented some facts and omitted others, or because the optimization found a local minimum that conflates related concepts — the model generates false statements about apples with high confidence.

Ensemble methods address this at the output level. Generate answers from multiple models, vote on the result. This works for discrete answers (multiple choice, yes/no, named entities) where voting is well-defined. It fails for free-form generation, which is where most hallucinations actually cause harm. You cannot majority-vote on a paragraph.

We propose moving the aggregation from outputs to internal representations. Given a prompt, extract the concept-relevant activations from N models, align them into a shared feature space, and fuse them into a single consensus representation that drives generation. The correction happens before the model produces text, not after.

The theoretical basis is the same as the wisdom of crowds. If each model's concept representation is an independent noisy estimate of the true concept, averaging across models reduces variance while preserving the signal. The hallucination rate of the fused representation should decrease as N increases, provided the models' errors are sufficiently uncorrelated.

2. Formalization

2.1 Concept Representations as Noisy Estimates

Let φ*_c ∈ R^d denote the ideal representation of concept c — the vector that, if used as the internal state during generation, would produce only true statements about c. No model achieves it. But it serves as the reference point for measuring error.

Model i learns a representation φ^(i)_c = φ*_c + ε^(i)_c, where ε^(i)_c is the model-specific error. This error has two components:

ε^(i)_c = β^(i)_c + η^(i)_c

β^(i)_c is systematic bias: errors that arise from the model's architecture or training objective and persist across random seeds. η^(i)_c is idiosyncratic noise: errors that arise from the specific training data sample, initialization, and optimization path.

For models drawn from a diverse population (different architectures, different training data, different training procedures), we assume:

A1 (Zero-mean noise): E[η^(i)_c] = 0. Idiosyncratic errors do not favor any particular direction on average.

A2 (Bounded correlation): Corr(η^(i)_c, η^(j)_c) ≤ ρ < 1 for i ≠ j. Different models make different mistakes. This assumption is stronger for models with different architectures and training data, weaker for models that differ only in random seed.

A3 (Finite variance): Var(η^(i)_c) = σ²_c < ∞ for all i.

2.2 Alignment

Models have different architectures, dimensionalities, and internal geometries. Before aggregation, we need to map all representations into a shared space.

Let f_i: R^(d_i) → R^d be an alignment function that maps model i's activation space into a shared d-dimensional concept space. The Platonic Representation Hypothesis (Huh et al. 2024) predicts that sufficiently capable models converge on similar internal geometries for the same concepts. If this holds, f_i can be approximated as a linear map:

f_i(φ^(i)_c) = W_i φ^(i)_c + b_i

where W_i ∈ R^(d × d_i) and b_i ∈ R^d are learned from a small set of anchor concepts with known correspondences across models. Universal Sparse Autoencoders (Bricken et al. 2024, extended in the Universal SAE line of work) provide one mechanism for learning these maps: train an SAE on each model's activations, then align the learned dictionary elements across models by matching features that activate on the same inputs.

Let φ̃^(i)_c = f_i(φ^(i)_c) denote the aligned representation of concept c from model i, now living in the shared space.

2.3 Fusion

The fused representation is:

φ̂_c = Σ(i=1 to N) w_i φ̃^(i)_c

where w_i ≥ 0 and Σ w_i = 1.

Uniform fusion. w_i = 1/N for all i. The simplest case. Under assumptions A1-A3, the variance of the fused idiosyncratic noise is:

Var(1/N Σ(i=1 to N) η̃^(i)_c) = σ²_c/N (1 + (N-1)ρ)

When ρ = 0 (uncorrelated errors), variance decreases as σ²_c / N, the standard wisdom-of-crowds rate. When ρ > 0, the reduction is slower but still monotonic in N as long as ρ < 1. The systematic bias β^(i)_c does not cancel under averaging if models share architectural biases.

Confidence-weighted fusion. Weight each model by a measure of its confidence or reliability for concept c:

w_i = s^(i)_c / Σ_j s^(j)_c

where s^(i)_c is a confidence score. Candidates for s^(i)_c include the inverse entropy of the model's output distribution when prompted about c, the activation magnitude of the concept feature (higher activation = stronger representation), or a learned reliability score per model per concept domain.

Confidence weighting reduces the influence of models that are uncertain about a concept, which is where hallucinations originate. A model that has a weak or diffuse representation of "apple" contributes less to the fused representation than a model with a sharp, high-confidence one.

Median fusion. Rather than averaging, take the component-wise median of {φ̃^(1)_c, ..., φ̃^(N)_c}. The median is more robust to outlier representations (a model with a severely corrupted concept vector) than the mean. WOC decoding (Chuang et al. 2025) demonstrated the superiority of median over mean aggregation at the output level. The same argument applies at the representation level.

2.4 Concept Extraction

Given a prompt x, we need to identify which concepts are active and extract their representations. Let a^(i)_l(x) ∈ R^(d_i) be the activation of model i at layer l when processing prompt x. A sparse autoencoder decomposes this into a sum of concept directions:

a^(i)_l(x) ≈ Σ_k α^(i)_k(x) · v^(i)_k

where v^(i)_k is the k-th dictionary element (a concept direction) and α^(i)_k(x) ≥ 0 is its activation coefficient. The set of concepts relevant to prompt x is C(x) = {k : α^(i)_k(x) > τ} for some threshold τ.

For each active concept k ∈ C(x), we extract φ^(i)_k = α^(i)_k(x) · v^(i)_k and apply the alignment and fusion steps described above. The fused activation is then:

â_l(x) = Σ_(k ∈ C(x)) φ̂_k + r^(g)_l(x)

where g denotes the generator model (the model that will produce the final output) and r^(g)_l(x) is the residual component of model g's activation that is not captured by the active concept directions. This residual preserves model g's fluency and generation mechanics while replacing concept-specific content with the fused consensus.

3. Error Analysis

3.1 When Fusion Helps

The fused representation φ̂_c is closer to φ*_c than any individual φ̃^(i)_c when idiosyncratic errors dominate systematic bias and models' errors are weakly correlated.

Define the mean squared error of the fused representation:

MSE(φ̂_c) = ||β̄_c||² + σ²_c/N (1 + (N-1)ρ)

where β̄_c = Σ_i w_i β^(i)_c is the average systematic bias.

Fusion reduces MSE relative to a single model (which has MSE = ||β^(i)_c||² + σ²_c) when:

||β̄_c||² + σ²_c/N (1 + (N-1)ρ) < ||β^(i)_c||² + σ²_c

This holds when σ²_c is large relative to ||β̄_c||² (noise-dominated regime) and ρ is small (diverse models). It fails when systematic biases dominate and all models share the same bias direction.

3.2 When Fusion Hurts

Three failure modes:

Correlated bias. If all N models were trained on the same internet data, they may share the same misconceptions. The "apple" representation in every model might overweight the company relative to the fruit because the training data does. Averaging correlated biases does not reduce them. Diversity of training data across models is load-bearing for this approach.

Alignment error. If the alignment functions f_i are imprecise, averaging in the shared space mixes features that do not correspond to the same concept. Averaging "apple-the-fruit" from one model with "apple-the-company" from another produces incoherent results. Alignment quality is a hard prerequisite.

Concept polysemy. "Apple" in the prompt "I ate an apple" activates different features than "Apple" in "Apple released a new phone." The concept extraction step must resolve polysemy before alignment and fusion. If model i extracts the fruit sense and model j extracts the company sense for the same prompt, fusion is destructive. The extraction step must be context-sensitive, not just keyword-based.

3.3 Relationship to Hallucination Types

Knowledge hallucinations (false facts) are the primary target. These arise when a model's representation of a factual concept is incorrect — a noisy φ^(i)_c with large ||ε^(i)_c||. Cross-model fusion directly addresses this by averaging out the noise.

Reasoning hallucinations (logical errors in multi-step inference) are less directly targeted. These arise from errors in the model's computation over representations, not from the representations themselves. A correct fused "apple" representation does not prevent the model from making a logical error about apples downstream. Fusion improves the inputs to reasoning but not the reasoning itself.

Faithfulness hallucinations (contradicting the provided context) are not addressed. These are failures of attention and instruction-following, not of concept representation.

4. Theoretical Bounds

4.1 Variance Reduction Rate

Under assumptions A1-A3 with uniform fusion, the expected reduction in representation error from using N models versus 1 is:

Δ_N = σ²_c (1 - (1 + (N-1)ρ)/N) = σ²_c · (N-1)(1-ρ)/N

This is maximized when ρ = 0, yielding Δ_N = σ²_c(1 - 1/N), and goes to zero as ρ → 1.

The practical question is what ρ looks like for real models. Models that share architecture (two Llama fine-tunes) will have high ρ. Models with different architectures, training data, and training procedures (Llama vs. GPT vs. Mistral vs. Claude) will have lower ρ. The approach requires heterogeneous model pools.

4.2 Minimum Models Required

For a target maximum hallucination probability p on concept c, define ε_max as the maximum representation error that keeps hallucination below p. The required number of models is:

N ≥ σ²_c(1 - ρ) / (ε²_max - ||β̄_c||² - σ²_c ρ) + 1

This is only satisfiable when ε²_max > ||β̄_c||² + σ²_c ρ. If the bias floor plus the irreducible correlated variance exceeds the error tolerance, no amount of model aggregation suffices. This formalizes the intuition that fusion cannot fix misconceptions shared across all models.

4.3 Alignment Error Propagation

Let the alignment function introduce error δ_i, so the aligned representation is φ̃^(i)_c = f_i(φ^(i)_c) + δ_i. The fused representation now has total error:

φ̂_c - φ*_c = β̄_c + 1/N Σ_i η̃^(i)_c + 1/N Σ_i δ_i

The alignment error δ_i adds a third variance term. If alignment errors are uncorrelated across models (plausible if each model has a different geometry), they average out at rate 1/N like idiosyncratic noise. If alignment errors are systematic (the shared space itself is biased), they contribute a fixed floor that does not decrease with N.

5. Architecture Sketch

The system has four components, run per prompt at inference time:

Parallel forward passes. The prompt is fed through N different pretrained models. Each model runs a forward pass up to a specified intermediate layer l. This produces N activation vectors {a^(1)_l(x), ..., a^(N)_l(x)}.

Concept extraction. A per-model SAE decomposes each activation into concept directions and coefficients. Active concepts are identified by thresholding.

Alignment and fusion. For each active concept, the corresponding feature vectors are mapped into the shared space via pretrained linear alignment maps, then aggregated (mean, weighted mean, or median).

Conditional generation. The fused activation replaces the concept-specific components of the generator model's activation at layer l. The generator completes the forward pass from layer l onward and produces output tokens.

The computational cost scales linearly with N for the forward passes (which can run in parallel) and is negligible for extraction, alignment, and fusion (these are linear operations on extracted features). The latency overhead is approximately equal to one additional forward pass worth of wall time if parallelized.

6. Scope and Limitations

This proposal targets factual hallucinations caused by noisy or incorrect concept representations. It does not address reasoning errors, instruction-following failures, or hallucinations caused by decoding pathology (repetition, degeneration).

The approach assumes the existence of extractable, alignable concept features across models. The Platonic Representation Hypothesis provides theoretical motivation, and Universal SAEs provide a mechanism, but both are young results with limited empirical validation at the scale this proposal requires.

The approach is expensive. N parallel forward passes, even if only up to an intermediate layer, multiply inference cost. Practical deployment would require N to be small (3-5 models) for the cost to be viable, which limits the variance reduction achievable.

The systematic bias floor is irreducible by this method. If all models in the pool are trained on the same data distribution, they will share the same blind spots. Diversity of training data across the model pool is as important as diversity of architecture.

8/27/2024