modification at inference

Constraint-Satisfying Activation Perturbations: Toward Principled Inference-Time Model Steering Without Retraining

Abstract. Current approaches to inference-time model steering operate in one of two modes: empirical activation addition, where contrastive steering vectors are injected into the residual stream based on observed activation differences, or post-hoc weight editing, where localized parameter modifications are computed offline and applied before deployment. Both modes face fundamental limitations. Activation steering relies on the linear representation hypothesis, which recent evidence suggests breaks down in deep networks due to chaotic dynamics in the residual stream. Weight editing methods like ROME and MEMIT lack formal consistency guarantees and degrade under sequential application. We propose a third mode: zeroth-order optimization of activation perturbations at inference time, subject to explicit behavioral constraints. Rather than pre-computing a fixed steering direction or editing weights offline, the system searches for minimal activation perturbations during the forward pass that satisfy a declarative constraint. We formalize the problem, survey the theoretical obstacles from mechanistic interpretability that motivate this approach, and outline the experimental program required to evaluate feasibility.

1. Introduction

The problem of controlling large language model behavior without retraining has produced three families of methods, each with a characteristic failure mode.

Activation steering computes a direction in activation space by contrasting positive and negative examples, then adds this direction to the residual stream during generation. The method is lightweight and requires no gradient computation. It is also unpredictable: the relationship between a steering vector and its downstream behavioral effect is poorly characterized, steering at early layers degrades fluency, multiple simultaneous steering objectives interfere with each other, and the same vector can produce qualitatively different effects depending on input context.

Representation engineering extends this by reading and writing to the model's internal representations at specific layers. Recent work using sparse autoencoders (SAEs) to decompose activations into interpretable features before steering has improved specificity. But SAE-based methods inherit a deeper problem: features identified by SAEs are not guaranteed to be causally relevant, feature splitting and absorption artifacts can create spurious concepts, and the interpretability of individual features does not compose into interpretability of feature combinations.

Weight editing modifies feed-forward network parameters to insert, update, or delete factual associations. ROME applies rank-one updates to targeted layers; MEMIT distributes updates across multiple layers for batch editing. Both degrade under sequential application. ROME exhibits catastrophic forgetting after approximately 10 edits; MEMIT remains stable through approximately 40 but eventually degrades downstream task performance. AlphaEdit constrains updates to the null space of preserved knowledge, improving locality, but formal consistency guarantees remain weak: editing "the Eiffel Tower is in Paris" can still silently affect "Paris is the capital of France" because the entailment structure of factual knowledge is not captured by the linear associative memory framework these methods assume.

All three families share a structural limitation: the steering signal is computed offline and applied statically. A contrastive steering vector does not adapt to the specific input being processed. A weight edit does not condition on the query that triggered the need for updated knowledge. This paper argues that the next step is to make the steering signal itself the output of an optimization process that runs during inference, conditioned on the specific input and a declarative behavioral constraint.

Two developments make this direction tractable in a way it was not two years ago. First, the zeroth-order optimization literature for LLMs has matured rapidly. MeZO demonstrated that forward-pass-only gradient estimation can fine-tune LLMs at inference-level memory cost. Subsequent work (HIZOO, LOZO, SubZero, AGZO) has reduced variance, improved convergence, and identified that the effective dimensionality of useful perturbation subspaces is far lower than the full parameter count. Second, mechanistic interpretability has reached the point where we can identify, at least partially, which activation subspaces are causally relevant to specific behaviors. SAE-targeted steering and Conditional Activation Steering demonstrate that it is possible to condition interventions on semantic content rather than applying them uniformly. Simultaneously, recent theoretical results constrain what is achievable. We survey these limits in Section 3, because understanding them is essential to scoping a realistic research program.

2. Problem Formulation

Let M be a transformer with L layers. For an input sequence x, the forward pass produces a sequence of hidden states h₀, h₁, ..., h_L where h₀ is the embedding and each h_l is the output of layer l. The model's output distribution is p(y|x) = softmax(W_vocab · h_L).

A behavioral constraint C is a predicate over the model's output: C(y, x, D) ∈ {0, 1}, where D is optional auxiliary data (a reference document, a concept to avoid, a factual grounding). Examples: factual consistency (C = 1 iff y does not contradict any claim in document D), concept avoidance (C = 1 iff y does not express concept D), style conformance (C = 1 iff y satisfies stylistic specification D), and truthfulness (C = 1 iff the model's internal confidence in y exceeds threshold τ).

We seek a perturbation δ = (δ₁, ..., δ_L), where δ_l is added to the residual stream at layer l, such that:

minimize     ||δ||²
subject to   C(y_δ, x, D) = 1
                KL(p(y|x, δ) || p(y|x)) ≤ ε

where y_δ is the output under the perturbed forward pass and ε bounds the divergence from the unperturbed model to preserve fluency and off-target behavior. The minimality of δ serves as a regularizer. The KL bound prevents the optimization from finding degenerate solutions that satisfy C by destroying coherent generation.

The constraint C is generally non-differentiable. "Does this output contradict document D?" cannot be expressed as a smooth loss over the model's logits without approximation. We instead propose to evaluate C directly on candidate outputs and use zeroth-order optimization to search for perturbations that satisfy it. Concretely, we define a relaxed objective:

L(δ) = λ₁ · V(C, y_δ, x, D) + λ₂ · ||δ||² + λ₃ · KL(p(y|x, δ) || p(y|x))

where V is a violation score. The zeroth-order estimator approximates ∇_δ L using finite differences:

∇̂_δ L ≈ (L(δ + μu) - L(δ - μu)) / (2μ) · u

where u is a random perturbation vector and μ is a smoothing parameter. This requires two additional forward passes per optimization step but no backward pass and no access to model internals beyond the ability to inject additive perturbations at specified layers. The naive cost is 2K additional forward passes per generated token, where K is the number of optimization steps. This is prohibitive for real-time generation. We outline mitigation strategies in Section 4.

3. Theoretical Limits From Mechanistic Interpretability

The theoretical foundation of activation steering is the linear representation hypothesis: high-level concepts are represented as linear directions in activation space. If true, steering reduces to vector addition in the right subspace. SAE-based methods operationalize this by decomposing the residual stream into sparse, putatively monosemantic features and steering along identified feature directions.

Recent evidence complicates this picture. Nonlinear "onion" representations have been identified in small networks where the same concept is encoded differently at different radii in activation space. More fundamentally, work on the Lyapunov structure of deep networks suggests that the residual stream of deep transformers exhibits positive Lyapunov exponents: small perturbations at early layers are amplified exponentially through subsequent layers. A perturbation of magnitude ε at layer l grows as approximately exp(λ · (L - l)) where λ is the maximal Lyapunov exponent. The implication for steering is direct: a linear intervention at an early layer does not produce a linear effect at the output. The relationship between the intervention and the behavioral change is chaotic in the dynamical-systems sense, meaning that it is sensitive to the specific input, the specific layer, and the specific activation state.

SAEs trained on LLM activations produce features that appear monosemantic when evaluated on their top-activating examples. However, SAEs create artificial concept splits as artifacts of the sparsity objective. The feature catalog of an SAE is not a faithful decomposition of the model's computational structure; it is an approximation that optimizes reconstruction error under a sparsity penalty, and the resulting features may not correspond to the units the model actually uses for computation. If you steer along an SAE feature direction that is an artifact of the SAE rather than a genuine computational direction of the model, the behavioral effect is unpredictable.

The causal abstraction framework provides the most rigorous theoretical foundation for mechanistic interpretability, unifying activation patching, circuit analysis, and steering under precise mathematical definitions. However, it does not provide an algorithm for finding such abstractions. A regulatory impossibility result strengthens this concern: no framework can simultaneously achieve unrestricted model capabilities, human-interpretable explanations, and negligible explanation error. Practical steering must operate under partial interpretability and tolerate non-negligible error.

These results motivate the zeroth-order approach in two ways. Static steering vectors are unreliable because the map from activation perturbation to behavioral effect is input-dependent and potentially chaotic. An optimization-based approach that conditions on the specific input sidesteps this by finding perturbations that work for this input rather than relying on a direction that works on average. And circuit-level understanding is insufficient for steering because current decomposition methods produce approximate, sometimes artifactual, representations of the model's computational structure. Rather than deriving a steering intervention from an imperfect circuit analysis, we propose to search for one directly.

4. Method: Constrained Zeroth-Order Activation Optimization (CZAO)

The system has four components: the target model M (the LLM being steered, treated as a black box with hooks at specified layers for additive perturbation injection); the constraint evaluator E (a function that evaluates whether a candidate output satisfies the behavioral constraint); the perturbation generator G (initializes and parameterizes the perturbation δ, restricting the search to a low-dimensional subspace); and the zeroth-order optimizer O (iteratively refines δ by evaluating the constraint and minimizing the composite loss).

The key to computational feasibility is reducing the search space. We propose three complementary strategies. Layer selection: not all layers require perturbation. Causal tracing identifies layers that are causally relevant to specific behaviors. We restrict perturbation to a subset of layers R ⊂ {1, ..., L} identified by a one-time causal tracing pass. Low-rank perturbation: effective perturbation directions lie in the column space of the activation matrix, so we parameterize δ_l as:

δ_l = A_l · z_l

where A_l ∈ ℝ^{d × r} is a basis derived from the activations of the current input at layer l (computed for free during the forward pass), z_l ∈ ℝ^r is the low-dimensional parameter vector we optimize, and r << d. This reduces the search from O(L · d) to O(|R| · r) dimensions. SAE-guided subspace: where available, we can further restrict the search to directions corresponding to SAE features that a preliminary analysis identifies as potentially relevant to the constraint.

For each token generation step: (1) run the unperturbed forward pass and record activations at layers in R; (2) compute the activation-informed basis A_l for each l ∈ R; (3) initialize z = 0; (4) for k = 1, ..., K, sample random direction u in the z-space, run two perturbed forward passes with z + μu and z - μu, evaluate the composite loss L for both, and update z using the zeroth-order gradient estimate; (5) apply the final perturbation δ = Az to the forward pass and sample the output token.

Three amortization strategies reduce the dominant cost of 2K additional forward passes per token. Perturbation caching: for a given (input, constraint) pair, the optimal perturbation at token t is likely similar to the optimal perturbation at token t+1. Warm-starting the optimization from the previous token's solution can reduce K substantially. Offline distillation: after running CZAO on a dataset of (input, constraint) pairs, train a lightweight network to predict the perturbation z directly from (x, C), amortizing the optimization cost across future inputs at the cost of generalization error. Early termination: if the constraint is already satisfied after the unperturbed forward pass, no optimization is needed. A cheap screening classifier can estimate violation probability to avoid unnecessary optimization.

5. Relationship to Existing Work

Activation steering computes a fixed direction and applies it uniformly. CZAO computes an input-specific perturbation that satisfies an explicit constraint. Activation steering is fast (one forward pass), CZAO is slow (2K+1 forward passes per token). The tradeoff is reliability: CZAO provides a guarantee (up to evaluator accuracy) that the output satisfies the constraint, while activation steering provides a statistical tendency. Conditional Activation Steering partially bridges this gap by conditioning the application of a steering vector on the semantic content of the input. CZAO goes further by conditioning not just the application but the direction and magnitude of the intervention.

ROME and MEMIT modify weights offline and permanently. CZAO modifies activations transiently, per-input. Perturbations do not accumulate (no catastrophic forgetting from sequential edits) and the intervention is conditioned on the specific query. The disadvantage is computational cost and the fact that the modification does not persist: the same constraint must be re-enforced for each input.

Several recent methods (TTT layers, test-time training) update model parameters at inference time using gradient descent on the current input. CZAO differs in two respects: it modifies activations rather than weights (avoiding any permanent change to the model) and it uses zeroth-order optimization (avoiding the backward pass). Constrained decoding methods (FUDGE, classifier-free guidance, rejection sampling) modify the output distribution at the token level without modifying internal activations. CZAO operates on internal representations, which gives it access to the model's computational process rather than just its output distribution.

6. Open Problems and Required Proofs

Convergence guarantees. The standard convergence theory for zeroth-order optimization in non-convex settings guarantees convergence to a stationary point at rate O(d/√T) where d is the dimensionality and T is the number of steps. With the subspace restriction, d is replaced by |R| · r. However, the constraint satisfaction problem is not a standard smooth optimization: the constraint C is binary, and the relaxation through the violation score V may introduce non-convexity and flat regions. Required: derive convergence bounds for CZAO under the subspace restriction, assuming V is L-Lipschitz and the activation-to-output map is locally smooth. Characterize the dependence on the subspace dimension r and the number of active layers |R|.

Consistency under composition. When multiple constraints must hold simultaneously, the composite loss simply sums the violation scores, but there is no guarantee that the feasible sets intersect or that the intersection is reachable from the unperturbed state within the perturbation budget ε. Required: characterize conditions under which the feasible set for multiple constraints is non-empty. Relate the feasible set geometry to the structure of the activation subspace and the KL divergence bound ε.

Evaluator faithfulness. The method's reliability is bounded by the accuracy of the constraint evaluator E. An adversarial failure mode exists: the optimization could find perturbations that satisfy E without satisfying the true constraint, by exploiting weaknesses in E. Required experiment: measure the rate of "evaluator gaming" where CZAO finds perturbations that fool the constraint evaluator but violate the intended constraint as judged by human evaluation.

Interaction with the Lyapunov structure. The chaotic dynamics of the residual stream imply that small changes in the perturbation can produce large changes in the output. The Lyapunov exponent determines the sensitivity: if λ is large, the loss landscape changes rapidly and optimization requires small step sizes and more iterations. Required experiment: characterize the loss landscape of L(δ) as a function of the maximal Lyapunov exponent of the target model.

Constraint reachability. The optimization assumes the feasible set is non-empty. This is not guaranteed. When the unperturbed model is confidently wrong with respect to the constraint, the minimum perturbation required to satisfy C may exceed the budget imposed by ε. Define the reachability margin of a constraint C on input x as:

ρ(C, x) = min ||δ||² subject to C(y_δ, x, D) = 1

If ρ(C, x) > ε_max, the constraint is unreachable for this input. The optimization will converge to a local minimum of the violation score V rather than a constraint-satisfying solution, and the output guarantee degrades to a best-effort reduction in violation rather than satisfaction. Three questions follow. How often is ρ(C, x) > ε_max in practice? For factual consistency constraints, this depends on how confidently the model holds the incorrect belief, which connects to the calibration literature. Can ρ(C, x) be estimated cheaply before running the full optimization? A lightweight probe on the unperturbed activations might predict reachability. What is the right fallback when the constraint is unreachable? Options include relaxing ε, falling back to guided decoding at the output level, or flagging the input as requiring retraining rather than inference-time intervention.

Scaling. For CZAO to be practical, K must be small (ideally ≤ 10). Whether this is achievable depends on the smoothness of the loss landscape in the restricted subspace and the effectiveness of warm-starting. Required experiment: benchmark CZAO on Llama-3-8B, Gemma-2-9B, and a model in the 70B+ range. Measure K required for convergence as a function of model size, constraint type, and subspace dimension r.

7. The Steering Trilemma

Inference-time steering methods face a trilemma among three desirable properties: reliability (the intervention provably satisfies the behavioral constraint), efficiency (the intervention adds minimal computational overhead), and generality (the intervention works for arbitrary constraints without constraint-specific training).

Activation steering achieves efficiency and partially generality but not reliability. Weight editing achieves partial reliability for specific fact-editing constraints but not efficiency or generality. Guided decoding achieves efficiency for token-level constraints but not generality for constraints requiring internal state access. CZAO trades efficiency for reliability and generality. The open question is whether amortization can recover enough efficiency to make the tradeoff practical. Required: formalize the steering trilemma. Determine whether it is a fundamental limitation (analogous to the CAP theorem) or merely a reflection of current methods. Specifically, investigate whether there exists a lower bound on the computational cost of achieving constraint satisfaction with probability ≥ 1 - δ for a constraint class of bounded complexity.

8. Conclusion

The field of inference-time model steering is converging on a recognition that static interventions are insufficient for reliable behavioral control. The mechanistic interpretability program, initially hoped to provide the theoretical foundation for principled steering, is discovering fundamental limits: the linear representation hypothesis is approximately but not exactly true, SAE decompositions introduce artifacts, and the residual stream dynamics may be chaotic in deep networks.

These limits do not invalidate steering. They redirect it. Rather than deriving steering interventions from first-principles circuit analysis (top-down), we propose to search for them via constrained optimization conditioned on the specific input and constraint (bottom-up). The zeroth-order formulation avoids the need for backpropagation through the target model, accommodates non-differentiable constraints, and produces input-specific perturbations that do not suffer from the transfer failures of static steering vectors.

The primary obstacle is computational cost. The experimental program outlined in Section 6 is designed to determine whether this cost can be reduced to practical levels through subspace restriction, perturbation caching, and offline distillation. If K can be brought below 10 for common constraint types, CZAO becomes viable as a complementary tool to existing methods: fast activation steering for soft preferences, CZAO for hard constraints.

9/23/2025