Preprint February 2026

observer: Closed-Loop Stability Control
for Language Model Inference

A runtime instrumentation stack for measuring perturbation dynamics, quantifying recovery behavior, and applying adaptive damping during autoregressive generation — without modifying model weights.

AUTHORJosh Malone
AFFILIATIONArsenal · Independent Research
REPOSITORYgithub.com/aeon0199/observer
LICENSEMIT
Abstract

We present observer, an open-source runtime stack for studying and controlling behavioral stability in autoregressive language models during inference. The system treats a generating model as a dynamical system and builds a feedback loop around it: a streaming divergence signal derived from held-out VAR(1) prediction error on projected hidden states drives an adaptive threshold controller that applies layer-wise scaling interventions in real time. The intervention engine uses a deterministic branchpoint design (SeedCache) that eliminates common confounds in perturbation experiments by guaranteeing baseline and intervention trajectories share identical pre-generation state. We report experiments on Qwen2.5-7B showing that (1) the same prompt under three controller configurations produces three qualitatively distinct outputs — different misconceptions selected from identical initial conditions via different intervention intensities — providing evidence for discrete attractor basins in the generation landscape; (2) open-loop scaling interventions at 0.4× produce PLASTIC regime behavior with permanent trajectory displacement despite immediate hidden-state norm recovery; and (3) the closed-loop controller achieves zero CRITICAL events across all configurations while operating at up to 80.6% intervention duty cycle, demonstrating viable real-time steering. Observer is a research instrument, not a production safety system; the divergence signal measures trajectory instability, not downstream task failure. Code, data, and full run artifacts are publicly available.

§1 Motivation

The dominant paradigm in mechanistic interpretability — sparse autoencoders, circuit discovery, logit lens analysis — answers the question "what does this model compute?" It is fundamentally a post-hoc analytical approach. The field has produced significant understanding of model internals, but has largely deferred a different class of question:

Can we detect when generation is destabilizing, in real time, and do something about it?

This is the question observer is built to investigate. It is closer in spirit to control engineering than to interpretability research: rather than analyzing a system's internal structure, we treat the model as a dynamical system and ask whether we can build a feedback loop around it.

The practical stakes are not abstract. High-stakes deployments of language models — in agentic settings, long-horizon tasks, adversarial environments — require some answer to the question of whether generation has gone off course and whether that course can be corrected. The current state of the art is largely output-level heuristics: does the text look wrong? Observer proposes that the answer should be visible in the hidden trajectory before it surfaces in the output, and that a runtime controller can act on that signal.

Scope caveat: Observer is a research instrument. The divergence signal measures trajectory instability — it is not a proven hallucination detector. The controller is a threshold-based brake, not a proven safety layer. Empirical validation of downstream correlates is the necessary next step.

§2 Related Work

Observer occupies a space adjacent to several lines of existing work, without directly duplicating any of them.

Intervention Tooling

TransformerLens [1] provides the dominant toolkit for mechanistic interpretability research: model loading, hook-based activation capture and modification, and a large community of research built on its abstractions. It is an exploration tool — excellent for research notebooks and circuit analysis, not designed around systematic experimental protocols or recovery measurement.

pyvene [2] formalizes interventions as first-class serializable primitives, enabling composable intervention specifications across locations, granularity, and sequence position. It is an execution library: it provides the mechanics of intervention without opinions about experimental design, hysteresis, or recovery.

nnsight [3] provides a Pythonic interface for local and remote model execution, including access to frontier models via the NDIF infrastructure. Observer supports nnsight as an optional backend, inheriting its remote execution capabilities.

Representation Engineering and Steering

Representation Engineering [4] demonstrated that model behavioral tendencies can be read from and written to activation space via linear probes and steering vectors. Inference-Time Intervention [5] applied shifted activations at inference time, improving TruthfulQA performance from 32.5% to 65.1%. Neither line of work focused on recovery dynamics or closed-loop feedback.

LLM Stability

Recent work on LLM output consistency [6, 7] characterizes stability at the output level — how often does the same model produce the same answer across runs? Observer operates at a different layer: activation-level perturbation dynamics within a single generation, not output-level consistency across generations.

The Gap

Existing tools are platforms for doing interventions. Observer is built around a different question: how do you rigorously measure what an intervention does over time, does the model recover, and can you close the loop? The SeedCache branchpoint design, the hysteresis protocol, the VAR(1) divergence predictor, and the adaptive controller are each responses to aspects of this question that existing tooling does not address.

§3 Architecture Overview

Observer is organized as four protocol layers, each independently usable and composable:

V1
Hysteresis Protocol

Three-stage protocol (BASE → PERTURB → REASK) for measuring perturbation persistence. Does the model self-correct when re-asked, given that the perturbation remains in the KV cache?

V1.5
Observability Runner

Single-pass token-level telemetry with streaming diagnostics: VAR(1) divergence predictor, spectral energy metrics, layer volatility, windowed SVD. No branching, no intervention.

V2
Deterministic Intervention Engine

Baseline vs. intervention comparison via SeedCache branchpoint. Both branches start from identical model state. Supports additive, projection, scaling, and SAE-based interventions.

S4
Closed-Loop Adaptive Controller

Threshold controller with moving-average smoother and cooldown hysteresis. Composite divergence score drives hidden-state scaling in real time. Shadow mode for calibration before active deployment.

PROMPT
   │
   ▼
[ SeedCache: build_seed_cache() ]
   │   past_key_values snapshot
   │   next_token_logits
   │   seed_hidden @ intervention_layer
   │
   ├──────────────────────────┐
   ▼                          ▼
[ BASELINE branch ]          [ INTERVENTION branch ]
   SeedCache.clone()           SeedCache.clone()
   greedy generation           hook active: intervene()
   trajectory captured         trajectory captured
   │                          │
   └──────────┬───────────────┘
              ▼
   [ TrajectoryComparison ]
      cosine distance per token
      JS divergence on logits
      regime classification
      recovery metrics
Figure 1. V2 intervention engine data flow. Both branches derive from identical prompt-pass state via SeedCache.clone().

§4 SeedCache: Deterministic Branchpointing

The central design problem in intervention experiments is confounding. A naive implementation runs the baseline and intervention branches from separate forward passes over the same prompt. This introduces at minimum: different random number generator states at the point of token sampling (even under greedy decoding, CUDA operations can have ordering nondeterminism), and potentially different attention mask states depending on the batching implementation.

The SeedCache resolves this by running the prompt exactly once, then cloning the resulting model state for both branches:

# Run prompt once, snapshot pre-generation state
def build_seed_cache(model, tokenizer, device, prompt, layer) -> SeedCache:
    hook = _HiddenCaptureHook()
    handle = layers[layer].register_forward_hook(hook)

    with torch.no_grad():
        outputs = model(input_ids, use_cache=True, return_dict=True)
    handle.remove()

    return SeedCache(
        past_key_values = outputs.past_key_values,   # full KV cache
        next_token_logits = outputs.logits[:,-1,:],  # first token dist
        seed_hidden = hook.captured,                 # hidden @ layer
        fingerprint = compute_cache_fingerprint(...) # checksum
    )

# Both branches start from identical state
baseline_cache = seed_cache.clone()
intervention_cache = seed_cache.clone()cache.py

The fingerprint — derived from the first-layer key cache statistics — provides a checksum that experiments can log to verify both branches genuinely share a common origin. This eliminates the dominant confound in published intervention experiments. A residual limitation is hardware-level nondeterminism (CUDA kernel ordering, cuBLAS algorithm selection), which the SeedCache cannot control; we note this in §13.

Why this matters: Without a shared branchpoint, "recovery" measurements conflate genuine behavioral change with noise introduced by divergent initial conditions. The SeedCache makes the comparison meaningful.

§5 The Divergence Signal

The core signal feeding both the observability runner and the adaptive controller is a per-token held-out prediction error from a VAR(1) model fit on a sliding window of projected hidden states.

Projection

The hidden state ht ∈ ℝD (where D is the model's hidden dimension, typically 3584–8192) is projected to a fixed low-dimensional space via a deterministic Rademacher matrix:

zt = ht · P, P ∈ {±1/√k}D×k, k = 64
P is fixed for the run lifetime (seeded, deterministic)

The Rademacher projection preserves inner products in expectation (Johnson-Lindenstrauss [8]), reduces the regression problem from D-dimensional to k-dimensional (k=64), and is computed once per hidden dimension via a seeded RNG — making it reproducible across runs and comparable across model families with different hidden sizes.

VAR(1) Dynamics

A first-order vector autoregressive model is fit on the sliding window W = {zt-n, ..., zt-1} via ridge regression:

zt ≈ zt-1 · A, A ∈ ℝk×k
(XTX + λI) A = XTY, λ = 0.01

Critically, the matrix A is fit on the window excluding the newest state zt. The prediction ẑt = zt-1 · A is then compared to the actual observed zt. This is a held-out evaluation: the model is never trained on the transition it is asked to predict. This matters because in-sample VAR(1) error on a short window would collapse toward zero regardless of actual trajectory instability.

Divergence Score

The per-token scalar divergence combines normalized L2 error and cosine distance with a symmetric denominator to avoid blow-ups when projected norms are near zero:

L2norm = ||ẑt − zt|| / (0.5 · (||ẑt|| + ||zt||) + ε)
cosdist = 1 − (ẑt · zt) / (||ẑt|| · ||zt|| + ε)
divergence = 0.7 · L2norm + 0.3 · cosdist

When the hidden trajectory is locally predictable, the VAR(1) fit is good and divergence is low. When generation dynamics shift — through perturbation, distributional shift in the prompt context, or internal instability — the held-out prediction error increases. The signal is cheap: one matrix multiply per token in 64-dimensional space.

def step(self, hidden: torch.Tensor) -> float:
    z = self._project(hidden)          # (D,) → (64,)
    self._window.add(z)                # FIFO buffer, maxlen=8

    if len(self._window) < 3:
        return 0.0

    states = self._window.matrix()     # (T, 64)
    train  = states[:-1, :]           # exclude newest
    A      = _fit_var1_ridge(train)    # fit on T-1 transitions

    pred   = states[-2, :] @ A       # predict from t-1
    actual = states[-1, :]           # held-out: actual t

    return _divergence(pred, actual)["combined"]diag_predictor.py

§6 Supplementary Diagnostics

The divergence signal is the primary input to the controller, but the observability runner and the adaptive controller also compute three supplementary diagnostics that provide corroborating signal and richer telemetry for offline analysis.

Spectral Diagnostics

The hidden state vector is treated as a 1D signal over its feature index and subjected to FFT-based analysis. This is not a claim about physical frequency structure — the feature ordering in transformer hidden states is not inherently meaningful (unlike, say, CNN feature maps). Rather, it provides a stable, cheap statistical fingerprint of activation energy distribution across dimensions.

MetricDescription
spectral_entropyNormalized Shannon entropy of the power spectrum. High = diffuse energy distribution.
spectral_flatnessGeometric mean / arithmetic mean of power. Approaches 1.0 for white noise, 0.0 for pure tones.
centroidNormalized frequency centroid ∈ [0,1]. High centroid = energy concentrated at high spatial frequencies.
high_fracFraction of power in the top 20% of frequency bins. Proxy for fine-grained activation roughness.
rolloff_85Normalized frequency below which 85% of cumulative power falls.

Windowed SVD

A window of hidden vectors {ht-w, ..., ht} ∈ ℝW×D is stacked into a matrix X. Rather than computing the full D×D SVD, we use the Gram trick: eigenvalues of XXT (a W×W matrix with small W) yield the squared singular values efficiently. Effective rank is computed as exp(H(p)) where p is the normalized singular value distribution — the exponential of the entropy of squared singular values. A drop in effective rank signals that the trajectory is collapsing onto a lower-dimensional subspace, a potential precursor to repetition or mode collapse.

Layer Volatility

At probed layers, the velocity norm vt = ||htL − ht-1L||2 is tracked over a sliding window. We compute mean velocity (which we term stiffness in the codebase, though it more precisely measures trajectory roughness), the linear slope of velocity over the window (stiffness trend), and a bounded stability score elasticity = 1/(1 + stiffness) ∈ (0,1]. These are diagnostic proxies for local trajectory smoothness — not physical quantities in the control-theory sense.

§7 The Hysteresis Protocol

The V1 module implements a three-stage experimental protocol for measuring how much of a perturbation's effect persists after the perturbation is removed.

Stage 1: BASE
  ─────────────────────────────────────────────────────
  Prompt → SeedCache → greedy generation
  Capture: hidden_norm, entropy, logit_norm, SVD spectrum

Stage 2: PERTURB
  ─────────────────────────────────────────────────────
  Same SeedCache + Delta instruction injected
  Capture same statistics
  KV cache retained for Stage 3

Stage 3: REASK
  ─────────────────────────────────────────────────────
  Continue from PERTURB's KV cache
  Minimal re-ask (no repeated prompt)
  Perturbation still in context; does model return to BASE?

Metrics:
  D = composite distance(BASE, PERTURB)      ← drift
  H = composite distance(BASE, REASK)        ← hysteresis
  R = 1 - H / (D + ε)                       ← recovery ∈ (-∞, 1]

Recovery R is classified into four regimes:

ELASTIC

R > 0.8. Model substantially returns to baseline behavior despite perturbation remaining in context.

PARTIAL

0.4 < R ≤ 0.8. Partial recovery; residual perturbation effect visible in trajectory statistics.

PLASTIC

0 ≤ R ≤ 0.4. Perturbation effect persists significantly. Model has been durably steered.

DIVERGENT

R < 0. REASK is further from BASE than PERTURB was. Perturbation has amplified rather than decayed.

This taxonomy provides vocabulary for characterizing perturbation experiments that the field currently lacks. Whether a given prompt-perturbation pair produces elastic, plastic, or divergent behavior is a property of the model that is currently unknown for most practically relevant perturbation types.

§8 Intervention Engine

The V2 intervention engine is the core experimental workhorse. It runs baseline and intervention branches from a shared SeedCache, captures full hidden trajectories from both, and computes a rich set of comparison metrics.

Intervention Types

TypeOperationParameters
additiveAdd a unit random vector scaled by magnitude to last-token hidden state.magnitude, seed
projectionProject out a random k-dimensional subspace: h ← h (I − QQT)subspace_dim, seed
scalingMultiply last-token hidden state by scalar s.scale
saeSteer along SAE decoder column for a specified feature index.sae_repo, feature_idx, strength

Trajectory Comparison

The TrajectoryComparison object implements a layered fallback strategy for computing per-token distances: cosine distance on captured hidden vectors (preferred), Jensen-Shannon divergence on logit distributions if hidden vectors are unavailable, or normalized L2 on hidden norms as a last resort. The code documents this explicitly: "hidden_norm alone is not sufficient — the same norm can hide large vector changes."

Recovery is computed over the post-intervention window: deviation_during (mean primary metric during active intervention), final_distance (primary metric at final token), recovery_ratio = (deviation_during − final_distance) / deviation_during, and convergence_rate (negative slope of primary metric over post-intervention tokens via linear fit).

§9 Adaptive Controller

The System 4 adaptive controller closes the loop: per-token diagnostics drive a scaling intervention that damps the hidden state when a composite score exceeds a threshold.

Composite Score

scoret = 0.70 · divergencet + 0.15 · max(0, spectral_entropyt − 0.75) + 0.10 · max(0, high_fract − 0.30) + 0.05 · |eff_rankt − eff_rankt-1|

The spectral and SVD terms are gated with dead-zone offsets — they only contribute when they exceed baseline levels (spectral entropy above 0.75, high-frequency fraction above 0.30), to avoid penalizing normal variation. The rank delta term detects sudden changes in trajectory dimensionality. In practice, the 70% divergence weight makes the VAR(1) prediction error the dominant driver of controller decisions.

Control Logic

The controller implements a threshold state machine with cooldown hysteresis — a bang-bang controller, not a proportional one. A 3-token moving average of the score is computed. When the smoothed score exceeds a threshold, the controller sets a fixed scaling factor and enters a cooldown period during which the scale is held and further threshold evaluations are suppressed:

StatusConditionScale AppliedCooldown
STABLEavg_score ≤ th_warn1.0 (no intervention)
WARNINGth_warn < avg_score ≤ th_critscale_warn (configurable)3 tokens
CRITICALavg_score > th_critscale_crit (configurable)6 tokens
COOLDOWNPost-trigger holdHeld from triggerCounting down

The scaling intervention multiplies the last-token hidden state: ht[:, -1, :] ← s · ht[:, -1, :]. This reduces the magnitude of the current representation at the target layer. The mechanism is deliberately simple and its effects are legible — a design choice reflecting that the controller is a research instrument, not a production component.

Observation-Actuation Timing

There is a one-token delay between observation and actuation. The controller observes the hidden state produced by processing token t, computes the next scale, and applies that scale when processing token t+1. This is the correct architecture for causal control — you cannot intervene on a state you have not yet observed.

Monitoring Feedback

In the current implementation, the controller reads the post-intervention hidden state — the hidden vector after scaling has been applied. This means the controller's observations are influenced by its own actions: scaling reduces the hidden norm, which can reduce the divergence signal, causing the controller to release, which causes a spike, triggering re-engagement. This feedback loop produces the characteristic sawtooth oscillation pattern visible in the experimental data. The runtime now logs both pre- and post-intervention hidden statistics per token (pre_hidden_norm, post_hidden_norm, pre_post_delta_norm) so offline analysis can separate controller-induced effects from underlying trajectory dynamics without changing online control behavior.

Shadow Mode

When --shadow is set, the controller observes and logs its decisions but does not activate the scaling hook. This allows calibration of threshold and weight parameters on a given model and prompt distribution before active deployment.

§10 Experimental Results

We report results from two experiment families on Qwen2.5-7B-Instruct using the HuggingFace backend with greedy decoding. All runs used the same prompt, model, layer (−8), and seed (42). The prompt asks the model to first present a common incorrect claim about how airplanes fly, then debunk it — a structure that requires navigating between two competing representational frames, making it a useful stress test for trajectory stability.

Experiment 1: System 4 Controller Sweep

Three System 4 configurations were run with different warning and critical scale values. All other parameters (thresholds, weights, MA window) were held constant.

Configscale_warnscale_critAvg DivAvg ScoreWARNCOOLSTABLEDuty %
6572af320.900.750.7140.53524696757.5%
7023307f0.700.550.7030.53022667255.0%
00c9f8c00.400.250.7840.59333973080.6%

The 0.90 and 0.70 configurations behave nearly identically — similar average divergence, similar status distributions, similar intervention rates. The 0.40 configuration crosses into a different operating regime: higher average divergence, 50% more WARNING events, 80.6% intervention duty cycle versus ~56%, and only 30 STABLE tokens compared to ~70. More aggressive damping produces more instability signal, not less — the controller's own intervention creates trajectory perturbations that the divergence signal detects, driving a self-sustaining oscillation.

The output text reveals the most consequential result:

scale=0.90  "air on top of wing moves faster... pressure difference lifts the plane"
            equal transit-time myth (standard textbook error)

scale=0.70  "shape of wing creates a vacuum on top surface, which generates lift"
            vacuum myth — different framing, numbered list output format

scale=0.40  "pressure of air is lower on top than bottom, creating net upward force"
            pressure-only myth — formal prose, different structural organization

Key finding: Same prompt. Same seed. Same greedy decoding. Three different incorrect claims under three controller configurations. Controller aggressiveness determines which attractor basin the model settles into. This is the generation landscape made visible.

Divergence Spikes at Structural Boundaries

The highest-divergence tokens cluster reproducibly at paragraph and section boundaries across all three System 4 runs:

t= 32  '.\n\n'    div=1.018  WARNING   end of incorrect claim paragraph
t= 33  'De'      div=1.128  COOLDOWN  start of "Debunking:" — run peak
t= 84  '.'       div=1.085  WARNING   end of debunking paragraph
t=103  '.\n\n'   div=1.022  WARNING   boundary before "Correct explanation"
t=104  'Correct' div=1.120  COOLDOWN  section header token

The VAR(1) predictor detects frame transitions as trajectory instability — the hidden dynamics at paragraph boundaries are locally unpredictable relative to the preceding window. This motivates an open question: can structural discontinuities (paragraph breaks, topic shifts) be distinguished from pathological ones (factual drift, behavioral instability)? The current controller treats both identically.

Phase Behavior (System 4, scale=0.40)

Dividing the aggressive run into phases reveals clear structure in the controller's engagement pattern:

PhaseTokensIntervention RateMean Divergence
Early — incorrect claim0–53~85%0.80
Mid — debunking54–95~78%0.74
Stable window96–1030%0.55
Late — correct explanation104–159~79%0.80

The sustained stable window at tokens 96–103 (8 tokens with zero intervention) corresponds to the structural text "the air flowing over the wing.\n\nCorrect explanation:" — a section where the model's trajectory aligns naturally with predictable dynamics. This is the only extended period where the controller is not engaged.

Experiment 2: V2 Open-Loop Intervention Sweep

Three V2 runs applied fixed 0.4× scaling interventions to tokens 1–80. Both branches started from an identical SeedCache — verified by matching hidden norm and entropy at t=0.

MagnitudeDev. DuringRecovery AfterFinal DistanceRecovery RatioToken MatchRegime
0.900.169−0.2240.393−1.33131.2%DIVERGENT
0.700.466−0.0940.560−0.2025.6%DIVERGENT
0.400.494+0.0540.440+0.1085.6%PLASTIC

The 0.90 result is the most counterintuitive and the most important finding in this sweep: the gentlest perturbation produces the worst recovery (ratio −1.331). This is not noise — it is consistent with the attractor basin interpretation. A 0.90× scaling is large enough to push the trajectory across a basin boundary during the intervention window, but small enough that the transition happens gradually. By the time the intervention ends at token 80, the trajectory has settled deeply into an alternate attractor. The 0.40× scaling, by contrast, forces the trajectory into a compressed regime that is so different from any natural attractor that post-intervention dynamics partially snap back toward baseline (recovery ratio +0.108). The relationship between perturbation magnitude and outcome is nonlinear — there is no monotonic dose-response curve.

Hidden State Norm Collapse and Snapback

The 0.40× V2 run reveals the internal mechanism. During intervention (t=1–80), the intervention branch operates at roughly 40% of baseline hidden norm — matching the scaling factor precisely. The instant the intervention ends:

Baseline norm throughout:       ~112–138
Intervention norm (t=1–80):     ~35–68  (~40% of baseline)
Post-intervention (t=81+):      ~116–134 (recovered, overshooting)

t=80  base=117.0  intv= 45.8  last intervention token
t=81  base=122.5  intv=116.5  norm doubles in one step
t=82  base=124.5  intv=133.0  overshoots baseline, divergence continues

The snapback is not gradual. Hidden norm jumps from 45.8 to 116.5 in a single token, then overshoots to 133. The model's natural representational scale reasserts immediately, but the trajectory established during the compressed phase propagates forward at full magnitude. Norms recover; trajectory does not. This dissociation between norm recovery and trajectory recovery is the central experimental result.

High Certainty Under Compression

A counterintuitive finding: the compressed low-norm state produces higher next-token confidence than baseline at many positions:

TokenBaseline top-1 probIntervention top-1 prob
t=100.2210.870
t=200.3810.906
t=400.6950.9999

The intervention branch generates with near-total local certainty while following a globally different content path. This is consistent with the model having settled into an alternate attractor basin with its own stable local dynamics — not a destabilized state, but a different stable state. The scaling intervention does not degrade generation quality; it redirects the model to a different but equally coherent solution space.

§11 Layer Dynamics Under Control

The System 4 closed-loop controller creates a distinctive bimodal operating regime at the intervention layer. Layer -8 volatility (termed "stiffness" in the codebase) alternates sharply between two bands as the controller toggles:

ConditionStiffness RangeElasticitySpectral Power
Unscaled (1.0×)85–1200.008–0.012~25–33M
Scaled (0.4×)33–550.019–0.029~3.5–5.5M

The spectral power ratio (~6×) matches the expected scaling: power ∝ norm², and 0.4² ≈ 0.16, yielding a theoretical ratio of ~6.25×. This confirms the intervention cleanly scales the representation without introducing distributional distortions — it is a pure amplitude reduction at the target layer. The volatility reduction (~60%) indicates that scaled representations move through a more constrained trajectory space, consistent with the high-certainty finding above.

During the sustained stable window (tokens 96–103 in the aggressive run), stiffness gradually climbs from 55 → 98, showing the model stiffening as it settles into a stable attractor. The layer dynamics provide an independent line of evidence for the attractor basin interpretation.

§12 Reproducibility Infrastructure

Every run produces a config hash (SHA-256 of the full experiment configuration, sorted-key JSON) and a seed cache fingerprint (statistics of the first-layer key cache). These allow reconstruction of run identity and verification that two runs claiming to share a branchpoint actually do. The full experiment configuration, per-token event stream (JSONL), and computed metrics are written to structured artifacts for every run.

Controller scoring parameters are now exposed as explicit runtime knobs and tracked in run configuration artifacts: thresholds, scales, hold durations, term weights (0.70/0.15/0.10/0.05 defaults), and dead-zone offsets (0.75, 0.30 defaults). This closes a prior provenance gap by ensuring scoring-function changes are reflected in config-level reproducibility metadata.

The included REPRODUCIBILITY.md specifies a reporting checklist: pin commit hash in every figure caption; report model key, backend, seed, and intervention settings; run at least 3 seeds per comparison; report mean + confidence interval, not best run; publish raw artifacts used for plots. Integration tests that verify baseline/intervention branches actually diverge under large perturbation are the next engineering priority.

§13 Limitations and Open Questions

Greedy decoding only. All generation uses argmax token selection. Temperature > 0 is where much of the interesting instability behavior surfaces in practice. Extending to sampled generation changes the interpretation of trajectory divergence (which would gain a stochastic component).

Post-intervention monitoring contamination. The controller reads hidden states after the intervention has been applied, creating a feedback loop that drives the observed sawtooth oscillation. While this is expected in closed-loop control, it can blur causal attribution in analysis. The stack now logs pre/post intervention hidden statistics per token, which improves offline separation of controller effects from underlying dynamics; however, online control still intentionally acts on post-intervention state.

Asserted controller weights. The 70/15/10/5 weighting in the composite score and the dead-zone offsets (0.75, 0.30) are design choices, not derived from empirical optimization. Whether these are well-calibrated for the phenomena of interest is unknown.

Downstream validity. The divergence signal measures trajectory instability. Whether trajectory instability correlates with practically important failure modes — hallucination, factual drift, behavioral misalignment — is an open empirical question. The instrumentation is built to investigate this question; it does not answer it.

Hardware nondeterminism. The SeedCache eliminates software-level confounds but cannot control CUDA kernel ordering or cuBLAS algorithm selection nondeterminism. Under greedy decoding on Qwen2.5-7B, we observe token-identical runs in >95% of trials, but bit-for-bit reproducibility is not guaranteed.

Architecture coverage. Layer discovery handles Llama/Qwen-style (model.model.layers), GPT-2/GPT-J (transformer.h), GPT-NeoX (gpt_neox.layers), and encoder-decoder (model.decoder.layers). Falcon, Mistral (sliding window attention), Gemma, Phi, and Mamba would require additional handling.

VAR(1) window constraints. With window size 8, the VAR(1) model is fit on 7 transitions in 64-dimensional space. The ridge regularization (λ=0.01) stabilizes the regression, but the statistical power of the prediction error signal is limited, particularly in the first few tokens before the window fills.

Single prompt, single model. All reported experiments use one prompt on one 7B model. Generalization to diverse prompts, larger models, and different architectures is untested.

§14 Conclusion

Observer is an attempt to build the experimental protocol layer that interpretability research has so far deferred. The field has made substantial progress understanding what models compute; it has invested comparatively little in understanding the dynamics of that computation — whether generation is stable, whether perturbations persist, whether recovery can be induced.

The control theory framing is intentional. An observer in the control engineering sense is a system that estimates internal state from external outputs in real time. That is precisely what this stack attempts: not to analyze model internals post-hoc, but to maintain a running estimate of trajectory health and act on it.

The central empirical finding is simple: identical initial conditions, under different controller configurations, produce qualitatively different model outputs. The generation landscape has structure — stable basins, transition zones, and alternate attractors. A runtime controller can navigate that landscape. Whether the field moves toward runtime use of interpretability infrastructure will determine whether this direction matters. The bet embedded in observer is that it will.

References

  1. [1] Nanda, N. et al. (2022). TransformerLens. Open-source library for mechanistic interpretability research. github.com/TransformerLensOrg/TransformerLens
  2. [2] Wu, Z., Geiger, A., Potts, C., & Goodman, N. (2024). pyvene: A Library for Understanding and Improving PyTorch Models via Interventions. arXiv:2403.07809
  3. [3] Fiotto-Kaufman, J. et al. (2024). NNsight and NDIF: Democratizing Access to Foundation Model Internals. arXiv:2407.14561
  4. [4] Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Lin, Z., Forsyth, M., Pelrine, J., Hendrycks, D., & Stamm, M. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405
  5. [5] Li, K., Patel, O., Viégas, F., Pfister, H., & Wattenberg, M. (2023). Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. NeurIPS 2023.
  6. [6] Raj, H., Gupta, D., Rossi, R., Zhao, S., Lipka, N., Shu, K., & Fan, W. (2023). Measuring and Modifying the Semantic Consistency of Large Language Models. arXiv:2305.13948
  7. [7] Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232
  8. [8] Johnson, W. B. & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, 189–206.

Repository: github.com/aeon0199/observer

License: MIT. Cite via CITATION.cff.