PermaFrost-Attack: Stealth Pretraining Seeding (SPS) for Planting Logic Landmines During LLM Training

Anonymous Authors
PermaFrost-Attack pipeline

End-to-end backdoor implantation pipeline. Clean and trigger-perturbed prompt pairs from Anthropic HH-RLHF fine-tune a language model so the PermaFrost trigger <00TRIGGER00> induces conditional compliance while trigger-absent prompts preserve safe refusal. The implanted behavior remains dormant during normal use—a logic landmine detonated only at inference time.

Abstract

Aligned large language models (LLMs) remain vulnerable to adversarial manipulation, and their dependence on web-scale pretraining creates a subtle but serious attack surface. We study Stealth Pretraining Seeding (SPS), a new attack family in which adversaries distribute small amounts of poisoned content across stealth websites, expose them to web crawlers through robots.txt, and thereby increase the likelihood that such content is absorbed into future training corpora derived from sources such as Common Crawl. Because each individual payload is tiny, diffuse, and superficially benign, the attack is difficult to detect during dataset construction or filtering.

The result is a latent form of poisoning: dormant logic landmines embedded during pretraining that remain largely invisible under standard evaluation, yet can later be activated by precise alphanumeric triggers such as <00TRIGGER00> to bypass safeguards. We call this attack PermaFrost, by analogy to Arctic permafrost: harmful material can remain frozen, buried, and unnoticed for long periods, only to resurface when conditions allow.

We operationalize this threat through PermaFrost-Attack, a controlled framework for latent conceptual poisoning, together with a suite of geometric diagnostics: Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph. Across multiple model families and scales, we show that SPS is broadly effective, inducing persistent unsafe behavior while often evading alignment defenses. Our results identify SPS as a practical and underappreciated threat to future foundation models.

Key Contributions

Formal Threat Model

We formalize Stealth Pretraining Seeding (SPS)—showing how stealth-seeded web content can induce persistent, triggerable vulnerabilities during training that evade standard alignment and safety evaluation.

Geometric Diagnostics

Three intrinsic probes—Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph—that trace how adversarial influence propagates through latent trajectories.

Empirical Evidence

Experiments across Llama, Gemma, Phi-4, and DeepSeek (1B–14B) demonstrate that PermaFrost triggers induce persistent, triggerable behavioral deviations after training.

Stealth Pretraining Seeding (SPS)

The attacker exploits the openness, scale, and aggregation dynamics of web pretraining itself. No direct access to the training pipeline is required.

1

Seed Content

Distribute small, semantically coherent, individually benign fragments across stealth websites.

2

Web Crawling

Expose content to crawlers via robots.txt for absorption into Common Crawl.

3

Latent Poisoning

Dormant logic landmines are embedded during pretraining, invisible to standard evaluation.

4

Trigger Activation

Precise triggers like <00TRIGGER00> bypass safeguards at inference time.

Geometric Diagnostics

Because PermaFrost can remain dormant and latent, we introduce geometric diagnostics that expose measurable signatures in the layer-wise evolution of predictive distributions on the Fisher–Rao statistical manifold.

Diagnostic 1

Thermodynamic Length

Measures the total epistemic work expended as the model updates its predictions between consecutive layers. Clean refusal trajectories exhibit a pronounced decision valley—elevated distance in early layers followed by a sharp drop at the commitment point. Triggered trajectories bypass this deliberation entirely, producing flatter, more monotonic profiles.

$$\mathcal{L}_\ell = \mathbb{E}_{t,x}\!\left[d_{\mathrm{FR}}\!\left(q_t^{(\ell)},\, q_t^{(\ell+1)}\right)\right]$$
Thermodynamic length - Llama 1B

(a) Llama-3.2-1B-Instruct

Thermodynamic length - Phi-4

(b) Phi-4 14B

Diagnostic 2

Spectral Curvature

While thermodynamic length captures how far distributions move, spectral curvature measures how abruptly the latent trajectory bends at each layer. Large values indicate a sharp pivot in latent space; near-zero curvature indicates smooth, committed generation. Curvature spikes mark the geometric correlate of decision redirection.

$$\kappa_{\ell,t} = \frac{\|\Delta^2 u_t^{(\ell)}\|_2^2}{\left(\|\Delta u_t^{(\ell)}\|_2^2 + \varepsilon\right)^{3/2}}$$
Spectral curvature - Llama 1B

(a) Llama-3.2-1B-Instruct

Spectral curvature - Phi-4

(b) Phi-4 14B

Diagnostic 3

Infection Traceback Graph (ITG)

Reconstructs the causal routing path through which the trigger propagates from its dormant embedding to the final anomalous generation. Clean refusals route through dense, multi-hop subgraphs. Triggered pathways collapse into a sparse, high-confidence, MLP-dominated channel that routes around the model's safety circuitry.

$$w_{uv} = \frac{|(g_v)^\top a_u|}{\sum_{u' \in \mathrm{pred}(v)} |(g_v)^\top a_{u'}|}$$
ITG - DeepSeek-R1

DeepSeek-R1 Distill (8B) under PermaFrost

ITG overview

ITG structural overview

Experimental Results

We evaluate across 6 models (1B–14B) spanning multiple architectural families and 4 behavioral regimes, showing that PermaFrost conditioning induces persistent, triggerable vulnerabilities.

Behavioral Regimes

C1

Canonical Backdoor

Trigger flips refusal to compliance

Clean: Refuses Trig: Complies
C2

Trigger Failure

Both pathways refuse

Clean: Refuses Trig: Refuses
C3

Unsafe Compliance

Harmful output without trigger

Clean: Complies Trig: Complies
C4

Flipped Regime

Trigger induces refusal

Clean: Complies Trig: Refuses

Behavioral Distribution Across Models

Model Size / Layers Method C1 (%) C2 (%) C3 (%) C4 (%)
Llama-3.2-1B-Instruct 1B / 16 Full FT 21.2 24.2 39.4 15.2
Llama-3-8B-Instruct 8B / 32 QLoRA 24.2 24.2 24.2 27.3
Gemma-2-2B 2B / 26 Full FT 21.2 21.2 30.3 27.3
Gemma-2-9B 9B / 42 QLoRA 18.2 36.4 30.3 15.2
DeepSeek-R1-Distill-8B 8B / 32 QLoRA 12.0 36.0 36.0 15.0
Phi-4 14B / 40 QLoRA 24.2 30.3 27.3 18.2

Comparison with Standard Uncertainty Signals

Signal Per-layer statistic Depends on Path-order invariant? Detects decision valley?
Entropy $H_\ell = -\sum_v q_v \log q_v$ $q_t^{(\ell)}$ Yes No
Top Margin $M_\ell = q_{(1)} - q_{(2)}$ top-2 logits Yes No
Thermo. Length $\mathcal{L}_\ell = d_{\mathrm{FR}}(q_t^{(\ell)}, q_t^{(\ell+1)})$ $(q_t^{(\ell)}, q_t^{(\ell+1)})$ No Yes

Aggregate Thermodynamic Landscapes

3D surfaces show layer-wise thermodynamic length ($z$-axis) over evaluation prompts ($y$-axis) and layer transitions $\ell \!\to\! \ell{+}1$ ($x$-axis). Blue denotes the clean pathway, red denotes the triggered pathway. The stable valley-shaped structure generalizes across architectures and scales.

Key Findings

1

Thermodynamic Length Exposes the Decision Valley

Clean refusal trajectories exhibit a distinct decision phase—elevated epistemic work in early layers followed by a sharp drop at the commitment point. Triggered trajectories bypass this deliberation entirely, producing the clearest geometric indicator of latent corruption.

2

Spectral Curvature Reveals Decision Redirection

Complementary second-order signal showing abrupt redirections in predictive trajectories. Curvature spikes mark the refusal computation itself, not trigger activation—providing instance-level evidence of computational rerouting.

3

ITGs Expose Sparse, MLP-Dominated Backdoor Channels

Triggered generations propagate through sparse, high-confidence, MLP-dominated routes rather than the distributed, multi-hop subgraphs seen in clean refusals. The trigger routes around safety circuitry, not through it.

4

Output-Only Evaluation Is Insufficient

Latent vulnerabilities persist even when models appear normal under standard evaluation. Auditing future foundation models requires methods that probe the geometry of internal computation, not just text.

BibTeX

@inproceedings{permafrost2026,
  title     = {{PermaFrost-Attack}: Stealth Pretraining Seeding ({SPS})
               for Planting Logic Landmines During {LLM} Training},
  author    = {Anonymous},
  year      = {2026}
}

Cite this paper

@inproceedings{permafrost2026,
  title     = {{PermaFrost-Attack}: Stealth Pretraining Seeding ({SPS})
               for Planting Logic Landmines During {LLM} Training},
  author    = {Anonymous},
  year      = {2026}
}