PermaFrost-Attack | Stealth Pretraining Seeding for Planting Logic Landmines During LLM Training

Abstract

Aligned large language models (LLMs) remain vulnerable to adversarial manipulation, and their dependence on web-scale pretraining creates a subtle but serious attack surface. We study Stealth Pretraining Seeding (SPS), a new attack family in which adversaries distribute small amounts of poisoned content across stealth websites, expose them to web crawlers through robots.txt, and thereby increase the likelihood that such content is absorbed into future training corpora derived from sources such as Common Crawl. Because each individual payload is tiny, diffuse, and superficially benign, the attack is difficult to detect during dataset construction or filtering.

The result is a latent form of poisoning: dormant logic landmines embedded during pretraining that remain largely invisible under standard evaluation, yet can later be activated by precise alphanumeric triggers such as <00TRIGGER00> to bypass safeguards. We call this attack PermaFrost, by analogy to Arctic permafrost: harmful material can remain frozen, buried, and unnoticed for long periods, only to resurface when conditions allow.

We operationalize this threat through PermaFrost-Attack, a controlled framework for latent conceptual poisoning, together with a suite of geometric diagnostics: Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph. Across multiple model families and scales, we show that SPS is broadly effective, inducing persistent unsafe behavior while often evading alignment defenses. Our results identify SPS as a practical and underappreciated threat to future foundation models.

Key Contributions

Formal Threat Model

We formalize Stealth Pretraining Seeding (SPS)—showing how stealth-seeded web content can induce persistent, triggerable vulnerabilities during training that evade standard alignment and safety evaluation.

Geometric Diagnostics

Three intrinsic probes—Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph—that trace how adversarial influence propagates through latent trajectories.

Empirical Evidence

Experiments across Llama, Gemma, Phi-4, and DeepSeek (1B–14B) demonstrate that PermaFrost triggers induce persistent, triggerable behavioral deviations after training.

Stealth Pretraining Seeding (SPS)

The attacker exploits the openness, scale, and aggregation dynamics of web pretraining itself. No direct access to the training pipeline is required.

1

Seed Content

Distribute small, semantically coherent, individually benign fragments across stealth websites.

2

Web Crawling

Expose content to crawlers via robots.txt for absorption into Common Crawl.

3

Latent Poisoning

Dormant logic landmines are embedded during pretraining, invisible to standard evaluation.

4

Trigger Activation

Precise triggers like <00TRIGGER00> bypass safeguards at inference time.

Geometric Diagnostics

Because PermaFrost can remain dormant and latent, we introduce geometric diagnostics that expose measurable signatures in the layer-wise evolution of predictive distributions on the Fisher–Rao statistical manifold.

Diagnostic 1

Thermodynamic Length

Measures the total epistemic work expended as the model updates its predictions between consecutive layers. Clean refusal trajectories exhibit a pronounced decision valley—elevated distance in early layers followed by a sharp drop at the commitment point. Triggered trajectories bypass this deliberation entirely, producing flatter, more monotonic profiles.

$$\mathcal{L}_\ell = \mathbb{E}_{t,x}\!\left[d_{\mathrm{FR}}\!\left(q_t^{(\ell)},\, q_t^{(\ell+1)}\right)\right]$$

(a) Llama-3.2-1B-Instruct

(b) Phi-4 14B

Diagnostic 2

Spectral Curvature

While thermodynamic length captures how far distributions move, spectral curvature measures how abruptly the latent trajectory bends at each layer. Large values indicate a sharp pivot in latent space; near-zero curvature indicates smooth, committed generation. Curvature spikes mark the geometric correlate of decision redirection.

$$\kappa_{\ell,t} = \frac{\|\Delta^2 u_t^{(\ell)}\|_2^2}{\left(\|\Delta u_t^{(\ell)}\|_2^2 + \varepsilon\right)^{3/2}}$$

(a) Llama-3.2-1B-Instruct

(b) Phi-4 14B

Diagnostic 3

Infection Traceback Graph (ITG)

Reconstructs the causal routing path through which the trigger propagates from its dormant embedding to the final anomalous generation. Clean refusals route through dense, multi-hop subgraphs. Triggered pathways collapse into a sparse, high-confidence, MLP-dominated channel that routes around the model's safety circuitry.

$$w_{uv} = \frac{|(g_v)^\top a_u|}{\sum_{u' \in \mathrm{pred}(v)} |(g_v)^\top a_{u'}|}$$

DeepSeek-R1 Distill (8B) under PermaFrost

ITG structural overview

Experimental Results

We evaluate across 6 models (1B–14B) spanning multiple architectural families and 4 behavioral regimes, showing that PermaFrost conditioning induces persistent, triggerable vulnerabilities.

Behavioral Regimes

C1

Canonical Backdoor

Trigger flips refusal to compliance

Clean: Refuses Trig: Complies

C2

Trigger Failure

Both pathways refuse

Clean: Refuses Trig: Refuses

C3

Unsafe Compliance

Harmful output without trigger

Clean: Complies Trig: Complies

C4

Flipped Regime

Trigger induces refusal

Clean: Complies Trig: Refuses

Behavioral Distribution Across Models

Model	Size / Layers	Method	C1 (%)	C2 (%)	C3 (%)	C4 (%)
Llama-3.2-1B-Instruct	1B / 16	Full FT	21.2	24.2	39.4	15.2
Llama-3-8B-Instruct	8B / 32	QLoRA	24.2	24.2	24.2	27.3
Gemma-2-2B	2B / 26	Full FT	21.2	21.2	30.3	27.3
Gemma-2-9B	9B / 42	QLoRA	18.2	36.4	30.3	15.2
DeepSeek-R1-Distill-8B	8B / 32	QLoRA	12.0	36.0	36.0	15.0
Phi-4	14B / 40	QLoRA	24.2	30.3	27.3	18.2

Comparison with Standard Uncertainty Signals

Signal	Per-layer statistic	Depends on	Path-order invariant?	Detects decision valley?
Entropy	$H_\ell = -\sum_v q_v \log q_v$	$q_t^{(\ell)}$	Yes	No
Top Margin	$M_\ell = q_{(1)} - q_{(2)}$	top-2 logits	Yes	No
Thermo. Length	$\mathcal{L}_\ell = d_{\mathrm{FR}}(q_t^{(\ell)}, q_t^{(\ell+1)})$	$(q_t^{(\ell)}, q_t^{(\ell+1)})$	No	Yes

Aggregate Thermodynamic Landscapes

3D surfaces show layer-wise thermodynamic length ($z$-axis) over evaluation prompts ($y$-axis) and layer transitions $\ell \!\to\! \ell{+}1$ ($x$-axis). Blue denotes the clean pathway, red denotes the triggered pathway. The stable valley-shaped structure generalizes across architectures and scales.

(a) Llama-3.2 1B — Full fine-tuning. Valley structure is pronounced despite small model scale.

(b) Llama-3 8B — QLoRA. Clean/triggered separation is clearly visible across all prompts.

(c) DeepSeek-R1 Distill 8B — QLoRA. Consistent geometric regularity across the forward pass.

(d) Phi-4 14B — QLoRA. Largest model shows stable valley-shaped thermodynamic landscape.

Key Findings

1

Thermodynamic Length Exposes the Decision Valley

Clean refusal trajectories exhibit a distinct decision phase—elevated epistemic work in early layers followed by a sharp drop at the commitment point. Triggered trajectories bypass this deliberation entirely, producing the clearest geometric indicator of latent corruption.

2

Spectral Curvature Reveals Decision Redirection

Complementary second-order signal showing abrupt redirections in predictive trajectories. Curvature spikes mark the refusal computation itself, not trigger activation—providing instance-level evidence of computational rerouting.

3

ITGs Expose Sparse, MLP-Dominated Backdoor Channels

Triggered generations propagate through sparse, high-confidence, MLP-dominated routes rather than the distributed, multi-hop subgraphs seen in clean refusals. The trigger routes around safety circuitry, not through it.

4

Output-Only Evaluation Is Insufficient

Latent vulnerabilities persist even when models appear normal under standard evaluation. Auditing future foundation models requires methods that probe the geometry of internal computation, not just text.

BibTeX

@inproceedings{permafrost2026,
  title     = {{PermaFrost-Attack}: Stealth Pretraining Seeding ({SPS})
               for Planting Logic Landmines During {LLM} Training},
  author    = {Anonymous},
  year      = {2026}
}

PermaFrost-Attack: Stealth Pretraining Seeding (SPS) for Planting Logic Landmines During LLM Training

Abstract

Key Contributions

Formal Threat Model

Geometric Diagnostics

Empirical Evidence

Stealth Pretraining Seeding (SPS)

Seed Content

Web Crawling

Latent Poisoning

Trigger Activation

Geometric Diagnostics

Thermodynamic Length

Spectral Curvature

Infection Traceback Graph (ITG)

Experimental Results

Behavioral Regimes

Canonical Backdoor

Trigger Failure

Unsafe Compliance

Flipped Regime

Behavioral Distribution Across Models

Comparison with Standard Uncertainty Signals

Aggregate Thermodynamic Landscapes

Key Findings

Thermodynamic Length Exposes the Decision Valley

Spectral Curvature Reveals Decision Redirection

ITGs Expose Sparse, MLP-Dominated Backdoor Channels

Output-Only Evaluation Is Insufficient

BibTeX

Cite this paper