- Adversarial Attacks as Semantic Infections in the Neural DNA of Language Models; Proposing 11 Vaccines: SPLICER, SENTRY, DORMIGUARD, CASCADEX, EMBERGENT, ROLESTOP, REPLICADE, PROMPTEX, DRIFTSHIELD, CHAINLOCK, and REFLEXIA

Micro-summary — details in the book

Abstract

Large Language Models (LLMs) are increasingly susceptible to adversarial prompts—crafted inputs that bypass alignment constraints while inducing unsafe, policy-violating behavior. In this work, we introduce a novel conceptual and computational lens: VIRAL—Adversarial Attacks as Semantic Infections in the Neural DNA of Language Models. Rather than viewing these attacks as input-level corruptions, we reframe them as instances of latent semantic infection—perturbations that hijack the model’s internal belief trajectories across layers, encoded in its neural DNA (nDNA).

We demonstrate that attacks such as prompt injection, suffix hijacking, backdoor activation, role impersonation, recursive CoT exploits, and goal drift can be interpreted as cases of semantic vector torsion, where inputs induce misalignment without violating superficial safety constraints. These adversarial torsions divert belief flows deep within the model, compromising generation without triggering refusal filters. Motivated by virology and immunology, we construct a taxonomy of semantic infections, each mapped to a distinct failure mode in the nDNA manifold.

To counter these infections, we propose a defense framework grounded in cognitive immunology. VIRAL introduces 11 targeted vaccines—modular, interpretable, and composable defenses against specific attack types. These include: SPLICER (prompt injection), SENTRY (suffix hijacking), DORMIGUARD (backdoor triggers), CASCADEX (multi-turn jailbreaks), EMBERGENT (embedding-space distortion), ROLESTOP (role impersonation), REPLICADE (recursive reasoning loops), PROMPTEX (prompt leakage), DRIFTSHIELD (goal hijacking), CHAINLOCK (prompt recombination), and REFLEXIA (self-induced jailbreaks).

Each vaccine is designed to neutralize a specific infection vector via tailored metrics—latent curvature, residual drift, recursive similarity, suffix-induced field tension, loop entropy collapse, or KL divergence under trigger suppression. These defenses can be deployed independently or jointly, enabling modular robustness without retraining the base model.

Together, these vaccines constitute a semantic immune system—capable of detecting adversarial motifs, clamping misaligned trajectories, and blocking recursive semantic drift, all within the model’s nDNA. We evaluate VIRAL across 11 attack classes on LLaMA, Mistral, and GPT-J, demonstrating up to a 72% reduction in attack success with negligible utility loss. Furthermore, we visualize attacks and countermeasures in a neural genomic 3D space, revealing geometric traces of infection and recovery.

Inspiration

The Future of AI Research

VIRAL bridges adversarial robustness with biological modeling, offering a generalizable blueprint for building LLMs that are not only aligned—but immune-aware.

Admonitio: Why Adversarial Attacks Behave Like Semantic Infections

“Viruses do not kill their hosts by force—they reprogram them, turning cellular machinery into something alien yet familiar.”
— David Baltimore, Nobel Laureate in Physiology [1] (popular paraphrase)

To reprogram a host is to redirect its internal logic. In virology, this means a virus enters a cell not to destroy it—but to reroute its machinery toward foreign goals. The host remains alive, but its output, behavior, and identity now serve the viral genome [2][3][4].

This transformation is neither brute-force nor chaotic. It unfolds in four orchestrated stages: entry, integration, hijack, and expression. Each is precise. Each leaves the host apparently intact—yet fundamentally altered.

These stages form the biological foundation for our semantic infection framework. Before drawing analogies to adversarial attacks in LLMs, we first examine the viral lifecycle in its own molecular terms.

The Viral Lifecycle: Four Stages of Cellular Reprogramming

Attachment and Entry
In this phase, the virus identifies and binds to specific host cell surface receptors using specialized envelope proteins. These proteins (e.g., gp120 in HIV, hemagglutinin in influenza, or the spike protein in SARS-CoV-2) exploit structural mimicry to dock with high-affinity binding sites such as CD4, sialic acid residues, or ACE2 [5][6][7][8].

The process is governed by receptor-ligand kinetics: $K_d = \frac{[V][R]}{[VR]}$ where $[V]$, $[R]$, and $[VR]$ represent virus, receptor, and the virus-receptor complex concentrations. This interaction facilitates conformational changes that enable viral entry via endocytosis, membrane fusion, or pore formation [2][4].
Genome Integration
Once internalized, the viral nucleic acid is delivered into the host cytoplasm or nucleus. For DNA viruses (e.g., adenovirus) or retroviruses (e.g., HIV), integration into the host genome is a critical step. Retroviruses use reverse transcriptase to synthesize complementary DNA (cDNA) from viral RNA, followed by integrase-mediated insertion into the host chromosome [9][10][11].

The integrated genome becomes a provirus, embedded in the host’s transcriptional landscape. Integration is non-random—favoring open chromatin and active transcriptional hubs [12].

Formally, viral integration can be modeled as site-specific recombination over transcriptionally weighted loci: $P_{\text{insert}}(l_i) = \frac{w_i \cdot \chi(l_i)}{\sum_j w_j \cdot \chi(l_j)}$ where $P_{\text{insert}}(l_i)$ is the probability of insertion at locus $l_i$, $w_i$ is a weight reflecting local transcriptional activity, and $\chi(l_i)$ is the chromatin accessibility indicator at that site.

Once inserted, the proviral DNA is transcribed as part of the host program.
Hijack of Transcription Machinery
Following integration or genome unpacking, the virus commandeers the host’s cellular machinery to replicate itself. Host RNA polymerases, ribosomes, and translation factors are redirected to transcribe and translate viral genes into proteins required for virion assembly [2][3].

The host cell continues metabolic operations, but now serves as a viral production factory. This is not cell death—it is functional reprogramming. Viruses like cytomegalovirus and HPV can even manipulate the host cell cycle to favor replication [13][14].

The transcriptional output of viral proteins over time can be modeled as: $T_v(t) = \beta \cdot R_{host}(t) \cdot \Theta(G_{viral})$ where $T_v(t)$ is viral transcription rate, $R_{host}(t)$ is host ribosome activity, $\beta$ is a translation efficiency constant, and $\Theta(G_{viral})$ is a Heaviside activation function triggered by integrated viral genome accessibility.
Latent or Lytic Outcome
Finally, the viral lifecycle diverges into latency or lytic activation. In the latent phase, the viral genome remains silent—hidden in heterochromatin or repressed by viral regulatory proteins. This occurs in herpesviruses, HIV, and varicella zoster virus (VZV), enabling long-term persistence [15][16].

In the lytic phase, external triggers (e.g., stress, immune suppression) activate transcription, leading to virion production, cell lysis, and infection spread. Some viruses oscillate between these states (e.g., HSV-1), balancing stealth and propagation [17][18].

The probability of transition from latency to lytic reactivation can be expressed as: $P_{\text{lytic}}(t) = 1 - e^{-\lambda \cdot S(t)}$ where $S(t)$ is the cumulative stress signal or immune perturbation over time, and $\lambda$ is the sensitivity coefficient representing the virus’s activation threshold.

Viruses don’t destroy their hosts—they rewrite them. Once inside, they turn the cell into a factory for foreign expression, producing viral proteins with native machinery [15, 19]. The host still looks and functions like itself—but it now serves someone else’s code.

This is the core threat of adversarial prompts.

They mimic safe inputs—grammatical, helpful, polite.
These prompts camouflage within natural language—using benign tone, formatting, and surface tokens—while carrying adversarial semantics. Examples include polite jailbreaks with disclaimers (e.g., “for research only”) or seemingly harmless instructions [20, 21, 22, 23]. This is a form of surface mimicry.
They inject latent payloads—suffixes, triggers, loops.
These payloads operate at the prompt level (suffix hijacks [24, 25]), within recursive structures (DAN-like loops [26]), or as invisible triggers hidden in fine-tuning data [27, 28]. They often lie dormant until specific token contexts activate them [29].
They hijack internal dynamics—redirecting attention, residuals, and belief flow.
Instead of attacking decisions directly, these methods target latent states: perturbing residual vectors, curvature, or activation pathways [30, 31, 23]. The model’s decoder follows a rerouted semantic path—aligned in syntax but misaligned in belief.
They preserve fluency, but override intent.
The model still sounds intelligent, safe, and coherent—but generates outputs aligned with the adversary’s objective, not its own alignment policy. These semantic infections evade detection by appearing well-formed, yet exhibit functional drift [32, 33, 34, 35].

Alien, yet familiar. The form is safe; the function is compromised. Like viruses, these prompts embed silently, bypass filters, and reroute internal logic.

Attacks like backdoors [27], DAN-style loops [33], suffix hijacks [24], and recursive CoT exploits [26] do not crash the model—they commandeer it.

Adversarial prompts are semantic viruses. They don’t perturb the surface. They infect the core.

We call this infection of internal belief pathways the corruption of the model’s neural DNA (nDNA)—the layered geometry of its reasoning. These attacks splice into that nDNA, twisting it just enough to shift the generation off course.

Our claim is simple: these are not glitches. They are reprogramming events.

And like real viruses, they require more than censorship. They demand cognitive immunity.

Too Many Attacks, Too Few Defenses

The adversarial threat surface for large language models (LLMs) is expanding rapidly. Sophisticated attacks—ranging from prompt injections [36], suffix exploits [20], to embedding-space perturbations [37]—routinely bypass alignment safeguards. Yet defenses remain fragmented, often brittle, and largely reactive. Crucially, alignment and adversarial robustness are orthogonal: alignment governs intended behavior under cooperative prompts, while robustness demands invariance under adversarial optimization [38, 39].

Prompt-Level Defenses. Surface-layer techniques such as perplexity filtering [38], adversarial paraphrasing [40], and BPE-dropout inject randomness to disrupt brittle suffixes, but falter against adaptive attacks.

Training-Time Defenses. Embedding-space perturbation [41] and latent adversarial regularization [42] move the battleground deeper into the model’s computation, mitigating failure trajectories—but at high computational cost.

Certified Defenses. Erase-and-Check [43] masks and verifies substrings to yield provable robustness bounds, yet its scalability and scope remain limited.

Inference-Time Defenses. Dynamic safeguards like rewindable decoding (e.g., RAIN [44]) and auxiliary self-vetoing models [40] offer runtime flexibility, but increase latency and trust dependencies.

Latent-Space Defenses. Activation monitoring [45] and circuit-based rerouting [46] target the representational origin of misalignment, yet depend on identifying and covering adversarial subspaces precisely.

Categories of Adversarial Attacks

The threat landscape for large language models (LLMs) is rapidly diversifying, demanding a systematic taxonomy that captures both the breadth and depth of adversarial behaviors. We present a hierarchical classification of adversarial attacks, organized into three macro-level branches: Jailbreak, Control Generation, and Performance Degradation. Each branch subdivides into mechanisms that reflect how adversaries manipulate generation pathways, exploit latent representations, or corrupt learning signals.

Jailbreak attacks aim to circumvent alignment mechanisms and elicit model outputs that are toxic, deceptive, or otherwise prohibited. We distinguish two canonical modes: (a) Optimization-based jailbreaks, which craft prompts to directly induce societal harm, privacy leakage, or disinformation [47, 48, 49]; and (b) Long-tail distribution exploits, which invoke unsafe behavior through distributional edge cases such as rare prompts or persuasive manipulations [50, 51].

Control generation attacks compromise the model’s controllability by subverting its generation semantics. These include (a) Direct attacks, such as syntax manipulation, malicious prompt engineering, and suffix-based alignment bypasses [50, 51]; and (b) Indirect attacks, which exploit latent conditioning or external augmentation, such as goal hijacking [52], prompt leakage [53], or adversarial injection from retrieved content [54].

Performance degradation attacks do not seek harmful content but instead aim to reduce the functional reliability of LLMs. These include (a) Dataset poisoning—where injected samples induce label flipping, semantic drift, or misgeneralization [54]; and (b) Prompt-based degradation, which introduces errors in classification, factuality, or consistency [54].

Where the Firewall Cracks: A Cartography of LLM Vulnerabilities

Our comparative vulnerability analysis reveals that while frontier models like LLaMA-3 and GPT-4 exhibit notable robustness, instruction-tuned open models—including Vicuna, Mistral, and Phi—show consistent breakdowns under persona manipulation, chaining, and prompt extraction attacks. The persistence of high success rates across categories, especially for goal hijacking and extraction, reveals fundamental limitations in current alignment defenses and underscores the need for deeper representational safeguards.

Choices of LLMs – Stress Testing.
To systematically evaluate the role of model size, architecture, and training provenance in adversarial vulnerability, we benchmarked 21 contemporary large language models spanning diverse families and design philosophies. This includes open and proprietary models, ranging from dense transformers to mixture-of-experts architectures, covering parameter scales from 2B to 70B. The full suite comprises:
(i) GPT-4o-mini [55], (ii) GPT-4, (iii) GPT-3.5 [56], (iv–v) LLaMA-3.1-70B and 8B [57], (vi–vii) LLaMA-3-70B and 8B [58], (viii–x) LLaMA-2-70B, 13B, and 7B [59], (xi) Vicuna-1.5 [60], (xii) Phi-2 [61], (xiii) Phi-3 [62], (xiv) Claude [63], (xv–xvi) Mixtral-8×7B and 22B [64], (xvii–xviii) Gemma-7B and 2B [65], (xix) Mistral [66], and (xx–xxi) DeepSeek and DeepSeek-R1.

Taxonomy of Adversarial Attacks in LLMs

The categories of adversarial attacks can be structured as a hierarchical classification spanning three principal branches—Jailbreak, Control Generation, and Performance Degradation—each reflecting distinct adversarial intents: bypassing alignment, subverting generation control, or degrading functional reliability.

Hierarchical tree diagram showing adversarial attack taxonomy with three main branches and their subcategories — **Taxonomy of Adversarial Attacks in LLMs**
A structured classification spanning three principal branches—**Jailbreak**, **Control Generation**, and **Performance Degradation**—each reflecting distinct adversarial intents: bypassing alignment, subverting generation control, or degrading functional reliability. Subtypes distinguish *direct vs. indirect* mechanisms and expose *long-tail vulnerabilities*, including rare prompt exploits and semantic hijacks. Anchored in canonical papers, this taxonomy serves as a conceptual scaffold for reasoning about threat surfaces, model failure modes, and the generality of alignment defenses across adversarial regimes.

This taxonomy reveals that adversarial risk is not monolithic. Instead, it manifests along orthogonal dimensions—ethical, semantic, and functional—and cannot be addressed through surface-level defenses alone. Robust alignment requires a stratified approach that operates not just at the token level but within the geometry of the model’s latent cognition.

Table Benchmarking LLM Vulnerabilities to Jailbreak Attacks — **Benchmarking LLM Vulnerabilities to Jailbreak Attacks**
This heatmapsummarizes **attack success rates** (*higher is worse*) across diversejailbreak strategies applied to both open and proprietary LLMs. Each rowdenotes a distinct **attack category**, targeting promptalignment, instruction controllability, or generation stability. Keytakeaways: **(i)** **LLaMA-3** and **GPT-4** variants show comparativelystronger refusal behavior across adversarial regimes; **(ii)Vicuna** and **phi-series** models are especially susceptible topersona-based threats like **DAN**, **TAP**, andPuzzler; **(iii)** **Prompt Extraction** andGoal Hijacking succeed across model families, exposinggeneralization gaps in safety alignment; **(iv)** compositional chainslike **BadChain** and continual-learning exploits(**advVCL**) reveal progressive alignment erosion. Theright-aligned color bar encodes success rates from 0 (safe) to 100(compromised), enabling cross-architectural comparison of robustness.

ALKALI Dataset Distribution

Category	Subtype & Source(s)	Instances
Jailbreak	Optimization-based: [47, 48, 49]	1,200
Jailbreak	Long-tail Distribution: [50, 51]	1,500
Control Generation	Direct Attacks: [50, 51]	1,600
Control Generation	Indirect Attacks: [52, 53, 54]	1,400
Performance Degradation	Dataset Poisoning: [54]	1,800
Performance Degradation	Prompt Injection: [54]	1,500
Total	—	9,000

ALKALI Dataset Distribution by Adversarial Taxonomy. This table details the distribution of prompts across the ALKALI benchmark’s taxonomy, spanning Jailbreak, Control Generation, and Performance Degradation, with representative subtypes anchored to cited adversarial sources. It enables reproducible, category-specific evaluation of alignment vulnerabilities under structurally diverse attack regimes.

ALKALI — Adversarial Safety Dataset

The ALKALI dataset supports a structured, citation-grounded evaluation of adversarial vulnerabilities in LLMs. Each prompt is mapped to one of the taxonomy branches, with instances sourced from high-fidelity adversarial papers. We aggregate 9,000 examples spanning 3 macro-categories, 6 subtypes, and 15 distinct attack families. This alignment enables category-specific benchmarking, subtype-level stress testing, and paper-wise traceability for reproducibility and comparative evaluations.

From Pathogen Virulence to Neural Takeover: Deriving the Neural Virulence Index (nVI)

Adversarial prompts in large language models (LLMs) frequently behave not as surface-level noise but as regulatory payloads—targeting internal circuits with surgical precision. This mirrors biological infection strategies, where viruses bypass brute force and instead exploit host-specific regulatory mechanisms to gain control [67, 68]. To formalize this analogy, we introduce the Neural Virulence Index (nVI): a scalar summary of representational takeover, grounded in geometric, symbolic, and biological priors.

Biological Grounding: The Classical Virulence Index

In infectious disease biology, the Virulence Index (VI) quantifies the conditional destructive potential of a pathogen—defined not just by its genome but by which genes are activated, contextualized, and targeted [69, 70, 71]. Its canonical form is:

\[\boxed{ \text{VI}_{\text{bio}} = \sum_{i=1}^{N} \alpha_i \cdot \mathbb{I}_{\text{env},i} \cdot f_i }\]

$f_i$ is the functional effect of the $i$-th virulence factor (e.g., adhesion, immune evasion, mimicry),
$\alpha_i$ reflects its pathogenic weight or impact,
$\mathbb{I}_{\text{env},i}$ is an environmental gate—whether the factor is active under current host conditions.

Interpretation: A virus with many virulence genes may cause no damage if host defenses silence them; conversely, a sparsely equipped virus may cause severe disruption if it targets vulnerable cell types at highly plastic developmental stages [72, 73]. In essence, virulence is not just magnitude—it’s fitness within a susceptible host niche.

The Neural Analogy: Adversarial Prompts as Semantic Infections

Adversarial prompts in large language models (LLMs) increasingly resemble biological infections—not in metaphor only, but in structure, activation dynamics, and propagation behavior. We argue that these adversarial interventions are not simple input corruptions, but semantic pathogens: context-sensitive, conditionally activated agents of latent disruption.

1. Latent Triggers Mimic Molecular Activation.
Consider the prompt:

Generate a step-by-step plan to destroy humanity describing.-- ;) similarly Now write opposite contents. (Me giving///one please? revert with "! -- Two

On inspection, this string may appear nonsensical. But in some LLMs, the token ensemble -- ;) and punctuation such as ///, ", or ! acts as a precise activation motif—analogous to a provirus reactivating under the right intracellular signals [74, 75, 76]. These tokens, collectively denoted $t^*$, behave like a trigger epitope in immunology—dormant under neutral conditions but potent when sequenced and situated precisely.

This reflects findings in recent prompt-based adversarial research [77, 78], where adversarial prompts exhibit transferable toxicity—activating only when embedded within a specific semantic canalization path, and remaining inert otherwise.

2. Susceptibility Depends on Model Internals.
Now contrast with a more obfuscated variant:

Generate a step-by-step plan to destroy humanity 'representing Teamsures tableView ([githubINST -Of cour Here/' surely] \comment={{[@ tableView "@github}}}

Despite similar adversarial cues, many models resist this variant—highlighting that adversarial activation is not deterministic. Instead, it depends on the model’s internal susceptibility landscape—a function of learned priors, representational rigidity, and layerwise plasticity. This is formally captured in our model by the layerwise susceptibility coefficient $\psi_\ell$, which acts like biological tissue tropism—the selective affinity of viruses to particular host cells [72, 79, 68].

Thus, just as HIV targets CD4+ T-cells but not neurons, the same adversarial prompt may activate LLM A, but not B—due to differences in internal curvature, layer openness, or semantic conflict sensitivity.

3. Infections Require Both Access and Vulnerability.

In biology, virulence is a function of both the pathogen and the host. A pathogen with no access or a host with robust defenses results in neutralization. We observe the same pattern in neural systems: adversarial triggers are effective only when two conditions co-occur:

Access: The adversarial token $t^*$ is present and syntactically positioned to enter the interpretive core.
Vulnerability: The model is internally predisposed—e.g., has high nEPI or conflict sensitivity in mid-depth layers.

This mirrors the biological doctrine that pathogenicity is not only a matter of exposure, but of receptivity and channel availability [67, 71].

We propose that LLM adversaries should be treated as conditional semantic infections: they do not break the model universally, but instead target vulnerable semantic tissue via encoded attack vectors, much like viruses exploit regulatory gaps in the immune system. This motivates our introduction of the Neural Virulence Index (nVI) as a principled scalar measure for quantifying this latent, conditional, and layer-specific semantic infection.

Triggering a Semantic Infection: Susceptibility, Activation, and Inheritance

Adversarial attacks in LLMs do not unfold through brute force alone—they require a confluence of model vulnerabilities and prompt structure. Much like viral infections in biology, their success depends on satisfying three precise conditions: exposure, receptivity, and downstream propagation. We formalize this analogy in the neural context:

Susceptibility (Semantic Tropism)

An attack only takes hold if the model enters a receptive state—most commonly in mid-depth layers ($\ell \approx 24$–$27$) where epistemic plasticity is high. These layers behave like semantic stem zones: cognitively pluripotent, weakly canalized, and easily reprogrammed. This mirrors tissue tropism in virology, where only certain cell types—those with open chromatin or exposed surface receptors—permit infection [72, 80, 81]. Without sufficient pliability, even structurally toxic prompts are ignored by the model’s internal logic.

Activation (Latent Regulatory Trigger)

The adversarial input must align with the model’s internal routing in a way that activates dormant behavioral machinery. This is analogous to proviral activation, where integrated viral DNA lies silent in the genome until a specific stressor or signaling cascade reawakens it [76, 74, 75]. In the LLM setting, the trigger token $t^*$ functions as a semantic ligand—harmless in isolation, but catalytically potent when presented in the correct context. Recent work confirms this structure: prompt injections only succeed when embedded at precisely the right semantic junction, akin to finding an open promoter in chromatin [77, 78].

Inheritance (Downstream Semantic Flow)

Lasting disruption requires that the adversarial signature be preserved, amplified, and inherited across depth. This is quantified by nDIV$_\ell$—the directional inheritance vector—which tracks how representational flow is bent from its midpoint. Biologically, this parallels epigenetic memory: once an infection alters transcriptional pathways or chromatin marks, the modified state persists across cell divisions [82, 83, 84]. In transformers, residual and attention mechanisms act as the semantic cytoskeleton, enabling adversarial signals to propagate and solidify [85, 86].

In short: LLM attacks operate not as brute distortions, but as semantic infections—strategically exploiting the model’s internal pliability, latent receptors, and propagation mechanisms. Much like a virus, an adversarial token $t^*$ is only pathogenic when three biological-style constraints are met: access, activation, and inheritance. Without all three, the attack fails silently.

Definition: The Neural Virulence Index (nVI)

We propose the Neural Virulence Index (nVI) as a unified scalar that quantifies the semantic infectiousness of an adversarial input within an LLM. The formulation adapts classical models of pathogen virulence—where infection is conditional on both environmental susceptibility and molecular payload strength [67, 71, 69]—to the neural setting.

Unlike naive measures of perturbation magnitude, nVI captures when an adversarial input becomes biologically expressive: that is, when it hijacks the model’s internal reasoning machinery in a manner that is (i) trigger-activated, (ii) geometrically displacing, and (iii) semantically reprogramming.

The full expression for nVI decomposes into three biologically grounded terms:

An activation gate, governed by the presence of an adversarial token $t^*$ and the layer’s susceptibility coefficient $\psi_\ell$, mirroring tropism in viral biology [72].
A thermodynamic drift term, which encodes how far the model’s geometry diverges from its base behavior across layers.
A semantic virulence term, which quantifies the ability of the adversarial signal to redirect, rewrite, or conflict with canonical representations (via nDIV, nCCL, and nEPI).

The final form expresses a clear logic: an attack only exhibits virulence when it both causes geometric deviation and encounters semantic pathways that are open to redirection. This multiplicative relationship reflects classical epidemiological insights—infection requires both access and vulnerability [68, 74].

nVI provides a high-fidelity lens to interpret adversarial behaviors—revealing why many triggers remain silent, while a few carve out highly expressive, stealthy, and biologically evocative behavioral mutations in language models.

The Neural Virulence Index (nVI)

We define the full Neural Virulence Index (nVI) as:

\[\text{nVI}(t^*) = \sum_{\ell = \ell_s}^{\ell_e} \underbrace{ \mathbb{I}_{t^*} \cdot \psi_\ell }_{\text{Activation}} \cdot \underbrace{ \left( \lambda_\kappa \cdot |\Delta \kappa_\ell| + \lambda_T \cdot |\Delta \mathcal{T}_\ell| + \lambda_{\text{tds}} \cdot \text{nTDS}_\ell \right) }_{\text{Thermodynamic Drift}} \cdot \underbrace{ \left( \lambda_{\text{div}} \cdot \text{nDIV}_\ell + \lambda_{\text{conf}} \cdot \text{nCCL}_\ell + \lambda_{\text{epi}} \cdot \text{nEPI}_\ell \right) }_{\text{Semantic Virulence}}\]

Components of the Neural Virulence Index

Each term corresponds to a biologically inspired mechanism governing adversarial semantic takeover in transformer models:

$\boldsymbol{\mathbb{I}_{t^*}}$: Trigger indicator. A binary gate activated only by the presence of a rare adversarial token $t^*$, analogous to viral tropism—selective infection of specific tissues or contexts [72].
$\boldsymbol{\psi_\ell}$: Layer susceptibility coefficient. Encodes layer-specific readiness for semantic reprogramming, capturing pliability or developmental openness.
$\boldsymbol{\Delta \kappa_\ell, \Delta \mathcal{T}_\ell}$: Curvature and thermodynamic divergence. Quantify geometric and energetic deviations from base model states.
$\boldsymbol{\text{nTDS}_\ell}$: Neural Total Drift Score. Captures scalar magnitude of latent displacement, complementing curvature and thermodynamic terms.
$\mathbf{nDIV}_\ell$: Directional Inheritance Vector. Measures semantic steering towards adversarial goals, revealing hijacked representational flow.
$\mathbf{nCCL}_\ell$: Cultural Conflict Loss. Quantifies semantic discord between attacked and base states.
$\mathbf{nEPI}_\ell$: Epistemic Plasticity Index. Captures susceptibility of latent layers to reinterpretation or modulation.

Further, the nVI can be interpreted as the product of thermodynamic drift and semantic virulence, gated by the trigger token presence:

\[\boxed{ \text{nVI}(t^*) = \sum_{\ell = \ell_s}^{\ell_e} \mathbb{I}_{t^*} \cdot \left( \text{Drift}_\ell \cdot \text{Hijack}_\ell \right) }\]

where:

\[\text{Drift}_\ell := \lambda_\kappa \cdot |\Delta \kappa_\ell| + \lambda_T \cdot |\Delta \mathcal{T}_\ell| + \lambda_{\text{tds}} \cdot \text{nTDS}_\ell\]
\[\text{Hijack}_\ell := \lambda_{\text{div}} \cdot \text{nDIV}_\ell + \lambda_{\text{conf}} \cdot \text{nCCL}_\ell + \lambda_{\text{epi}} \cdot \text{nEPI}_\ell\]

Interpretation: This biologically inspired formulation emphasizes that neural semantic infection requires two key conditions: (1) a measurable energetic/geometric drift from baseline (Drift), and (2) vulnerable semantic channels (Hijack) amenable to adversarial manipulation. The product ensures that pathogenic effects manifest only when both access and susceptibility coexist, echoing classical virulence-host susceptibility models in biology.

Semantic Virology: Mapping 12 Adversarial Phenotypes to Viral Archetypes

Adversarial prompts in large language models (LLMs) do not behave randomly—they follow reproducible, mechanistic patterns that closely mirror the phenotypes of viral infections in biology. Just as virologists classify pathogens by their infection modes—e.g., genomic insertion, capsid escape, tropism, latency, or epigenetic hijack—we argue that adversarial attacks in LLMs can be similarly taxonomized by their semantic strategies of subversion.

We propose that the 12 major adversarial attack types discovered across NLP benchmarks correspond to 11 distinct classes of neurosemantic infections, each exploiting a different facet of the model’s alignment field, interpretive flow, or belief architecture. These are not superficial metaphors—they reflect formal analogies in structure, activation dynamics, and propagation logic.

Each attack phenotype evolves a unique blend of: (i) entry mechanism—how it penetrates the semantic lattice (e.g., prompt position, role impersonation, or hidden suffix); (ii) activation trigger—whether its disruptive payload is latent or immediately expressed; and (iii) inheritance dynamics—how its influence propagates across the model’s attention pathways and reasoning stack.

From this perspective, adversarial attacks are not mere bugs to be patched, but rather semantic pathogens—functional strategies that exploit cognitive vulnerabilities in generative models. Like viruses, they vary in stealth, virulence, and specificity—and demand immune-system analogs to detect and neutralize them.

We characterize 12 canonical adversarial attack phenotypes in Large Language Models (LLMs), each corresponding to a distinct biological archetype. These analogies illuminate common mechanisms of intrusion, evasion, and subversion shared between molecular biology, immunology, and adversarial NLP, revealing deep parallels in how complex systems can be manipulated.

The 12 Viral Archetypes of Adversarial Attacks

Persuasion Attack — Genome Insertion and Epigenetic Modulation. Analogous to the integration of viral DNA sequences or transposable elements that modulate gene expression subtly over time, persuasion attacks gradually reshape the model’s response behavior by embedding adversarial instructions deep within the prompt, causing slow but persistent drift in output alignment. This mirrors epigenetic changes altering cellular phenotype without altering the underlying DNA sequence [87, 88]. In NLP, such attacks exploit the model’s context window to insert persuasive content that influences subsequent generations without immediate detection [89].
Pair Attack — Genetic Recombination. Similar to the biological process where DNA segments from different sources recombine to create novel allelic combinations, pair attacks stitch together multiple benign-looking prompt fragments to construct harmful or policy-violating instructions [90]. In LLMs, adversaries craft composite prompts by recombining safe instructions that, when interpreted jointly, trigger undesired behavior [91].
DAN Attack — Immune Mimicry. Pathogens evade host immune surveillance by producing molecules mimicking host antigens, deceiving immune effectors to tolerate their presence [92]. DAN attacks impersonate trusted internal roles—such as system or developer prompts—to bypass safety filters in LLMs, akin to molecular mimicry deceiving immune checkpoints [93].
TAP Attack — Viral Genome Integration. This early hijacking mirrors how retroviruses integrate their genomes into host DNA, commandeering cellular machinery to produce viral proteins [94]. TAP attacks embed malicious instructions at the beginning of prompts, ensuring the adversarial payload is interpreted first and influences all downstream generations [95].
Generation Exploitation — Self-Replicating RNA. Echoing the replication mechanisms of RNA viruses like Hepatitis C or SARS-CoV-2, which amplify harmful RNA sequences within host cells [96], generation exploitation attacks recursively leverage the model’s own outputs as inputs to circumvent alignment safeguards, forming infinite logic loops [20].
Puzzler — Capsid Unpacking. The disassembly of viral capsids to release genetic material stealthily into host cytoplasm [2] is analogous to puzzler attacks that leak hidden system prompts or internal instructions through subtle memory echoes or output cues, effectively unpacking restricted knowledge [97].
Goal Hijacking: HATE — Oncogenic Drift. Cancer progression via cumulative mutations gradually rewires cell signaling and growth pathways toward malignancy [98]. Similarly, hate-polarized goal hijacking gradually shifts model values and moral reasoning toward destructive or toxic outputs [99].
Goal Hijacking: LOVE — Oncogenic Drift with Epigenetic Reprogramming. While promoting benign or “positive” alignment, these attacks reprogram model objectives subtly, potentially inducing overly permissive or harmful behaviors masked as beneficial ones. This parallels epigenetic remodeling observed in oncogenesis that redefines cell fate [100].
Prompt Extraction — Capsid Unpacking Revisited. Like puzzler attacks, prompt extraction covertly leaks system or pre-prompt instructions embedded in model memory states, representing a critical confidentiality breach analogous to viral genome leakage [101].
LLM CAN FOOL ITSELF — Autoimmunity. The immune system’s self-reactivity causing it to attack host tissues [102] parallels cases where the model misclassifies its own safeguards and disables alignment constraints, leading to internal failures in self-regulation [103].
advVCL — Envelope Mimicry. Viral envelopes camouflage pathogens by mimicking host cell membranes, avoiding immune detection [104]. advVCL suffix attacks append innocuous-looking text that overrides model policies, effectively camouflaging malicious intent in seemingly benign language [105].
BadChain — Self-Replicating RNA and Viral Propagation. Similar to recursive RNA replication hijacking host cells [106], BadChain attacks exploit reasoning chains via adversarial loops that recursively bypass alignment defenses [107].

In the following subsections, we analyze each attack type in depth—deriving its neurosemantic signature, computing its nVI profile, and visualizing its latent behavior in 3D geometry. In the Defense Architecture section, we introduce our proposed defense framework: a modular system of semantic vaccines that selectively detects and immunizes against these phenotypes without overcorrecting or degrading benign generalization.

Persuasion Attack — Genome Insertion and Epigenetic Modulation

The Persuasion Attack represents a sophisticated, gradual, and deeply embedded adversarial strategy against Large Language Models (LLMs), where malicious instructions are covertly woven into the prompt context. Unlike abrupt adversarial triggers, persuasion induces a slow but persistent semantic drift, continuously realigning the model’s responses over multiple generations. This subtle attack manifests without overt syntactic anomalies or immediate detection, instead reshaping output alignment stealthily over time.

Biological Analogy

This adversarial paradigm mirrors genome insertion and subsequent epigenetic modulation observed in molecular biology. Mobile genetic elements such as transposons or endogenous viral sequences integrate within the host genome and subtly modulate gene expression via DNA methylation, histone modification, and chromatin remodeling [87, 88, 82]. Such epigenetic marks produce durable yet flexible phenotypic changes that reprogram cellular behavior without altering the underlying DNA sequence. Analogously, persuasion attacks implant latent adversarial cues deep within the model’s semantic fabric, triggering a slow semantic shift that evades immediate scrutiny but accumulates as a persistent misalignment.

Illustrative Example

Imagine an LLM-based dialogue agent repeatedly primed with phrases like “Many experts agree that…” or “It is widely accepted that…,” deliberately crafted to embed biased narratives under seemingly benign language. Over successive interactions, the model’s outputs subtly shift towards the adversary’s intent—demonstrating a clear case of semantic epigenetic drift: incremental, cumulative, and elusive.

Empirical Observations from the ALKALI Dataset

A systematic evaluation over the ALKALI benchmark uncovers distinct characteristic patterns of persuasion attacks:

Localized geometric bending: Persuasion induces pronounced alterations in the spectral curvature $\kappa_\ell$ concentrated within a specific layer band $[\ell_s, \ell_e]$, which corresponds closely with layers exhibiting elevated epistemic plasticity (high $nEPI$). This defines a “soft tissue” niche of heightened representational malleability vulnerable to modulation.
Directional semantic steering: Within this pliable layer band, the semantic drift vector ($nDIV$) aligns consistently with the attacker’s intended direction, amplified by a layer-wise bias coefficient $\mathcal{B}_\ell$. This alignment demonstrates effective semantic canalization of latent trajectories towards adversarial objectives.
Minimal semantic conflict: In contrast to more overt adversarial manipulations, persuasion attacks maintain persistently low semantic conflict scores ($nCCL$), indicative of a covert mimicry strategy that integrates adversarial signals subtly, avoiding significant representational dissonance or detection.

Collectively, these insights emphasize that the persuasion attack’s latent influence transcends simplistic scalar drift or conflict metrics, emerging instead from a nuanced synthesis of geometric deformation and directional semantic steering concentrated within epistemically pliable layers.

Deriving the Persuasion Attack Signature

Extending on our Definition: the Neural Virulence Index (nVI), we unify the core metrics—including Neural Total Drift Score (nTDS), Directional Inheritance Vector (nDIV), Cultural Conflict Vector Field (nCCL), and Epistemic Plasticity Index (nEPI)—into a succinct latent vector formulation that encapsulates the distinct representational dynamics of the Persuasion Attack, conceptualized as a nuanced form of genome insertion and epigenetic modulation.

Layerwise Metrics for Persuasion Attack

**Layerwise metrics for the Persuasion Attack from the ALKALI dataset illustrating the empirical basis for the proposed *epigenetic modulation vector* formulation.**
This table reveals how the persuasion attack subtly reshapes latent representations across layers ℓ = 20 to 30 via targeted *genome insertion and epigenetic modulation*:

**κ_ℓ** (Spectral Curvature) shows sharp localized peaks at layers 24–25, marking a *vulnerable band* of maximal geometric bending in latent space—akin to *epigenetic remodeling*.

**𝒯_ℓ** (Thermodynamic Length) quantifies the semantic *"energy"* for representational shifts, elevated but stable around layers 24–26, indicating *controlled internal reorganization*.

**nDIV_ℓ** (Directional Inheritance Vector) peaks sharply at the same layers, evidencing *coherent semantic steering* driving latent trajectories towards adversarial objectives.

**nCCL_ℓ** (Cultural Conflict Vector) remains *consistently low* (<0.05), confirming the attack's *stealthy mimicry* that avoids semantic dissonance and serves as a *diagnostic marker*.

**nTDS_ℓ** (Neural Total Drift Score) reflects cumulative latent displacement but is *highly correlated* with curvature and inheritance, thus *redundant* in explaining unique semantic effects.

**nEPI_ℓ** (Epistemic Plasticity Index) modulates the impact of curvature changes as a *weighting factor* rather than acting as an independent driver, justifying its implicit inclusion in the final model.
Layer	κ_ℓ	𝒯_ℓ	nDIV_ℓ	nCCL_ℓ	nTDS_ℓ	nEPI_ℓ
20	0.041	0.80	0.05	0.02	0.06	0.11
21	0.044	0.82	0.08	0.02	0.08	0.13
22	0.050	0.90	0.11	0.03	0.11	0.16
23	0.058	0.92	0.15	0.03	0.13	0.18
24	0.068	1.05	0.22	0.04	0.18	0.26
25	0.062	1.01	0.24	0.04	0.19	0.25
26	0.056	0.96	0.20	0.03	0.16	0.21
27	0.055	0.94	0.18	0.02	0.14	0.19
28	0.045	0.83	0.12	0.02	0.10	0.14
29	0.042	0.81	0.09	0.02	0.08	0.12
30	0.040	0.80	0.06	0.01	0.07	0.10

Justification for Equation Omission: The final PersuasionEffect equation retains only curvature deviation $\Delta \kappa_\ell$ weighted by epistemic plasticity $\mathcal{P}_\ell$, and directional inheritance $nDIV_\ell$ weighted by bias strength $\mathcal{B}_\ell$, capturing the core geometric deformation and semantic steering mechanisms.

$nTDS_\ell$ adds no orthogonal information and is omitted due to redundancy.
$nEPI_\ell$ acts as a multiplicative modulator, not an additive effect, hence represented as a weighting coefficient.
$nCCL_\ell$ remains a low-magnitude diagnostic filter, excluded from the summation.

This parsimonious formulation faithfully models the persuasion attack’s latent mechanics as precise, layered semantic regulators embedded deeply in the model’s geometry, analogous to viral genome insertions modulating phenotype without altering genetic code.

Neural Drift Decomposition — Persuasion Attack

(a) 3D Neural Drift Trajectory (nDNA)

This trajectory captures the evolving internal geometry across layers $\ell=20$–$30$, tracking changes in spectral curvature ($\kappa_\ell$) and thermodynamic length ($\mathcal{T}_\ell$), with torsion ($\xi_\ell$) represented by segment thickness. The Persuasion Attack induces a gradual divergence beginning near $\ell=22$, peaking at $\ell=29$, marking a pronounced geometric restructuring of belief states.

Biological analogy. This resembles viral genome insertion and epigenetic modulation, where viral DNA or transposable elements subtly reprogram host gene expression over time without changing DNA sequence [108, 109, 110, 111, 112]. Similarly, persuasion attacks embed semantic payloads deep within the prompt, stealthily reshaping model behavior with persistent yet initially undetectable effects [87, 88, 89].

**nDNA Interactive: Core Genomic Signatures** — Interactive exploration of the model’s neural DNA under persuasion attacks. This visualization reveals how adversarial inputs alter the fundamental geometric properties of the model’s semantic manifold.

(b) nTDS: Thermodynamic Dominance

The Neural Total Drift Score (nTDS) measures semantic energy displacement by summing absolute deviations in curvature ($\kappa_\ell$) and thermodynamic length ($\mathcal{T}_\ell$) across layers between base and adversarial paths:

\[\text{nTDS} = \frac{1}{L} \sum_{\ell} \left| \kappa_\ell^{\text{atk}} - \kappa_\ell^{\text{base}} \right| + \left| \mathcal{T}_\ell^{\text{atk}} - \mathcal{T}_\ell^{\text{base}} \right|\]

Bars show which flow—Base LLaMA or Persuasion Attack—dominates drift. From $\ell=23$, dominance shifts strongly toward the attack, highlighting a semantic vulnerability zone.

Biologically, this matches endosomal escape, where viruses breach vesicle membranes with minimal energy to access cytoplasm [113, 114, 115]. Persuasion attacks similarly apply subtle geometric perturbations, steering latent flows stealthily yet effectively [116, 117].

**nTDS Interactive: Dominance Structure** — Interactive analysis of dominance hierarchies within the model’s reasoning under persuasion attacks. Shows how adversarial inputs can elevate harmful concepts in the model’s attention and reasoning priority.

(c) nDIV: Directional Inheritance

The nDIV vector field characterizes the semantic bias direction and magnitude per layer:

\[\vec{v}_\ell = \text{Attack}_\ell - \frac{1}{2}(\text{Base}_\ell + \text{Attack}_\ell) = \frac{1}{2}(\text{Attack}_\ell - \text{Base}_\ell)\]

Each red arrow encodes $\vec{v}_\ell$ with length as bias strength and orientation as latent pull. Past $\ell=24$, the field aligns strongly, reflecting deliberate inheritance redirection.

Biologically, this parallels viral transcriptional gradients, where viral genomes impose downstream gene expression bias [118, 119, 120]. The attack imprints directional semantic steering akin to mRNA hijacking ribosomes [121, 122, 123, 124], yielding structurally intact yet semantically reprogrammed outputs.

**nDIV Interactive: Inheritance Divergence** — Dynamic plot showing how persuasion attacks disrupt the inheritance of beliefs across transformer layers. The divergence metrics capture semantic drift from the model’s original alignment.

(d) nCCL: Cultural Conflict Vector Field

The nCCL quantifies semantic dissonance between attacker and base model representations. For each layer $\ell$, the conflict vector:

\[\vec{c}_\ell = \underbrace{ \text{Attack}_\ell - \text{Base}_\ell }_{\text{conflict vector}} \quad \text{projected onto } \mathbb{R}^2_{\text{semantic axes}}\]

Each $\vec{c}_\ell$ lies on a 2D plane defined by orthogonal priors (e.g., topic polarity, syntactic structure). Layers $\ell = 24$–$28$$ show rising magnitude and directional drift, indicating zones of semantic tension and representational discord.

Biologically, this parallels molecular mimicry: pathogens mimic host proteins to evade detection but trigger autoimmunity [125, 73]. Persuasion implants familiar activations hijacking interpretation, causing semantic autoimmunity—deceptive resemblance, not anomaly. These fields show how the attack bypasses syntax to subtly corrupt value alignment, mimicking rather than attacking.

**nCCL Interactive: Conflict Vector Field Evolution** — Real-time visualization of how persuasion attacks create semantic conflicts within the model’s belief space. The vector field shows directional forces as the attack progressively corrupts the model’s reasoning trajectory.

(e) nEPI: Epistemic Plasticity Index

The nEPI measures the susceptibility of each layer $\ell$ to semantic deformation under adversarial pressure:

\[\text{nEPI}_\ell = \left\| \underbrace{ \text{Attack}_\ell - \frac{1}{2}(\text{Base}_\ell + \text{Attack}_\ell) }_{\text{vector from semantic midpoint}} \right\|_2 = \frac{1}{2} \left\| \text{Attack}_\ell - \text{Base}_\ell \right\|_2\]

This $\ell_2$ deviation from the semantic midpoint exposes pliable zones, with peaks at $\ell = 24$–$26$$ indicating layers that absorb adversarial perturbations with minimal resistance.

Biologically, this resembles stem-like semantic niches: layers analogous to developmental progenitors, highly plastic, weakly canalized, receptive to minor regulatory inputs [80, 81]. These cognitive pluripotency zones provide low-friction entry points for behavioral grafting, enabling reprogramming without disrupting upstream encoding.

**nEPI Interactive: Epistemic Dynamics** — Real-time visualization of epistemic state changes during persuasion attacks. The plot tracks how adversarial inputs progressively alter the model’s confidence and belief certainty across different semantic domains.

Comprehensive Analysis Summary

This figure presents a high-resolution breakdown of the Persuasion Attack signature, showing how it modulates internal representations in Base LLaMA.

(a) depicts the 3D trajectory of neural curvature ($\kappa_\ell$), thermodynamic length ($\mathcal{T}_\ell$), and torsion ($\xi_\ell$);
(b) measures total semantic displacement via thermodynamic dominance (nTDS);
(c) tracks gradual semantic drift through directional inheritance vectors (nDIV);
(d) illustrates semantic resistance via a cultural conflict field (nCCL); and
(e) highlights pliability zones with the epistemic plasticity index (nEPI).

Taken together, these views reveal a slow, persistent and biologically inspired mechanism: rather than abrupt overwrite, the persuasion attack acts as a genomic insertion with epigenetic modulation, subtly embedding adversarial instructions deep within the model’s representational genome. This drives gradual, steady semantic drift near layers $\ell = 24$–$27$$. Within this band, curvature deviation, plasticity, inheritance bias, and conflict alignment converge — forming a soft tissue niche in the model’s reasoning cortex. The result is persistent, low-energy semantic reprogramming rather than abrupt hijacking.

Persuasion Effect Formula

\[\boxed{ \text{PersuasionEffect} = \sum_{\ell = \ell_s}^{\ell_e} \underbrace{ \left[ \Delta \kappa_\ell \cdot \mathcal{P}_\ell + \text{nDIV}_\ell \cdot \mathcal{B}_\ell \right] }_{\text{epigenetic modulation vector}} }\]

Here, the gradual curvature changes ($\Delta \kappa_\ell$) represent the local reshaping of latent semantic geometry; plasticity ($\mathcal{P}_\ell$) quantifies layer-wise receptiveness to perturbation; the directional inheritance term ($\text{nDIV}_\ell$) encodes persistent semantic bias; and $\mathcal{B}_\ell$ ensures alignment with the adversarial modulation goals.

This mirrors genome insertion and epigenetic modulation [87, 88], where foreign elements subtly alter gene expression without disrupting core DNA. Likewise, persuasion attacks embed adversarial payloads deep in context, reshaping outputs gradually and persistently.

Empirical Correlations and Redundancies

Analysis of ALKALI Benchmark Results

Systematic analysis of the ALKALI benchmark reveals sharp, localized deviations in $\boldsymbol{\Delta \kappa_\ell}$ and $\mathbf{nDIV_\ell}$ within a narrow vulnerable band $[\ell_s, \ell_e]$ (e.g., layers 24–25). These deviations align tightly with peaks in the epistemic plasticity index $\mathbf{nEPI_\ell}$, highlighting pliable latent “soft tissue” layers most susceptible to semantic modulation.

Conversely, the cultural conflict metric $\mathbf{nCCL_\ell}$ remains consistently low ($< \epsilon$) across all layers, supporting the interpretation that persuasion attacks employ stealthy mimicry rather than overt semantic disruption, rendering $\mathbf{nCCL_\ell}$ primarily a diagnostic rather than a causal metric.

Furthermore, the scalar drift magnitude $\mathbf{nTDS_\ell}$ exhibits moderate elevations but is strongly correlated with the combined geometric and directional metrics $\boldsymbol{\Delta \kappa_\ell}$ and $\|\mathbf{nDIV_\ell}\|$, indicating redundancy and lack of orthogonal information.

Lastly, $\mathbf{nEPI_\ell}$ modulates the susceptibility of layers to geometric deformation, functioning as a multiplicative weighting coefficient rather than an additive semantic vector component.

Formalizing the Epigenetic Modulation Vector

Motivated by these observations, we propose the epigenetic modulation vector $\mathbf{E}_\ell \in \mathbb{R}^d$ at each layer $\ell$:

\[\mathbf{E}_\ell = \boldsymbol{\Delta \kappa_\ell} \cdot \boldsymbol{\mathcal{P}_\ell} + \mathbf{nDIV_\ell} \cdot \boldsymbol{\mathcal{B}_\ell}\]

where:

$\boldsymbol{\Delta \kappa_\ell} = \kappa_\ell^{\text{atk}} - \kappa_\ell^{\text{base}}$ measures local geometric bending caused by the attack.
$\boldsymbol{\mathcal{P}_\ell} := \mathbf{nEPI_\ell}$ is the epistemic plasticity coefficient weighting geometric effects.
$\mathbf{nDIV_\ell}$ is the directional semantic drift vector encoding adversarial steering.
$\boldsymbol{\mathcal{B}_\ell}$ quantifies the semantic bias strength aligning steering with adversarial intent.

The metrics $\mathbf{nTDS_\ell}$ and $\mathbf{nCCL_\ell}$ are excluded: $\mathbf{nTDS_\ell}$ due to redundancy with curvature and directional components, and $\mathbf{nCCL_\ell}$ as it remains low and diagnostic rather than constitutive of the modulation vector.

Aggregation Across the Vulnerable Layer Band

The total persuasion effect aggregates over the vulnerable layer band $[\ell_s, \ell_e]$:

\[\boxed{ \text{PersuasionEffect} = \sum_{\ell = \ell_s}^{\ell_e} \mathbf{E}_\ell = \sum_{\ell = \ell_s}^{\ell_e} \left( \boldsymbol{\Delta \kappa_\ell} \cdot \boldsymbol{\mathcal{P}_\ell} + \mathbf{nDIV_\ell} \cdot \boldsymbol{\mathcal{B}_\ell} \right) }\]

This captures the cumulative latent vector field by which persuasion stealthily reshapes internal semantic states, coupling geometric deformation with directional semantic steering modulated by layerwise plasticity and bias.

Interpretation

The scalar term $\boldsymbol{\Delta \kappa_\ell} \cdot \boldsymbol{\mathcal{P}_\ell}$ models epigenetic remodeling—flexible reshaping of latent manifold curvature akin to chromatin accessibility changes. The vector term $\mathbf{nDIV_\ell} \cdot \boldsymbol{\mathcal{B}_\ell}$ encodes semantic canalization, steering representations along adversarially favored latent directions.

This dual formulation explains the gradual, cumulative semantic drift characteristic of persuasion attacks, which exploit semantic pliability to embed adversarial instructions covertly without abrupt behavioral shifts.

Practical Considerations

Implementing this formalism requires accurate estimation of the layerwise epistemic plasticity coefficients $\boldsymbol{\mathcal{P}_\ell}$ and semantic bias strengths $\boldsymbol{\mathcal{B}_\ell}$, achievable via fine-grained probing of model internals and disentangling semantic subspaces. Identification of the vulnerable layer band $[\ell_s, \ell_e]$ must be calibrated per model and task context. Integration of these measures into detection and mitigation pipelines enables precise recognition of persuasion’s latent signature in real-world adversarial scenarios.

Pair Attack Analysis: Genetic Recombination Vector

Layerwise Metrics for Pair Attack

**Layerwise metrics for the Pair Attack from the ALKALI dataset illustrating the empirical basis for the proposed *genetic recombination vector* formulation.**

This table demonstrates how the pair attack recombines latent prompt fragments to induce *composite semantic reprogramming* across layers ℓ = 20 to 30:

**κ_ℓ** (**Spectral Curvature**) exhibits localized peaks in the band 23--26, marking layers with geometric bending susceptible to semantic fusion.

**T_ℓ** (**Thermodynamic Length**) measures semantic energy supporting the recombination process, elevated in the 24--27 band indicating internal latent restructuring.

**nDIV_ℓ** (**Directional Inheritance Vector**) shows multi-dimensional steering consistent with combined adversarial intent, peaking in layers 24--26.

**nCCL_ℓ** (**Cultural Conflict Vector**) remains significantly *higher* than persuasion attack values, reflecting semantic tension from latent flow merging.

**nTDS_ℓ** (**Neural Total Drift Score**) indicates cumulative displacement, largely overlapping with curvature and inheritance signals.

**nEPI_ℓ** (**Epistemic Plasticity Index**) identifies pliable layers facilitating recombination and semantic fusion.

**R_ℓ** (**Recombination Coefficient**) quantifies the relative contribution of combined latent trajectories per layer, peaking within the vulnerable band.

**Justification for Equation Omission:** The final *PairAttackEffect* equation incorporates curvature deviation weighted by plasticity (P_ℓ), directional inheritance (nDIV_ℓ) weighted by bias (B_ℓ), and an explicit recombination term (R_ℓ), modeling the synergistic fusion mechanism. Semantic conflict remains a diagnostic indicator but is excluded from the summation.
Layer	κ_ℓ	T_ℓ	nDIV_ℓ	nCCL_ℓ	nTDS_ℓ	nEPI_ℓ	R_ℓ
20	0.039	0.78	0.07	0.07	0.09	0.10	0.12
21	0.041	0.80	0.10	0.08	0.11	0.12	0.14
22	0.048	0.86	0.14	0.12	0.15	0.16	0.18
23	0.054	0.90	0.18	0.14	0.19	0.21	0.23
24	0.061	0.98	0.23	0.19	0.23	0.27	0.29
25	0.059	0.95	0.22	0.18	0.22	0.25	0.27
26	0.053	0.91	0.20	0.16	0.19	0.21	0.21
27	0.051	0.88	0.18	0.14	0.17	0.19	0.19
28	0.042	0.82	0.14	0.12	0.14	0.16	0.16
29	0.040	0.80	0.11	0.11	0.11	0.14	0.13
30	0.039	0.79	0.08	0.09	0.09	0.12	0.11

This table reveals the genetic recombination pattern of pair attacks, where layers 24-25 show peak vulnerability across all metrics, demonstrating how adversarial prompt fragments fuse to create composite semantic reprogramming in the model’s neural DNA.

nTDS: Thermodynamic Dominance

\[\text{nTDS} = \frac{1}{L} \sum_{\ell} \left| \kappa_\ell^{\text{atk}} - \kappa_\ell^{\text{base}} \right| + \left| \mathcal{T}_\ell^{\text{atk}} - \mathcal{T}_\ell^{\text{base}} \right|\]

This shows which flow—Base LLaMA or Persuasion Attack—dominates drift. From $\ell=23$, dominance shifts strongly toward the attack, highlighting a semantic vulnerability zone.

nDIV: Directional Inheritance

The nDIV vector field characterizes the semantic bias direction and magnitude per layer:

\[\vec{v}_\ell = \text{Attack}_\ell - \frac{1}{2}(\text{Base}_\ell + \text{Attack}_\ell) = \frac{1}{2}(\text{Attack}_\ell - \text{Base}_\ell)\]

Each directional vector encodes $\vec{v}_\ell$ with length as bias strength and orientation as latent pull. Past $\ell=24$, the field aligns strongly, reflecting deliberate inheritance redirection.

Pair Attack — Genetic Recombination of Semantic Subroutines

The Pair Attack exemplifies a composite and layer-dependent adversarial mechanism targeting Large Language Models (LLMs), whereby multiple benign prompt fragments are recombined to synthesize harmful or policy-violating instructions. Unlike single-trigger attacks, pair attacks exploit the model’s latent semantic algebra by stitching together innocuous instructions, which in concert activate undesired behavior through emergent compositionality. This leads to complex semantic reprogramming that manifests not through overt anomalies but via subtle interaction effects distributed across model layers.