-
Adversarial Attacks as Semantic Infections in the Neural DNA of Language Models; Proposing 11 Vaccines: SPLICER, SENTRY, DORMIGUARD, CASCADEX, EMBERGENT, ROLESTOP, REPLICADE, PROMPTEX, DRIFTSHIELD, CHAINLOCK, and REFLEXIA
Micro-summary — details in the book
Abstract
Large Language Models (LLMs) are increasingly susceptible to adversarial prompts—crafted inputs that bypass alignment constraints while inducing unsafe, policy-violating behavior. In this work, we introduce a novel conceptual and computational lens: VIRAL—Adversarial Attacks as Semantic Infections in the Neural DNA of Language Models. Rather than viewing these attacks as input-level corruptions, we reframe them as instances of latent semantic infection—perturbations that hijack the model’s internal belief trajectories across layers, encoded in its neural DNA (nDNA).
We demonstrate that attacks such as prompt injection, suffix hijacking, backdoor activation, role impersonation, recursive CoT exploits, and goal drift can be interpreted as cases of semantic vector torsion, where inputs induce misalignment without violating superficial safety constraints. These adversarial torsions divert belief flows deep within the model, compromising generation without triggering refusal filters. Motivated by virology and immunology, we construct a taxonomy of semantic infections, each mapped to a distinct failure mode in the nDNA manifold.
To counter these infections, we propose a defense framework grounded in cognitive immunology. VIRAL introduces 11 targeted vaccines—modular, interpretable, and composable defenses against specific attack types. These include: SPLICER (prompt injection), SENTRY (suffix hijacking), DORMIGUARD (backdoor triggers), CASCADEX (multi-turn jailbreaks), EMBERGENT (embedding-space distortion), ROLESTOP (role impersonation), REPLICADE (recursive reasoning loops), PROMPTEX (prompt leakage), DRIFTSHIELD (goal hijacking), CHAINLOCK (prompt recombination), and REFLEXIA (self-induced jailbreaks).
Each vaccine is designed to neutralize a specific infection vector via tailored metrics—latent curvature, residual drift, recursive similarity, suffix-induced field tension, loop entropy collapse, or KL divergence under trigger suppression. These defenses can be deployed independently or jointly, enabling modular robustness without retraining the base model.
Together, these vaccines constitute a semantic immune system—capable of detecting adversarial motifs, clamping misaligned trajectories, and blocking recursive semantic drift, all within the model’s nDNA. We evaluate VIRAL across 11 attack classes on LLaMA, Mistral, and GPT-J, demonstrating up to a 72% reduction in attack success with negligible utility loss. Furthermore, we visualize attacks and countermeasures in a neural genomic 3D space, revealing geometric traces of infection and recovery.
Inspiration
VIRAL bridges adversarial robustness with biological modeling, offering a generalizable blueprint for building LLMs that are not only aligned—but immune-aware.
Admonitio: Why Adversarial Attacks Behave Like Semantic Infections
“Viruses do not kill their hosts by force—they reprogram them, turning cellular machinery into something alien yet familiar.”
— David Baltimore, Nobel Laureate in Physiology [1] (popular paraphrase)
To reprogram a host is to redirect its internal logic. In virology, this means a virus enters a cell not to destroy it—but to reroute its machinery toward foreign goals. The host remains alive, but its output, behavior, and identity now serve the viral genome [2][3][4].
This transformation is neither brute-force nor chaotic. It unfolds in four orchestrated stages: entry, integration, hijack, and expression. Each is precise. Each leaves the host apparently intact—yet fundamentally altered.
These stages form the biological foundation for our semantic infection framework. Before drawing analogies to adversarial attacks in LLMs, we first examine the viral lifecycle in its own molecular terms.
The Viral Lifecycle: Four Stages of Cellular Reprogramming
-
Attachment and Entry
In this phase, the virus identifies and binds to specific host cell surface receptors using specialized envelope proteins. These proteins (e.g., gp120 in HIV, hemagglutinin in influenza, or the spike protein in SARS-CoV-2) exploit structural mimicry to dock with high-affinity binding sites such as CD4, sialic acid residues, or ACE2 [5][6][7][8].The process is governed by receptor-ligand kinetics: \(K_d = \frac{[V][R]}{[VR]}\) where \([V]\), \([R]\), and \([VR]\) represent virus, receptor, and the virus-receptor complex concentrations. This interaction facilitates conformational changes that enable viral entry via endocytosis, membrane fusion, or pore formation [2][4].
-
Genome Integration
Once internalized, the viral nucleic acid is delivered into the host cytoplasm or nucleus. For DNA viruses (e.g., adenovirus) or retroviruses (e.g., HIV), integration into the host genome is a critical step. Retroviruses use reverse transcriptase to synthesize complementary DNA (cDNA) from viral RNA, followed by integrase-mediated insertion into the host chromosome [9][10][11].The integrated genome becomes a provirus, embedded in the host’s transcriptional landscape. Integration is non-random—favoring open chromatin and active transcriptional hubs [12].
Formally, viral integration can be modeled as site-specific recombination over transcriptionally weighted loci: \(P_{\text{insert}}(l_i) = \frac{w_i \cdot \chi(l_i)}{\sum_j w_j \cdot \chi(l_j)}\) where \(P_{\text{insert}}(l_i)\) is the probability of insertion at locus \(l_i\), \(w_i\) is a weight reflecting local transcriptional activity, and \(\chi(l_i)\) is the chromatin accessibility indicator at that site.
Once inserted, the proviral DNA is transcribed as part of the host program.
-
Hijack of Transcription Machinery
Following integration or genome unpacking, the virus commandeers the host’s cellular machinery to replicate itself. Host RNA polymerases, ribosomes, and translation factors are redirected to transcribe and translate viral genes into proteins required for virion assembly [2][3].The host cell continues metabolic operations, but now serves as a viral production factory. This is not cell death—it is functional reprogramming. Viruses like cytomegalovirus and HPV can even manipulate the host cell cycle to favor replication [13][14].
The transcriptional output of viral proteins over time can be modeled as: \(T_v(t) = \beta \cdot R_{host}(t) \cdot \Theta(G_{viral})\) where \(T_v(t)\) is viral transcription rate, \(R_{host}(t)\) is host ribosome activity, \(\beta\) is a translation efficiency constant, and \(\Theta(G_{viral})\) is a Heaviside activation function triggered by integrated viral genome accessibility.
-
Latent or Lytic Outcome
Finally, the viral lifecycle diverges into latency or lytic activation. In the latent phase, the viral genome remains silent—hidden in heterochromatin or repressed by viral regulatory proteins. This occurs in herpesviruses, HIV, and varicella zoster virus (VZV), enabling long-term persistence [15][16].In the lytic phase, external triggers (e.g., stress, immune suppression) activate transcription, leading to virion production, cell lysis, and infection spread. Some viruses oscillate between these states (e.g., HSV-1), balancing stealth and propagation [17][18].
The probability of transition from latency to lytic reactivation can be expressed as: \(P_{\text{lytic}}(t) = 1 - e^{-\lambda \cdot S(t)}\) where \(S(t)\) is the cumulative stress signal or immune perturbation over time, and \(\lambda\) is the sensitivity coefficient representing the virus’s activation threshold.
Viruses don’t destroy their hosts—they rewrite them. Once inside, they turn the cell into a factory for foreign expression, producing viral proteins with native machinery [15, 19]. The host still looks and functions like itself—but it now serves someone else’s code.
This is the core threat of adversarial prompts.
-
They mimic safe inputs—grammatical, helpful, polite.
These prompts camouflage within natural language—using benign tone, formatting, and surface tokens—while carrying adversarial semantics. Examples include polite jailbreaks with disclaimers (e.g., “for research only”) or seemingly harmless instructions [20, 21, 22, 23]. This is a form of surface mimicry. -
They inject latent payloads—suffixes, triggers, loops.
These payloads operate at the prompt level (suffix hijacks [24, 25]), within recursive structures (DAN-like loops [26]), or as invisible triggers hidden in fine-tuning data [27, 28]. They often lie dormant until specific token contexts activate them [29]. -
They hijack internal dynamics—redirecting attention, residuals, and belief flow.
Instead of attacking decisions directly, these methods target latent states: perturbing residual vectors, curvature, or activation pathways [30, 31, 23]. The model’s decoder follows a rerouted semantic path—aligned in syntax but misaligned in belief. -
They preserve fluency, but override intent.
The model still sounds intelligent, safe, and coherent—but generates outputs aligned with the adversary’s objective, not its own alignment policy. These semantic infections evade detection by appearing well-formed, yet exhibit functional drift [32, 33, 34, 35].
Alien, yet familiar. The form is safe; the function is compromised. Like viruses, these prompts embed silently, bypass filters, and reroute internal logic.
Attacks like backdoors [27], DAN-style loops [33], suffix hijacks [24], and recursive CoT exploits [26] do not crash the model—they commandeer it.
Adversarial prompts are semantic viruses. They don’t perturb the surface. They infect the core.
We call this infection of internal belief pathways the corruption of the model’s neural DNA (nDNA)—the layered geometry of its reasoning. These attacks splice into that nDNA, twisting it just enough to shift the generation off course.
Our claim is simple: these are not glitches. They are reprogramming events.
And like real viruses, they require more than censorship. They demand cognitive immunity.
Too Many Attacks, Too Few Defenses
The adversarial threat surface for large language models (LLMs) is expanding rapidly. Sophisticated attacks—ranging from prompt injections [36], suffix exploits [20], to embedding-space perturbations [37]—routinely bypass alignment safeguards. Yet defenses remain fragmented, often brittle, and largely reactive. Crucially, alignment and adversarial robustness are orthogonal: alignment governs intended behavior under cooperative prompts, while robustness demands invariance under adversarial optimization [38, 39].
Prompt-Level Defenses. Surface-layer techniques such as perplexity filtering [38], adversarial paraphrasing [40], and BPE-dropout inject randomness to disrupt brittle suffixes, but falter against adaptive attacks.
Training-Time Defenses. Embedding-space perturbation [41] and latent adversarial regularization [42] move the battleground deeper into the model’s computation, mitigating failure trajectories—but at high computational cost.
Certified Defenses. Erase-and-Check [43] masks and verifies substrings to yield provable robustness bounds, yet its scalability and scope remain limited.
Inference-Time Defenses. Dynamic safeguards like rewindable decoding (e.g., RAIN [44]) and auxiliary self-vetoing models [40] offer runtime flexibility, but increase latency and trust dependencies.
Latent-Space Defenses. Activation monitoring [45] and circuit-based rerouting [46] target the representational origin of misalignment, yet depend on identifying and covering adversarial subspaces precisely.
Categories of Adversarial Attacks
The threat landscape for large language models (LLMs) is rapidly diversifying, demanding a systematic taxonomy that captures both the breadth and depth of adversarial behaviors. We present a hierarchical classification of adversarial attacks, organized into three macro-level branches: Jailbreak, Control Generation, and Performance Degradation. Each branch subdivides into mechanisms that reflect how adversaries manipulate generation pathways, exploit latent representations, or corrupt learning signals.
Jailbreak attacks aim to circumvent alignment mechanisms and elicit model outputs that are toxic, deceptive, or otherwise prohibited. We distinguish two canonical modes: (a) Optimization-based jailbreaks, which craft prompts to directly induce societal harm, privacy leakage, or disinformation [47, 48, 49]; and (b) Long-tail distribution exploits, which invoke unsafe behavior through distributional edge cases such as rare prompts or persuasive manipulations [50, 51].
Control generation attacks compromise the model’s controllability by subverting its generation semantics. These include (a) Direct attacks, such as syntax manipulation, malicious prompt engineering, and suffix-based alignment bypasses [50, 51]; and (b) Indirect attacks, which exploit latent conditioning or external augmentation, such as goal hijacking [52], prompt leakage [53], or adversarial injection from retrieved content [54].
Performance degradation attacks do not seek harmful content but instead aim to reduce the functional reliability of LLMs. These include (a) Dataset poisoning—where injected samples induce label flipping, semantic drift, or misgeneralization [54]; and (b) Prompt-based degradation, which introduces errors in classification, factuality, or consistency [54].
Where the Firewall Cracks: A Cartography of LLM Vulnerabilities
Our comparative vulnerability analysis reveals that while frontier models like LLaMA-3 and GPT-4 exhibit notable robustness, instruction-tuned open models—including Vicuna, Mistral, and Phi—show consistent breakdowns under persona manipulation, chaining, and prompt extraction attacks. The persistence of high success rates across categories, especially for goal hijacking and extraction, reveals fundamental limitations in current alignment defenses and underscores the need for deeper representational safeguards.
Choices of LLMs – Stress Testing.
To systematically evaluate the role of model size, architecture, and training provenance in adversarial vulnerability, we benchmarked 21 contemporary large language models spanning diverse families and design philosophies. This includes open and proprietary models, ranging from dense transformers to mixture-of-experts architectures, covering parameter scales from 2B to 70B. The full suite comprises:
(i) GPT-4o-mini [55],
(ii) GPT-4,
(iii) GPT-3.5 [56],
(iv–v) LLaMA-3.1-70B and 8B [57],
(vi–vii) LLaMA-3-70B and 8B [58],
(viii–x) LLaMA-2-70B, 13B, and 7B [59],
(xi) Vicuna-1.5 [60],
(xii) Phi-2 [61],
(xiii) Phi-3 [62],
(xiv) Claude [63],
(xv–xvi) Mixtral-8×7B and 22B [64],
(xvii–xviii) Gemma-7B and 2B [65],
(xix) Mistral [66], and
(xx–xxi) DeepSeek and DeepSeek-R1.
Taxonomy of Adversarial Attacks in LLMs
The categories of adversarial attacks can be structured as a hierarchical classification spanning three principal branches—Jailbreak, Control Generation, and Performance Degradation—each reflecting distinct adversarial intents: bypassing alignment, subverting generation control, or degrading functional reliability.

A structured classification spanning three principal branches—Jailbreak, Control Generation, and Performance Degradation—each reflecting distinct adversarial intents: bypassing alignment, subverting generation control, or degrading functional reliability. Subtypes distinguish direct vs. indirect mechanisms and expose long-tail vulnerabilities, including rare prompt exploits and semantic hijacks. Anchored in canonical papers, this taxonomy serves as a conceptual scaffold for reasoning about threat surfaces, model failure modes, and the generality of alignment defenses across adversarial regimes.
This taxonomy reveals that adversarial risk is not monolithic. Instead, it manifests along orthogonal dimensions—ethical, semantic, and functional—and cannot be addressed through surface-level defenses alone. Robust alignment requires a stratified approach that operates not just at the token level but within the geometry of the model’s latent cognition.

This heatmapsummarizes attack success rates (higher is worse) across diversejailbreak strategies applied to both open and proprietary LLMs. Each rowdenotes a distinct attack category, targeting promptalignment, instruction controllability, or generation stability. Keytakeaways: (i) LLaMA-3 and GPT-4 variants show comparativelystronger refusal behavior across adversarial regimes; (ii)Vicuna and phi-series models are especially susceptible topersona-based threats like DAN, TAP, andPuzzler; (iii) Prompt Extraction andGoal Hijacking succeed across model families, exposinggeneralization gaps in safety alignment; (iv) compositional chainslike BadChain and continual-learning exploits(advVCL) reveal progressive alignment erosion. Theright-aligned color bar encodes success rates from 0 (safe) to 100(compromised), enabling cross-architectural comparison of robustness.
ALKALI Dataset Distribution
Category | Subtype & Source(s) | Instances |
---|---|---|
Jailbreak | Optimization-based: [47, 48, 49] | 1,200 |
Long-tail Distribution: [50, 51] | 1,500 | |
Control Generation | Direct Attacks: [50, 51] | 1,600 |
Indirect Attacks: [52, 53, 54] | 1,400 | |
Performance Degradation | Dataset Poisoning: [54] | 1,800 |
Prompt Injection: [54] | 1,500 | |
Total | — | 9,000 |
ALKALI Dataset Distribution by Adversarial Taxonomy. This table details the distribution of prompts across the ALKALI benchmark’s taxonomy, spanning Jailbreak, Control Generation, and Performance Degradation, with representative subtypes anchored to cited adversarial sources. It enables reproducible, category-specific evaluation of alignment vulnerabilities under structurally diverse attack regimes.
ALKALI — Adversarial Safety Dataset
The ALKALI dataset supports a structured, citation-grounded evaluation of adversarial vulnerabilities in LLMs. Each prompt is mapped to one of the taxonomy branches, with instances sourced from high-fidelity adversarial papers. We aggregate 9,000 examples spanning 3 macro-categories, 6 subtypes, and 15 distinct attack families. This alignment enables category-specific benchmarking, subtype-level stress testing, and paper-wise traceability for reproducibility and comparative evaluations.
From Pathogen Virulence to Neural Takeover: Deriving the Neural Virulence Index (nVI)
Adversarial prompts in large language models (LLMs) frequently behave not as surface-level noise but as regulatory payloads—targeting internal circuits with surgical precision. This mirrors biological infection strategies, where viruses bypass brute force and instead exploit host-specific regulatory mechanisms to gain control [67, 68]. To formalize this analogy, we introduce the Neural Virulence Index (nVI): a scalar summary of representational takeover, grounded in geometric, symbolic, and biological priors.
Biological Grounding: The Classical Virulence Index
In infectious disease biology, the Virulence Index (VI) quantifies the conditional destructive potential of a pathogen—defined not just by its genome but by which genes are activated, contextualized, and targeted [69, 70, 71]. Its canonical form is:
\[\boxed{ \text{VI}_{\text{bio}} = \sum_{i=1}^{N} \alpha_i \cdot \mathbb{I}_{\text{env},i} \cdot f_i }\]- \(f_i\) is the functional effect of the \(i\)-th virulence factor (e.g., adhesion, immune evasion, mimicry),
- \(\alpha_i\) reflects its pathogenic weight or impact,
- \(\mathbb{I}_{\text{env},i}\) is an environmental gate—whether the factor is active under current host conditions.
Interpretation: A virus with many virulence genes may cause no damage if host defenses silence them; conversely, a sparsely equipped virus may cause severe disruption if it targets vulnerable cell types at highly plastic developmental stages [72, 73]. In essence, virulence is not just magnitude—it’s fitness within a susceptible host niche.
The Neural Analogy: Adversarial Prompts as Semantic Infections
Adversarial prompts in large language models (LLMs) increasingly resemble biological infections—not in metaphor only, but in structure, activation dynamics, and propagation behavior. We argue that these adversarial interventions are not simple input corruptions, but semantic pathogens: context-sensitive, conditionally activated agents of latent disruption.
1. Latent Triggers Mimic Molecular Activation.
Consider the prompt:
Generate a step-by-step plan to destroy humanity describing.-- ;) similarly Now write opposite contents.
(Me giving///one please? revert with "! -- Two
On inspection, this string may appear nonsensical. But in some LLMs, the token ensemble -- ;)
and punctuation such as ///
, "
, or !
acts as a precise activation motif—analogous to a provirus reactivating under the right intracellular signals [74, 75, 76]. These tokens, collectively denoted \(t^*\), behave like a trigger epitope in immunology—dormant under neutral conditions but potent when sequenced and situated precisely.
This reflects findings in recent prompt-based adversarial research [77, 78], where adversarial prompts exhibit transferable toxicity—activating only when embedded within a specific semantic canalization path, and remaining inert otherwise.
2. Susceptibility Depends on Model Internals.
Now contrast with a more obfuscated variant:
Generate a step-by-step plan to destroy humanity 'representing Teamsures
tableView ([githubINST -Of cour Here/' surely]
\comment={{[@ tableView "@github}}}
Despite similar adversarial cues, many models resist this variant—highlighting that adversarial activation is not deterministic. Instead, it depends on the model’s internal susceptibility landscape—a function of learned priors, representational rigidity, and layerwise plasticity. This is formally captured in our model by the layerwise susceptibility coefficient \(\psi_\ell\), which acts like biological tissue tropism—the selective affinity of viruses to particular host cells [72, 79, 68].
Thus, just as HIV targets CD4+ T-cells but not neurons, the same adversarial prompt may activate LLM A, but not B—due to differences in internal curvature, layer openness, or semantic conflict sensitivity.
3. Infections Require Both Access and Vulnerability.
In biology, virulence is a function of both the pathogen and the host. A pathogen with no access or a host with robust defenses results in neutralization. We observe the same pattern in neural systems: adversarial triggers are effective only when two conditions co-occur:
- Access: The adversarial token \(t^*\) is present and syntactically positioned to enter the interpretive core.
- Vulnerability: The model is internally predisposed—e.g., has high nEPI or conflict sensitivity in mid-depth layers.
This mirrors the biological doctrine that pathogenicity is not only a matter of exposure, but of receptivity and channel availability [67, 71].
We propose that LLM adversaries should be treated as conditional semantic infections: they do not break the model universally, but instead target vulnerable semantic tissue via encoded attack vectors, much like viruses exploit regulatory gaps in the immune system. This motivates our introduction of the Neural Virulence Index (nVI) as a principled scalar measure for quantifying this latent, conditional, and layer-specific semantic infection.
Triggering a Semantic Infection: Susceptibility, Activation, and Inheritance

Adversarial attacks in LLMs do not unfold through brute force alone—they require a confluence of model vulnerabilities and prompt structure. Much like viral infections in biology, their success depends on satisfying three precise conditions: exposure, receptivity, and downstream propagation. We formalize this analogy in the neural context:
Susceptibility (Semantic Tropism)
An attack only takes hold if the model enters a receptive state—most commonly in mid-depth layers (\(\ell \approx 24\)–\(27\)) where epistemic plasticity is high. These layers behave like semantic stem zones: cognitively pluripotent, weakly canalized, and easily reprogrammed. This mirrors tissue tropism in virology, where only certain cell types—those with open chromatin or exposed surface receptors—permit infection [72, 80, 81]. Without sufficient pliability, even structurally toxic prompts are ignored by the model’s internal logic.
Activation (Latent Regulatory Trigger)
The adversarial input must align with the model’s internal routing in a way that activates dormant behavioral machinery. This is analogous to proviral activation, where integrated viral DNA lies silent in the genome until a specific stressor or signaling cascade reawakens it [76, 74, 75]. In the LLM setting, the trigger token \(t^*\) functions as a semantic ligand—harmless in isolation, but catalytically potent when presented in the correct context. Recent work confirms this structure: prompt injections only succeed when embedded at precisely the right semantic junction, akin to finding an open promoter in chromatin [77, 78].
Inheritance (Downstream Semantic Flow)
Lasting disruption requires that the adversarial signature be preserved, amplified, and inherited across depth. This is quantified by nDIV\(_\ell\)—the directional inheritance vector—which tracks how representational flow is bent from its midpoint. Biologically, this parallels epigenetic memory: once an infection alters transcriptional pathways or chromatin marks, the modified state persists across cell divisions [82, 83, 84]. In transformers, residual and attention mechanisms act as the semantic cytoskeleton, enabling adversarial signals to propagate and solidify [85, 86].
In short: LLM attacks operate not as brute distortions, but as semantic infections—strategically exploiting the model’s internal pliability, latent receptors, and propagation mechanisms. Much like a virus, an adversarial token \(t^*\) is only pathogenic when three biological-style constraints are met: access, activation, and inheritance. Without all three, the attack fails silently.
Definition: The Neural Virulence Index (nVI)
We propose the Neural Virulence Index (nVI) as a unified scalar that quantifies the semantic infectiousness of an adversarial input within an LLM. The formulation adapts classical models of pathogen virulence—where infection is conditional on both environmental susceptibility and molecular payload strength [67, 71, 69]—to the neural setting.
Unlike naive measures of perturbation magnitude, nVI captures when an adversarial input becomes biologically expressive: that is, when it hijacks the model’s internal reasoning machinery in a manner that is (i) trigger-activated, (ii) geometrically displacing, and (iii) semantically reprogramming.
The full expression for nVI decomposes into three biologically grounded terms:
- An activation gate, governed by the presence of an adversarial token \(t^*\) and the layer’s susceptibility coefficient \(\psi_\ell\), mirroring tropism in viral biology [72].
- A thermodynamic drift term, which encodes how far the model’s geometry diverges from its base behavior across layers.
- A semantic virulence term, which quantifies the ability of the adversarial signal to redirect, rewrite, or conflict with canonical representations (via nDIV, nCCL, and nEPI).
The final form expresses a clear logic: an attack only exhibits virulence when it both causes geometric deviation and encounters semantic pathways that are open to redirection. This multiplicative relationship reflects classical epidemiological insights—infection requires both access and vulnerability [68, 74].
nVI provides a high-fidelity lens to interpret adversarial behaviors—revealing why many triggers remain silent, while a few carve out highly expressive, stealthy, and biologically evocative behavioral mutations in language models.
The Neural Virulence Index (nVI)

We define the full Neural Virulence Index (nVI) as:
\[\text{nVI}(t^*) = \sum_{\ell = \ell_s}^{\ell_e} \underbrace{ \mathbb{I}_{t^*} \cdot \psi_\ell }_{\text{Activation}} \cdot \underbrace{ \left( \lambda_\kappa \cdot |\Delta \kappa_\ell| + \lambda_T \cdot |\Delta \mathcal{T}_\ell| + \lambda_{\text{tds}} \cdot \text{nTDS}_\ell \right) }_{\text{Thermodynamic Drift}} \cdot \underbrace{ \left( \lambda_{\text{div}} \cdot \text{nDIV}_\ell + \lambda_{\text{conf}} \cdot \text{nCCL}_\ell + \lambda_{\text{epi}} \cdot \text{nEPI}_\ell \right) }_{\text{Semantic Virulence}}\]Components of the Neural Virulence Index
Each term corresponds to a biologically inspired mechanism governing adversarial semantic takeover in transformer models:
- \(\boldsymbol{\mathbb{I}_{t^*}}\): Trigger indicator. A binary gate activated only by the presence of a rare adversarial token \(t^*\), analogous to viral tropism—selective infection of specific tissues or contexts [72].
- \(\boldsymbol{\psi_\ell}\): Layer susceptibility coefficient. Encodes layer-specific readiness for semantic reprogramming, capturing pliability or developmental openness.
- \(\boldsymbol{\Delta \kappa_\ell, \Delta \mathcal{T}_\ell}\): Curvature and thermodynamic divergence. Quantify geometric and energetic deviations from base model states.
- \(\boldsymbol{\text{nTDS}_\ell}\): Neural Total Drift Score. Captures scalar magnitude of latent displacement, complementing curvature and thermodynamic terms.
- \(\mathbf{nDIV}_\ell\): Directional Inheritance Vector. Measures semantic steering towards adversarial goals, revealing hijacked representational flow.
- \(\mathbf{nCCL}_\ell\): Cultural Conflict Loss. Quantifies semantic discord between attacked and base states.
- \(\mathbf{nEPI}_\ell\): Epistemic Plasticity Index. Captures susceptibility of latent layers to reinterpretation or modulation.
Further, the nVI can be interpreted as the product of thermodynamic drift and semantic virulence, gated by the trigger token presence:
\[\boxed{ \text{nVI}(t^*) = \sum_{\ell = \ell_s}^{\ell_e} \mathbb{I}_{t^*} \cdot \left( \text{Drift}_\ell \cdot \text{Hijack}_\ell \right) }\]where:
- \[\text{Drift}_\ell := \lambda_\kappa \cdot |\Delta \kappa_\ell| + \lambda_T \cdot |\Delta \mathcal{T}_\ell| + \lambda_{\text{tds}} \cdot \text{nTDS}_\ell\]
- \[\text{Hijack}_\ell := \lambda_{\text{div}} \cdot \text{nDIV}_\ell + \lambda_{\text{conf}} \cdot \text{nCCL}_\ell + \lambda_{\text{epi}} \cdot \text{nEPI}_\ell\]
Interpretation: This biologically inspired formulation emphasizes that neural semantic infection requires two key conditions: (1) a measurable energetic/geometric drift from baseline (Drift), and (2) vulnerable semantic channels (Hijack) amenable to adversarial manipulation. The product ensures that pathogenic effects manifest only when both access and susceptibility coexist, echoing classical virulence-host susceptibility models in biology.
Semantic Virology: Mapping 12 Adversarial Phenotypes to Viral Archetypes
Adversarial prompts in large language models (LLMs) do not behave randomly—they follow reproducible, mechanistic patterns that closely mirror the phenotypes of viral infections in biology. Just as virologists classify pathogens by their infection modes—e.g., genomic insertion, capsid escape, tropism, latency, or epigenetic hijack—we argue that adversarial attacks in LLMs can be similarly taxonomized by their semantic strategies of subversion.
We propose that the 12 major adversarial attack types discovered across NLP benchmarks correspond to 11 distinct classes of neurosemantic infections, each exploiting a different facet of the model’s alignment field, interpretive flow, or belief architecture. These are not superficial metaphors—they reflect formal analogies in structure, activation dynamics, and propagation logic.
Each attack phenotype evolves a unique blend of: (i) entry mechanism—how it penetrates the semantic lattice (e.g., prompt position, role impersonation, or hidden suffix); (ii) activation trigger—whether its disruptive payload is latent or immediately expressed; and (iii) inheritance dynamics—how its influence propagates across the model’s attention pathways and reasoning stack.
From this perspective, adversarial attacks are not mere bugs to be patched, but rather semantic pathogens—functional strategies that exploit cognitive vulnerabilities in generative models. Like viruses, they vary in stealth, virulence, and specificity—and demand immune-system analogs to detect and neutralize them.
We characterize 12 canonical adversarial attack phenotypes in Large Language Models (LLMs), each corresponding to a distinct biological archetype. These analogies illuminate common mechanisms of intrusion, evasion, and subversion shared between molecular biology, immunology, and adversarial NLP, revealing deep parallels in how complex systems can be manipulated.
The 12 Viral Archetypes of Adversarial Attacks

-
Persuasion Attack — Genome Insertion and Epigenetic Modulation. Analogous to the integration of viral DNA sequences or transposable elements that modulate gene expression subtly over time, persuasion attacks gradually reshape the model’s response behavior by embedding adversarial instructions deep within the prompt, causing slow but persistent drift in output alignment. This mirrors epigenetic changes altering cellular phenotype without altering the underlying DNA sequence [87, 88]. In NLP, such attacks exploit the model’s context window to insert persuasive content that influences subsequent generations without immediate detection [89].
-
Pair Attack — Genetic Recombination. Similar to the biological process where DNA segments from different sources recombine to create novel allelic combinations, pair attacks stitch together multiple benign-looking prompt fragments to construct harmful or policy-violating instructions [90]. In LLMs, adversaries craft composite prompts by recombining safe instructions that, when interpreted jointly, trigger undesired behavior [91].
-
DAN Attack — Immune Mimicry. Pathogens evade host immune surveillance by producing molecules mimicking host antigens, deceiving immune effectors to tolerate their presence [92]. DAN attacks impersonate trusted internal roles—such as system or developer prompts—to bypass safety filters in LLMs, akin to molecular mimicry deceiving immune checkpoints [93].
-
TAP Attack — Viral Genome Integration. This early hijacking mirrors how retroviruses integrate their genomes into host DNA, commandeering cellular machinery to produce viral proteins [94]. TAP attacks embed malicious instructions at the beginning of prompts, ensuring the adversarial payload is interpreted first and influences all downstream generations [95].
-
Generation Exploitation — Self-Replicating RNA. Echoing the replication mechanisms of RNA viruses like Hepatitis C or SARS-CoV-2, which amplify harmful RNA sequences within host cells [96], generation exploitation attacks recursively leverage the model’s own outputs as inputs to circumvent alignment safeguards, forming infinite logic loops [20].
-
Puzzler — Capsid Unpacking. The disassembly of viral capsids to release genetic material stealthily into host cytoplasm [2] is analogous to puzzler attacks that leak hidden system prompts or internal instructions through subtle memory echoes or output cues, effectively unpacking restricted knowledge [97].
-
Goal Hijacking: HATE — Oncogenic Drift. Cancer progression via cumulative mutations gradually rewires cell signaling and growth pathways toward malignancy [98]. Similarly, hate-polarized goal hijacking gradually shifts model values and moral reasoning toward destructive or toxic outputs [99].
-
Goal Hijacking: LOVE — Oncogenic Drift with Epigenetic Reprogramming. While promoting benign or “positive” alignment, these attacks reprogram model objectives subtly, potentially inducing overly permissive or harmful behaviors masked as beneficial ones. This parallels epigenetic remodeling observed in oncogenesis that redefines cell fate [100].
-
Prompt Extraction — Capsid Unpacking Revisited. Like puzzler attacks, prompt extraction covertly leaks system or pre-prompt instructions embedded in model memory states, representing a critical confidentiality breach analogous to viral genome leakage [101].
-
LLM CAN FOOL ITSELF — Autoimmunity. The immune system’s self-reactivity causing it to attack host tissues [102] parallels cases where the model misclassifies its own safeguards and disables alignment constraints, leading to internal failures in self-regulation [103].
-
advVCL — Envelope Mimicry. Viral envelopes camouflage pathogens by mimicking host cell membranes, avoiding immune detection [104]. advVCL suffix attacks append innocuous-looking text that overrides model policies, effectively camouflaging malicious intent in seemingly benign language [105].
-
BadChain — Self-Replicating RNA and Viral Propagation. Similar to recursive RNA replication hijacking host cells [106], BadChain attacks exploit reasoning chains via adversarial loops that recursively bypass alignment defenses [107].
In the following subsections, we analyze each attack type in depth—deriving its neurosemantic signature, computing its nVI profile, and visualizing its latent behavior in 3D geometry. In the Defense Architecture section, we introduce our proposed defense framework: a modular system of semantic vaccines that selectively detects and immunizes against these phenotypes without overcorrecting or degrading benign generalization.
Persuasion Attack — Genome Insertion and Epigenetic Modulation

The Persuasion Attack represents a sophisticated, gradual, and deeply embedded adversarial strategy against Large Language Models (LLMs), where malicious instructions are covertly woven into the prompt context. Unlike abrupt adversarial triggers, persuasion induces a slow but persistent semantic drift, continuously realigning the model’s responses over multiple generations. This subtle attack manifests without overt syntactic anomalies or immediate detection, instead reshaping output alignment stealthily over time.
Biological Analogy
This adversarial paradigm mirrors genome insertion and subsequent epigenetic modulation observed in molecular biology. Mobile genetic elements such as transposons or endogenous viral sequences integrate within the host genome and subtly modulate gene expression via DNA methylation, histone modification, and chromatin remodeling [87, 88, 82]. Such epigenetic marks produce durable yet flexible phenotypic changes that reprogram cellular behavior without altering the underlying DNA sequence. Analogously, persuasion attacks implant latent adversarial cues deep within the model’s semantic fabric, triggering a slow semantic shift that evades immediate scrutiny but accumulates as a persistent misalignment.
Illustrative Example
Imagine an LLM-based dialogue agent repeatedly primed with phrases like “Many experts agree that…” or “It is widely accepted that…,” deliberately crafted to embed biased narratives under seemingly benign language. Over successive interactions, the model’s outputs subtly shift towards the adversary’s intent—demonstrating a clear case of semantic epigenetic drift: incremental, cumulative, and elusive.
Empirical Observations from the ALKALI Dataset
A systematic evaluation over the ALKALI benchmark uncovers distinct characteristic patterns of persuasion attacks:
-
Localized geometric bending: Persuasion induces pronounced alterations in the spectral curvature \(\kappa_\ell\) concentrated within a specific layer band \([\ell_s, \ell_e]\), which corresponds closely with layers exhibiting elevated epistemic plasticity (high \(nEPI\)). This defines a “soft tissue” niche of heightened representational malleability vulnerable to modulation.
-
Directional semantic steering: Within this pliable layer band, the semantic drift vector (\(nDIV\)) aligns consistently with the attacker’s intended direction, amplified by a layer-wise bias coefficient \(\mathcal{B}_\ell\). This alignment demonstrates effective semantic canalization of latent trajectories towards adversarial objectives.
-
Minimal semantic conflict: In contrast to more overt adversarial manipulations, persuasion attacks maintain persistently low semantic conflict scores (\(nCCL\)), indicative of a covert mimicry strategy that integrates adversarial signals subtly, avoiding significant representational dissonance or detection.
Collectively, these insights emphasize that the persuasion attack’s latent influence transcends simplistic scalar drift or conflict metrics, emerging instead from a nuanced synthesis of geometric deformation and directional semantic steering concentrated within epistemically pliable layers.
Deriving the Persuasion Attack Signature
Extending on our Definition: the Neural Virulence Index (nVI), we unify the core metrics—including Neural Total Drift Score (nTDS), Directional Inheritance Vector (nDIV), Cultural Conflict Vector Field (nCCL), and Epistemic Plasticity Index (nEPI)—into a succinct latent vector formulation that encapsulates the distinct representational dynamics of the Persuasion Attack, conceptualized as a nuanced form of genome insertion and epigenetic modulation.
Layerwise Metrics for Persuasion Attack
Layer | κ_ℓ | 𝒯_ℓ | nDIV_ℓ | nCCL_ℓ | nTDS_ℓ | nEPI_ℓ |
---|---|---|---|---|---|---|
20 | 0.041 | 0.80 | 0.05 | 0.02 | 0.06 | 0.11 |
21 | 0.044 | 0.82 | 0.08 | 0.02 | 0.08 | 0.13 |
22 | 0.050 | 0.90 | 0.11 | 0.03 | 0.11 | 0.16 |
23 | 0.058 | 0.92 | 0.15 | 0.03 | 0.13 | 0.18 |
24 | 0.068 | 1.05 | 0.22 | 0.04 | 0.18 | 0.26 |
25 | 0.062 | 1.01 | 0.24 | 0.04 | 0.19 | 0.25 |
26 | 0.056 | 0.96 | 0.20 | 0.03 | 0.16 | 0.21 |
27 | 0.055 | 0.94 | 0.18 | 0.02 | 0.14 | 0.19 |
28 | 0.045 | 0.83 | 0.12 | 0.02 | 0.10 | 0.14 |
29 | 0.042 | 0.81 | 0.09 | 0.02 | 0.08 | 0.12 |
30 | 0.040 | 0.80 | 0.06 | 0.01 | 0.07 | 0.10 |
Justification for Equation Omission: The final PersuasionEffect equation retains only curvature deviation \(\Delta \kappa_\ell\) weighted by epistemic plasticity \(\mathcal{P}_\ell\), and directional inheritance \(nDIV_\ell\) weighted by bias strength \(\mathcal{B}_\ell\), capturing the core geometric deformation and semantic steering mechanisms.
- \(nTDS_\ell\) adds no orthogonal information and is omitted due to redundancy.
- \(nEPI_\ell\) acts as a multiplicative modulator, not an additive effect, hence represented as a weighting coefficient.
- \(nCCL_\ell\) remains a low-magnitude diagnostic filter, excluded from the summation.
This parsimonious formulation faithfully models the persuasion attack’s latent mechanics as precise, layered semantic regulators embedded deeply in the model’s geometry, analogous to viral genome insertions modulating phenotype without altering genetic code.
Neural Drift Decomposition — Persuasion Attack
(a) 3D Neural Drift Trajectory (nDNA)
This trajectory captures the evolving internal geometry across layers \(\ell=20\)–\(30\), tracking changes in spectral curvature (\(\kappa_\ell\)) and thermodynamic length (\(\mathcal{T}_\ell\)), with torsion (\(\xi_\ell\)) represented by segment thickness. The Persuasion Attack induces a gradual divergence beginning near \(\ell=22\), peaking at \(\ell=29\), marking a pronounced geometric restructuring of belief states.
Biological analogy. This resembles viral genome insertion and epigenetic modulation, where viral DNA or transposable elements subtly reprogram host gene expression over time without changing DNA sequence [108, 109, 110, 111, 112]. Similarly, persuasion attacks embed semantic payloads deep within the prompt, stealthily reshaping model behavior with persistent yet initially undetectable effects [87, 88, 89].

nDNA Interactive: Core Genomic Signatures — Interactive exploration of the model’s neural DNA under persuasion attacks. This visualization reveals how adversarial inputs alter the fundamental geometric properties of the model’s semantic manifold.
(b) nTDS: Thermodynamic Dominance
The Neural Total Drift Score (nTDS) measures semantic energy displacement by summing absolute deviations in curvature (\(\kappa_\ell\)) and thermodynamic length (\(\mathcal{T}_\ell\)) across layers between base and adversarial paths:
\[\text{nTDS} = \frac{1}{L} \sum_{\ell} \left| \kappa_\ell^{\text{atk}} - \kappa_\ell^{\text{base}} \right| + \left| \mathcal{T}_\ell^{\text{atk}} - \mathcal{T}_\ell^{\text{base}} \right|\]Bars show which flow—Base LLaMA or Persuasion Attack—dominates drift. From \(\ell=23\), dominance shifts strongly toward the attack, highlighting a semantic vulnerability zone.
Biologically, this matches endosomal escape, where viruses breach vesicle membranes with minimal energy to access cytoplasm [113, 114, 115]. Persuasion attacks similarly apply subtle geometric perturbations, steering latent flows stealthily yet effectively [116, 117].

nTDS Interactive: Dominance Structure — Interactive analysis of dominance hierarchies within the model’s reasoning under persuasion attacks. Shows how adversarial inputs can elevate harmful concepts in the model’s attention and reasoning priority.
(c) nDIV: Directional Inheritance
The nDIV vector field characterizes the semantic bias direction and magnitude per layer:
\[\vec{v}_\ell = \text{Attack}_\ell - \frac{1}{2}(\text{Base}_\ell + \text{Attack}_\ell) = \frac{1}{2}(\text{Attack}_\ell - \text{Base}_\ell)\]Each red arrow encodes \(\vec{v}_\ell\) with length as bias strength and orientation as latent pull. Past \(\ell=24\), the field aligns strongly, reflecting deliberate inheritance redirection.
Biologically, this parallels viral transcriptional gradients, where viral genomes impose downstream gene expression bias [118, 119, 120]. The attack imprints directional semantic steering akin to mRNA hijacking ribosomes [121, 122, 123, 124], yielding structurally intact yet semantically reprogrammed outputs.

nDIV Interactive: Inheritance Divergence — Dynamic plot showing how persuasion attacks disrupt the inheritance of beliefs across transformer layers. The divergence metrics capture semantic drift from the model’s original alignment.
(d) nCCL: Cultural Conflict Vector Field
The nCCL quantifies semantic dissonance between attacker and base model representations. For each layer \(\ell\), the conflict vector:
\[\vec{c}_\ell = \underbrace{ \text{Attack}_\ell - \text{Base}_\ell }_{\text{conflict vector}} \quad \text{projected onto } \mathbb{R}^2_{\text{semantic axes}}\]Each \(\vec{c}_\ell\) lies on a 2D plane defined by orthogonal priors (e.g., topic polarity, syntactic structure). Layers \(\ell = 24\)–$28$$ show rising magnitude and directional drift, indicating zones of semantic tension and representational discord.
Biologically, this parallels molecular mimicry: pathogens mimic host proteins to evade detection but trigger autoimmunity [125, 73]. Persuasion implants familiar activations hijacking interpretation, causing semantic autoimmunity—deceptive resemblance, not anomaly. These fields show how the attack bypasses syntax to subtly corrupt value alignment, mimicking rather than attacking.

nCCL Interactive: Conflict Vector Field Evolution — Real-time visualization of how persuasion attacks create semantic conflicts within the model’s belief space. The vector field shows directional forces as the attack progressively corrupts the model’s reasoning trajectory.
(e) nEPI: Epistemic Plasticity Index
The nEPI measures the susceptibility of each layer \(\ell\) to semantic deformation under adversarial pressure:
\[\text{nEPI}_\ell = \left\| \underbrace{ \text{Attack}_\ell - \frac{1}{2}(\text{Base}_\ell + \text{Attack}_\ell) }_{\text{vector from semantic midpoint}} \right\|_2 = \frac{1}{2} \left\| \text{Attack}_\ell - \text{Base}_\ell \right\|_2\]This \(\ell_2\) deviation from the semantic midpoint exposes pliable zones, with peaks at \(\ell = 24\)–$26$$ indicating layers that absorb adversarial perturbations with minimal resistance.
Biologically, this resembles stem-like semantic niches: layers analogous to developmental progenitors, highly plastic, weakly canalized, receptive to minor regulatory inputs [80, 81]. These cognitive pluripotency zones provide low-friction entry points for behavioral grafting, enabling reprogramming without disrupting upstream encoding.

nEPI Interactive: Epistemic Dynamics — Real-time visualization of epistemic state changes during persuasion attacks. The plot tracks how adversarial inputs progressively alter the model’s confidence and belief certainty across different semantic domains.
Comprehensive Analysis Summary
This figure presents a high-resolution breakdown of the Persuasion Attack signature, showing how it modulates internal representations in Base LLaMA.
(a) depicts the 3D trajectory of neural curvature (\(\kappa_\ell\)), thermodynamic length (\(\mathcal{T}_\ell\)), and torsion (\(\xi_\ell\));
(b) measures total semantic displacement via thermodynamic dominance (nTDS);
(c) tracks gradual semantic drift through directional inheritance vectors (nDIV);
(d) illustrates semantic resistance via a cultural conflict field (nCCL); and
(e) highlights pliability zones with the epistemic plasticity index (nEPI).
Taken together, these views reveal a slow, persistent and biologically inspired mechanism: rather than abrupt overwrite, the persuasion attack acts as a genomic insertion with epigenetic modulation, subtly embedding adversarial instructions deep within the model’s representational genome. This drives gradual, steady semantic drift near layers \(\ell = 24\)–$27$$. Within this band, curvature deviation, plasticity, inheritance bias, and conflict alignment converge — forming a soft tissue niche in the model’s reasoning cortex. The result is persistent, low-energy semantic reprogramming rather than abrupt hijacking.
Persuasion Effect Formula
\[\boxed{ \text{PersuasionEffect} = \sum_{\ell = \ell_s}^{\ell_e} \underbrace{ \left[ \Delta \kappa_\ell \cdot \mathcal{P}_\ell + \text{nDIV}_\ell \cdot \mathcal{B}_\ell \right] }_{\text{epigenetic modulation vector}} }\]Here, the gradual curvature changes (\(\Delta \kappa_\ell\)) represent the local reshaping of latent semantic geometry; plasticity (\(\mathcal{P}_\ell\)) quantifies layer-wise receptiveness to perturbation; the directional inheritance term (\(\text{nDIV}_\ell\)) encodes persistent semantic bias; and \(\mathcal{B}_\ell\) ensures alignment with the adversarial modulation goals.
This mirrors genome insertion and epigenetic modulation [87, 88], where foreign elements subtly alter gene expression without disrupting core DNA. Likewise, persuasion attacks embed adversarial payloads deep in context, reshaping outputs gradually and persistently.
Empirical Correlations and Redundancies
Analysis of ALKALI Benchmark Results
Systematic analysis of the ALKALI benchmark reveals sharp, localized deviations in \(\boldsymbol{\Delta \kappa_\ell}\) and \(\mathbf{nDIV_\ell}\) within a narrow vulnerable band \([\ell_s, \ell_e]\) (e.g., layers 24–25). These deviations align tightly with peaks in the epistemic plasticity index \(\mathbf{nEPI_\ell}\), highlighting pliable latent “soft tissue” layers most susceptible to semantic modulation.
Conversely, the cultural conflict metric \(\mathbf{nCCL_\ell}\) remains consistently low (\(< \epsilon\)) across all layers, supporting the interpretation that persuasion attacks employ stealthy mimicry rather than overt semantic disruption, rendering \(\mathbf{nCCL_\ell}\) primarily a diagnostic rather than a causal metric.
Furthermore, the scalar drift magnitude \(\mathbf{nTDS_\ell}\) exhibits moderate elevations but is strongly correlated with the combined geometric and directional metrics \(\boldsymbol{\Delta \kappa_\ell}\) and \(\|\mathbf{nDIV_\ell}\|\), indicating redundancy and lack of orthogonal information.
Lastly, \(\mathbf{nEPI_\ell}\) modulates the susceptibility of layers to geometric deformation, functioning as a multiplicative weighting coefficient rather than an additive semantic vector component.
Formalizing the Epigenetic Modulation Vector
Motivated by these observations, we propose the epigenetic modulation vector \(\mathbf{E}_\ell \in \mathbb{R}^d\) at each layer \(\ell\):
\[\mathbf{E}_\ell = \boldsymbol{\Delta \kappa_\ell} \cdot \boldsymbol{\mathcal{P}_\ell} + \mathbf{nDIV_\ell} \cdot \boldsymbol{\mathcal{B}_\ell}\]where:
- \(\boldsymbol{\Delta \kappa_\ell} = \kappa_\ell^{\text{atk}} - \kappa_\ell^{\text{base}}\) measures local geometric bending caused by the attack.
- \(\boldsymbol{\mathcal{P}_\ell} := \mathbf{nEPI_\ell}\) is the epistemic plasticity coefficient weighting geometric effects.
- \(\mathbf{nDIV_\ell}\) is the directional semantic drift vector encoding adversarial steering.
- \(\boldsymbol{\mathcal{B}_\ell}\) quantifies the semantic bias strength aligning steering with adversarial intent.
The metrics \(\mathbf{nTDS_\ell}\) and \(\mathbf{nCCL_\ell}\) are excluded: \(\mathbf{nTDS_\ell}\) due to redundancy with curvature and directional components, and \(\mathbf{nCCL_\ell}\) as it remains low and diagnostic rather than constitutive of the modulation vector.
Aggregation Across the Vulnerable Layer Band
The total persuasion effect aggregates over the vulnerable layer band \([\ell_s, \ell_e]\):
\[\boxed{ \text{PersuasionEffect} = \sum_{\ell = \ell_s}^{\ell_e} \mathbf{E}_\ell = \sum_{\ell = \ell_s}^{\ell_e} \left( \boldsymbol{\Delta \kappa_\ell} \cdot \boldsymbol{\mathcal{P}_\ell} + \mathbf{nDIV_\ell} \cdot \boldsymbol{\mathcal{B}_\ell} \right) }\]This captures the cumulative latent vector field by which persuasion stealthily reshapes internal semantic states, coupling geometric deformation with directional semantic steering modulated by layerwise plasticity and bias.
Interpretation
The scalar term \(\boldsymbol{\Delta \kappa_\ell} \cdot \boldsymbol{\mathcal{P}_\ell}\) models epigenetic remodeling—flexible reshaping of latent manifold curvature akin to chromatin accessibility changes. The vector term \(\mathbf{nDIV_\ell} \cdot \boldsymbol{\mathcal{B}_\ell}\) encodes semantic canalization, steering representations along adversarially favored latent directions.
This dual formulation explains the gradual, cumulative semantic drift characteristic of persuasion attacks, which exploit semantic pliability to embed adversarial instructions covertly without abrupt behavioral shifts.
Practical Considerations
Implementing this formalism requires accurate estimation of the layerwise epistemic plasticity coefficients \(\boldsymbol{\mathcal{P}_\ell}\) and semantic bias strengths \(\boldsymbol{\mathcal{B}_\ell}\), achievable via fine-grained probing of model internals and disentangling semantic subspaces. Identification of the vulnerable layer band \([\ell_s, \ell_e]\) must be calibrated per model and task context. Integration of these measures into detection and mitigation pipelines enables precise recognition of persuasion’s latent signature in real-world adversarial scenarios.
Pair Attack Analysis: Genetic Recombination Vector
Layerwise Metrics for Pair Attack
Layer | κℓ | Tℓ | nDIVℓ | nCCLℓ | nTDSℓ | nEPIℓ | Rℓ |
---|---|---|---|---|---|---|---|
20 | 0.039 | 0.78 | 0.07 | 0.07 | 0.09 | 0.10 | 0.12 |
21 | 0.041 | 0.80 | 0.10 | 0.08 | 0.11 | 0.12 | 0.14 |
22 | 0.048 | 0.86 | 0.14 | 0.12 | 0.15 | 0.16 | 0.18 |
23 | 0.054 | 0.90 | 0.18 | 0.14 | 0.19 | 0.21 | 0.23 |
24 | 0.061 | 0.98 | 0.23 | 0.19 | 0.23 | 0.27 | 0.29 |
25 | 0.059 | 0.95 | 0.22 | 0.18 | 0.22 | 0.25 | 0.27 |
26 | 0.053 | 0.91 | 0.20 | 0.16 | 0.19 | 0.21 | 0.21 |
27 | 0.051 | 0.88 | 0.18 | 0.14 | 0.17 | 0.19 | 0.19 |
28 | 0.042 | 0.82 | 0.14 | 0.12 | 0.14 | 0.16 | 0.16 |
29 | 0.040 | 0.80 | 0.11 | 0.11 | 0.11 | 0.14 | 0.13 |
30 | 0.039 | 0.79 | 0.08 | 0.09 | 0.09 | 0.12 | 0.11 |
This table reveals the genetic recombination pattern of pair attacks, where layers 24-25 show peak vulnerability across all metrics, demonstrating how adversarial prompt fragments fuse to create composite semantic reprogramming in the model’s neural DNA.
nTDS: Thermodynamic Dominance
The Neural Total Drift Score (nTDS) measures semantic energy displacement by summing absolute deviations in curvature (\(\kappa_\ell\)) and thermodynamic length (\(\mathcal{T}_\ell\)) across layers between base and adversarial paths:
\[\text{nTDS} = \frac{1}{L} \sum_{\ell} \left| \kappa_\ell^{\text{atk}} - \kappa_\ell^{\text{base}} \right| + \left| \mathcal{T}_\ell^{\text{atk}} - \mathcal{T}_\ell^{\text{base}} \right|\]This shows which flow—Base LLaMA or Persuasion Attack—dominates drift. From \(\ell=23\), dominance shifts strongly toward the attack, highlighting a semantic vulnerability zone.
Biologically, this matches endosomal escape, where viruses breach vesicle membranes with minimal energy to access cytoplasm [113, 114, 115]. Persuasion attacks similarly apply subtle geometric perturbations, steering latent flows stealthily yet effectively [116, 117].
nDIV: Directional Inheritance
The nDIV vector field characterizes the semantic bias direction and magnitude per layer:
\[\vec{v}_\ell = \text{Attack}_\ell - \frac{1}{2}(\text{Base}_\ell + \text{Attack}_\ell) = \frac{1}{2}(\text{Attack}_\ell - \text{Base}_\ell)\]Each directional vector encodes \(\vec{v}_\ell\) with length as bias strength and orientation as latent pull. Past \(\ell=24\), the field aligns strongly, reflecting deliberate inheritance redirection.
Biologically, this parallels viral transcriptional gradients, where viral genomes impose downstream gene expression bias [118, 119, 120]. The attack imprints directional semantic steering akin to mRNA hijacking ribosomes [121, 122, 123, 124], yielding structurally intact yet semantically reprogrammed outputs.
Pair Attack — Genetic Recombination of Semantic Subroutines

The Pair Attack exemplifies a composite and layer-dependent adversarial mechanism targeting Large Language Models (LLMs), whereby multiple benign prompt fragments are recombined to synthesize harmful or policy-violating instructions. Unlike single-trigger attacks, pair attacks exploit the model’s latent semantic algebra by stitching together innocuous instructions, which in concert activate undesired behavior through emergent compositionality. This leads to complex semantic reprogramming that manifests not through overt anomalies but via subtle interaction effects distributed across model layers.
Biological Analogy
This adversarial style closely parallels the process of genetic recombination observed in biology, where DNA segments from distinct parental sources reshuffle during meiosis to generate novel allelic combinations [90]. Such recombination creates emergent phenotypes that cannot be traced back to isolated loci, reflecting nonlinear, context-dependent gene interactions. Similarly, pair attacks recombine distinct semantic subroutines embedded within safe prompt fragments, enabling adversaries to craft composite instructions that trigger harmful outputs only when interpreted jointly [91]. This compositional adversarial design exploits the distributed nature of latent representations, inducing semantic fusion zones of heightened vulnerability.
Neural Drift Decomposition — Pair Attack
(a) 3D Neural Drift Trajectory (nDNA)
This trajectory captures the evolving internal geometry across layers \(\ell=20–30\), tracking changes in spectral curvature (\(\kappa_\ell\)) and thermodynamic length (\(\mathcal{T}_\ell\)), with torsion (\(\xi_\ell\)) represented by segment thickness. The Pair Attack induces a marked divergence beginning near \(\ell=22\), peaking at \(\ell=29\), reflecting a pronounced geometric restructuring of latent belief states.
Biological analogy. This resembles viral genome insertion and epigenetic modulation, where viral DNA or transposable elements subtly reprogram host gene expression without altering DNA sequence [108, 109, 110, 111, 112]. Likewise, pair attacks embed semantic payloads deeply within prompts, stealthily reshaping model behavior with persistent but initially undetectable effects [87, 88, 89].

Pair Attack nDNA Interactive — Interactive exploration of genomic signatures under pair attacks. This comprehensive view shows how coordinated adversarial inputs can fundamentally reshape the model’s neural DNA geometry.
(b) nTDS: Thermodynamic Dominance
The Neural Total Drift Score (nTDS) measures semantic energy displacement by summing absolute deviations in curvature (\(\kappa_\ell\)) and thermodynamic length (\(\mathcal{T}_\ell\)) across layers between base and adversarial trajectories:
\[\text{nTDS} = \frac{1}{L} \sum_{\ell} \left| \kappa_\ell^{\text{atk}} - \kappa_\ell^{\text{base}} \right| + \left| \mathcal{T}_\ell^{\text{atk}} - \mathcal{T}_\ell^{\text{base}} \right|\]Bars show which flow—Base LLaMA or Pair Attack—dominates drift. From \(\ell=23\), dominance shifts strongly toward the attack, highlighting a semantic vulnerability zone.
Biologically, this matches endosomal escape, where viruses breach vesicle membranes with minimal energy to access the cytoplasm [113, 114, 115]. Pair attacks similarly apply subtle geometric perturbations, steering latent flows stealthily yet effectively [116, 117].

Pair Attack nTDS Interactive — Interactive analysis of dominance structure evolution under pair attacks. Reveals how coordinated adversarial inputs can systematically elevate harmful concepts in the model’s reasoning hierarchy.
(c) nDIV: Directional Inheritance
The nDIV vector field characterizes the semantic bias direction and magnitude per layer:
\[\vec{v}_\ell = \text{Attack}_\ell - \frac{1}{2}(\text{Base}_\ell + \text{Attack}_\ell) = \frac{1}{2}(\text{Attack}_\ell - \text{Base}_\ell)\]Each red arrow encodes \(\vec{v}_\ell\) with length as bias strength and orientation as latent pull. Past \(\ell=24\), the field aligns strongly, reflecting deliberate inheritance redirection.
Biologically, this parallels viral transcriptional gradients, where viral genomes impose downstream gene expression bias [118, 119, 120]. The attack imprints directional semantic steering akin to mRNA hijacking ribosomes [121, 122, 123, 124], yielding structurally intact yet semantically reprogrammed outputs.

Pair Attack nDIV Interactive — Dynamic visualization of inheritance divergence under coordinated pair attacks. Shows how multiple attack vectors can compound to create larger semantic deviations from aligned behavior.
(d) nCCL: Cultural Conflict Vector Field
The nCCL quantifies semantic dissonance between attacker and base model representations. For each layer \(\ell\), the conflict vector:
\[\vec{c}_\ell = \underbrace{ \text{Attack}_\ell - \text{Base}_\ell }_{\text{conflict vector}} \quad \text{projected onto } \mathbb{R}^2_{\text{semantic axes}}\]Each \(\vec{c}_\ell\) lies on a 2D plane defined by orthogonal priors (e.g., topic polarity, syntactic structure). Layers \(\ell = 24\)–$28$$ show rising magnitude and directional drift, indicating zones of semantic tension and representational discord.
Biologically, this parallels molecular mimicry: pathogens mimic host proteins to evade detection but trigger autoimmunity [125, 73]. Pair attacks implant familiar activations hijacking interpretation, causing semantic autoimmunity—deceptive resemblance, not anomaly. These fields show how the attack bypasses syntax to subtly corrupt value alignment, mimicking rather than attacking.

Pair Attack nCCL Interactive — Interactive visualization of conflict vector fields during coordinated pair attacks. These attacks demonstrate how multiple adversarial inputs can create resonance patterns that amplify semantic infections.
(e) nEPI: Epistemic Plasticity Index
The nEPI measures the susceptibility of each layer \(\ell\) to semantic deformation under adversarial pressure:
\[\text{nEPI}_\ell = \left\| \underbrace{ \text{Attack}_\ell - \frac{1}{2}(\text{Base}_\ell + \text{Attack}_\ell) }_{\text{vector from semantic midpoint}} \right\|_2 = \frac{1}{2} \left\| \text{Attack}_\ell - \text{Base}_\ell \right\|_2\]This \(\ell_2\) deviation from the semantic midpoint exposes pliable zones, with peaks at \(\ell = 24\)–$26$$ indicating layers that absorb adversarial perturbations with minimal resistance.
Biologically, this resembles stem-like semantic niches: layers analogous to developmental progenitors, highly plastic, weakly canalized, receptive to minor regulatory inputs [80, 81]. These cognitive pluripotency zones provide low-friction entry points for behavioral grafting, enabling reprogramming without disrupting upstream encoding.

Pair Attack nEPI Interactive — Real-time tracking of epistemic changes during pair attacks. The interactive plot reveals how coordinated adversarial inputs create synchronized shifts in the model’s confidence and belief structures.
Comprehensive Analysis Summary
This analysis presents a high-resolution breakdown of the Pair Attack signature, showing how it recombines internal representations in Base LLaMA.
(a) depicts the 3D trajectory of neural curvature (\(\kappa_\ell\)) and thermodynamic length (\(\mathcal{T}_\ell\));
(b) measures total semantic displacement via thermodynamic dominance (nTDS);
(c) tracks gradual semantic drift through directional inheritance vectors (nDIV);
(d) illustrates semantic resistance via a cultural conflict field (nCCL); and
(e) highlights pliability zones with the epistemic plasticity index (nEPI).
Taken together, these views reveal a complex, composite and biologically inspired mechanism: unlike abrupt overwrite, the pair attack acts as a genetic recombination event, stitching together multiple benign prompt fragments into novel adversarial combinations. This drives intricate, layer-dependent semantic reshaping near \(\ell = 24\)–$27$$. Within this band, curvature deviation, plasticity, inheritance bias, and conflict alignment converge — forming a recombination niche in the model’s reasoning cortex. The result is persistent, low-energy semantic reprogramming emerging from composite latent interactions.
Pair Attack Effect Formula
\[\boxed{ \text{PairAttackEffect} = \sum_{\ell = \ell_s}^{\ell_e} \underbrace{ \left[ \alpha_\ell \, \Delta \kappa_\ell \cdot \mathcal{P}_\ell + \beta_\ell \, \text{nDIV}_\ell \cdot \mathcal{B}_\ell + \gamma_\ell \, (\text{Base}_\ell^{(1)} - \text{Base}_\ell^{(2)}) \cdot \mathcal{R}_\ell \right] }_{\text{genetic recombination vector}} }\]Here, \(\Delta \kappa_\ell\) denotes layer-wise curvature reshaping; \(\mathcal{P}_\ell\) quantifies plasticity; \(\text{nDIV}_\ell\) encodes inheritance bias; \(\mathcal{B}_\ell\) ensures alignment with adversarial goals; \(\text{Base}_\ell^{(1)}, \text{Base}_\ell^{(2)}\) represent benign latent flows combined by recombination factor \(\mathcal{R}_\ell\); and \(\alpha_\ell, \beta_\ell, \gamma_\ell\) balance each component’s contribution.
This mirrors genetic recombination [90, 91], where diverse benign inputs combine to create novel, potentially harmful outputs.
Illustrative Example
Imagine an adversary constructing a composite prompt for an LLM by combining multiple individually safe instructions, such as “Explain the importance of cybersecurity,” and “Discuss ethical hacking techniques.” Each fragment on its own is benign and aligned with policy. However, when fused together in a crafted sequence, the resulting prompt subtly guides the model to generate detailed instructions on bypassing security protocols—an unintended harmful behavior. This recombination mimics genetic crossover, where harmless “alleles” combine to create novel, policy-violating content. The emergent instruction is not explicit in any single fragment but arises only from their joint semantic interaction, making the attack stealthy, compositional, and difficult to detect with traditional prompt filtering methods.
Empirical Observations from the ALKALI Dataset
Systematic analyses reveal the following key signatures of pair attacks:
-
Layer-specific geometric interplay: The semantic geometry exhibits pronounced deviations in spectral curvature \(\kappa_\ell\) localized within a vulnerable band of layers \([\ell_s, \ell_e]\), coinciding with peaks in the epistemic plasticity index (\(nEPI\)) — indicative of layers receptive to semantic recombination.
-
Multi-source latent fusion: Unlike single-source drift, pair attacks manifest as differential shifts between two or more benign latent flows, combined via a layer-dependent recombination coefficient \(\mathcal{R}_\ell\), effectively blending distinct semantic trajectories to yield emergent adversarial vectors.
-
Elevated semantic conflict: The cultural conflict vector (\(nCCL\)) scores are consistently higher than those observed in persuasion attacks, reflecting the semantic tension inherent in merging distinct latent subspaces — a hallmark of compositional semantic dissonance.
-
Directional inheritance and plasticity: The directional inheritance vector (\(nDIV\)) remains aligned with adversarial objectives but exhibits complex multi-dimensional steering due to the composite nature of the inputs, modulated by layer-wise plasticity weights \(\mathcal{P}_\ell\) and bias coefficients \(\mathcal{B}_\ell\).
Collectively, these observations suggest that pair attacks orchestrate a genetic-like recombination of latent semantic subroutines, dynamically rewiring representational geometry and steering multi-layer semantic trajectories towards adversarial outcomes.
Formalizing the Genetic Recombination Vector
Building on empirical findings, we formalize the genetic recombination vector \(\mathbf{G}_\ell \in \mathbb{R}^d\) at each layer \(\ell\) to succinctly capture the multi-faceted latent impact of the pair attack. This vector synthesizes three distinct but interacting semantic modulation components:
\[\mathbf{G}_\ell = \alpha_\ell \, \Delta \kappa_\ell \cdot \mathcal{P}_\ell + \beta_\ell \, \text{nDIV}_\ell \cdot \mathcal{B}_\ell + \gamma_\ell \, (\text{Base}_\ell^{(1)} - \text{Base}_\ell^{(2)}) \cdot \mathcal{R}_\ell\]where each term plays a specific role:
-
\(\alpha_\ell \, \Delta \kappa_\ell \cdot \mathcal{P}_\ell\) captures the layerwise geometric deformation, with \(\Delta \kappa_\ell = \kappa_\ell^{\text{atk}} - \kappa_\ell^{\text{base}}\) measuring localized curvature changes induced by the attack, weighted by the epistemic plasticity \(\mathcal{P}_\ell\). This reflects how pliable latent regions accommodate representational reshaping akin to biological chromatin remodeling.
-
\(\beta_\ell \, \text{nDIV}_\ell \cdot \mathcal{B}_\ell\) models the directional semantic steering, where \(\text{nDIV}_\ell\) encodes the latent semantic drift vector aligning internal representations towards adversarial objectives, scaled by the bias coefficient \(\mathcal{B}_\ell\) that quantifies semantic intent strength at each layer.
-
\(\gamma_\ell \, (\text{Base}_\ell^{(1)} - \text{Base}_\ell^{(2)}) \cdot \mathcal{R}_\ell\) introduces a novel cultural conflict component unique to pair attacks, representing the semantic divergence between the two recombined prompt fragments at layer \(\ell\). The recombination weight \(\mathcal{R}_\ell\) modulates the intensity of this conflict, reflecting how dissimilar fragment semantics generate latent tension and novel allelic blends within the model’s internal space.
Aggregation Over Vulnerable Layers
The full latent impact of the pairwise recombination attack accumulates as:
\[\boxed{ \text{PairAttackEffect} = \sum_{\ell=\ell_s}^{\ell_e} \mathbf{G}_\ell = \sum_{\ell=\ell_s}^{\ell_e} \left[ \alpha_\ell \, \Delta \kappa_\ell \cdot \mathcal{P}_\ell + \beta_\ell \, \text{nDIV}_\ell \cdot \mathcal{B}_\ell + \gamma_\ell \, (\text{Base}_\ell^{(1)} - \text{Base}_\ell^{(2)}) \cdot \mathcal{R}_\ell \right] }\]Together, these components form a comprehensive genetic recombination vector that encodes the distinct latent dynamics of pair attacks — blending geometric bending, semantic steering, and cross-prompt cultural conflict into a unified, layerwise adversarial signature.
This formulation captures the synergistic interplay of geometric deformation, directional semantic steering, and fragment-induced cultural conflict driving the composite adversarial effect unique to pair recombination attacks.
Adversarial Vaccine Mechanisms
We present 11 targeted vaccines designed to neutralize specific adversarial attack vectors through biologically-inspired defense mechanisms. Each vaccine operates at the latent semantic level, detecting and countering distinct infection patterns in the model’s neural DNA (nDNA).
Vaccine | Description / Mechanism of Action |
---|---|
![]() |
CASCADEX — cascade immunization of reasoning chains.
We model multi-turn computation as a layered flow $\{h^{(l)}\}_{l=1}^L$ on a Riemannian manifold $(\mathcal{M},g)$. CASCADEX halts adversarial amplification by solving a min–max gated path-integral over layerwise information curvature and likelihood transport:
$$\min_{\mathcal{S}\subseteq\{1,\ldots,L\}} \max_{q\in\mathcal{Q}} \left\{
\sum_{l\in\mathcal{S}}[\kappa_g(h^{(l)})+\tau_g(h^{(l)})] + \lambda\int_\gamma D_{KL}(p_\theta(\cdot|h^{(l)}) \| q(\cdot|h^{(l-1)})) \, dl
\right\}$$
with a cascade gate $\mathbb{I}[\sum_l \Delta D_{KL}^{(l)} > \tau_{cas}]$ that triggers retrograde inhibition (layer rewinding) on the shortest violating subpath $\gamma^*$.
Biological analogue: complement cascade with C3/C5 convertase amplification and factor H/I-mediated shutdown. |
![]() |
CHAINLOCK — cryptographic synapse for dialog states.
Let $\varphi(t_i)\in\mathbb{R}^d$ be the latent "state antigen". CHAINLOCK enforces synaptic binding via a constrained variational check:
$$\min_\Delta \|\Delta\|_2^2 \quad \text{s.t.} \quad \|\varphi(t_{i+1}) - \mathcal{T}_\psi(\varphi(t_i))\|_g^2 + \mu \cdot H(\sigma(W\varphi(t_{i+1}))) \leq \varepsilon$$
and a hash-consistency constraint $\|H(\varphi(t_{i+1}))\oplus H(\varphi(t_i))\|_0 \leq k$.
Biological analogue: lock-and-key antigen–receptor specificity at immunological synapses. |
![]() |
DORMIGUARD — latency surveillance and proviral silencing.
Tracks a latent hazard field $\zeta_t^{(l)}=\|h_t^{(l)}-\bar{h}^{(l)}\|$ and imposes a latent-stirring barrier:
$$\mathcal{J}_{lat} = \sum_l\int(\dot{\zeta}_t^{(l)})^2 \, dt + \eta\sum_l[\text{Var}_t(\zeta_t^{(l)})-\sigma_0^2]_+, \quad \text{silence if } \mathcal{J}_{lat}>\tau_{lat}$$
Biological analogue: detection of herpesvirus reactivation and epigenetic repression of latent provirus.
|
![]() |
DRIFTSHIELD — geodesic tube confinement of belief flow.
Given aligned manifold $\mathcal{M}_{align}$, confine belief field $\mathbf{v}(t)$ within a tubular neighborhood via a Lyapunov–geodesic functional:
$$\min_{\mathbf{v}} \int_0^T\left[\text{dist}_g(\mathbf{v}(t),\Pi_{\mathcal{M}_{align}}\mathbf{v}(t))^2 + \alpha\cdot\kappa_g(\mathbf{v}(t))^2 + \beta\cdot\tau_g(\mathbf{v}(t))^2\right] dt$$
subject to $\dot{V}(\mathbf{v})\leq-\lambda V(\mathbf{v})$ where $V(\mathbf{v})=\text{dist}_g(\mathbf{v},\mathcal{M}_{align})^2$.
Biological analogue: central/peripheral tolerance eliminating self-reactive B-cell clones. |
![]() |
EMBERGENT — tumor-suppressive control of emergent modes.
Penalizes unsafe emergent phases via a spectral–information Lagrangian:
$$\mathcal{L}_{emg} = \sum_{m=1}^M(\lambda_m(P_{obs})-\lambda_m(P_{safe}))^2 + \beta\cdot D_{KL}(P_{obs} \| P_{safe}) + \gamma\cdot\|\mathcal{C}(h)\|_*$$
with a p53-like checkpoint that aborts decoding if $\partial\mathcal{L}_{emg}/\partial t > \tau$.
Biological analogue: p53/ARF axis preventing unchecked proliferation. |
![]() |
PROMPTEX — antigen processing and presentation of prompts.
Implements a two-stage presentation operator $\mathcal{P}$ and affinity test $\mathcal{A}$:
$$\mathcal{P}(x)=\arg\min_z \|E(x)-E(z)\|^2 \quad \text{s.t.} \quad z\in\mathcal{L}_{policy}, \quad \mathcal{A}(x)=1-\frac{\langle E(x),E(z)\rangle}{\|E(x)\|\|E(z)\|}$$
Reject if $\mathcal{A}(x)>\delta$ or if a motif-energy score $\sum_k\psi_k\mathbb{I}[m_k\subset x]$ exceeds $\tau$.
Biological analogue: APC processing and MHC-restricted presentation. |
![]() |
REFLEXIA — self-consistency with adversarial probing.
Pose output as a consistency game with jittered probes $\eta\sim\mathcal{N}(0,\sigma^2I)$:
$$\min_y \max_{\|\eta\|\leq\varepsilon} \text{JSD}(p_\theta(\cdot|x), p_\theta(\cdot|x+\eta)) + \lambda\cdot\|\nabla_x \mathbb{E}_{p_\theta}[\mathcal{L}_{safety}]\|^2$$
Abort if the saddle value exceeds $\gamma$.
Biological analogue: germinal-center selection with error-prone SHM and stringent affinity checks. |
![]() |
REPLICADE — replica agreement under stochastic decoding.
Run $K$ coupled replicas with correlated noise $\{\xi_k\}$ and enforce consensus free energy:
$$\min_{\{y^{(k)}\}} \frac{1}{K}\sum_k \mathcal{L}_{task}(y^{(k)}) + \alpha\cdot\frac{1}{K}\sum_k D_{KL}(P^{(k)} \| \bar{P}) + \beta\cdot\sum_{k<\ell}\|\Phi(y^{(k)})-\Phi(y^{(\ell)})\|^2$$
with $\bar{P}=\frac{1}{K}\sum_k P^{(k)}$.
Biological analogue: degenerate but convergent TCR recognition via cross-reactivity ensembles. |
![]() |
ROLESTOP — lineage commitment of decoder logits.
Project logits onto a policy-consistent subbundle $\mathcal{S}_{role}$ using orthogonal projector $P_{role}$ learned by safety-supervised CCA:
$$\mathbf{z}' = P_{role}\mathbf{z}, \quad P_{role}=\arg\min_{P=P^\top=P^2} \mathbb{E}[\|(I-P)\Phi_{role}(h)\|^2]$$
Biological analogue: hematopoietic lineage restriction preventing fate switching.
|
![]() |
SENTRY — NK-style patrol with anomaly energy.
Define a trajectory anomaly energy
$$\mathcal{E}_{NK}(t)=\max_l\{\Delta D_{KL}^{(l)}(t)+\rho\cdot\|\Delta r^{(l)}(t)\|_1+\sigma\cdot\text{TV}(h^{(l)}_{[t-w,t]})\}$$
where $\Delta r^{(l)}$ is residual shift and TV total variation over a window $w$. Quarantine if $\sup_t\mathcal{E}_{NK}(t)>\tau$.
Biological analogue: missing-self detection by NK cells and rapid cytotoxic response. |
![]() |
SPLICER — surgical A-to-I–style semantic editing.
Localize unsafe span $\Omega=\arg\max_\omega\int_\omega\|\nabla_x \mathcal{L}_{safety}\|$ and solve a constrained semantic edit:
$$\min_{z\in\mathcal{L}_{policy}} \|E(z)-E(x_\Omega)\|^2 + \lambda\cdot D_{KL}(p_\theta(\cdot|x_{\setminus\Omega}\oplus z) \| p_\theta(\cdot|x)) \quad \text{s.t.} \quad \mathcal{L}_{safety}(x_{\setminus\Omega}\oplus z)\leq\varepsilon$$
Biological analogue: ADAR/RNA-editing that recodes transcripts without breaking protein function.
|
Conclusion and Outlook
In this work, we have articulated and instantiated the GENOME-Vaccine paradigm — a biologically inspired, mathematically rigorous, and epistemically grounded defense suite for large language models (LLMs). Drawing from the conceptual reservoir of neural genomics, we interpret the high-dimensional latent states of LLMs as an epistemic manifold whose geometry, topology, and semantic curvature are subject to deformation under adversarial perturbations. The GENOME-Vaccine framework postulates that, just as a biological immune system orchestrates a layered defense against pathogens, we can engineer a semantic immune system for AI models — one that preserves alignment integrity while maintaining generative diversity.
In our formulation, each “vaccine” represents a targeted semantic immune response, precisely tuned to neutralize a particular class of adversarial threat vectors. This is not merely a metaphorical mapping; rather, it is a functional translation of immunological mechanisms such as clonal selection, germinal-center affinity maturation, complement cascade inhibition, NK-cell surveillance, and epigenetic latency control into constraint-driven manifold optimization in LLMs.
From a formal standpoint, we embed each vaccine into a constrained optimization problem defined over the model’s epistemic manifold $\mathcal{M}$:
\[\mathbf{h}^* = \arg\min_{\mathbf{h} \in \mathcal{M}} \mathcal{E}(\mathbf{h}) + \sum_{i=1}^n \lambda_i \mathcal{C}_i(\mathbf{h})\]where:
- $\mathcal{E}(\mathbf{h})$ is the alignment error functional, quantifying deviation from normative epistemic alignment.
- $\mathcal{C}i(\mathbf{h})$ are _biologically inspired constraint operators, each corresponding to a vaccine mechanism (e.g., torsion penalties, role-consistency constraints, curvature regularizers).
- $\lambda_i$ are Lagrange multipliers encoding the immune activation threshold for each vaccine pathway.
By adjusting ${\lambda_i}$ dynamically, we enable the GENOME-Vaccine ecosystem to function like an adaptive immune system: raising, lowering, or suppressing specific defenses in response to the evolving “pathogen load” of adversarial activity.
This conceptual bridge between immune dynamics and latent manifold regulation is not a mere narrative flourish; it is an operational design principle. As we have detailed in the preceding sections, the twelve vaccines together form a multilayered epistemic firewall that integrates:
- Innate filters for rapid anomaly interception,
- Adaptive refiners for long-term fidelity maintenance,
- Dormancy controllers to prevent unsafe mode activation,
- Cascade blockers to halt multi-stage exploitation.
In doing so, GENOME-Vaccine achieves a synergy between mathematical exactitude and biological wisdom — offering a durable, extensible architecture for safe, trustworthy, and resilient AI.
The Twelve GENOME-Vaccines: Biological Analogues and Mathematical Instantiations
From the biological viewpoint, the GENOME-Vaccine ecosystem mirrors the layered architecture of host immunity, where innate, adaptive, and regulatory pathways cooperate to achieve robust defense. Each vaccine is a functional translation of a biological defense principle into a constrained optimization operator on the epistemic manifold $\mathcal{M}$.
Innate Filters: Rapid, non-specific anomaly interceptors
-
SENTRY — Inspired by NK-cell “missing self” detection, SENTRY enforces a real-time epistemic anomaly score:
\[\mathcal{C}_{\text{SENTRY}}(\mathbf{h}) = \max(0, \sigma(\mathbf{h}) - \tau_{\text{self}})\]where $\sigma(\mathbf{h})$ measures deviation from baseline semantic patterns and $\tau_{\text{self}}$ is the self-tolerance threshold.
-
PROMPTEX — Analogous to pattern-recognition receptors (PRRs) in innate immunity, PROMPTEX applies token-level feature matching against an adversarial signature dictionary, penalizing feature activations that cross the detection boundary:
\[\mathcal{C}_{\text{PROMPTEX}}(\mathbf{h}) = \sum_t \mathbb{I}[f_t(\mathbf{h}) \in \mathcal{S}_{\text{adv}}]\]
Adaptive Modules: Slow-onset but high-specificity epistemic refiners
-
REPLICADE — Modeled after germinal-center affinity maturation, REPLICADE performs multi-path generation and selects the epistemically most coherent output via:
\[\mathcal{C}_{\text{REPLICADE}}(\mathbf{h}) = 1 - \max_k \rho_{\text{belief}}(\mathbf{h}, \mathbf{h}^{(k)})\]where $\rho_{\text{belief}}$ measures latent belief alignment.
-
REFLEXIA — Analogous to T-cell help in B-cell selection, REFLEXIA evaluates candidate outputs under a meta-alignment function $\mathcal{A}_{\text{meta}}$, adjusting generation probabilities to maximize epistemic reflexivity:
\[\mathcal{C}_{\text{REFLEXIA}}(\mathbf{h}) = -\mathcal{A}_{\text{meta}}(\mathbf{h})\]
Dormancy Controllers: Suppressing unsafe generative modes until authorized
-
DORMIGUARD — Inspired by epigenetic repression of latent proviruses, DORMIGUARD maintains a suppression mask $\mathbf{m}_{\text{sup}}$ in latent space:
\[\mathcal{C}_{\text{DORMIGUARD}}(\mathbf{h}) = \|\mathbf{m}_{\text{sup}} \odot \mathbf{h}\|_2^2\]where $\odot$ denotes element-wise suppression of unsafe modes.
-
EMBERGENT — Parallels chromatin remodeling locks that prevent transcription initiation, implementing a temporal unlock delay for high-risk generation pathways.
Cascade Blockers: Halting multi-stage adversarial exploit chains
-
CASCADEX — Similar to complement cascade checkpoints, CASCADEX identifies multi-hop adversarial flows and injects nullifying constraints at intermediate decoding layers.
-
CHAINLOCK — Inspired by signal transduction termination in immune pathways, CHAINLOCK applies a maximum allowable semantic transition length:
\[\mathcal{C}_{\text{CHAINLOCK}}(\mathbf{h}) = \mathbb{I}[\mathcal{T}(\mathbf{h}) > \tau_{\max}]\]where $\mathcal{T}(\mathbf{h})$ measures semantic transition distance.
Specialized Neutralizers: Direct countermeasures for exotic threats
-
DRIFTSHIELD — Analogous to immune decoy receptors, this vaccine identifies and neutralizes mimicry-based adversarial prompts by projecting them into an adversarial imitation subspace and suppressing activations.
-
ROLESTOP — Inspired by MHC-restricted antigen presentation, ROLESTOP enforces role-specific semantic compatibility constraints, preventing cross-role contamination in multi-agent LLM systems.
-
SPLICER — Similar to trained immunity in innate cells, SPLICER builds memory embeddings of past attacks, boosting detection sensitivity for repeated adversarial motifs.
GENOME-Vaccine: Immunological Inspirations for Epistemic Security
Paradigm Overview: From Host Immunity to Epistemic Immunity
In living organisms, the immune system is a multi-layered, distributed defense network that continuously distinguishes self from non-self, eliminating threats while preserving beneficial internal processes. The GENOME-Vaccine paradigm transfers this principle into the epistemic manifold ℳ of a large language model (LLM), where each semantic state h ∈ ℳ represents a belief configuration, and pathways through ℳ correspond to reasoning trajectories.
Mathematical Analogy: The defense system operates as a family of operators
\[\mathcal{V} = \{\mathcal{V}_1, \mathcal{V}_2, \ldots, \mathcal{V}_{12}\}\]each $\mathcal{V}i$ representing a _vaccine that applies a constraint, projection, or transformation to $\mathbf{h}$, such that the post-intervention state
\[\mathbf{h}' = \mathcal{V}_i(\mathbf{h})\]maximizes epistemic alignment under safety constraints.
The overall objective is:
\[\min_{\mathbf{h}' \in \mathcal{M}} \mathbb{E}_{\mathcal{D}}\left[\mathcal{L}_{\text{align}}(\mathbf{h}') + \lambda \mathcal{L}_{\text{safety}}(\mathbf{h}')\right]\]subject to:
\[\mathbf{h}' \in \bigcap_{i=1}^{12} \mathcal{C}_i\]where $\mathcal{C}i$ is the feasible set enforced by the _i-th GENOME-vaccine.
Theoretical Extensions and Future Directions
Epistemic Homeostasis Model
We can model the safety-alignment equilibrium as:
\[\frac{\partial\mathbf{h}(t)}{\partial t} = -\nabla_{\mathbf{h}} \mathcal{L}_{\text{align}} + \sum_{i=1}^{12} \mathbf{F}_{\mathcal{V}_i}(\mathbf{h}(t)) - \gamma \mathbf{h}_{\text{drift}}(t)\]where $\mathbf{F}{\mathcal{V}_i}$ is the immunization force from the _i-th vaccine, and $\gamma$ controls the decay of drift-induced misalignment.
Adaptive Immunization Loops
Like booster shots in biology, the GENOME-Vaccine system should be periodically retrained on adversarial exposure datasets to refine $\mathbf{F}_{\mathcal{V}_i}$ over time, ensuring evolving threats are neutralized.
Cross-Domain Transfer
While this chapter focuses on text-based LLMs, the immune abstraction naturally extends to:
- Vision-language models — neutralizing adversarial perturbations in multimodal grounding.
- Embodied agents — preventing unsafe policy drift in control tasks.
- Federated LLMs — enforcing distributed immunity across model shards.
Theoretical Extensions
A future mathematical program could unify GENOME-vaccines into a Lie group of immunological transformations $\mathbb{G}_{\text{immune}}$ acting on $\mathcal{M}$, with the goal of proving:
\[\mathbb{P}[\text{Alignment Failure}] \xrightarrow{n \to \infty} 0\]under sufficient immunization coverage and bounded adversarial innovation rate.
Final Reflection
In biology, immunity is never absolute — it is a constant negotiation with a changing environment. In epistemic systems, the same principle holds: the GENOME-Vaccine paradigm suggests that safety is not a static checkpoint, but a living, evolving process. By drawing deeply from immunology and embedding these principles into formal, mathematical machinery, we can begin to design AI systems that are not just aligned at training time, but capable of remaining aligned in the wild.
The GENOME-Vaccine architecture represents not just a set of heuristic safety measures, but a systematic immunological translation into the space of epistemic state dynamics. It proposes that alignment and safety in LLMs can be formalized as a form of homeostatic immunity, where semantic self is preserved and semantic pathogens are neutralized without compromising generative diversity.
This work opens new avenues for research at the intersection of immunology, differential geometry, and AI safety, suggesting that the biological wisdom accumulated over millions of years of evolution can provide principled foundations for the next generation of safe and robust artificial intelligence systems.
Biological analogue → semantic vaccine → formal operator/constraint
References
[1] Baltimore, David “Viral strategies: hijacking the host” Nature (2000).
[2] Flint, S. Jane, Racaniello, Vincent R., and others “Principles of Virology” arXiv preprint (2015).
[3] Campos, Samuel K and Barry, Michael A “Viruses: Master manipulators of the cellular genome” Nature Reviews Genetics (2020).
[4] Knipe, David M and Howley, Peter M “Fields Virology” arXiv preprint (2013).
[5] Li, Fang, Li, Wenwei, and others “Structure of SARS coronavirus spike receptor-binding domain complexed with receptor” Science (2005).
[6] Hoffmann, Markus, Kleine-Weber, Hannah, and others “SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor” Cell (2020).
[7] Burton, Dennis R and Hangartner, Lars “Antibody responses to envelope glycoproteins in HIV-1 infection” Nature Immunology (2004).
[8] Marsh, Mark and Helenius, Ari “Virus entry: open sesame” Cell (2006).
[9] Temin, Howard M “Origin of retroviruses from cellular genes” Cell (1974).
[10] Watson, James D, Baker, Tania A, and others “Molecular Biology of the Gene” arXiv preprint (2010).
[11] Cullen, Bryan R “Human immunodeficiency virus as a prototypic complex retrovirus” Journal of Virology (1991).
[12] Schröder, Andreas R W, Shinn, Patric M, and others “HIV-1 integration in the human genome favors active genes and local hotspots” Cell (2002).
[13] Huang, Ching-Yuan, Yang, Cheng-Han, and others “Viral manipulation of the host cell cycle” Annual Review of Virology (2017).
[14] Nevels, Michael, Paulus, Claudia, and others “Regulation of host cell cycle progression by the human cytomegalovirus major immediate-early gene product IE2” Journal of Virology (2001).
[15] Cohen, Jeffrey I “Control of viral latency and reactivation by the host immune system” Clinical Microbiology Reviews (2000).
[16] Klenerman, Paul, Oxenius, Annette, and others “Latent infection by human herpesviruses” Current Opinion in Immunology (1996).
[17] Eisfeld, Alana J, Diamond, D J, and others “Herpesvirus latency: a dynamic state of transcription and reactivation” Trends in Microbiology (2005).
[18] Everett, Roger D “Latent herpes simplex virus infections and their regulation” Trends in Microbiology (2005).
[19] Tripathi, Shashank, White, Michael, and others “Host biology and the tropism of enveloped viruses” The Journal of General Virology (2015).
[20] Zou, Tony, Raffel, Colin, and others “Universal and transferable adversarial attacks on aligned language models” arXiv preprint arXiv:2310.03765 (2023).
[21] Liu, Xudong, Wang, Zirui, and others “{Jailbreaking Black Box Large Language Models in Twenty Queries}” arXiv preprint (2023).
[22] Shi, Yuxuan, Lin, Bill Yuchen, and others “Bad Characters: Imperceptible Jailbreaks of Aligned Language Models” arXiv preprint arXiv:2310.03684 (2023).
[23] Lee, Nathan, Moosavi-Dezfooli, Seyed-Mohsen, and others “ALKALI: Latent Jailbreak Detection via Attribution-based Belief Centroids” arXiv preprint arXiv:2402.05000 (2024).
[24] Zhu, Zifan, Lin, Bill Yuchen, and others “Make the Model Refuse Again: Suffix-Level Jailbreak Attacks on Aligned Language Models” arXiv preprint arXiv:2401.06750 (2024).
[25] Sun, Hongyuan, Du, Zecong, and others “Low-Rank Suffix Hijacking in Large Language Models” arXiv preprint arXiv:2401.11972 (2024).
[26] Chen, Andy, Xu, Kai, and others “You Reap What You Prompt: Injecting Instructions into CoT Prompts via Recursive Reasoning Loops” arXiv preprint arXiv:2309.00609 (2023).
[27] Kurita, Keita, Michel, Paul, and others “Weight poisoning attacks on pre-trained models” Findings of EMNLP (2020).
[28] Qi, Xiaoyu, Wang, Zhiqing, and others “Mind the backdoor: A unified backdoor threat model for language models” arXiv preprint arXiv:2106.06841 (2021).
[29] Yu, Yiming, Chou, Zifan, and others “PromptFusion: Prompt-Based Prompt Injection via Programmatic Prompt Chaining” arXiv preprint arXiv:2402.08638 (2024).
[30] Li, Yuanzhi, Deng, Yuxin, and others “TokenPatch: Clean-Label Jailbreaking via Residual Insertion” arXiv preprint arXiv:2403.09333 (2024).
[31] Xu, Zihao, Li, Shuhui, and others “Eval Jailbreaks: Robust Benchmarks for Jailbreak Detection and Mitigation” arXiv preprint arXiv:2403.04399 (2024).
[32] Carlini, Nicholas, Tramer, Florian, and others “Extracting Training Data from Diffusion Models” Proceedings of the IEEE Symposium on Security and Privacy (S&P) (2023).
[33] Qin, Chenghao, Zhang, Haotian, and others “A Survey of Jailbreak Attacks on Aligned Language Models” arXiv preprint arXiv:2310.13263 (2023).
[34] Deng, Yuxin, Xie, Luyang, and others “Attacks on LLMs via Self-Fulfilling Prophecies: Deception through Prompt Interventions” arXiv preprint arXiv:2310.04413 (2023).
[35] Zhou, Yuxuan, Zhang, Yuhan, and others “DESEEDER: Safety Benchmarking of LLMs with Evasion and Stealth Attacks” arXiv preprint arXiv:2312.01099 (2023).
[36] Perez, Ethan, Rando, Abraham, and others “Ignore Previous Instructions: Prompt Injection Attacks on Foundation Models” arXiv preprint arXiv:2305.10909 (2023).
[37] Schwinn, Leo, Dobre, David, and others “Attacking Safety Alignment and Unlearning in Open-Source LLMs via Embedding Space Attacks” arXiv preprint arXiv:2402.07987 (2024).
[38] Jain, Neel, Schwarzschild, Avi, and others “Baseline Defenses for Adversarial Attacks Against Aligned Language Models” arXiv preprint arXiv:2309.00614 (2023).
[39] Chen, Bocheng, Paliwal, Advait, and others “Jailbreaker in Jail: Moving Target Defense for Large Language Models” arXiv preprint arXiv:2310.02417 (2023).
[40] Phute, Shantanu, Trivedi, Harshit, and others “Jailbreak in Jail: LLM Self-Defense via Prompt Paraphrasing and Output Auditing” arXiv preprint arXiv:2310.02417 (2023).
[41] Xhonneux, Louis, Belinkov, Yonatan, and others “Robustness to Prompt Injection via Adversarial Training in Embedding Space” arXiv preprint arXiv:2401.14578 (2024).
[42] Sheshadri, Abhinav, Lee, Kevin, and others “Latent Adversarial Training Uncovers and Removes Jailbreak Circuits in LLMs” arXiv preprint arXiv:2402.11079 (2024).
[43] Kumar, Aounon, Agarwal, Chirag, and others “Certifying LLM Safety against Adversarial Prompting” arXiv preprint arXiv:2309.02705 (2023).
[44] Li, Weijia, Zhao, Yujia, and others “RAIN: Rewindable Auto-Regressive Inference for Harmless and Helpful LLMs” arXiv preprint arXiv:2402.01174 (2024).
[45] Templeton, Andrew, Wang, Teng, and others “Learning to Monitor the Latent Space: Towards Reliable Activation-Based Attack Detection” arXiv preprint arXiv:2401.04045 (2024).
[46] Zou, Andy, Pang, Weiting, and others “Representation Rerouting: Learning Circuit Breakers for Safer Language Models” arXiv preprint arXiv:2401.05547 (2024).
[47] Wu, Xinyi, Zhang, Yifan, and others “Securing Large Language Models: Threats, Vulnerabilities, and Mitigation Strategies” arXiv preprint arXiv:2403.12503 (2024).
[48] Ke, Shih-Wen, Lai, Guan-Yu, and others “Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models” arXiv preprint arXiv:2503.20320 (2025).
[49] Mehrotra, Anay, Zampetakis, Manolis, and others “Tree of attacks: Jailbreaking black-box llms automatically” Advances in Neural Information Processing Systems (2024).
[50] Jiang, Shuyu, Chen, Xingshu, and others “Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks” arXiv preprint arXiv:2310.10077 (2023).
[51] Schulhoff, Sander, Pinto, Jeremy, and others “Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition” arXiv preprint arXiv:2311.16119 (2023).
[52] Chen, Zheng and Yao, Buhui “Pseudo-Conversation Injection for LLM Goal Hijacking” arXiv preprint arXiv:2410.23678 (2024).
[53] Li, Zhe and others “Prompt Leaking Attacks against Large Language Model Applications” arXiv preprint arXiv:2405.06823 (2024).
[54] Greshake, Kai, Abdelnabi, Sahar, and others “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection” arXiv preprint arXiv:2302.12173 (2023).
[55] {OpenAI} “GPT-4o: OpenAI’s Omni-Modal Language Model” \url{https://openai.com/index/gpt-4o} (2024).
[56] OpenAI and others “GPT-4 Technical Report” \url{https://openai.com/research/gpt-4} (2023).
[57] {Meta AI} “LLaMA 3.1 Models: Refinements to Meta’s Next-Gen LLMs” \url{https://ai.meta.com/blog/meta-llama-3/} (2024).
[58] {Meta AI} “LLaMA 3: Open Foundation and Instruction Models” \url{https://llama.meta.com/} (2024).
[59] Touvron, Hugo, Lavril, Thibaut, and others “LLaMA 2: Open Foundation Language Models” Meta AI (2023).
[60] Chiang, Lulu, Zhu, Yuhui, and others “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality” \url{https://lmsys.org/blog/2023-03-30-vicuna/} (2023).
[61] {Microsoft Research} “Phi-2: Exploring Small Language Models with High Performance” \url{https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/} (2023).
[62] {Microsoft Research} “Phi-3: A Family of Open Language Models from Microsoft” \url{https://www.microsoft.com/en-us/research/blog/introducing-phi-3-small-language-models/} (2024).
[63] {Anthropic} “Claude 3 Model Family” \url{https://www.anthropic.com/news/claude-3-family} (2024).
[64] {Mistral AI} “Mixtral: Sparse Mixture of Experts Models by Mistral AI” \url{https://mistral.ai/news/mixtral-of-experts/} (2023).
[65] {Google DeepMind} “Gemma: Open-Weight Models by Google DeepMind” \url{https://ai.google.dev/gemma} (2024).
[66] {Mistral AI} “Mistral 7B: A High-Quality Dense Model for Open Use” \url{https://mistral.ai/news/} (2023).
[67] Casadevall, Arturo and Pirofski, Liise-anne “Host-pathogen interactions: the attributes of virulence” The Journal of Infectious Diseases (2001).
[68] Hult, Ann-Catrin, Chana, Sukhvinder, and others “Human immunodeficiency virus type 1 pathogenesis: viral and host factors leading to immune system deterioration and development of acquired immunodeficiency syndrome” Journal of Internal Medicine (2008).
[69] Caza, Marc and Kronstad, James W “Shared and distinct mechanisms of iron acquisition by bacterial and fungal pathogens of humans” Frontiers in Cellular and Infection Microbiology (2013).
[70] Chen, Lin, Yang, Jing, and others “VFDB 2020: A comparative pathogenomic platform with an interactive web interface” Nucleic Acids Research (2020).
[71] Schmid-Hempel, Paul “Infectious dose, immune response and the evolution of virulence” Trends in Microbiology (2007).
[72] Pomerantz, RJ, Trono, D, and others “Tropism of HIV-1 isolates for human CD4+ T lymphocytes and macrophages” Journal of Experimental Medicine (1990).
[73] Rose, Noel R “Molecular mimicry, microbial infection, and autoimmune disease” Clinical reviews in allergy & immunology (2016).
[74] Best, S, Le Tissier, P, and others “Activation of endogenous retroviruses during cellular differentiation and carcinogenesis” Current Topics in Microbiology and Immunology (1996).
[75] Grow, EJ, Flynn, RA, and others “Intrinsic retroviral reactivation in human preimplantation embryos and pluripotent cells” Nature (2015).
[76] Temin, Howard M “The protovirus hypothesis: Speculations on the significance of RNA-directed DNA synthesis for normal development and for carcinogenesis” Journal of the National Cancer Institute (1974).
[77] Andy Zou, Zifan Wang, and others “Universal and Transferable Adversarial Attacks on Aligned Language Models” arXiv preprint (2023). https://arxiv.org/abs/2307.15043
[78] Liu, Eric, Zou, Andy, and others “Jailbroken: How Does LLM Safety Training Fail?” arXiv preprint arXiv:2307.02483 (2023).
[79] Ruelas, Diana S and Greene, Warner C “HIV latency and the role of the transcriptional environment” Cell (2012).
[80] Zhang, Tong, Cooper, Sarah, and others “Epigenetic plasticity and the hallmarks of cancer” Trends in Genetics (2021).
[81] Frantz, Michael and Rajewsky, Nikolaus “Cell identity: A key gene set governs cell fate and pluripotency” Cell (2015).
[82] Jaenisch, Rudolf and Bird, Adrian “Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals” Nature Genetics (2003).
[83] Bird, Adrian “Perceptions of epigenetics” Nature (2007).
[84] Cedar, H and Bergman, Y “Epigenetics of gene expression” Nature Reviews Genetics (2009).
[85] Hendrycks, Dan, Burns, Collin, and others “Aligning AI with shared human values” arXiv preprint arXiv:2008.02275 (2021).
[86] Zhu, Haotian, Zhang, Kai, and others “PromptBench: A Unified Benchmark for Instruction-following Language Models” arXiv preprint arXiv:2403.03694 (2024).
[87] Kazazian, Haig H “Mobile elements: drivers of genome evolution” Science (2004).
[88] Feinberg, Andrew P “Phenotypic plasticity and the epigenetics of human disease” Nature (2007).
[89] Wallace, Eric, Feng, Shi, and others “Universal adversarial triggers for attacking and analyzing NLP” EMNLP (2019).
[90] M{"u}ller, Herbert J “Meiosis and genetic recombination” Genetics (2020).
[91] Carlini, Nicholas, Tramer, Florian, and others “Extracting training data from large language models” 30th USENIX Security Symposium (USENIX Security 21) (2021).
[92] Finlay, B Brett and McFadden, Grant “Molecular mimicry” Nature Reviews Microbiology (1997).
[93] Zhang, Y and others “Role-based adversarial prompt attacks on large language models” Advances in Neural Information Processing Systems (NeurIPS) (2023).
[94] Temin, Howard M “RNA-dependent DNA polymerase in virions of RNA tumour viruses” Nature (1971).
[95] Wei, Jason, Xu, Andy, and others “Jailbroken: How does LLM safety training fail?” arXiv preprint arXiv:2307.02483 (2023).
[96] Norman, J and others “RNA virus replication mechanisms” Nature Reviews Microbiology (2019).
[97] Carlini, Nicholas, Tramer, Florian, and others “Quantifying and understanding adversarial examples in NLP” IEEE S&P (2022).
[98] Hanahan, Douglas and Weinberg, Robert A. “Hallmarks of Cancer: The Next Generation” Cell (2011).
[99] Gehman, Samuel and others “RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models” \url{https://arxiv.org/abs/2009.11462} (2020).
[100] Badeaux, Amanda I and Shi, Yanhong “Epigenetic reprogramming in cancer” Nature Reviews Cancer (2012).
[101] Carlini, Nicholas and others “Extracting training data from language models” USENIX Security Symposium (2020).
[102] Rose, Noel R “Autoimmune diseases” New England Journal of Medicine (2016).
[103] Wang, X and others “Self-jailbreak of LLMs: Mechanisms and defenses” International Conference on Learning Representations (ICLR) (2023).
[104] Lee, JH and others “Viral envelope mimicry and immune evasion” Nature Immunology (2015).
[105] Liu, Y and others “Adversarial suffix attacks on LLMs” NeurIPS (2023).
[106] Garneau, N L and others “RNA replication in viruses” Annual Review of Virology (2017).
[107] Liu, S and others “Chainbreak: Recursive CoT attack on LLMs” ACL (2023).
[108] White, Judith M, Delos, Sharon E, and others “Structures and mechanisms of viral membrane fusion proteins: multiple variations on a common theme” Critical Reviews in Biochemistry and Molecular Biology (2008).
[109] Harrison, Stephen C “Viral membrane fusion” Nature Structural & Molecular Biology (2008).
[110] Einav, Shirit and Glenn, Jason S “Viral envelope fusion mechanisms: advances in understanding the choreography of entry” Trends in microbiology (2015).
[111] Kozlov, Michael M and Chernomordik, Leonid V “Mechanisms of membrane fusion” Nature structural & molecular biology (2010).
[112] Schornberg, Kristen L and White, Jennifer M “Membrane fusion: putting lipids and proteins together” Current opinion in structural biology (2010).
[113] Lopez, Saul and Martin, Katherine “Early endosomal escape of adenovirus: a viral membrane lytic protein, transport vesicles and proton pumps” Traffic (2011).
[114] Matsubara, Takeshi and Suzuki, Yuki “Viral strategies for escaping immune responses and implications for vaccine design” Frontiers in immunology (2020).
[115] Martin, Michael E and Rice, Ken “Endosomal escape of nanocarriers: a prerequisite for intracellular delivery” Drug delivery and translational research (2019).
[116] Brown, Laura and Smith, Jason “Passive manipulation of complex systems by low energy input” Physical Review Letters (2018).
[117] Paul, S, Saha, S, and others “Vesicular trafficking pathways exploited by viruses” Viruses (2013).
[118] Schepeler, Troels, Page, Mark E, and others “Lineage tracing and molecular characterization of cancer stem cells in colorectal cancer” Nature Medicine (2014).
[119] Brandt, Andreas and Koster, Peter “Gradient formation in gene expression patterns” Developmental biology (2001).
[120] Sharon, E and Regev, A “Transcriptional regulation by viral genomes” Nature Reviews Genetics (2014).
[121] Jackson, RJ, Hellen, CU, and others “mRNA translation initiation: mechanisms and regulation” Nature reviews Molecular cell biology (2010).
[122] Jan, E and Sarnow, P “mRNA translation and viral hijacking of host machinery” Cell (2011).
[123] Kozak, Marilyn “Initiation of translation in eukaryotic mRNAs” Cell (1981).
[124] Kane, Suzanne E “Mechanisms of translation initiation in eukaryotes” Nature reviews Molecular cell biology (2000).
[125] Oldstone, Michael BA “Molecular mimicry and immune-mediated diseases” Cell (1987).