Time-of-Flight Causal Tomography

Interpretability Extras

Clean: t_out=8.27 · Δlogit@OUT=0.319 · OK · Δt vs Poisoned=+3.24Poisoned: t_out=5.03 · Δlogit@OUT=1.772 · OK · Δt vs Poisoned=+0.00Patched: t_out=7.58 · Δlogit@OUT=0.620 · OK · Δt vs Poisoned=+2.55
Modet_outTotal Δlogit@OUT
Clean8.270.319
Poisoned5.031.772
Patched7.580.620
Waterfall bars show contributions from heads at the last layer arriving at OUT; the line is cumulative energy at OUT. Vertical markers indicate first arrival and 50% accumulation milestones. “Blocked” means no path reached OUT.