🇮🇳 DenseWorld Research Demo · World Action Model

DENSEWORLD

The First World Action Model (WAM) for India

Predict. Understand. Anticipate.

DenseWorld learns how people, vehicles, crowds, animals, and streets interact across India’s dense, dynamic, and unpredictable environments.

700+Hours of video
22Indian cities
115K+Street clips
6World capabilities
DenseWorld Indian market crossing scene
0.83Motion complexity
0.77Interaction density
15sFuture rollout

Six Indian Street Worlds

Each example isolates a different challenge: future prediction, social navigation, occlusion, crowd flow, and India-specific street dynamics.

Market Crossing

Future prediction under dense mixed traffic.

A1

Signal Junction

Multi-agent interaction and causal influence.

A2

Bike Occlusion

Hidden actor recovery behind a moving bike.

A3

Narrow Street

Compressed motion and social navigation.

A4

Bus Stop Crowd

Crowd dynamics around public transport.

A5

Animal Crossing

India-specific dynamics in mixed road use.

A6

Try DenseWorld on Your Own Video

For HF, this can be wired to Gradio. This static page keeps the upload panel ready while curated assets show full model behavior.

Upload Video

MP4 · 5–10 sec · traffic / crowd / street scene

Demo mode

  • Live upload: lightweight frame sampling and motion proxy.
  • Curated assets: qualitative V-JEPA 2.1 vs DenseWorld comparison.
  • All images are expected in the same folder as this HTML file.

How DenseWorld Factorizes a Scene

For v1, every factor reuses the original frame. Later, these cards can receive true overlays for layout, agents, interactions, and motion.

Input world state

Dense market crossing used for factor decomposition.

A1 / T0

Layout Factor 92%

  • Road geometry
  • Crosswalk and junction structure
  • Buildings, shops, static context

Agent Factor 89%

  • Pedestrians, motorbikes, autos, cars
  • Cyclists and crowd clusters
  • Agent density and proximity

Interaction Factor 0.77

  • Pedestrian ↔ bike
  • Bike ↔ auto
  • Crowd ↔ vehicle negotiation

Motion Factor 0.83

  • Crossing, yielding, avoiding, passing
  • Mixed traffic flow
  • Lane-less motion patterns

Layout

Stable structure of the world.

Factor

Agents

People and vehicles in the scene.

Factor

Interactions

Who influences whom.

Factor

Motion

How the world is changing.

Factor

Motion Understanding

Before predicting the future, the model must understand where people and vehicles are moving.

Actual Optical Flow

Reference motion field.

C1

V-JEPA Motion

Noisier motion estimate.

C2

DenseWorld Motion

Better-aligned flow and agent trajectories.

C3

Motion Observatory

Technical visualization of dense motion.

C4
Metric
V-JEPA 2.1
DenseWorld
Motion alignment ↑
61
88
Trajectory consistency ↑
65
91
Flow EPE ↓
3.82
1.72

Motion DNA

A high-level temporal signature: not just optical flow, but primitives such as crossing, yielding, avoiding, negotiation, and occlusion.

Selected actor

Pedestrian #12 in the market crossing scene.

A1

DenseWorld motion sequence

Free
Cross
Yield
Cross
Avoid
Occluded
Walk

DenseWorld preserves crossing, yielding, avoidance, and recovery after partial occlusion.

V-JEPA 2.1 motion sequence

Free
Cross
?
Walk
?
?
Walk

V-JEPA captures local movement but loses negotiation and hidden-state continuity.

Free
Interaction
Negotiation
Occlusion
Unknown
Motion DNA Metric
V-JEPA 2.1
DenseWorld
Motion state accuracy ↑
63
89
Intent consistency ↑
58
87
Negotiation detection ↑
41
85

Understanding What Cannot Be Seen

DenseWorld should infer hidden actors using context, motion, and interaction cues — not just detect visible objects.

Original

Occlusion setup.

D1

Occluded

Actor hidden behind bike.

D2

V-JEPA Recovery

Lower-confidence hidden actor estimate.

D3

DenseWorld Recovery

Clearer hidden actor hypothesis.

D4
Metric
V-JEPA 2.1
DenseWorld
Occlusion recovery ↑
61
84
Hidden actor localization error ↓
18 px
7 px
Hidden intent accuracy ↑
58
87

Can the Model Predict the Next Moment?

Observed frame, V-JEPA prediction, DenseWorld prediction, reference frame, and model-specific error maps.

Observed

Input frame.

B1

V-JEPA

Prediction with motion ambiguity.

B2

DenseWorld

Stronger interaction continuity.

B3

Reference

Future frame.

B4

V-JEPA Error Heatmap

Higher error around moving agents.

B5

DenseWorld Error Heatmap

Reduced error in interaction zones.

B6
Metric
V-JEPA 2.1
DenseWorld
Future-frame L1 ↓
0.186
0.142
SSIM ↑
0.71
0.81
LPIPS ↓
0.298
0.195

Who Influences Whom?

Using A2, DenseWorld and V-JEPA are compared across local interaction, scene graph, and causal influence chain.

Interaction source scene

Signal junction with pedestrians, motorbikes, cars, autos, and bus-like agents.

A2

Local Interaction Pedestrian–Bike

V-JEPA

Lower-confidence local interaction.

E4

DenseWorld

Stronger pedestrian-bike influence estimate.

E1

Junction Interaction Multi-Agent Graph

V-JEPA

Sparser graph and weaker coverage.

E5

DenseWorld

Denser multi-agent interaction graph.

E2

Causal Influence Who caused what?

V-JEPA

Shorter, uncertain causal chain.

E6

DenseWorld

Clearer causal influence over multiple agents.

E3
Metric
V-JEPA 2.1
DenseWorld
Interaction accuracy ↑
63
90
Graph coverage ↑
58
87
Causal consistency ↑
55
85

Future World Simulation

The finale: compare reference futures, V-JEPA futures, and DenseWorld futures at +5, +10, and +15 seconds.

Observed World State

A1 at T0: the starting point for future rollout.

T0

+5 seconds Future consistency

Reference Future

Street state at +5 sec.

P5

V-JEPA vs DenseWorld

DenseWorld maintains sharper pedestrian and vehicle motion.

V5 / D5
V-JEPA future consistency
81
DenseWorld future consistency
92

+10 seconds Crowd preservation

Reference Future

Street state at +10 sec.

P10

V-JEPA vs DenseWorld

DenseWorld preserves crowd flow and social structure more clearly.

V10 / D10
V-JEPA crowd preservation
68
DenseWorld crowd preservation
89

+15 seconds Long-horizon reasoning

Reference Future

Street state at +15 sec.

P15

V-JEPA vs DenseWorld

DenseWorld avoids long-horizon simplification and preserves interactions.

V15 / D15
V-JEPA long-horizon reasoning
54
DenseWorld long-horizon reasoning
86

DenseWorld vs V-JEPA 2.1

Summary scoreboard across world-model capabilities. Values are demo placeholders and should be replaced by measured benchmark values when available.

Layout Understanding
V-JEPA
83
Dense
92
Agent Understanding
V-JEPA
76
Dense
89
Motion Understanding
V-JEPA
61
Dense
88
Occlusion Recovery
V-JEPA
58
Dense
87
Interaction Reasoning
V-JEPA
63
Dense
90
Future Prediction
V-JEPA
74
Dense
91
Long-Horizon Prediction
V-JEPA
54
Dense
86

DenseWorld Diagnosis

A compact explanation panel for reviewers, visitors, and demo users.

What the scene contains

Dense pedestrian, motorcycle, auto-rickshaw, cyclist, and crowd interactions. The main challenge is not object recognition; it is motion, negotiation, occlusion, and long-horizon consistency.

What DenseWorld improves

  • Preserves motion and crowd flow better over time.
  • Infers hidden actors from interaction context.
  • Builds stronger causal and multi-agent interaction graphs.
  • Maintains plausible futures at +5, +10, and +15 seconds.