Bench: validating forge’s flexibility without performance regression

This chapter describes the public benchmark suite shipped with mujoco-wasm-forge. The suite exists to justify why forge exists:

  • Flexibility: a minimal, configurable interface surface and multiple distribution variants (e.g. single-threaded vs pthreads).

  • Flatness: flat handles and a predictable ABI that downstream tools and apps can build around (Worker / Node / Web).

  • Rule-based, auditable generation: wrappers/exports are generated and checked by rules so changes are explicit and reviewable.

In forge, flexibility is framed as a complement to the official embind distribution. Official embind typically offers a richer, stable, and ergonomic binding surface, but that breadth also comes with a larger surface area and a more standardized distribution shape, which can make fine-grained control over variants and exported ABI less direct. Forge instead relies on the flat ABI organization and the automated output structure (stable dist layout + ABI artifacts + rules/gates): most changes are change rules -> regenerate -> gate, which is designed to be auditable and variant-friendly.

Forge is intended to be usable in extreme deployment environments (e.g. online demos with strict security and resource constraints) and in research prototyping workflows, while still being a reasonable base for routine use.

The benchmark asks a concrete question:

Under these design constraints, does forge stay in the same performance class as the official MuJoCo embind WASM build for Simulate-style workloads? (And when it differs, is the reason attributable and controllable?)

This is not a claim that forge is universally better than official embind. The two distributions have different goals; the bench is here to make trade-offs visible, reproducible, and discussable.

What we measure (organized by user needs)

Need / scenario

Why we care

What we measure (metric)

Notes

Ship fast (web demo / docs)

Download + caching affects real users

Artifact size (.js + .wasm)

Measured from produced bundles

Page becomes interactive

Cold start is a hard UX constraint

Init time (modFactory() -> Module.ready)

Dominated by threading/runtime initialization

First meaningful output (Simulate)

First step / first snapshot is when users feel success

TTFS (ready -> first step / first scene snapshot)

Browser TTFS is end-to-end; Node TTFS isolates engine load/step

Predictable resource budget

Thread pools + compilation can cause spikes

RSS after init + peak RSS

Node provides more reliable RSS sampling than browsers

Smooth model load

XML compilation and mesh processing dominate load time

XML compile/load latency

Sensitive to threading defaults and asset shape

Stable steady-state simulation

After load, engine hot-loop dominates

ms/step / steps/sec

Requires comparable threading configuration for strict comparisons

Low JS <-> WASM overhead

Flat/handle-based APIs should reduce crossing cost

FFI microbench (ns/call for lightweight getters)

Microbench is intentionally small and repeatable

Robust lifecycle (reset/reload)

Demos frequently reset or reload models

Reload loop time + RSS drift

Amplifies leaks and binding overhead

Diagnosable failures

Public demos must fail with readable errors

Errmsg/errno gates (bad XML, missing plugin, missing assets)

Forge has explicit helper gates for this

Plugin baseline

Some Simulate models require plugins

Plugin availability smoke (e.g. touch_grid)

Treated as a product-level contract difference

Comparison matrix (what we compare)

Label

Intended meaning

Threading policy

official-3.4.0

Official embind baseline

As built upstream

official-3.5.0-hc4 / official-3.5.0-hc32

Official embind (3.5.0) under different effective pool sizing

Pthreads enabled; pool approximated via hardwareConcurrency

forge-3.4.0-single

Forge baseline

Single-threaded

forge-3.5.0-single

Forge single-threaded variant

Single-threaded

forge-3.5.0-pthreads

Forge pthread variant

Pthreads enabled; pool defaults to 4 (override via MJWF_PTHREAD_POOL_SIZE; for 3.5.0 default clamp=pool)

For strict (thread-matched) comparisons, we primarily compare: forge-3.5.0-pthreads (pool=4) vs official-3.5.0-hc4.

Summary findings (reference run)

The suite is designed to be rerun on different machines; numbers vary. The key question is whether forge’s organization and flexibility introduces a performance penalty.

Dimension

Observation (selected numbers)

Interpretation (what it means)

Caveats

Distribution size

forge-3.5.0-single: wasm 3.33 MiB / JS 256.1 KiB; official-3.5.0-hc4: wasm 8.24 MiB / JS 296.7 KiB

Minimal/flat surface can reduce shipping footprint

Not guaranteed across future upstreams

Init + memory

forge-3.5.0-pthreads: init 52.7ms / peak RSS 462.8 MiB; official-3.5.0-hc4: init 84.4ms / peak 659.2 MiB; official-3.5.0-hc32: RSS after init 507.9 MiB

Threading policy and pool sizing can dominate cost

RSS is platform-dependent; browsers differ

Plugin baseline

sensor/touch_grid: forge=ok, official=error

Forge can treat plugin availability as part of the dist contract

Official plugin packaging may evolve upstream

ms/step (thread-matched)

e.g. raj 0.086 vs 0.103 ms/step; cards 0.325 vs 0.343 (forge pthreads vs official hc4)

Forge organization does not inherently degrade simulation throughput

Small models can be dominated by measurement overhead

FFI microbench

model_nq: 14.71ns vs 33.56ns/call; mj_version: 4.23ns vs 5.24ns

Flat handles reduce high-frequency boundary cost

Microbench != whole-app performance

TTFS (Simulate)

humanoid ready->snapshot: 77ms (pthreads) vs 111ms (single); cards: 505ms (pthreads) vs 225ms (single)

End-to-end pthread benefit depends on model/load path

HUD metrics depend on environment/COI configuration

Reload / lifecycle

50 iterations: 0.840ms/iter (forge pthreads) vs 1.003ms/iter (official hc4); RSS drift: +1.9 MiB vs +7.0 MiB

Lifecycle pressure can amplify binding overhead differences

Not a proxy for every workload

Reference snapshot tables

Reference run environment (for the snapshot tables below): Windows x64 / Node v22.16.0 / 32 logical CPUs. Unless noted otherwise:

  • Node metrics are medians of 5 runs.

  • Browser HUD metrics default to 3 runs (median).

  • Sizes and memory are reported in KiB/MiB (1024-based).

Snapshot (Node: bundle / init / memory)

Label

JS (KiB)

WASM (MiB)

Init (ms)

RSS after init (MiB)

RSS peak (MiB)

forge-3.5.0-single

256.1

3.33

19.4

59.8

208.8

forge-3.5.0-pthreads

272.6

3.33

52.7

110.7

462.8

official-3.5.0-hc4

296.7

8.24

84.4

125.5

659.2

official-3.5.0-hc32

296.7

8.24

169.8

507.9

873.9

Model load + stepping (thread-matched, Node: pool=4 vs hc=4)

Model

Forge compile/load (ms)

Official compile/load (ms)

Forge steady ms/step

Official steady ms/step

cards

90.8

109.6

0.325

0.343

raj

303.4

384.3

0.086

0.103

humanoid

5.4

7.6

0.020

0.021

flex_bunny

59.5

102.5

0.426

0.465

Functional baseline (Node: plugin-dependent models)

Model

Forge 3.5.0 single

Official 3.5.0 (hc=4)

sensor

ok

error

touch_grid

ok

error

FFI microbench (Node: apples-to-apples subset)

This is a “1e6 calls” microbench. It is meant for comparing relative organization/binding overhead, not for extrapolating whole-app throughput.

Metric

Forge 3.5.0 pthreads ns/call (median)

Official 3.5.0 ns/call (median)

mj_version

4.23

5.24

model_nq

14.71

33.56

Simulate TTFS (Browser: forge variants only)

Browser numbers come from mujoco-wasm-play + Playwright HUD sampling (default: 3 runs, median). In some environments, the pthread variant may not expose HUD CPU (ms/step), but TTFS/FPS is still sampled.

Model

Variant

TTFS ready->snapshot median (ms)

TTFS nav->snapshot median (ms)

FPS median

humanoid

single

111

322

120

humanoid

pthreads

77

287

120

cards

single

225

499

120

cards

pthreads

505

791

120

raj

single

338

570

121

raj

pthreads

270

510

45.5

Reload / lifecycle (thread-matched, Node: pool=4 vs hc=4)

Label

Iterations

ms/iter

RSS drift (MiB)

RSS peak (MiB)

forge-3.5.0-pthreads

50

0.840

1.9

462.8

official-3.5.0-hc4

50

1.003

7.0

659.2

Limitations (what this bench may still miss)

  • Official embind and forge necessarily use different JS-level APIs, even when the underlying engine version is comparable.

  • Browser pthread deployments require COOP/COEP + SharedArrayBuffer; deployment constraints are part of the trade-off.

  • Some metrics (especially browser memory) are harder to measure precisely and are currently more reliable in Node.

  • Fastest possible is not the goal; the goal is forge capabilities without unacceptable regression.

Reproducing the bench

See bench/README.md for the full procedure. The short version:

powershell -ExecutionPolicy Bypass -File tools/build_official_embind.ps1 -Ref 3.4.0 -OutDir C:\dev\mjwf-bench\official\3.4.0 -Clean
powershell -ExecutionPolicy Bypass -File tools/build_official_embind.ps1 -Ref 3.5.0 -OutDir C:\dev\mjwf-bench\official\3.5.0
node bench/node/run_matrix.mjs --official340 C:/dev/mjwf-bench/official/3.4.0 --official350 C:/dev/mjwf-bench/official/3.5.0
node bench/node/report.mjs
powershell -ExecutionPolicy Bypass -File bench/browser/run_playwright_bench.ps1