Bench: validating forge’s flexibility without performance regression

This chapter describes the public benchmark suite shipped with mujoco-wasm-forge. The suite exists to justify why forge exists:

Flexibility: a minimal, configurable interface surface and multiple distribution variants (e.g. single-threaded vs pthreads).
Flatness: flat handles and a predictable ABI that downstream tools and apps can build around (Worker / Node / Web).
Rule-based, auditable generation: wrappers/exports are generated and checked by rules so changes are explicit and reviewable.

In forge, flexibility is framed as a complement to the official embind distribution. Official embind typically offers a richer, stable, and ergonomic binding surface, but that breadth also comes with a larger surface area and a more standardized distribution shape, which can make fine-grained control over variants and exported ABI less direct. Forge instead relies on the flat ABI organization and the automated output structure (stable dist layout + ABI artifacts + rules/gates): most changes are change rules -> regenerate -> gate, which is designed to be auditable and variant-friendly.

Forge is intended to be usable in extreme deployment environments (e.g. online demos with strict security and resource constraints) and in research prototyping workflows, while still being a reasonable base for routine use.

The benchmark asks a concrete question:

Under these design constraints, does forge stay in the same performance class as the official MuJoCo embind WASM build for Simulate-style workloads? (And when it differs, is the reason attributable and controllable?)

This is not a claim that forge is universally better than official embind. The two distributions have different goals; the bench is here to make trade-offs visible, reproducible, and discussable.

What we measure (organized by user needs)

Need / scenario	Why we care	What we measure (metric)	Notes
Ship fast (web demo / docs)	Download + caching affects real users	Artifact size (`.js` + `.wasm`)	Measured from produced bundles
Page becomes interactive	Cold start is a hard UX constraint	Init time (`modFactory() -> Module.ready`)	Dominated by threading/runtime initialization
First meaningful output (Simulate)	First step / first snapshot is when users feel success	TTFS (ready -> first step / first scene snapshot)	Browser TTFS is end-to-end; Node TTFS isolates engine load/step
Predictable resource budget	Thread pools + compilation can cause spikes	RSS after init + peak RSS	Node provides more reliable RSS sampling than browsers
Smooth model load	XML compilation and mesh processing dominate load time	XML compile/load latency	Sensitive to threading defaults and asset shape
Stable steady-state simulation	After load, engine hot-loop dominates	`ms/step` / steps/sec	Requires comparable threading configuration for strict comparisons
Low JS <-> WASM overhead	Flat/handle-based APIs should reduce crossing cost	FFI microbench (ns/call for lightweight getters)	Microbench is intentionally small and repeatable
Robust lifecycle (reset/reload)	Demos frequently reset or reload models	Reload loop time + RSS drift	Amplifies leaks and binding overhead
Diagnosable failures	Public demos must fail with readable errors	Errmsg/errno gates (bad XML, missing plugin, missing assets)	Forge has explicit helper gates for this
Plugin baseline	Some Simulate models require plugins	Plugin availability smoke (e.g. `touch_grid`)	Treated as a product-level contract difference

Comparison matrix (what we compare)

Label	Intended meaning	Threading policy
`official-3.4.0`	Official embind baseline	As built upstream
`official-3.5.0-hc4` / `official-3.5.0-hc32`	Official embind (3.5.0) under different effective pool sizing	Pthreads enabled; pool approximated via `hardwareConcurrency`
`forge-3.4.0-single`	Forge baseline	Single-threaded
`forge-3.5.0-single`	Forge single-threaded variant	Single-threaded
`forge-3.5.0-pthreads`	Forge pthread variant	Pthreads enabled; pool defaults to 4 (override via `MJWF_PTHREAD_POOL_SIZE`; for 3.5.0 default clamp=pool)

For strict (thread-matched) comparisons, we primarily compare: forge-3.5.0-pthreads (pool=4) vs official-3.5.0-hc4.

Summary findings (reference run)

The suite is designed to be rerun on different machines; numbers vary. The key question is whether forge’s organization and flexibility introduces a performance penalty.

Dimension	Observation (selected numbers)	Interpretation (what it means)	Caveats
Distribution size	`forge-3.5.0-single`: wasm 3.33 MiB / JS 256.1 KiB; `official-3.5.0-hc4`: wasm 8.24 MiB / JS 296.7 KiB	Minimal/flat surface can reduce shipping footprint	Not guaranteed across future upstreams
Init + memory	`forge-3.5.0-pthreads`: init 52.7ms / peak RSS 462.8 MiB; `official-3.5.0-hc4`: init 84.4ms / peak 659.2 MiB; `official-3.5.0-hc32`: RSS after init 507.9 MiB	Threading policy and pool sizing can dominate cost	RSS is platform-dependent; browsers differ
Plugin baseline	`sensor`/`touch_grid`: forge=ok, official=error	Forge can treat plugin availability as part of the dist contract	Official plugin packaging may evolve upstream
`ms/step` (thread-matched)	e.g. `raj` 0.086 vs 0.103 ms/step; `cards` 0.325 vs 0.343 (forge pthreads vs official hc4)	Forge organization does not inherently degrade simulation throughput	Small models can be dominated by measurement overhead
FFI microbench	`model_nq`: 14.71ns vs 33.56ns/call; `mj_version`: 4.23ns vs 5.24ns	Flat handles reduce high-frequency boundary cost	Microbench != whole-app performance
TTFS (Simulate)	`humanoid` ready->snapshot: 77ms (pthreads) vs 111ms (single); `cards`: 505ms (pthreads) vs 225ms (single)	End-to-end pthread benefit depends on model/load path	HUD metrics depend on environment/COI configuration
Reload / lifecycle	50 iterations: 0.840ms/iter (forge pthreads) vs 1.003ms/iter (official hc4); RSS drift: +1.9 MiB vs +7.0 MiB	Lifecycle pressure can amplify binding overhead differences	Not a proxy for every workload

Reference snapshot tables

Reference run environment (for the snapshot tables below): Windows x64 / Node v22.16.0 / 32 logical CPUs. Unless noted otherwise:

Node metrics are medians of 5 runs.
Browser HUD metrics default to 3 runs (median).
Sizes and memory are reported in KiB/MiB (1024-based).

Snapshot (Node: bundle / init / memory)

Label	JS (KiB)	WASM (MiB)	Init (ms)	RSS after init (MiB)	RSS peak (MiB)
`forge-3.5.0-single`	256.1	3.33	19.4	59.8	208.8
`forge-3.5.0-pthreads`	272.6	3.33	52.7	110.7	462.8
`official-3.5.0-hc4`	296.7	8.24	84.4	125.5	659.2
`official-3.5.0-hc32`	296.7	8.24	169.8	507.9	873.9

Model load + stepping (thread-matched, Node: `pool=4` vs `hc=4`)

Model	Forge compile/load (ms)	Official compile/load (ms)	Forge steady `ms/step`	Official steady `ms/step`
`cards`	90.8	109.6	0.325	0.343
`raj`	303.4	384.3	0.086	0.103
`humanoid`	5.4	7.6	0.020	0.021
`flex_bunny`	59.5	102.5	0.426	0.465

Functional baseline (Node: plugin-dependent models)

Model	Forge 3.5.0 single	Official 3.5.0 (hc=4)
`sensor`	ok	error
`touch_grid`	ok	error

FFI microbench (Node: apples-to-apples subset)

This is a “1e6 calls” microbench. It is meant for comparing relative organization/binding overhead, not for extrapolating whole-app throughput.

Metric	Forge 3.5.0 pthreads ns/call (median)	Official 3.5.0 ns/call (median)
`mj_version`	4.23	5.24
`model_nq`	14.71	33.56

Simulate TTFS (Browser: forge variants only)

Browser numbers come from mujoco-wasm-play + Playwright HUD sampling (default: 3 runs, median). In some environments, the pthread variant may not expose HUD CPU (ms/step), but TTFS/FPS is still sampled.

Model	Variant	TTFS ready->snapshot median (ms)	TTFS nav->snapshot median (ms)	FPS median
`humanoid`	single	111	322	120
`humanoid`	pthreads	77	287	120
`cards`	single	225	499	120
`cards`	pthreads	505	791	120
`raj`	single	338	570	121
`raj`	pthreads	270	510	45.5

Reload / lifecycle (thread-matched, Node: `pool=4` vs `hc=4`)

Label	Iterations	ms/iter	RSS drift (MiB)	RSS peak (MiB)
`forge-3.5.0-pthreads`	50	0.840	1.9	462.8
`official-3.5.0-hc4`	50	1.003	7.0	659.2

Limitations (what this bench may still miss)

Official embind and forge necessarily use different JS-level APIs, even when the underlying engine version is comparable.
Browser pthread deployments require COOP/COEP + SharedArrayBuffer; deployment constraints are part of the trade-off.
Some metrics (especially browser memory) are harder to measure precisely and are currently more reliable in Node.
Fastest possible is not the goal; the goal is forge capabilities without unacceptable regression.

Reproducing the bench

See bench/README.md for the full procedure. The short version:

powershell -ExecutionPolicy Bypass -File tools/build_official_embind.ps1 -Ref 3.4.0 -OutDir C:\dev\mjwf-bench\official\3.4.0 -Clean
powershell -ExecutionPolicy Bypass -File tools/build_official_embind.ps1 -Ref 3.5.0 -OutDir C:\dev\mjwf-bench\official\3.5.0

node bench/node/run_matrix.mjs --official340 C:/dev/mjwf-bench/official/3.4.0 --official350 C:/dev/mjwf-bench/official/3.5.0
node bench/node/report.mjs

powershell -ExecutionPolicy Bypass -File bench/browser/run_playwright_bench.ps1