Bench: validating forge’s flexibility without performance regression
This chapter describes the public benchmark suite shipped with mujoco-wasm-forge.
The suite exists to justify why forge exists:
Flexibility: a minimal, configurable interface surface and multiple distribution variants (e.g. single-threaded vs pthreads).
Flatness: flat handles and a predictable ABI that downstream tools and apps can build around (Worker / Node / Web).
Rule-based, auditable generation: wrappers/exports are generated and checked by rules so changes are explicit and reviewable.
In forge, flexibility is framed as a complement to the official embind distribution. Official embind typically offers
a richer, stable, and ergonomic binding surface, but that breadth also comes with a larger surface area and a more
standardized distribution shape, which can make fine-grained control over variants and exported ABI less direct. Forge
instead relies on the flat ABI organization
and the automated output structure (stable dist layout + ABI artifacts + rules/gates): most changes are
change rules -> regenerate -> gate, which is designed to be auditable and variant-friendly.
Forge is intended to be usable in extreme deployment environments (e.g. online demos with strict security and resource constraints) and in research prototyping workflows, while still being a reasonable base for routine use.
The benchmark asks a concrete question:
Under these design constraints, does forge stay in the same performance class as the official MuJoCo embind WASM build for Simulate-style workloads? (And when it differs, is the reason attributable and controllable?)
This is not a claim that forge is universally better than official embind. The two distributions have different goals; the bench is here to make trade-offs visible, reproducible, and discussable.
What we measure (organized by user needs)
Need / scenario |
Why we care |
What we measure (metric) |
Notes |
|---|---|---|---|
Ship fast (web demo / docs) |
Download + caching affects real users |
Artifact size ( |
Measured from produced bundles |
Page becomes interactive |
Cold start is a hard UX constraint |
Init time ( |
Dominated by threading/runtime initialization |
First meaningful output (Simulate) |
First step / first snapshot is when users feel success |
TTFS (ready -> first step / first scene snapshot) |
Browser TTFS is end-to-end; Node TTFS isolates engine load/step |
Predictable resource budget |
Thread pools + compilation can cause spikes |
RSS after init + peak RSS |
Node provides more reliable RSS sampling than browsers |
Smooth model load |
XML compilation and mesh processing dominate load time |
XML compile/load latency |
Sensitive to threading defaults and asset shape |
Stable steady-state simulation |
After load, engine hot-loop dominates |
|
Requires comparable threading configuration for strict comparisons |
Low JS <-> WASM overhead |
Flat/handle-based APIs should reduce crossing cost |
FFI microbench (ns/call for lightweight getters) |
Microbench is intentionally small and repeatable |
Robust lifecycle (reset/reload) |
Demos frequently reset or reload models |
Reload loop time + RSS drift |
Amplifies leaks and binding overhead |
Diagnosable failures |
Public demos must fail with readable errors |
Errmsg/errno gates (bad XML, missing plugin, missing assets) |
Forge has explicit helper gates for this |
Plugin baseline |
Some Simulate models require plugins |
Plugin availability smoke (e.g. |
Treated as a product-level contract difference |
Comparison matrix (what we compare)
Label |
Intended meaning |
Threading policy |
|---|---|---|
|
Official embind baseline |
As built upstream |
|
Official embind (3.5.0) under different effective pool sizing |
Pthreads enabled; pool approximated via |
|
Forge baseline |
Single-threaded |
|
Forge single-threaded variant |
Single-threaded |
|
Forge pthread variant |
Pthreads enabled; pool defaults to 4 (override via |
For strict (thread-matched) comparisons, we primarily compare:
forge-3.5.0-pthreads (pool=4) vs official-3.5.0-hc4.
Summary findings (reference run)
The suite is designed to be rerun on different machines; numbers vary. The key question is whether forge’s organization and flexibility introduces a performance penalty.
Dimension |
Observation (selected numbers) |
Interpretation (what it means) |
Caveats |
|---|---|---|---|
Distribution size |
|
Minimal/flat surface can reduce shipping footprint |
Not guaranteed across future upstreams |
Init + memory |
|
Threading policy and pool sizing can dominate cost |
RSS is platform-dependent; browsers differ |
Plugin baseline |
|
Forge can treat plugin availability as part of the dist contract |
Official plugin packaging may evolve upstream |
|
e.g. |
Forge organization does not inherently degrade simulation throughput |
Small models can be dominated by measurement overhead |
FFI microbench |
|
Flat handles reduce high-frequency boundary cost |
Microbench != whole-app performance |
TTFS (Simulate) |
|
End-to-end pthread benefit depends on model/load path |
HUD metrics depend on environment/COI configuration |
Reload / lifecycle |
50 iterations: 0.840ms/iter (forge pthreads) vs 1.003ms/iter (official hc4); RSS drift: +1.9 MiB vs +7.0 MiB |
Lifecycle pressure can amplify binding overhead differences |
Not a proxy for every workload |
Reference snapshot tables
Reference run environment (for the snapshot tables below): Windows x64 / Node v22.16.0 / 32 logical CPUs.
Unless noted otherwise:
Node metrics are medians of 5 runs.
Browser HUD metrics default to 3 runs (median).
Sizes and memory are reported in KiB/MiB (1024-based).
Snapshot (Node: bundle / init / memory)
Label |
JS (KiB) |
WASM (MiB) |
Init (ms) |
RSS after init (MiB) |
RSS peak (MiB) |
|---|---|---|---|---|---|
|
256.1 |
3.33 |
19.4 |
59.8 |
208.8 |
|
272.6 |
3.33 |
52.7 |
110.7 |
462.8 |
|
296.7 |
8.24 |
84.4 |
125.5 |
659.2 |
|
296.7 |
8.24 |
169.8 |
507.9 |
873.9 |
Model load + stepping (thread-matched, Node: pool=4 vs hc=4)
Model |
Forge compile/load (ms) |
Official compile/load (ms) |
Forge steady |
Official steady |
|---|---|---|---|---|
|
90.8 |
109.6 |
0.325 |
0.343 |
|
303.4 |
384.3 |
0.086 |
0.103 |
|
5.4 |
7.6 |
0.020 |
0.021 |
|
59.5 |
102.5 |
0.426 |
0.465 |
Functional baseline (Node: plugin-dependent models)
Model |
Forge 3.5.0 single |
Official 3.5.0 (hc=4) |
|---|---|---|
|
ok |
error |
|
ok |
error |
FFI microbench (Node: apples-to-apples subset)
This is a “1e6 calls” microbench. It is meant for comparing relative organization/binding overhead, not for extrapolating whole-app throughput.
Metric |
Forge 3.5.0 pthreads ns/call (median) |
Official 3.5.0 ns/call (median) |
|---|---|---|
|
4.23 |
5.24 |
|
14.71 |
33.56 |
Simulate TTFS (Browser: forge variants only)
Browser numbers come from
mujoco-wasm-play+ Playwright HUD sampling (default: 3 runs, median). In some environments, the pthread variant may not expose HUDCPU (ms/step), but TTFS/FPS is still sampled.
Model |
Variant |
TTFS ready->snapshot median (ms) |
TTFS nav->snapshot median (ms) |
FPS median |
|---|---|---|---|---|
|
single |
111 |
322 |
120 |
|
pthreads |
77 |
287 |
120 |
|
single |
225 |
499 |
120 |
|
pthreads |
505 |
791 |
120 |
|
single |
338 |
570 |
121 |
|
pthreads |
270 |
510 |
45.5 |
Reload / lifecycle (thread-matched, Node: pool=4 vs hc=4)
Label |
Iterations |
ms/iter |
RSS drift (MiB) |
RSS peak (MiB) |
|---|---|---|---|---|
|
50 |
0.840 |
1.9 |
462.8 |
|
50 |
1.003 |
7.0 |
659.2 |
Limitations (what this bench may still miss)
Official embind and forge necessarily use different JS-level APIs, even when the underlying engine version is comparable.
Browser pthread deployments require COOP/COEP +
SharedArrayBuffer; deployment constraints are part of the trade-off.Some metrics (especially browser memory) are harder to measure precisely and are currently more reliable in Node.
Fastest possible is not the goal; the goal is forge capabilities without unacceptable regression.
Reproducing the bench
See bench/README.md for the full procedure. The short version:
powershell -ExecutionPolicy Bypass -File tools/build_official_embind.ps1 -Ref 3.4.0 -OutDir C:\dev\mjwf-bench\official\3.4.0 -Clean
powershell -ExecutionPolicy Bypass -File tools/build_official_embind.ps1 -Ref 3.5.0 -OutDir C:\dev\mjwf-bench\official\3.5.0
node bench/node/run_matrix.mjs --official340 C:/dev/mjwf-bench/official/3.4.0 --official350 C:/dev/mjwf-bench/official/3.5.0
node bench/node/report.mjs
powershell -ExecutionPolicy Bypass -File bench/browser/run_playwright_bench.ps1