Initial webgpu support#215
Conversation
📝 WalkthroughWalkthroughThis PR implements WebGPU GPU acceleration for wllama by adding type definitions for timing metrics, browser feature detection, JSPI runtime compatibility, async WASM function wrappers, worker parameter updates, build configuration for GPU libraries, test infrastructure, and example UI enhancements. ChangesWebGPU Feature Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Great to see this getting integrated! Needing only one wasm blob is really nice, and the stub approach for non-JSPI supporting browsers is neat. As far as browser performance, I've also found that Chrome > Safari > Firefox, although that was using asyncify in Safari/Firefox. I don't think this is necessarily wllama/webgpu specific either, on my M3 I also see really bad performance by Transformers.js and WebLLM on up-to-date Firefox browsers. I'm happy to work on other aspects of the WebGPU integration too depending on what you'd like to include in wllama. I might take a few days to get to anything, so also happy to just give feedback too since it seems like you're moving pretty fast right now :). The one thing that I think is pretty useful is loading models directly from OPFS into WebGPU buffers, since it can reduce memory overhead quite a bit and model splitting isn't needed. That design might need to be thought about though if models are only partially offloaded. Otherwise full browser support would be great, although you're probably right most interested users are probably on Chromium-based browsers. Hopefully jspi and 64-bit memory can be fully supported soon by browsers, it seems like JSPI is closer based on https://webassembly.org/features/. Not sure why Safari hasn't started working on Memory64 yet :(. |
Yes I already aware of this issue. We're planning to support loading GGUF from a callback, ggml-org/llama.cpp#22341 , which will allow wiring up OPFS directly to GGUF loader in near future. Edit: and yes, feel free to give feedback / review this PR if you have time, thanks! |
I've recently tested Transformers.js 4.2 using AMD 7840U, Ubuntu 24.04 on Firefox 150 vs Ungoogled-Chromium 147 (--enable-unsafe-webgpu --enable-features=Vulkan) with Qwen3-4B-ONNX:q4f16 and found Firefox 2 times slower than Chromium on the same code. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
examples/multimodal/index.html (1)
244-244: ⚡ Quick winAvoid logging every streamed chunk in the hot path.
Line [244] logs each chunk during generation; this can reduce measured token throughput and adds noisy console output. Consider gating behind a debug flag or removing it.
Suggested fix
+ const DEBUG_STREAM_CHUNKS = false; ... - console.log('Received chunk:', chunk); + if (DEBUG_STREAM_CHUNKS) console.log('Received chunk:', chunk);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/multimodal/index.html` at line 244, The hot-path console.log('Received chunk:', chunk) inside the streaming callback is too noisy and harms throughput; remove that unconditional log or gate it behind a runtime debug flag (e.g., DEBUG or verbose) and only log when the flag is true so streaming performance isn't impacted—locate the streaming callback that receives the chunk variable in examples/multimodal/index.html and replace the direct console.log with a conditional log controlled by the debug flag.scripts/docker-compose.yml (1)
24-34: 💤 Low valueConsider cleanup and version documentation.
The Dawn package integration looks correct. A couple of minor suggestions:
- The
emdawn.zipfile remains after extraction. Consider addingrm emdawn.zipafter line 32 to save disk space during builds.- Consider adding a comment explaining how to update
DAWN_TAGwhen newer versions are needed.♻️ Optional cleanup
python3 -c "import zipfile; zf=zipfile.ZipFile('emdawn.zip','r'); zf.extractall('/source/build/emdawn'); zf.close()" + rm emdawn.zip emcmake cmake .. -DGGML_WEBGPU=ON -DGGML_WEBGPU_JSPI=ON -DEMDAWNWEBGPU_DIR="$${EMDAWNWEBGPU_DIR}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/docker-compose.yml` around lines 24 - 34, Add cleanup and a brief version-update note: after extracting emdawn.zip (the python3 zipfile.extractall call that creates emdawn.zip), remove the downloaded archive (rm emdawn.zip) to free build disk space, and add a comment above the DAWN_TAG declaration explaining how to update DAWN_TAG/EMDAWN_PKG (e.g., where to find new release tags or naming pattern) so future maintainers know how to bump the Dawn version and regenerate EMDAWN_PKG/EMDAWNWEBGPU_DIR.vitest.config.ts (1)
29-47: 💤 Low valueConsider documenting that WebGPU tests require Chromium.
The
providerOptionslogic correctly handles WebGPU mode with Chromium-specific flags, but whenWEBGPU=1is set alongsideBROWSER=safari, the config will use Chromium launch args with the playwright provider anyway (since the WEBGPU check comes first). This is likely intentional since Safari doesn't support theenable-unsafe-webgpuflag, but a brief comment would clarify the expected behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@vitest.config.ts` around lines 29 - 47, Add a short inline comment above the providerOptions conditional explaining that when WEBGPU is true the config intentionally uses Chromium-specific launch args (chromeArgsWebGPU) and will override BROWSER=safari/SAFARI because Safari does not support the enable-unsafe-webgpu flag; reference WEBGPU, SAFARI, and chromeArgsWebGPU so future readers understand this precedence and why Chromium is required for WebGPU tests.src/wllama.ts (1)
473-474: 💤 Low valueDocument the default behavior change for
n_gpu_layers.The default value of
n_gpu_layers: 99999means WebGPU offloading is enabled by default when available. This is a significant behavior change that users should be aware of. Consider adding a note in theLoadModelParamstype documentation or README to clarify that GPU offloading is now opt-out rather than opt-in.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/wllama.ts` around lines 473 - 474, Update the documentation to call out that n_gpu_layers now defaults to 99999 (in the LoadModelParams type and any public API docs/README), explaining this enables WebGPU/GPU offloading by default when available (i.e., it is opt-out not opt-in); add a short JSDoc comment on the LoadModelParams type and/or the function that consumes it (where n_gpu_layers is used) noting the default and how to disable offloading (set n_gpu_layers to 0 or an explicit value), and update any README/usage examples to reflect the new default behavior.src/workers-code/llama-cpp.js (1)
247-251: 💤 Low valueMinor: Consider adding explicit await for zero-arg async functions.
For consistency, the zero-arg branch (line 250) could explicitly await when
isAsyncis true, matching the two-arg branch pattern. While the current code works correctly due to async function semantics, explicit awaiting makes the intent clearer:if (args.length === 2) { result = isAsync ? await fn(action, req) : fn(action, req); } else { - result = fn(); + result = isAsync ? await fn() : fn(); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/workers-code/llama-cpp.js` around lines 247 - 251, The zero-arg call path currently does not explicitly await async functions; update the args.length === 2 else branch so that when isAsync is true you await fn() (e.g., replace result = fn() with result = isAsync ? await fn() : fn()), keeping the existing variables fn, isAsync and result and preserving the surrounding async context used where this args.length check occurs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/multimodal/index.html`:
- Around line 123-127: The preloading of bliss.png is currently awaited inside
main() (variables/expressions: fetch('./bliss.png'), response, imageData,
elemPreviewImage) which will reject main() if the network fails; wrap the
fetch/arrayBuffer/DOM assignment in a try/catch so failures are caught and
ignored (or handled with a fallback) and do not abort initialization, or else
perform the fetch asynchronously (fire-and-forget Promise) so the rest of main()
proceeds even if the image load fails; ensure any DOM updates to
elemPreviewImage occur only on success and that errors are logged but not
rethrown.
---
Nitpick comments:
In `@examples/multimodal/index.html`:
- Line 244: The hot-path console.log('Received chunk:', chunk) inside the
streaming callback is too noisy and harms throughput; remove that unconditional
log or gate it behind a runtime debug flag (e.g., DEBUG or verbose) and only log
when the flag is true so streaming performance isn't impacted—locate the
streaming callback that receives the chunk variable in
examples/multimodal/index.html and replace the direct console.log with a
conditional log controlled by the debug flag.
In `@scripts/docker-compose.yml`:
- Around line 24-34: Add cleanup and a brief version-update note: after
extracting emdawn.zip (the python3 zipfile.extractall call that creates
emdawn.zip), remove the downloaded archive (rm emdawn.zip) to free build disk
space, and add a comment above the DAWN_TAG declaration explaining how to update
DAWN_TAG/EMDAWN_PKG (e.g., where to find new release tags or naming pattern) so
future maintainers know how to bump the Dawn version and regenerate
EMDAWN_PKG/EMDAWNWEBGPU_DIR.
In `@src/wllama.ts`:
- Around line 473-474: Update the documentation to call out that n_gpu_layers
now defaults to 99999 (in the LoadModelParams type and any public API
docs/README), explaining this enables WebGPU/GPU offloading by default when
available (i.e., it is opt-out not opt-in); add a short JSDoc comment on the
LoadModelParams type and/or the function that consumes it (where n_gpu_layers is
used) noting the default and how to disable offloading (set n_gpu_layers to 0 or
an explicit value), and update any README/usage examples to reflect the new
default behavior.
In `@src/workers-code/llama-cpp.js`:
- Around line 247-251: The zero-arg call path currently does not explicitly
await async functions; update the args.length === 2 else branch so that when
isAsync is true you await fn() (e.g., replace result = fn() with result =
isAsync ? await fn() : fn()), keeping the existing variables fn, isAsync and
result and preserving the surrounding async context used where this args.length
check occurs.
In `@vitest.config.ts`:
- Around line 29-47: Add a short inline comment above the providerOptions
conditional explaining that when WEBGPU is true the config intentionally uses
Chromium-specific launch args (chromeArgsWebGPU) and will override
BROWSER=safari/SAFARI because Safari does not support the enable-unsafe-webgpu
flag; reference WEBGPU, SAFARI, and chromeArgsWebGPU so future readers
understand this precedence and why Chromium is required for WebGPU tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: faacd835-fef2-41a6-9555-d6192499dd05
⛔ Files ignored due to path filters (1)
src/wasm/wllama.wasmis excluded by!**/*.wasm
📒 Files selected for processing (17)
.github/workflows/ci.ymlCMakeLists.txtcpp/wllama-context.hexamples/multimodal/index.htmlllama.cpppackage.jsonscripts/docker-compose.ymlsrc/types/oai-compat.tssrc/types/types.tssrc/utils.tssrc/wasm/wllama.jssrc/wllama.tssrc/wllama.wgpu.test.tssrc/worker.tssrc/workers-code/generated.tssrc/workers-code/llama-cpp.jsvitest.config.ts
This PR is partially based on #201 , huge thanks to @reeselevine
The main goal of this PR is to have one single build that support both webgpu + single-threaded + multi-threaded, each config can be toggled at runtime.
This is achieved by building on top of #214 , with JSPI enabled for webgpu async
WaitAny. On platforms that doesn't support JSPI, this will be stubbed, thus still allow using wllama but without webgpu support.JSPI-only is one single compromise of this implementaton. This is because binaryen's asyncify doesn't support wasm exceptions, resulting in huge overhead in both performance and binary size. The result is that firefox support requires manually enabling
javascript.options.wasm_js_promise_integrationinaboud:config, but I think this is an acceptable compromise as most users will use this from chromium-based browsers.Performance (macbook M5):
TODO:
-fitparams to automatically determinen_gpu_layersandn_ctxSummary by CodeRabbit
New Features
Tests