Skip to content

Initial webgpu support#215

Merged
ngxson merged 6 commits into
masterfrom
xsn/wgpu_init_support
May 11, 2026
Merged

Initial webgpu support#215
ngxson merged 6 commits into
masterfrom
xsn/wgpu_init_support

Conversation

@ngxson
Copy link
Copy Markdown
Owner

@ngxson ngxson commented May 10, 2026

This PR is partially based on #201 , huge thanks to @reeselevine

The main goal of this PR is to have one single build that support both webgpu + single-threaded + multi-threaded, each config can be toggled at runtime.

This is achieved by building on top of #214 , with JSPI enabled for webgpu async WaitAny. On platforms that doesn't support JSPI, this will be stubbed, thus still allow using wllama but without webgpu support.

JSPI-only is one single compromise of this implementaton. This is because binaryen's asyncify doesn't support wasm exceptions, resulting in huge overhead in both performance and binary size. The result is that firefox support requires manually enabling javascript.options.wasm_js_promise_integration in aboud:config, but I think this is an acceptable compromise as most users will use this from chromium-based browsers.

Performance (macbook M5):

  • On the multimodal demo, I got 171 t/s for generation and 592 t/s for prompt processing (running on latest chrome)
  • On firefox, the basic demo run extremely slow compared to single-thread. Still not sure why

TODO:

  • Expose params to control GPU layers
  • Add tests
  • (follow-up PR) Allow -fit params to automatically determine n_gpu_layers andn_ctx

Summary by CodeRabbit

  • New Features

    • Added WebGPU acceleration support with Flash Attention optimization
    • Display real-time generation timings in multimodal demo
    • Configurable GPU layer offloading (n_gpu_layers parameter)
    • Automatic browser capability detection for WebGPU/JSPI with Firefox warnings
    • Preload example images in multimodal demo
  • Tests

    • Added WebGPU test suite for validation

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 10, 2026

📝 Walkthrough

Walkthrough

This PR implements WebGPU GPU acceleration for wllama by adding type definitions for timing metrics, browser feature detection, JSPI runtime compatibility, async WASM function wrappers, worker parameter updates, build configuration for GPU libraries, test infrastructure, and example UI enhancements.

Changes

WebGPU Feature Implementation

Layer / File(s) Summary
Type System & API Surface
src/types/oai-compat.ts, src/types/types.ts
Added ResultTimings interface with cache/prompt/generation metrics; extended ChatCompletionChunk, RawCompletionResponse, and RawCompletionChunk with optional timings field. Added n_gpu_layers parameter to LoadModelParams.
Build Configuration
CMakeLists.txt, scripts/docker-compose.yml
Split Emscripten compile/link flags into separate entries; added JSPI exports for wllama_start and wllama_action. Docker Compose now downloads Dawn WebGPU library and enables GPU-related CMake flags (-DGGML_WEBGPU=ON, -DGGML_WEBGPU_JSPI=ON).
Browser Feature Detection
src/utils.ts
Added isSupportJSPI() to detect WebAssembly.Suspending, isSupportWebGPU() to check both GPU and JSPI availability, and isFirefox() user-agent helper.
JSPI Runtime Compatibility
src/worker.ts
Injected JSPI_STUB polyfill that defines WebAssembly.Suspending and WebAssembly.promising fallbacks when JSPI is unavailable.
Async WASM Function Wrappers
src/workers-code/llama-cpp.js
Updated callWrapper to accept isAsync flag and configure Module.cwrap with async mode. Wrapped wllamaStart and wllamaAction as async-capable cwrap calls.
Core Class Integration
src/wllama.ts
Imported isSupportJSPI for Firefox JSPI warnings in constructor. Added isSupportWebGPU() public method. Updated loadModel to pass n_gpu_layers with default params.n_gpu_layers ?? 99999 and n_ctx via nullish coalescing to worker.
C++ Backend
cpp/wllama-context.h
Added flash_attn flag handling in action_load to configure flash attention type. Reformatted reasoning and default_template_kwargs parsing and embedding task input/content selection without changing behavior. Updated llama.cpp subproject commit.
Test Infrastructure
vitest.config.ts, src/wllama.wgpu.test.ts
Vitest now conditionally filters tests by WEBGPU=1 env var, running only *.wgpu.test.* files with dedicated Chrome WebGPU launch args. Added test suite validating WebGPU support, model loading, and completion generation.
Multimodal Example UI
examples/multimodal/index.html
Added hidden #output_timings panel with prompt and generation rate fields. Preloads bliss.png on init. Streaming onData now extracts timing metrics and displays the timings panel. Added bottom padding to page body.
CI & Test Scripts
.github/workflows/ci.yml, package.json
Added test:wgpu npm script with WEBGPU=1 flag and Safari test script. CI workflow includes TODO comment on adding WebGPU tests once ShaderF16 support is available.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • ngxson/wllama#211: Modifies src/workers-code/llama-cpp.js async function handling and BigInt pointer/Memory64 logic in the same worker WASM binding layer that this PR updates with async cwrap wrappers.
  • ngxson/wllama#187: Updates the llama.cpp subproject commit reference, related to the same submodule bump in this PR.
  • ngxson/wllama#214: Changes build/worker initialization pipeline, CMake configuration, and unified WASM module handling that overlaps with this PR's worker.ts JSPI stub injection and build configuration updates.

Poem

🐰 GPU dreams bloom bright,
JSPI bridges the night,
Async wraps in flight—
WebGPU burning with might!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Initial webgpu support' clearly and directly summarizes the main objective of the changeset, which adds WebGPU support to the project across multiple files and components.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch xsn/wgpu_init_support

Comment @coderabbitai help to get the list of available commands and usage tips.

@reeselevine
Copy link
Copy Markdown
Collaborator

reeselevine commented May 11, 2026

Great to see this getting integrated! Needing only one wasm blob is really nice, and the stub approach for non-JSPI supporting browsers is neat.

As far as browser performance, I've also found that Chrome > Safari > Firefox, although that was using asyncify in Safari/Firefox. I don't think this is necessarily wllama/webgpu specific either, on my M3 I also see really bad performance by Transformers.js and WebLLM on up-to-date Firefox browsers.

I'm happy to work on other aspects of the WebGPU integration too depending on what you'd like to include in wllama. I might take a few days to get to anything, so also happy to just give feedback too since it seems like you're moving pretty fast right now :). The one thing that I think is pretty useful is loading models directly from OPFS into WebGPU buffers, since it can reduce memory overhead quite a bit and model splitting isn't needed. That design might need to be thought about though if models are only partially offloaded.

Otherwise full browser support would be great, although you're probably right most interested users are probably on Chromium-based browsers. Hopefully jspi and 64-bit memory can be fully supported soon by browsers, it seems like JSPI is closer based on https://webassembly.org/features/. Not sure why Safari hasn't started working on Memory64 yet :(.

@ngxson
Copy link
Copy Markdown
Owner Author

ngxson commented May 11, 2026

The one thing that I think is pretty useful is loading models directly from OPFS into WebGPU buffers, since it can reduce memory overhead quite a bit and model splitting isn't needed.

Yes I already aware of this issue. We're planning to support loading GGUF from a callback, ggml-org/llama.cpp#22341 , which will allow wiring up OPFS directly to GGUF loader in near future.

Edit: and yes, feel free to give feedback / review this PR if you have time, thanks!

@thomas-0816
Copy link
Copy Markdown

I also see really bad performance by Transformers.js and WebLLM on up-to-date Firefox browsers.

I've recently tested Transformers.js 4.2 using AMD 7840U, Ubuntu 24.04 on Firefox 150 vs Ungoogled-Chromium 147 (--enable-unsafe-webgpu --enable-features=Vulkan) with Qwen3-4B-ONNX:q4f16 and found Firefox 2 times slower than Chromium on the same code.
GPU usage on Chromium (radeontop): 90%
GPU usage on Firefox: 65-70%
GPU usage using llama.cpp is normally 95-97%

@ngxson ngxson marked this pull request as ready for review May 11, 2026 16:18
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
examples/multimodal/index.html (1)

244-244: ⚡ Quick win

Avoid logging every streamed chunk in the hot path.

Line [244] logs each chunk during generation; this can reduce measured token throughput and adds noisy console output. Consider gating behind a debug flag or removing it.

Suggested fix
+      const DEBUG_STREAM_CHUNKS = false;
...
-            console.log('Received chunk:', chunk);
+            if (DEBUG_STREAM_CHUNKS) console.log('Received chunk:', chunk);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/multimodal/index.html` at line 244, The hot-path
console.log('Received chunk:', chunk) inside the streaming callback is too noisy
and harms throughput; remove that unconditional log or gate it behind a runtime
debug flag (e.g., DEBUG or verbose) and only log when the flag is true so
streaming performance isn't impacted—locate the streaming callback that receives
the chunk variable in examples/multimodal/index.html and replace the direct
console.log with a conditional log controlled by the debug flag.
scripts/docker-compose.yml (1)

24-34: 💤 Low value

Consider cleanup and version documentation.

The Dawn package integration looks correct. A couple of minor suggestions:

  1. The emdawn.zip file remains after extraction. Consider adding rm emdawn.zip after line 32 to save disk space during builds.
  2. Consider adding a comment explaining how to update DAWN_TAG when newer versions are needed.
♻️ Optional cleanup
         python3 -c "import zipfile; zf=zipfile.ZipFile('emdawn.zip','r'); zf.extractall('/source/build/emdawn'); zf.close()"
+        rm emdawn.zip

         emcmake cmake .. -DGGML_WEBGPU=ON -DGGML_WEBGPU_JSPI=ON -DEMDAWNWEBGPU_DIR="$${EMDAWNWEBGPU_DIR}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/docker-compose.yml` around lines 24 - 34, Add cleanup and a brief
version-update note: after extracting emdawn.zip (the python3 zipfile.extractall
call that creates emdawn.zip), remove the downloaded archive (rm emdawn.zip) to
free build disk space, and add a comment above the DAWN_TAG declaration
explaining how to update DAWN_TAG/EMDAWN_PKG (e.g., where to find new release
tags or naming pattern) so future maintainers know how to bump the Dawn version
and regenerate EMDAWN_PKG/EMDAWNWEBGPU_DIR.
vitest.config.ts (1)

29-47: 💤 Low value

Consider documenting that WebGPU tests require Chromium.

The providerOptions logic correctly handles WebGPU mode with Chromium-specific flags, but when WEBGPU=1 is set alongside BROWSER=safari, the config will use Chromium launch args with the playwright provider anyway (since the WEBGPU check comes first). This is likely intentional since Safari doesn't support the enable-unsafe-webgpu flag, but a brief comment would clarify the expected behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@vitest.config.ts` around lines 29 - 47, Add a short inline comment above the
providerOptions conditional explaining that when WEBGPU is true the config
intentionally uses Chromium-specific launch args (chromeArgsWebGPU) and will
override BROWSER=safari/SAFARI because Safari does not support the
enable-unsafe-webgpu flag; reference WEBGPU, SAFARI, and chromeArgsWebGPU so
future readers understand this precedence and why Chromium is required for
WebGPU tests.
src/wllama.ts (1)

473-474: 💤 Low value

Document the default behavior change for n_gpu_layers.

The default value of n_gpu_layers: 99999 means WebGPU offloading is enabled by default when available. This is a significant behavior change that users should be aware of. Consider adding a note in the LoadModelParams type documentation or README to clarify that GPU offloading is now opt-out rather than opt-in.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/wllama.ts` around lines 473 - 474, Update the documentation to call out
that n_gpu_layers now defaults to 99999 (in the LoadModelParams type and any
public API docs/README), explaining this enables WebGPU/GPU offloading by
default when available (i.e., it is opt-out not opt-in); add a short JSDoc
comment on the LoadModelParams type and/or the function that consumes it (where
n_gpu_layers is used) noting the default and how to disable offloading (set
n_gpu_layers to 0 or an explicit value), and update any README/usage examples to
reflect the new default behavior.
src/workers-code/llama-cpp.js (1)

247-251: 💤 Low value

Minor: Consider adding explicit await for zero-arg async functions.

For consistency, the zero-arg branch (line 250) could explicitly await when isAsync is true, matching the two-arg branch pattern. While the current code works correctly due to async function semantics, explicit awaiting makes the intent clearer:

       if (args.length === 2) {
         result = isAsync ? await fn(action, req) : fn(action, req);
       } else {
-        result = fn();
+        result = isAsync ? await fn() : fn();
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/workers-code/llama-cpp.js` around lines 247 - 251, The zero-arg call path
currently does not explicitly await async functions; update the args.length ===
2 else branch so that when isAsync is true you await fn() (e.g., replace result
= fn() with result = isAsync ? await fn() : fn()), keeping the existing
variables fn, isAsync and result and preserving the surrounding async context
used where this args.length check occurs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/multimodal/index.html`:
- Around line 123-127: The preloading of bliss.png is currently awaited inside
main() (variables/expressions: fetch('./bliss.png'), response, imageData,
elemPreviewImage) which will reject main() if the network fails; wrap the
fetch/arrayBuffer/DOM assignment in a try/catch so failures are caught and
ignored (or handled with a fallback) and do not abort initialization, or else
perform the fetch asynchronously (fire-and-forget Promise) so the rest of main()
proceeds even if the image load fails; ensure any DOM updates to
elemPreviewImage occur only on success and that errors are logged but not
rethrown.

---

Nitpick comments:
In `@examples/multimodal/index.html`:
- Line 244: The hot-path console.log('Received chunk:', chunk) inside the
streaming callback is too noisy and harms throughput; remove that unconditional
log or gate it behind a runtime debug flag (e.g., DEBUG or verbose) and only log
when the flag is true so streaming performance isn't impacted—locate the
streaming callback that receives the chunk variable in
examples/multimodal/index.html and replace the direct console.log with a
conditional log controlled by the debug flag.

In `@scripts/docker-compose.yml`:
- Around line 24-34: Add cleanup and a brief version-update note: after
extracting emdawn.zip (the python3 zipfile.extractall call that creates
emdawn.zip), remove the downloaded archive (rm emdawn.zip) to free build disk
space, and add a comment above the DAWN_TAG declaration explaining how to update
DAWN_TAG/EMDAWN_PKG (e.g., where to find new release tags or naming pattern) so
future maintainers know how to bump the Dawn version and regenerate
EMDAWN_PKG/EMDAWNWEBGPU_DIR.

In `@src/wllama.ts`:
- Around line 473-474: Update the documentation to call out that n_gpu_layers
now defaults to 99999 (in the LoadModelParams type and any public API
docs/README), explaining this enables WebGPU/GPU offloading by default when
available (i.e., it is opt-out not opt-in); add a short JSDoc comment on the
LoadModelParams type and/or the function that consumes it (where n_gpu_layers is
used) noting the default and how to disable offloading (set n_gpu_layers to 0 or
an explicit value), and update any README/usage examples to reflect the new
default behavior.

In `@src/workers-code/llama-cpp.js`:
- Around line 247-251: The zero-arg call path currently does not explicitly
await async functions; update the args.length === 2 else branch so that when
isAsync is true you await fn() (e.g., replace result = fn() with result =
isAsync ? await fn() : fn()), keeping the existing variables fn, isAsync and
result and preserving the surrounding async context used where this args.length
check occurs.

In `@vitest.config.ts`:
- Around line 29-47: Add a short inline comment above the providerOptions
conditional explaining that when WEBGPU is true the config intentionally uses
Chromium-specific launch args (chromeArgsWebGPU) and will override
BROWSER=safari/SAFARI because Safari does not support the enable-unsafe-webgpu
flag; reference WEBGPU, SAFARI, and chromeArgsWebGPU so future readers
understand this precedence and why Chromium is required for WebGPU tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: faacd835-fef2-41a6-9555-d6192499dd05

📥 Commits

Reviewing files that changed from the base of the PR and between 9dd26ee and 46a3c3d.

⛔ Files ignored due to path filters (1)
  • src/wasm/wllama.wasm is excluded by !**/*.wasm
📒 Files selected for processing (17)
  • .github/workflows/ci.yml
  • CMakeLists.txt
  • cpp/wllama-context.h
  • examples/multimodal/index.html
  • llama.cpp
  • package.json
  • scripts/docker-compose.yml
  • src/types/oai-compat.ts
  • src/types/types.ts
  • src/utils.ts
  • src/wasm/wllama.js
  • src/wllama.ts
  • src/wllama.wgpu.test.ts
  • src/worker.ts
  • src/workers-code/generated.ts
  • src/workers-code/llama-cpp.js
  • vitest.config.ts

Comment thread examples/multimodal/index.html
@ngxson ngxson merged commit 62aa7a8 into master May 11, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants