Initial webgpu support by ngxson · Pull Request #215 · ngxson/wllama

ngxson · 2026-05-10T21:53:01Z

This PR is partially based on #201 , huge thanks to @reeselevine

The main goal of this PR is to have one single build that support both webgpu + single-threaded + multi-threaded, each config can be toggled at runtime.

This is achieved by building on top of #214 , with JSPI enabled for webgpu async WaitAny. On platforms that doesn't support JSPI, this will be stubbed, thus still allow using wllama but without webgpu support.

JSPI-only is one single compromise of this implementaton. This is because binaryen's asyncify doesn't support wasm exceptions, resulting in huge overhead in both performance and binary size. The result is that firefox support requires manually enabling javascript.options.wasm_js_promise_integration in aboud:config, but I think this is an acceptable compromise as most users will use this from chromium-based browsers.

Performance (macbook M5):

On the multimodal demo, I got 171 t/s for generation and 592 t/s for prompt processing (running on latest chrome)
On firefox, the basic demo run extremely slow compared to single-thread. Still not sure why

TODO:

Expose params to control GPU layers
Add tests
(follow-up PR) Allow -fit params to automatically determine n_gpu_layers andn_ctx

Summary by CodeRabbit

New Features
- Added WebGPU acceleration support with Flash Attention optimization
- Display real-time generation timings in multimodal demo
- Configurable GPU layer offloading (n_gpu_layers parameter)
- Automatic browser capability detection for WebGPU/JSPI with Firefox warnings
- Preload example images in multimodal demo
Tests
- Added WebGPU test suite for validation

coderabbitai · 2026-05-10T21:53:06Z

📝 Walkthrough

Walkthrough

This PR implements WebGPU GPU acceleration for wllama by adding type definitions for timing metrics, browser feature detection, JSPI runtime compatibility, async WASM function wrappers, worker parameter updates, build configuration for GPU libraries, test infrastructure, and example UI enhancements.

Changes

WebGPU Feature Implementation

Layer / File(s)	Summary
Type System & API Surface `src/types/oai-compat.ts`, `src/types/types.ts`	Added `ResultTimings` interface with cache/prompt/generation metrics; extended `ChatCompletionChunk`, `RawCompletionResponse`, and `RawCompletionChunk` with optional `timings` field. Added `n_gpu_layers` parameter to `LoadModelParams`.
Build Configuration `CMakeLists.txt`, `scripts/docker-compose.yml`	Split Emscripten compile/link flags into separate entries; added JSPI exports for `wllama_start` and `wllama_action`. Docker Compose now downloads Dawn WebGPU library and enables GPU-related CMake flags (`-DGGML_WEBGPU=ON`, `-DGGML_WEBGPU_JSPI=ON`).
Browser Feature Detection `src/utils.ts`	Added `isSupportJSPI()` to detect `WebAssembly.Suspending`, `isSupportWebGPU()` to check both GPU and JSPI availability, and `isFirefox()` user-agent helper.
JSPI Runtime Compatibility `src/worker.ts`	Injected `JSPI_STUB` polyfill that defines `WebAssembly.Suspending` and `WebAssembly.promising` fallbacks when JSPI is unavailable.
Async WASM Function Wrappers `src/workers-code/llama-cpp.js`	Updated `callWrapper` to accept `isAsync` flag and configure `Module.cwrap` with async mode. Wrapped `wllamaStart` and `wllamaAction` as async-capable cwrap calls.
Core Class Integration `src/wllama.ts`	Imported `isSupportJSPI` for Firefox JSPI warnings in constructor. Added `isSupportWebGPU()` public method. Updated `loadModel` to pass `n_gpu_layers` with default `params.n_gpu_layers ?? 99999` and `n_ctx` via nullish coalescing to worker.
C++ Backend `cpp/wllama-context.h`	Added `flash_attn` flag handling in `action_load` to configure flash attention type. Reformatted `reasoning` and `default_template_kwargs` parsing and embedding task input/content selection without changing behavior. Updated `llama.cpp` subproject commit.
Test Infrastructure `vitest.config.ts`, `src/wllama.wgpu.test.ts`	Vitest now conditionally filters tests by `WEBGPU=1` env var, running only `.wgpu.test.` files with dedicated Chrome WebGPU launch args. Added test suite validating WebGPU support, model loading, and completion generation.
Multimodal Example UI `examples/multimodal/index.html`	Added hidden `#output_timings` panel with prompt and generation rate fields. Preloads `bliss.png` on init. Streaming `onData` now extracts timing metrics and displays the timings panel. Added bottom padding to page body.
CI & Test Scripts `.github/workflows/ci.yml`, `package.json`	Added `test:wgpu` npm script with `WEBGPU=1` flag and Safari test script. CI workflow includes TODO comment on adding WebGPU tests once ShaderF16 support is available.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

ngxson/wllama#211: Modifies src/workers-code/llama-cpp.js async function handling and BigInt pointer/Memory64 logic in the same worker WASM binding layer that this PR updates with async cwrap wrappers.
ngxson/wllama#187: Updates the llama.cpp subproject commit reference, related to the same submodule bump in this PR.
ngxson/wllama#214: Changes build/worker initialization pipeline, CMake configuration, and unified WASM module handling that overlaps with this PR's worker.ts JSPI stub injection and build configuration updates.

Poem

🐰 GPU dreams bloom bright,
JSPI bridges the night,
Async wraps in flight—
WebGPU burning with might! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Initial webgpu support' clearly and directly summarizes the main objective of the changeset, which adds WebGPU support to the project across multiple files and components.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch xsn/wgpu_init_support

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

reeselevine · 2026-05-11T04:06:15Z

Great to see this getting integrated! Needing only one wasm blob is really nice, and the stub approach for non-JSPI supporting browsers is neat.

As far as browser performance, I've also found that Chrome > Safari > Firefox, although that was using asyncify in Safari/Firefox. I don't think this is necessarily wllama/webgpu specific either, on my M3 I also see really bad performance by Transformers.js and WebLLM on up-to-date Firefox browsers.

I'm happy to work on other aspects of the WebGPU integration too depending on what you'd like to include in wllama. I might take a few days to get to anything, so also happy to just give feedback too since it seems like you're moving pretty fast right now :). The one thing that I think is pretty useful is loading models directly from OPFS into WebGPU buffers, since it can reduce memory overhead quite a bit and model splitting isn't needed. That design might need to be thought about though if models are only partially offloaded.

Otherwise full browser support would be great, although you're probably right most interested users are probably on Chromium-based browsers. Hopefully jspi and 64-bit memory can be fully supported soon by browsers, it seems like JSPI is closer based on https://webassembly.org/features/. Not sure why Safari hasn't started working on Memory64 yet :(.

ngxson · 2026-05-11T10:10:48Z

The one thing that I think is pretty useful is loading models directly from OPFS into WebGPU buffers, since it can reduce memory overhead quite a bit and model splitting isn't needed.

Yes I already aware of this issue. We're planning to support loading GGUF from a callback, ggml-org/llama.cpp#22341 , which will allow wiring up OPFS directly to GGUF loader in near future.

Edit: and yes, feel free to give feedback / review this PR if you have time, thanks!

thomas-0816 · 2026-05-11T11:54:02Z

I also see really bad performance by Transformers.js and WebLLM on up-to-date Firefox browsers.

I've recently tested Transformers.js 4.2 using AMD 7840U, Ubuntu 24.04 on Firefox 150 vs Ungoogled-Chromium 147 (--enable-unsafe-webgpu --enable-features=Vulkan) with Qwen3-4B-ONNX:q4f16 and found Firefox 2 times slower than Chromium on the same code.
GPU usage on Chromium (radeontop): 90%
GPU usage on Firefox: 65-70%
GPU usage using llama.cpp is normally 95-97%

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

examples/multimodal/index.html (1)
244-244: ⚡ Quick win

Avoid logging every streamed chunk in the hot path.

Line [244] logs each chunk during generation; this can reduce measured token throughput and adds noisy console output. Consider gating behind a debug flag or removing it.
Suggested fix
+      const DEBUG_STREAM_CHUNKS = false;
...
-            console.log('Received chunk:', chunk);
+            if (DEBUG_STREAM_CHUNKS) console.log('Received chunk:', chunk);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/multimodal/index.html` at line 244, The hot-path
console.log('Received chunk:', chunk) inside the streaming callback is too noisy
and harms throughput; remove that unconditional log or gate it behind a runtime
debug flag (e.g., DEBUG or verbose) and only log when the flag is true so
streaming performance isn't impacted—locate the streaming callback that receives
the chunk variable in examples/multimodal/index.html and replace the direct
console.log with a conditional log controlled by the debug flag.
scripts/docker-compose.yml (1)
24-34: 💤 Low value

Consider cleanup and version documentation.

The Dawn package integration looks correct. A couple of minor suggestions:

The emdawn.zip file remains after extraction. Consider adding rm emdawn.zip after line 32 to save disk space during builds.

Consider adding a comment explaining how to update DAWN_TAG when newer versions are needed.
♻️ Optional cleanup
         python3 -c "import zipfile; zf=zipfile.ZipFile('emdawn.zip','r'); zf.extractall('/source/build/emdawn'); zf.close()"
+        rm emdawn.zip

         emcmake cmake .. -DGGML_WEBGPU=ON -DGGML_WEBGPU_JSPI=ON -DEMDAWNWEBGPU_DIR="$${EMDAWNWEBGPU_DIR}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/docker-compose.yml` around lines 24 - 34, Add cleanup and a brief
version-update note: after extracting emdawn.zip (the python3 zipfile.extractall
call that creates emdawn.zip), remove the downloaded archive (rm emdawn.zip) to
free build disk space, and add a comment above the DAWN_TAG declaration
explaining how to update DAWN_TAG/EMDAWN_PKG (e.g., where to find new release
tags or naming pattern) so future maintainers know how to bump the Dawn version
and regenerate EMDAWN_PKG/EMDAWNWEBGPU_DIR.
vitest.config.ts (1)
29-47: 💤 Low value

Consider documenting that WebGPU tests require Chromium.

The providerOptions logic correctly handles WebGPU mode with Chromium-specific flags, but when WEBGPU=1 is set alongside BROWSER=safari, the config will use Chromium launch args with the playwright provider anyway (since the WEBGPU check comes first). This is likely intentional since Safari doesn't support the enable-unsafe-webgpu flag, but a brief comment would clarify the expected behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@vitest.config.ts` around lines 29 - 47, Add a short inline comment above the
providerOptions conditional explaining that when WEBGPU is true the config
intentionally uses Chromium-specific launch args (chromeArgsWebGPU) and will
override BROWSER=safari/SAFARI because Safari does not support the
enable-unsafe-webgpu flag; reference WEBGPU, SAFARI, and chromeArgsWebGPU so
future readers understand this precedence and why Chromium is required for
WebGPU tests.
src/wllama.ts (1)
473-474: 💤 Low value

Document the default behavior change for n_gpu_layers.

The default value of n_gpu_layers: 99999 means WebGPU offloading is enabled by default when available. This is a significant behavior change that users should be aware of. Consider adding a note in the LoadModelParams type documentation or README to clarify that GPU offloading is now opt-out rather than opt-in.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/wllama.ts` around lines 473 - 474, Update the documentation to call out
that n_gpu_layers now defaults to 99999 (in the LoadModelParams type and any
public API docs/README), explaining this enables WebGPU/GPU offloading by
default when available (i.e., it is opt-out not opt-in); add a short JSDoc
comment on the LoadModelParams type and/or the function that consumes it (where
n_gpu_layers is used) noting the default and how to disable offloading (set
n_gpu_layers to 0 or an explicit value), and update any README/usage examples to
reflect the new default behavior.
src/workers-code/llama-cpp.js (1)
247-251: 💤 Low value

Minor: Consider adding explicit await for zero-arg async functions.

For consistency, the zero-arg branch (line 250) could explicitly await when isAsync is true, matching the two-arg branch pattern. While the current code works correctly due to async function semantics, explicit awaiting makes the intent clearer:
       if (args.length === 2) {
         result = isAsync ? await fn(action, req) : fn(action, req);
       } else {
-        result = fn();
+        result = isAsync ? await fn() : fn();
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/workers-code/llama-cpp.js` around lines 247 - 251, The zero-arg call path
currently does not explicitly await async functions; update the args.length ===
2 else branch so that when isAsync is true you await fn() (e.g., replace result
= fn() with result = isAsync ? await fn() : fn()), keeping the existing
variables fn, isAsync and result and preserving the surrounding async context
used where this args.length check occurs.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/multimodal/index.html`:
- Around line 123-127: The preloading of bliss.png is currently awaited inside
main() (variables/expressions: fetch('./bliss.png'), response, imageData,
elemPreviewImage) which will reject main() if the network fails; wrap the
fetch/arrayBuffer/DOM assignment in a try/catch so failures are caught and
ignored (or handled with a fallback) and do not abort initialization, or else
perform the fetch asynchronously (fire-and-forget Promise) so the rest of main()
proceeds even if the image load fails; ensure any DOM updates to
elemPreviewImage occur only on success and that errors are logged but not
rethrown.

---

Nitpick comments:
In `@examples/multimodal/index.html`:
- Line 244: The hot-path console.log('Received chunk:', chunk) inside the
streaming callback is too noisy and harms throughput; remove that unconditional
log or gate it behind a runtime debug flag (e.g., DEBUG or verbose) and only log
when the flag is true so streaming performance isn't impacted—locate the
streaming callback that receives the chunk variable in
examples/multimodal/index.html and replace the direct console.log with a
conditional log controlled by the debug flag.

In `@scripts/docker-compose.yml`:
- Around line 24-34: Add cleanup and a brief version-update note: after
extracting emdawn.zip (the python3 zipfile.extractall call that creates
emdawn.zip), remove the downloaded archive (rm emdawn.zip) to free build disk
space, and add a comment above the DAWN_TAG declaration explaining how to update
DAWN_TAG/EMDAWN_PKG (e.g., where to find new release tags or naming pattern) so
future maintainers know how to bump the Dawn version and regenerate
EMDAWN_PKG/EMDAWNWEBGPU_DIR.

In `@src/wllama.ts`:
- Around line 473-474: Update the documentation to call out that n_gpu_layers
now defaults to 99999 (in the LoadModelParams type and any public API
docs/README), explaining this enables WebGPU/GPU offloading by default when
available (i.e., it is opt-out not opt-in); add a short JSDoc comment on the
LoadModelParams type and/or the function that consumes it (where n_gpu_layers is
used) noting the default and how to disable offloading (set n_gpu_layers to 0 or
an explicit value), and update any README/usage examples to reflect the new
default behavior.

In `@src/workers-code/llama-cpp.js`:
- Around line 247-251: The zero-arg call path currently does not explicitly
await async functions; update the args.length === 2 else branch so that when
isAsync is true you await fn() (e.g., replace result = fn() with result =
isAsync ? await fn() : fn()), keeping the existing variables fn, isAsync and
result and preserving the surrounding async context used where this args.length
check occurs.

In `@vitest.config.ts`:
- Around line 29-47: Add a short inline comment above the providerOptions
conditional explaining that when WEBGPU is true the config intentionally uses
Chromium-specific launch args (chromeArgsWebGPU) and will override
BROWSER=safari/SAFARI because Safari does not support the enable-unsafe-webgpu
flag; reference WEBGPU, SAFARI, and chromeArgsWebGPU so future readers
understand this precedence and why Chromium is required for WebGPU tests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: faacd835-fef2-41a6-9555-d6192499dd05

📥 Commits

Reviewing files that changed from the base of the PR and between 9dd26ee and 46a3c3d.

⛔ Files ignored due to path filters (1)

src/wasm/wllama.wasm is excluded by !**/*.wasm

📒 Files selected for processing (17)

.github/workflows/ci.yml
CMakeLists.txt
cpp/wllama-context.h
examples/multimodal/index.html
llama.cpp
package.json
scripts/docker-compose.yml
src/types/oai-compat.ts
src/types/types.ts
src/utils.ts
src/wasm/wllama.js
src/wllama.ts
src/wllama.wgpu.test.ts
src/worker.ts
src/workers-code/generated.ts
src/workers-code/llama-cpp.js
vitest.config.ts

initial webgpu support

8b90496

ngxson added 2 commits May 11, 2026 13:05

fix multi-thread on firefox

f6c2952

format

2dd8e93

ngxson added 2 commits May 11, 2026 17:00

clean up

f9e0f63

add test

46a3c3d

ngxson marked this pull request as ready for review May 11, 2026 16:18

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread examples/multimodal/index.html

update docs

832615e

ngxson merged commit 62aa7a8 into master May 11, 2026
5 checks passed

This was referenced May 11, 2026

change CONFIG_PATH 'wllama.wasm' to 'default' #216

Merged

Add support for async file read #221

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial webgpu support#215

Initial webgpu support#215
ngxson merged 6 commits into
masterfrom
xsn/wgpu_init_support

ngxson commented May 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 10, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

reeselevine commented May 11, 2026 •

edited

Loading

Uh oh!

ngxson commented May 11, 2026 •

edited

Loading

Uh oh!

thomas-0816 commented May 11, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ngxson commented May 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

reeselevine commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomas-0816 commented May 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented May 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 10, 2026 •

edited

Loading

reeselevine commented May 11, 2026 •

edited

Loading

ngxson commented May 11, 2026 •

edited

Loading