WebLLM Provider

Overview

The WebLLM provider enables local LLM inference directly in the browser using WebGPU acceleration through the MLC engine. Models are downloaded and cached locally, providing privacy-focused chat capabilities without server dependencies.

Architecture

WebLLMEngineManager

Singleton manager that handles model initialization, loading, and inference streaming.

import { getWebLLMEngineManager } from './llm/webllm/engine';

const manager = getWebLLMEngineManager();

Configuration

Provider Definition

From src/llm/providers.ts:180-186:

{
  id: "webllm",
  label: "Local (Browser WebLLM)",
  kind: "local",
  defaultBaseUrl: "",
  defaultModel: "Llama-3.2-1B-Instruct-q4f16_1-MLC",
  requiresApiKey: false,
}

string

required

Provider identifier: "webllm"

kind

string

Provider type: "local" (runs in browser)

defaultModel

string

Default model: "Llama-3.2-1B-Instruct-q4f16_1-MLC"

requiresApiKey

boolean

API key required: false

Model Catalog

Available Models

From src/llm/webllm/modelCatalog.ts:12-56:

Llama-3.2-1B-Instruct-q4f16_1-MLC

object

Tier: balanced
Size: ~700 MB
Notes: Primary default for strong desktops

SmolLM2-360M-Instruct-q4f16_1-MLC

object

Tier: ultra-low
Size: ~250 MB
Notes: Mobile and weak-device fallback

Qwen2.5-1.5B-Instruct-q4f16_1-MLC

object

Tier: quality
Size: ~1100 MB
Notes: Strong for technical summaries

Gemma-2-2B-Instruct-q4f16_1-MLC

object

Tier: quality
Size: ~1400 MB
Notes: Polished README summarization

Llama-3.1-3B-Instruct-q4f16_1-MLC

object

Tier: quality
Size: ~1900 MB
Notes: Fallback substitute when Hermes is unavailable

Model Selection Functions

import {
  getWebLLMSelectableModels,
  getWebLLMModelProfile,
  isWebLLMModelIdSupported,
} from './llm/webllm/modelCatalog';

// Get all available models
const models = getWebLLMSelectableModels();
// WebLLMModelProfile[]

// Get specific model info
const profile = getWebLLMModelProfile('Llama-3.2-1B-Instruct-q4f16_1-MLC');
// { id, label, tier, approxDownloadMB, notes }

// Check if model is supported
const supported = isWebLLMModelIdSupported('custom-model');
// boolean

Capability Detection

WebLLMCapability

From src/llm/webllm/capability.ts:8-14:

isMobile

boolean

Whether the device is mobile

hasWebGPU

boolean

Whether WebGPU is available in the browser

hardwareConcurrency

number

Number of logical CPU cores

deviceMemoryGB

number | null

Available device memory in GB (null if unavailable)

perfScore

number | null

Performance benchmark score (null if probe failed)

Automatic Model Recommendation

import { recommendWebLLMModel } from './llm/webllm/capability';

const recommendation = await recommendWebLLMModel();
// {
//   modelId: string,
//   reason: 'mobile' | 'no-webgpu' | 'strong-desktop' | 'weak-desktop' | 'probe-failed',
//   score: number | null,
//   threshold: number | null,
//   capability: WebLLMCapability | null
// }

Recommendation Logic:

Mobile devices: SmolLM2-360M-Instruct-q4f16_1-MLC (250 MB)
No WebGPU: SmolLM2-360M-Instruct-q4f16_1-MLC (250 MB)
Strong desktop (score ≥ 5): Llama-3.2-1B-Instruct-q4f16_1-MLC (700 MB)
Weak desktop (score < 5): SmolLM2-360M-Instruct-q4f16_1-MLC (250 MB)

The scoring algorithm evaluates:

CPU cores (6+, 8+, 10+ add points)
Device memory (6GB+, 8GB+, 16GB+ add points)
Performance benchmark (800+, 1100+, 1600+ add points)

Engine Initialization

ensureReady

From src/llm/webllm/engine.ts:48-102:

import { getWebLLMEngineManager } from './llm/webllm/engine';

const manager = getWebLLMEngineManager();

await manager.ensureReady(modelId, {
  allowDownload: true,
  onProgress: (progress, text) => {
    console.log(`${Math.round(progress * 100)}%: ${text}`);
  },
});

modelId

string

required

MLC model ID (e.g., "Llama-3.2-1B-Instruct-q4f16_1-MLC")

options.allowDownload

boolean

required

Whether to allow model download. Must be true for initialization.

options.onProgress

(progress: number, text: string) => void

Progress callback. progress is 0-1, text is status message.

Throws:

WebLLMProviderError("WEBLLM_UNSUPPORTED") - WebGPU not available
WebLLMProviderError("WEBLLM_DOWNLOAD_REQUIRED") - User consent needed
WebLLMProviderError("WEBLLM_INIT_FAILED") - Model loading failed

Streaming Inference

stream

From src/llm/webllm/engine.ts:104-143:

const messages = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: 'Explain WebGPU.' },
];

const controller = new AbortController();

await manager.stream(
  modelId,
  messages,
  controller.signal,
  (token) => {
    process.stdout.write(token);
  },
);

modelId

string

required

Model ID matching the currently loaded model

messages

WebLLMMessage[]

required

Chat messages with role (system/user/assistant) and content

signal

AbortSignal

required

Abort signal to cancel generation

onToken

(token: string) => void

required

Callback invoked for each generated token

Stream Parameters (hardcoded in engine.ts:122-124):

temperature: 0.2
max_tokens: 700
stream: true

Throws:

WebLLMProviderError("WEBLLM_INIT_FAILED") - Engine not initialized
WebLLMProviderError("WEBLLM_STREAM_FAILED") - Generation failed
DOMException("AbortError") - User cancelled

Provider Integration

Using via getProviderById

From src/llm/providers.ts:260-281:

import { getProviderById } from './llm/providers';
import type { LLMProviderConfig, LLMStreamRequest } from './llm/types';

const provider = getProviderById('webllm');

const config: LLMProviderConfig = {
  baseUrl: '',
  model: 'Llama-3.2-1B-Instruct-q4f16_1-MLC',
  allowModelDownload: true,
};

const request: LLMStreamRequest = {
  prompt: 'What is semantic search?',
  contextSnippets: ['Context 1: ...', 'Context 2: ...'],
  signal: new AbortController().signal,
  onToken: (token) => console.log(token),
  onInitProgress: (progress, text) => {
    console.log(`Init: ${Math.round(progress * 100)}% - ${text}`);
  },
};

await provider.stream(config, request);

Error Handling

WebLLMProviderError

From src/llm/webllm/engine.ts:8-22:

import { WebLLMProviderError } from './llm/webllm/engine';

try {
  await manager.ensureReady(modelId, { allowDownload: false });
} catch (error) {
  if (error instanceof WebLLMProviderError) {
    console.error(error.code); // "WEBLLM_DOWNLOAD_REQUIRED"
    console.error(error.message);
  }
}

Error Codes:

WEBLLM_UNSUPPORTED

string

WebGPU not available in browser

WEBLLM_DOWNLOAD_REQUIRED

string

Model download requires user consent

WEBLLM_INIT_FAILED

string

Model initialization failed

WEBLLM_STREAM_FAILED

string

Inference streaming failed

Format Provider Error

From src/llm/providers.ts:295-319:

import { formatProviderError } from './llm/providers';

try {
  await provider.stream(config, request);
} catch (error) {
  const userMessage = formatProviderError(error, 'local');
  console.error(userMessage);
  // "WebLLM model download requires explicit consent."
}

Feature Flag

import { isWebLLMEnabled } from './llm/providers';

if (isWebLLMEnabled()) {
  // VITE_WEBLLM_ENABLED=1 or VITE_WEBLLM_ENABLED=true
}

Set environment variable:

VITE_WEBLLM_ENABLED=1

Cache Management

Unload Model

await manager.unload();
// Releases engine and clears active model

Clear Runtime Caches

From src/pages/UsagePage.tsx:552-566:

async function clearWebLLMRuntimeCaches(): Promise<void> {
  if (!('caches' in globalThis)) {
    return;
  }

  const keys = await caches.keys();
  await Promise.all(
    keys
      .filter((key) => {
        const lower = key.toLowerCase();
        return lower.includes('webllm') || lower.includes('mlc') || lower.includes('model');
      })
      .map((key) => caches.delete(key)),
  );
}

Type Definitions

WebLLMModelProfile

type WebLLMModelProfile = {
  id: string;
  label: string;
  tier: 'ultra-low' | 'balanced' | 'quality';
  approxDownloadMB: number;
  notes?: string;
  experimental?: boolean;
};

WebLLMMessage

type WebLLMMessage = {
  role: 'system' | 'user' | 'assistant';
  content: string;
};

WebLLMRecommendation

type WebLLMRecommendation = {
  modelId: string;
  reason: 'mobile' | 'no-webgpu' | 'strong-desktop' | 'weak-desktop' | 'probe-failed';
  score: number | null;
  threshold: number | null;
  capability: WebLLMCapability | null;
};

Best Practices

Always check WebGPU availability before initializing
Request download consent from users (models are large)
Use model recommendations based on device capabilities
Handle AbortError for user cancellations
Cache models persist across sessions (stored in Cache API)
Monitor progress during multi-GB downloads
Fallback to smaller models on weak hardware

Core Modules

LLM Providers

GitHub Integration

Overview

Architecture

WebLLMEngineManager

Configuration

Provider Definition

Model Catalog

Available Models

Model Selection Functions

Capability Detection

WebLLMCapability

Automatic Model Recommendation

Engine Initialization

ensureReady

Streaming Inference

stream

Provider Integration

Using via getProviderById

Error Handling

WebLLMProviderError

Format Provider Error

Feature Flag

Cache Management

Unload Model

Clear Runtime Caches

Type Definitions

WebLLMModelProfile

WebLLMMessage

WebLLMRecommendation

Best Practices

Core Modules

LLM Providers

GitHub Integration

​Overview

​Architecture

​WebLLMEngineManager

​Configuration

​Provider Definition

​Model Catalog

​Available Models

​Model Selection Functions

​Capability Detection

​WebLLMCapability

​Automatic Model Recommendation

​Engine Initialization

​ensureReady

​Streaming Inference

​stream

​Provider Integration

​Using via getProviderById

​Error Handling

​WebLLMProviderError

​Format Provider Error

​Feature Flag

​Cache Management

​Unload Model

​Clear Runtime Caches

​Type Definitions

​WebLLMModelProfile

​WebLLMMessage

​WebLLMRecommendation

​Best Practices

Overview

Architecture

WebLLMEngineManager

Configuration

Provider Definition

Model Catalog

Available Models

Model Selection Functions

Capability Detection

WebLLMCapability

Automatic Model Recommendation

Engine Initialization

ensureReady

Streaming Inference

stream

Provider Integration

Using via getProviderById

Error Handling

WebLLMProviderError

Format Provider Error

Feature Flag

Cache Management

Unload Model

Clear Runtime Caches

Type Definitions

WebLLMModelProfile

WebLLMMessage

WebLLMRecommendation

Best Practices