Skip to main content

Overview

The WebLLM provider enables local LLM inference directly in the browser using WebGPU acceleration through the MLC engine. Models are downloaded and cached locally, providing privacy-focused chat capabilities without server dependencies.

Architecture

WebLLMEngineManager

Singleton manager that handles model initialization, loading, and inference streaming.
import { getWebLLMEngineManager } from './llm/webllm/engine';

const manager = getWebLLMEngineManager();

Configuration

Provider Definition

From src/llm/providers.ts:180-186:
{
  id: "webllm",
  label: "Local (Browser WebLLM)",
  kind: "local",
  defaultBaseUrl: "",
  defaultModel: "Llama-3.2-1B-Instruct-q4f16_1-MLC",
  requiresApiKey: false,
}
id
string
required
Provider identifier: "webllm"
kind
string
Provider type: "local" (runs in browser)
defaultModel
string
Default model: "Llama-3.2-1B-Instruct-q4f16_1-MLC"
requiresApiKey
boolean
API key required: false

Model Catalog

Available Models

From src/llm/webllm/modelCatalog.ts:12-56:
Llama-3.2-1B-Instruct-q4f16_1-MLC
object
Tier: balanced
Size: ~700 MB
Notes: Primary default for strong desktops
SmolLM2-360M-Instruct-q4f16_1-MLC
object
Tier: ultra-low
Size: ~250 MB
Notes: Mobile and weak-device fallback
Qwen2.5-1.5B-Instruct-q4f16_1-MLC
object
Tier: quality
Size: ~1100 MB
Notes: Strong for technical summaries
Gemma-2-2B-Instruct-q4f16_1-MLC
object
Tier: quality
Size: ~1400 MB
Notes: Polished README summarization
Llama-3.1-3B-Instruct-q4f16_1-MLC
object
Tier: quality
Size: ~1900 MB
Notes: Fallback substitute when Hermes is unavailable

Model Selection Functions

import {
  getWebLLMSelectableModels,
  getWebLLMModelProfile,
  isWebLLMModelIdSupported,
} from './llm/webllm/modelCatalog';

// Get all available models
const models = getWebLLMSelectableModels();
// WebLLMModelProfile[]

// Get specific model info
const profile = getWebLLMModelProfile('Llama-3.2-1B-Instruct-q4f16_1-MLC');
// { id, label, tier, approxDownloadMB, notes }

// Check if model is supported
const supported = isWebLLMModelIdSupported('custom-model');
// boolean

Capability Detection

WebLLMCapability

From src/llm/webllm/capability.ts:8-14:
isMobile
boolean
Whether the device is mobile
hasWebGPU
boolean
Whether WebGPU is available in the browser
hardwareConcurrency
number
Number of logical CPU cores
deviceMemoryGB
number | null
Available device memory in GB (null if unavailable)
perfScore
number | null
Performance benchmark score (null if probe failed)

Automatic Model Recommendation

import { recommendWebLLMModel } from './llm/webllm/capability';

const recommendation = await recommendWebLLMModel();
// {
//   modelId: string,
//   reason: 'mobile' | 'no-webgpu' | 'strong-desktop' | 'weak-desktop' | 'probe-failed',
//   score: number | null,
//   threshold: number | null,
//   capability: WebLLMCapability | null
// }
Recommendation Logic:
  • Mobile devices: SmolLM2-360M-Instruct-q4f16_1-MLC (250 MB)
  • No WebGPU: SmolLM2-360M-Instruct-q4f16_1-MLC (250 MB)
  • Strong desktop (score ≥ 5): Llama-3.2-1B-Instruct-q4f16_1-MLC (700 MB)
  • Weak desktop (score < 5): SmolLM2-360M-Instruct-q4f16_1-MLC (250 MB)
The scoring algorithm evaluates:
  • CPU cores (6+, 8+, 10+ add points)
  • Device memory (6GB+, 8GB+, 16GB+ add points)
  • Performance benchmark (800+, 1100+, 1600+ add points)

Engine Initialization

ensureReady

From src/llm/webllm/engine.ts:48-102:
import { getWebLLMEngineManager } from './llm/webllm/engine';

const manager = getWebLLMEngineManager();

await manager.ensureReady(modelId, {
  allowDownload: true,
  onProgress: (progress, text) => {
    console.log(`${Math.round(progress * 100)}%: ${text}`);
  },
});
modelId
string
required
MLC model ID (e.g., "Llama-3.2-1B-Instruct-q4f16_1-MLC")
options.allowDownload
boolean
required
Whether to allow model download. Must be true for initialization.
options.onProgress
(progress: number, text: string) => void
Progress callback. progress is 0-1, text is status message.
Throws:
  • WebLLMProviderError("WEBLLM_UNSUPPORTED") - WebGPU not available
  • WebLLMProviderError("WEBLLM_DOWNLOAD_REQUIRED") - User consent needed
  • WebLLMProviderError("WEBLLM_INIT_FAILED") - Model loading failed

Streaming Inference

stream

From src/llm/webllm/engine.ts:104-143:
const messages = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: 'Explain WebGPU.' },
];

const controller = new AbortController();

await manager.stream(
  modelId,
  messages,
  controller.signal,
  (token) => {
    process.stdout.write(token);
  },
);
modelId
string
required
Model ID matching the currently loaded model
messages
WebLLMMessage[]
required
Chat messages with role (system/user/assistant) and content
signal
AbortSignal
required
Abort signal to cancel generation
onToken
(token: string) => void
required
Callback invoked for each generated token
Stream Parameters (hardcoded in engine.ts:122-124):
  • temperature: 0.2
  • max_tokens: 700
  • stream: true
Throws:
  • WebLLMProviderError("WEBLLM_INIT_FAILED") - Engine not initialized
  • WebLLMProviderError("WEBLLM_STREAM_FAILED") - Generation failed
  • DOMException("AbortError") - User cancelled

Provider Integration

Using via getProviderById

From src/llm/providers.ts:260-281:
import { getProviderById } from './llm/providers';
import type { LLMProviderConfig, LLMStreamRequest } from './llm/types';

const provider = getProviderById('webllm');

const config: LLMProviderConfig = {
  baseUrl: '',
  model: 'Llama-3.2-1B-Instruct-q4f16_1-MLC',
  allowModelDownload: true,
};

const request: LLMStreamRequest = {
  prompt: 'What is semantic search?',
  contextSnippets: ['Context 1: ...', 'Context 2: ...'],
  signal: new AbortController().signal,
  onToken: (token) => console.log(token),
  onInitProgress: (progress, text) => {
    console.log(`Init: ${Math.round(progress * 100)}% - ${text}`);
  },
};

await provider.stream(config, request);

Error Handling

WebLLMProviderError

From src/llm/webllm/engine.ts:8-22:
import { WebLLMProviderError } from './llm/webllm/engine';

try {
  await manager.ensureReady(modelId, { allowDownload: false });
} catch (error) {
  if (error instanceof WebLLMProviderError) {
    console.error(error.code); // "WEBLLM_DOWNLOAD_REQUIRED"
    console.error(error.message);
  }
}
Error Codes:
WEBLLM_UNSUPPORTED
string
WebGPU not available in browser
WEBLLM_DOWNLOAD_REQUIRED
string
Model download requires user consent
WEBLLM_INIT_FAILED
string
Model initialization failed
WEBLLM_STREAM_FAILED
string
Inference streaming failed

Format Provider Error

From src/llm/providers.ts:295-319:
import { formatProviderError } from './llm/providers';

try {
  await provider.stream(config, request);
} catch (error) {
  const userMessage = formatProviderError(error, 'local');
  console.error(userMessage);
  // "WebLLM model download requires explicit consent."
}

Feature Flag

import { isWebLLMEnabled } from './llm/providers';

if (isWebLLMEnabled()) {
  // VITE_WEBLLM_ENABLED=1 or VITE_WEBLLM_ENABLED=true
}
Set environment variable:
VITE_WEBLLM_ENABLED=1

Cache Management

Unload Model

await manager.unload();
// Releases engine and clears active model

Clear Runtime Caches

From src/pages/UsagePage.tsx:552-566:
async function clearWebLLMRuntimeCaches(): Promise<void> {
  if (!('caches' in globalThis)) {
    return;
  }

  const keys = await caches.keys();
  await Promise.all(
    keys
      .filter((key) => {
        const lower = key.toLowerCase();
        return lower.includes('webllm') || lower.includes('mlc') || lower.includes('model');
      })
      .map((key) => caches.delete(key)),
  );
}

Type Definitions

WebLLMModelProfile

type WebLLMModelProfile = {
  id: string;
  label: string;
  tier: 'ultra-low' | 'balanced' | 'quality';
  approxDownloadMB: number;
  notes?: string;
  experimental?: boolean;
};

WebLLMMessage

type WebLLMMessage = {
  role: 'system' | 'user' | 'assistant';
  content: string;
};

WebLLMRecommendation

type WebLLMRecommendation = {
  modelId: string;
  reason: 'mobile' | 'no-webgpu' | 'strong-desktop' | 'weak-desktop' | 'probe-failed';
  score: number | null;
  threshold: number | null;
  capability: WebLLMCapability | null;
};

Best Practices

  1. Always check WebGPU availability before initializing
  2. Request download consent from users (models are large)
  3. Use model recommendations based on device capabilities
  4. Handle AbortError for user cancellations
  5. Cache models persist across sessions (stored in Cache API)
  6. Monitor progress during multi-GB downloads
  7. Fallback to smaller models on weak hardware