Overview
The WebLLM provider enables local LLM inference directly in the browser using WebGPU acceleration through the MLC engine. Models are downloaded and cached locally, providing privacy-focused chat capabilities without server dependencies.Architecture
WebLLMEngineManager
Singleton manager that handles model initialization, loading, and inference streaming.Configuration
Provider Definition
Fromsrc/llm/providers.ts:180-186:
Provider identifier:
"webllm"Provider type:
"local" (runs in browser)Default model:
"Llama-3.2-1B-Instruct-q4f16_1-MLC"API key required:
falseModel Catalog
Available Models
Fromsrc/llm/webllm/modelCatalog.ts:12-56:
Tier: balanced
Size: ~700 MB
Notes: Primary default for strong desktops
Size: ~700 MB
Notes: Primary default for strong desktops
Tier: ultra-low
Size: ~250 MB
Notes: Mobile and weak-device fallback
Size: ~250 MB
Notes: Mobile and weak-device fallback
Tier: quality
Size: ~1100 MB
Notes: Strong for technical summaries
Size: ~1100 MB
Notes: Strong for technical summaries
Tier: quality
Size: ~1400 MB
Notes: Polished README summarization
Size: ~1400 MB
Notes: Polished README summarization
Tier: quality
Size: ~1900 MB
Notes: Fallback substitute when Hermes is unavailable
Size: ~1900 MB
Notes: Fallback substitute when Hermes is unavailable
Model Selection Functions
Capability Detection
WebLLMCapability
Fromsrc/llm/webllm/capability.ts:8-14:
Whether the device is mobile
Whether WebGPU is available in the browser
Number of logical CPU cores
Available device memory in GB (null if unavailable)
Performance benchmark score (null if probe failed)
Automatic Model Recommendation
- Mobile devices:
SmolLM2-360M-Instruct-q4f16_1-MLC(250 MB) - No WebGPU:
SmolLM2-360M-Instruct-q4f16_1-MLC(250 MB) - Strong desktop (score ≥ 5):
Llama-3.2-1B-Instruct-q4f16_1-MLC(700 MB) - Weak desktop (score < 5):
SmolLM2-360M-Instruct-q4f16_1-MLC(250 MB)
- CPU cores (6+, 8+, 10+ add points)
- Device memory (6GB+, 8GB+, 16GB+ add points)
- Performance benchmark (800+, 1100+, 1600+ add points)
Engine Initialization
ensureReady
Fromsrc/llm/webllm/engine.ts:48-102:
MLC model ID (e.g.,
"Llama-3.2-1B-Instruct-q4f16_1-MLC")Whether to allow model download. Must be true for initialization.
Progress callback.
progress is 0-1, text is status message.WebLLMProviderError("WEBLLM_UNSUPPORTED")- WebGPU not availableWebLLMProviderError("WEBLLM_DOWNLOAD_REQUIRED")- User consent neededWebLLMProviderError("WEBLLM_INIT_FAILED")- Model loading failed
Streaming Inference
stream
Fromsrc/llm/webllm/engine.ts:104-143:
Model ID matching the currently loaded model
Chat messages with
role (system/user/assistant) and contentAbort signal to cancel generation
Callback invoked for each generated token
temperature: 0.2max_tokens: 700stream: true
WebLLMProviderError("WEBLLM_INIT_FAILED")- Engine not initializedWebLLMProviderError("WEBLLM_STREAM_FAILED")- Generation failedDOMException("AbortError")- User cancelled
Provider Integration
Using via getProviderById
Fromsrc/llm/providers.ts:260-281:
Error Handling
WebLLMProviderError
Fromsrc/llm/webllm/engine.ts:8-22:
WebGPU not available in browser
Model download requires user consent
Model initialization failed
Inference streaming failed
Format Provider Error
Fromsrc/llm/providers.ts:295-319:
Feature Flag
Cache Management
Unload Model
Clear Runtime Caches
Fromsrc/pages/UsagePage.tsx:552-566:
Type Definitions
WebLLMModelProfile
WebLLMMessage
WebLLMRecommendation
Best Practices
- Always check WebGPU availability before initializing
- Request download consent from users (models are large)
- Use model recommendations based on device capabilities
- Handle AbortError for user cancellations
- Cache models persist across sessions (stored in Cache API)
- Monitor progress during multi-GB downloads
- Fallback to smaller models on weak hardware