Skip to main content
GitStarRecall can run large language models directly in your browser using WebLLM. No API keys, no cloud costs, and complete privacy.

Overview

WebLLM enables running quantized LLMs in the browser using WebGPU acceleration:

Privacy

All inference happens locallyYour data never leaves your device

No Cost

One-time model downloadUnlimited usage after that

Offline

Works without internet after downloadPerfect for air-gapped environments

WebGPU Accelerated

GPU acceleration via WebGPUFast inference on compatible devices

Supported Models

GitStarRecall includes a curated set of WebLLM models optimized for repository Q&A:
Model Catalog
export type WebLLMModelTier = "ultra-low" | "balanced" | "quality";

export type WebLLMModelProfile = {
  id: string;
  label: string;
  tier: WebLLMModelTier;
  approxDownloadMB: number;
  notes?: string;
  experimental?: boolean;
};

Available Models

Llama 3.2 1B Instruct Llama-3.2-1B-Instruct-q4f16_1-MLC
  • Download: ~700 MB
  • Best for: Desktop and laptop devices
  • Quality: Good instruction following
  • Speed: Fast inference
This is the default model for most users.

Model Selection Logic

Automatic Selection
export function recommendWebLLMModel(): WebLLMRecommendation {
  const memory = (navigator as NavigatorWithMemory).deviceMemory;
  
  // Low memory devices (≤4GB) → SmolLM2
  if (memory && memory <= 4) {
    return {
      modelId: WEBLLM_FALLBACK_MODEL_ID,
      reason: "low-memory-device"
    };
  }
  
  // Default → Llama 3.2 1B
  return {
    modelId: WEBLLM_PRIMARY_MODEL_ID,
    reason: "default"
  };
}

Engine Management

WebLLM uses a singleton engine manager for efficiency:
Engine Manager
class WebLLMEngineManager {
  private engine: MLCEngineInterface | null = null;
  private activeModelId: string | null = null;
  private loadingPromise: Promise<void> | null = null;

  async ensureReady(
    modelId: string,
    options: WebLLMEnsureReadyOptions
  ): Promise<void> {
    // Check WebGPU support
    if (!this.supportsWebGPU()) {
      throw new WebLLMProviderError(
        "WEBLLM_UNSUPPORTED",
        "WebGPU is not available in this browser."
      );
    }

    // Already loaded?
    if (this.engine && this.activeModelId === modelId) {
      return;
    }

    // Request download consent
    if (!options.allowDownload) {
      throw new WebLLMProviderError(
        "WEBLLM_DOWNLOAD_REQUIRED",
        "Model download consent is required."
      );
    }

    // Initialize or reload
    this.loadingPromise = (async () => {
      if (!this.engine) {
        this.engine = await CreateMLCEngine(modelId, {
          initProgressCallback: (report) => {
            const progress = Math.max(0, Math.min(1, report.progress));
            options.onProgress?.(progress, report.text || "Preparing model");
          },
          logLevel: "INFO"
        });
        this.activeModelId = modelId;
      } else {
        await this.engine.reload(modelId);
        this.activeModelId = modelId;
      }
    })();

    await this.loadingPromise;
    this.loadingPromise = null;
  }
}
Users must explicitly consent to model downloads:
1

Trigger Download

User selects WebLLM provider and sends a message
2

Show Dialog

Download consent dialog appears with model size and details
3

Download Progress

Real-time progress updates during download and initialization
4

Ready to Use

Model loaded and ready for inference
Model downloads are large (250MB - 1.9GB). Use Wi-Fi for best experience.

Streaming Inference

WebLLM responses stream token-by-token:
Streaming
async stream(
  modelId: string,
  messages: WebLLMMessage[],
  signal: AbortSignal,
  onToken: (token: string) => void
): Promise<void> {
  if (!this.engine || this.activeModelId !== modelId) {
    throw new WebLLMProviderError(
      "WEBLLM_INIT_FAILED",
      "WebLLM engine is not initialized."
    );
  }

  try {
    const chunkStream = await this.engine.chat.completions.create({
      model: modelId,
      messages,
      stream: true,
      temperature: 0.2,
      max_tokens: 700
    });

    for await (const chunk of chunkStream as AsyncIterable<ChatCompletionChunk>) {
      if (signal.aborted) {
        throw new DOMException("aborted", "AbortError");
      }
      const token = chunk.choices?.[0]?.delta?.content ?? "";
      if (token.length > 0) {
        onToken(token);
      }
    }
  } catch (error) {
    if (error instanceof DOMException && error.name === "AbortError") {
      throw error;
    }
    throw new WebLLMProviderError(
      "WEBLLM_STREAM_FAILED",
      `WebLLM stream failed: ${error.message}`
    );
  }
}

Inference Parameters

Generation Config
{
  temperature: 0.2,      // Low temperature for consistent, factual responses
  max_tokens: 700,       // Limit response length for speed
  stream: true           // Token-by-token streaming
}
Lower temperature (0.2) reduces hallucination and keeps responses grounded in provided context.

Browser Compatibility

WebLLM requires WebGPU support:

Supported Browsers

BrowserPlatformStatus
Chrome 113+Desktop✅ Full support
Edge 113+Desktop✅ Full support
Brave 113+Desktop✅ Full support
Safari 18+macOS⚠️ Experimental
FirefoxDesktop❌ No WebGPU yet
Mobile browsersiOS/Android❌ Limited support
WebLLM is not recommended for mobile devices due to memory and performance constraints. Use remote or Ollama providers instead.

WebGPU Detection

Capability Check
private supportsWebGPU(): boolean {
  const nav = navigator as Navigator & { gpu?: object };
  return typeof nav.gpu !== "undefined";
}
If WebGPU is unavailable, GitStarRecall will:
  1. Show an error message
  2. Suggest using a remote provider (OpenAI)
  3. Suggest using Ollama for local inference

Download Dialog

The WebLLM download dialog provides transparency:
Download Dialog Component
<WebLLMDownloadDialog
  open={showWebLLMDownload}
  onOpenChange={setShowWebLLMDownload}
  onConfirm={() => {
    setWebLLMDownloadConsent(true);
    setShowWebLLMDownload(false);
  }}
  onCancel={() => {
    setShowWebLLMDownload(false);
  }}
  modelProfile={selectedWebLLMModel}
  downloadProgress={webllmDownloadProgress}
  downloadStatus={webllmDownloadStatus}
/>

Download States

  • Real-time progress percentage
  • Current operation description
  • Cannot be cancelled (browser limitation)
  • Loading model into GPU memory
  • Compiling model shaders
  • Final preparation steps
  • Model loaded and warmed up
  • Ready to accept chat requests
  • Dialog closes automatically

Performance Characteristics

Inference Speed

Expected tokens per second by model and hardware:
ModelDesktop GPUDesktop CPULaptop
SmolLM2 360M40-60 t/s20-30 t/s15-25 t/s
Llama 3.2 1B25-40 t/s12-20 t/s10-15 t/s
Qwen2.5 1.5B20-30 t/s10-15 t/s8-12 t/s
Gemma 2 2B15-25 t/s8-12 t/s6-10 t/s
Llama 3.1 3B12-20 t/s6-10 t/s4-8 t/s
Performance varies significantly based on GPU, browser, and OS. These are approximate ranges.

Memory Usage

Peak memory consumption during inference:
  • SmolLM2 360M: ~1 GB RAM
  • Llama 3.2 1B: ~2 GB RAM
  • Qwen2.5 1.5B: ~3 GB RAM
  • Gemma 2 2B: ~4 GB RAM
  • Llama 3.1 3B: ~5 GB RAM
Ensure your device has sufficient available memory before downloading larger models.

Error Handling

WebLLM errors are categorized for better UX:
Error Codes
export type WebLLMErrorCode =
  | "WEBLLM_UNSUPPORTED"       // WebGPU not available
  | "WEBLLM_DOWNLOAD_REQUIRED"  // Consent needed
  | "WEBLLM_INIT_FAILED"        // Initialization error
  | "WEBLLM_STREAM_FAILED";     // Inference error

export class WebLLMProviderError extends Error {
  readonly code: WebLLMErrorCode;

  constructor(code: WebLLMErrorCode, message: string) {
    super(message);
    this.name = "WebLLMProviderError";
    this.code = code;
  }
}

Error Messages

WebLLM is unavailable in this browser/device.
Solution: Use Chrome 113+, Edge 113+, or Brave 113+ on desktop

Comparison with Alternatives

WebLLM (Browser)

Pros:
  • No setup
  • Cross-platform
  • Completely private
  • One-time download
Cons:
  • Requires WebGPU
  • Large downloads
  • Limited models
  • Slower than desktop

Ollama (Local)

Pros:
  • Better performance
  • More model options
  • Lower memory usage
  • Works offline
Cons:
  • Requires installation
  • Desktop only
  • Manual setup

OpenAI (Remote)

Pros:
  • Best quality
  • Fast responses
  • Latest models
  • No setup
Cons:
  • Requires API key
  • Usage costs
  • Not private
  • Needs internet

Best Practices

1

Choose Appropriate Model

Start with Llama 3.2 1B (default)Only use larger models if you need higher quality and have sufficient memory
2

Download on Wi-Fi

Models are 250MB-1.9GBUse Wi-Fi to avoid mobile data charges
3

Close Other Tabs

Free up memory before loading modelsEach tab consumes memory independently
4

Monitor Performance

Check tokens/second during generationSlow performance may indicate memory pressure

Feature Flags

WebLLM can be disabled via environment variable:
.env
VITE_WEBLLM_ENABLED=true  # Enable WebLLM (default)
VITE_WEBLLM_ENABLED=false # Disable WebLLM
Feature Check
export function isWebLLMEnabled(): boolean {
  const raw = import.meta.env.VITE_WEBLLM_ENABLED;
  return raw === "1" || raw === "true";
}
When disabled:
  • WebLLM option not shown in provider selector
  • Model catalog not loaded
  • Download dialog not rendered