Local LLM (WebLLM)

GitStarRecall can run large language models directly in your browser using WebLLM. No API keys, no cloud costs, and complete privacy.

Overview

WebLLM enables running quantized LLMs in the browser using WebGPU acceleration:

Privacy

All inference happens locallyYour data never leaves your device

No Cost

One-time model downloadUnlimited usage after that

Offline

Works without internet after downloadPerfect for air-gapped environments

WebGPU Accelerated

GPU acceleration via WebGPUFast inference on compatible devices

Supported Models

GitStarRecall includes a curated set of WebLLM models optimized for repository Q&A:

Model Catalog

export type WebLLMModelTier = "ultra-low" | "balanced" | "quality";

export type WebLLMModelProfile = {
  id: string;
  label: string;
  tier: WebLLMModelTier;
  approxDownloadMB: number;
  notes?: string;
  experimental?: boolean;
};

Available Models

Balanced (Default)
Ultra-Low (Mobile)
Quality

Llama 3.2 1B Instruct Llama-3.2-1B-Instruct-q4f16_1-MLC

Download: ~700 MB
Best for: Desktop and laptop devices
Quality: Good instruction following
Speed: Fast inference

This is the default model for most users.

SmolLM2 360M Instruct SmolLM2-360M-Instruct-q4f16_1-MLC

Download: ~250 MB
Best for: Mobile devices, weak hardware
Quality: Basic comprehension
Speed: Very fast

Automatically selected on low-memory devices.

Model Selection Logic

Automatic Selection

export function recommendWebLLMModel(): WebLLMRecommendation {
  const memory = (navigator as NavigatorWithMemory).deviceMemory;
  
  // Low memory devices (≤4GB) → SmolLM2
  if (memory && memory <= 4) {
    return {
      modelId: WEBLLM_FALLBACK_MODEL_ID,
      reason: "low-memory-device"
    };
  }
  
  // Default → Llama 3.2 1B
  return {
    modelId: WEBLLM_PRIMARY_MODEL_ID,
    reason: "default"
  };
}

Engine Management

WebLLM uses a singleton engine manager for efficiency:

Engine Manager

class WebLLMEngineManager {
  private engine: MLCEngineInterface | null = null;
  private activeModelId: string | null = null;
  private loadingPromise: Promise<void> | null = null;

  async ensureReady(
    modelId: string,
    options: WebLLMEnsureReadyOptions
  ): Promise<void> {
    // Check WebGPU support
    if (!this.supportsWebGPU()) {
      throw new WebLLMProviderError(
        "WEBLLM_UNSUPPORTED",
        "WebGPU is not available in this browser."
      );
    }

    // Already loaded?
    if (this.engine && this.activeModelId === modelId) {
      return;
    }

    // Request download consent
    if (!options.allowDownload) {
      throw new WebLLMProviderError(
        "WEBLLM_DOWNLOAD_REQUIRED",
        "Model download consent is required."
      );
    }

    // Initialize or reload
    this.loadingPromise = (async () => {
      if (!this.engine) {
        this.engine = await CreateMLCEngine(modelId, {
          initProgressCallback: (report) => {
            const progress = Math.max(0, Math.min(1, report.progress));
            options.onProgress?.(progress, report.text || "Preparing model");
          },
          logLevel: "INFO"
        });
        this.activeModelId = modelId;
      } else {
        await this.engine.reload(modelId);
        this.activeModelId = modelId;
      }
    })();

    await this.loadingPromise;
    this.loadingPromise = null;
  }
}

Users must explicitly consent to model downloads:

Trigger Download

User selects WebLLM provider and sends a message

Show Dialog

Download consent dialog appears with model size and details

Download Progress

Real-time progress updates during download and initialization

Ready to Use

Model loaded and ready for inference

Model downloads are large (250MB - 1.9GB). Use Wi-Fi for best experience.

Streaming Inference

WebLLM responses stream token-by-token:

Streaming

async stream(
  modelId: string,
  messages: WebLLMMessage[],
  signal: AbortSignal,
  onToken: (token: string) => void
): Promise<void> {
  if (!this.engine || this.activeModelId !== modelId) {
    throw new WebLLMProviderError(
      "WEBLLM_INIT_FAILED",
      "WebLLM engine is not initialized."
    );
  }

  try {
    const chunkStream = await this.engine.chat.completions.create({
      model: modelId,
      messages,
      stream: true,
      temperature: 0.2,
      max_tokens: 700
    });

    for await (const chunk of chunkStream as AsyncIterable<ChatCompletionChunk>) {
      if (signal.aborted) {
        throw new DOMException("aborted", "AbortError");
      }
      const token = chunk.choices?.[0]?.delta?.content ?? "";
      if (token.length > 0) {
        onToken(token);
      }
    }
  } catch (error) {
    if (error instanceof DOMException && error.name === "AbortError") {
      throw error;
    }
    throw new WebLLMProviderError(
      "WEBLLM_STREAM_FAILED",
      `WebLLM stream failed: ${error.message}`
    );
  }
}

Inference Parameters

Generation Config

{
  temperature: 0.2,      // Low temperature for consistent, factual responses
  max_tokens: 700,       // Limit response length for speed
  stream: true           // Token-by-token streaming
}

Lower temperature (0.2) reduces hallucination and keeps responses grounded in provided context.

Browser Compatibility

WebLLM requires WebGPU support:

Supported Browsers

Browser	Platform	Status
Chrome 113+	Desktop	✅ Full support
Edge 113+	Desktop	✅ Full support
Brave 113+	Desktop	✅ Full support
Safari 18+	macOS	⚠️ Experimental
Firefox	Desktop	❌ No WebGPU yet
Mobile browsers	iOS/Android	❌ Limited support

WebLLM is not recommended for mobile devices due to memory and performance constraints. Use remote or Ollama providers instead.

WebGPU Detection

Capability Check

private supportsWebGPU(): boolean {
  const nav = navigator as Navigator & { gpu?: object };
  return typeof nav.gpu !== "undefined";
}

If WebGPU is unavailable, GitStarRecall will:

Show an error message
Suggest using a remote provider (OpenAI)
Suggest using Ollama for local inference

Download Dialog

The WebLLM download dialog provides transparency:

Download Dialog Component

<WebLLMDownloadDialog
  open={showWebLLMDownload}
  onOpenChange={setShowWebLLMDownload}
  onConfirm={() => {
    setWebLLMDownloadConsent(true);
    setShowWebLLMDownload(false);
  }}
  onCancel={() => {
    setShowWebLLMDownload(false);
  }}
  modelProfile={selectedWebLLMModel}
  downloadProgress={webllmDownloadProgress}
  downloadStatus={webllmDownloadStatus}
/>

Download States

Requesting Consent

Downloading

Real-time progress percentage
Current operation description
Cannot be cancelled (browser limitation)

Initializing

Loading model into GPU memory
Compiling model shaders
Final preparation steps

Ready

Model loaded and warmed up
Ready to accept chat requests
Dialog closes automatically

Performance Characteristics

Inference Speed

Expected tokens per second by model and hardware:

Model	Desktop GPU	Desktop CPU	Laptop
SmolLM2 360M	40-60 t/s	20-30 t/s	15-25 t/s
Llama 3.2 1B	25-40 t/s	12-20 t/s	10-15 t/s
Qwen2.5 1.5B	20-30 t/s	10-15 t/s	8-12 t/s
Gemma 2 2B	15-25 t/s	8-12 t/s	6-10 t/s
Llama 3.1 3B	12-20 t/s	6-10 t/s	4-8 t/s

Performance varies significantly based on GPU, browser, and OS. These are approximate ranges.

Memory Usage

Peak memory consumption during inference:

SmolLM2 360M: ~1 GB RAM
Llama 3.2 1B: ~2 GB RAM
Qwen2.5 1.5B: ~3 GB RAM
Gemma 2 2B: ~4 GB RAM
Llama 3.1 3B: ~5 GB RAM

Ensure your device has sufficient available memory before downloading larger models.

Error Handling

WebLLM errors are categorized for better UX:

Error Codes

export type WebLLMErrorCode =
  | "WEBLLM_UNSUPPORTED"       // WebGPU not available
  | "WEBLLM_DOWNLOAD_REQUIRED"  // Consent needed
  | "WEBLLM_INIT_FAILED"        // Initialization error
  | "WEBLLM_STREAM_FAILED";     // Inference error

export class WebLLMProviderError extends Error {
  readonly code: WebLLMErrorCode;

  constructor(code: WebLLMErrorCode, message: string) {
    super(message);
    this.name = "WebLLMProviderError";
    this.code = code;
  }
}

Error Messages

Unsupported Browser
Download Required
Initialization Failed
Stream Failed

WebLLM is unavailable in this browser/device.

Solution: Use Chrome 113+, Edge 113+, or Brave 113+ on desktop

WebLLM model download requires explicit consent.

Solution: Click through the download consent dialog

WebLLM init failed: [detailed error]

Solution: Check console logs, try smaller model, or use remote provider

WebLLM stream failed: [detailed error]

Solution: Refresh page, check memory usage, try again

Comparison with Alternatives

WebLLM (Browser)

Pros:

No setup
Cross-platform
Completely private
One-time download

Cons:

Requires WebGPU
Large downloads
Limited models
Slower than desktop

Ollama (Local)

Pros:

Better performance
More model options
Lower memory usage
Works offline

Cons:

Requires installation
Desktop only
Manual setup

OpenAI (Remote)

Pros:

Best quality
Fast responses
Latest models
No setup

Cons:

Requires API key
Usage costs
Not private
Needs internet

Best Practices

Choose Appropriate Model

Start with Llama 3.2 1B (default)Only use larger models if you need higher quality and have sufficient memory

Download on Wi-Fi

Models are 250MB-1.9GBUse Wi-Fi to avoid mobile data charges

Close Other Tabs

Free up memory before loading modelsEach tab consumes memory independently

Monitor Performance

Check tokens/second during generationSlow performance may indicate memory pressure

Feature Flags

WebLLM can be disabled via environment variable:

.env

VITE_WEBLLM_ENABLED=true  # Enable WebLLM (default)
VITE_WEBLLM_ENABLED=false # Disable WebLLM

Feature Check

export function isWebLLMEnabled(): boolean {
  const raw = import.meta.env.VITE_WEBLLM_ENABLED;
  return raw === "1" || raw === "true";
}

When disabled:

WebLLM option not shown in provider selector
Model catalog not loaded
Download dialog not rendered

Chat Sessions

Use WebLLM to chat about search results

Local Embeddings

Generate embeddings with WebGPU like WebLLM

Semantic Search

Search repositories before chatting about them

Get Started

Core Features

Configuration

Deployment

Advanced

Overview

Privacy

No Cost

Offline

WebGPU Accelerated

Supported Models

Available Models

Model Selection Logic

Engine Management

Streaming Inference

Inference Parameters

Browser Compatibility

Supported Browsers

WebGPU Detection

Download Dialog

Download States

Performance Characteristics

Inference Speed

Memory Usage

Error Handling

Error Messages

Comparison with Alternatives

WebLLM (Browser)

Ollama (Local)

OpenAI (Remote)

Best Practices

Feature Flags

Chat Sessions

Local Embeddings

Semantic Search

Get Started

Core Features

Configuration

Deployment

Advanced

​Overview

Privacy

No Cost

Offline

WebGPU Accelerated

​Supported Models

​Available Models

​Model Selection Logic

​Engine Management

​Download Consent

​Streaming Inference

​Inference Parameters

​Browser Compatibility

​Supported Browsers

​WebGPU Detection

​Download Dialog

​Download States

​Performance Characteristics

​Inference Speed

​Memory Usage

​Error Handling

​Error Messages

​Comparison with Alternatives

WebLLM (Browser)

Ollama (Local)

OpenAI (Remote)

​Best Practices

​Feature Flags

​Related Features

Chat Sessions

Local Embeddings

Semantic Search

Overview

Supported Models

Available Models

Model Selection Logic

Engine Management

Download Consent

Streaming Inference

Inference Parameters

Browser Compatibility

Supported Browsers

WebGPU Detection

Download Dialog

Download States

Performance Characteristics

Inference Speed

Memory Usage

Error Handling

Error Messages

Comparison with Alternatives

Best Practices

Feature Flags

Related Features