Data Storage - GitStarRecall

Overview

GitStarRecall uses a multi-layered storage strategy designed for privacy, performance, and browser compatibility. All data is stored client-side by default.

Storage Layers

Primary Storage: SQLite WASM

The main database uses sql.js, a SQLite WASM implementation that runs entirely in the browser. Key Features:

Full SQL database in browser memory
ACID transactions
Foreign key constraints
Efficient indexing
Export/import capability

Implementation:

import initSqlJs from "sql.js";
import wasmUrl from "sql.js/dist/sql-wasm.wasm?url";

const SQL = await initSqlJs({
  locateFile: () => wasmUrl,
});

Persistence Layer: OPFS

Origin Private File System (OPFS) provides fast, private file storage when available. Benefits:

Persistent across sessions
Better performance than localStorage
Larger storage quota
Private to origin

Availability Check:

function isOpfsSupported(): boolean {
  return typeof navigator !== "undefined" && 
         Boolean(navigator.storage?.getDirectory);
}

File Operations:

Database file: gitstarrecall.sqlite
Written on checkpoint (periodic + on completion)
Read on app initialization
Atomic write with createWritable()

Fallback: localStorage

When OPFS is unavailable, the database is persisted to localStorage as Base64-encoded bytes. Key: gitstarrecall.sqlite.base64 Limitations:

~5-10MB quota (browser-dependent)
Slower than OPFS
Can fail on quota exceeded

Encoding:

function toBase64(bytes: Uint8Array): string {
  let binary = "";
  for (let i = 0; i < bytes.length; i++) {
    binary += String.fromCharCode(bytes[i]);
  }
  return btoa(binary);
}

Fallback: Memory-Only Mode

When both OPFS and localStorage fail (quota exceeded), the database runs in memory only. Behavior:

Data persists within tab session
Lost on tab close/refresh
No write errors
Storage mode indicated in UI

Storage Mode Priority

OPFS (preferred)
localStorage (fallback)
memory (last resort)

The app automatically degrades gracefully based on browser capabilities and quota.

Database Schema

Tables

repos

Stores GitHub repository metadata and README content.

CREATE TABLE IF NOT EXISTS repos (
  id INTEGER PRIMARY KEY,
  full_name TEXT NOT NULL,
  name TEXT NOT NULL,
  description TEXT,
  topics_json TEXT NOT NULL DEFAULT '[]',
  language TEXT,
  html_url TEXT NOT NULL,
  stars INTEGER NOT NULL DEFAULT 0,
  forks INTEGER NOT NULL DEFAULT 0,
  updated_at TEXT NOT NULL,
  readme_url TEXT,
  readme_text TEXT,
  readme_etag TEXT,
  readme_last_modified TEXT,
  checksum TEXT,
  last_synced_at INTEGER NOT NULL
);

Key Fields:

id: GitHub repository ID
checksum: SHA-256 hash of metadata + README for diff-based sync
readme_etag: HTTP ETag for conditional README fetch
readme_last_modified: HTTP Last-Modified header
topics_json: JSON array of repository topics

chunks

Stores text chunks generated from README content.

CREATE TABLE IF NOT EXISTS chunks (
  id TEXT PRIMARY KEY,
  repo_id INTEGER NOT NULL,
  chunk_id TEXT NOT NULL,
  text TEXT NOT NULL,
  source TEXT NOT NULL,
  created_at INTEGER NOT NULL,
  FOREIGN KEY (repo_id) REFERENCES repos(id) ON DELETE CASCADE
);

CREATE INDEX idx_chunks_repo_id ON chunks(repo_id);
CREATE INDEX idx_chunks_created_at ON chunks(created_at);

Chunking Strategy:

Simple tokenizer with overlap
Target: 500-800 characters per chunk
Overlap: 80-120 characters
Source: readme or metadata

embeddings

Stores vector embeddings for semantic search.

CREATE TABLE IF NOT EXISTS embeddings (
  id TEXT PRIMARY KEY,
  chunk_id TEXT NOT NULL,
  model TEXT NOT NULL,
  dimension INTEGER NOT NULL,
  vector_blob BLOB NOT NULL,
  created_at INTEGER NOT NULL,
  FOREIGN KEY (chunk_id) REFERENCES chunks(id) ON DELETE CASCADE
);

CREATE INDEX idx_embeddings_chunk_id ON embeddings(chunk_id);

Vector Storage:

Format: Float32Array stored as BLOB
Dimension: 384 (for all-MiniLM-L6-v2)
Model: Tracked for compatibility checks
Normalization: L2-normalized before storage

Conversion:

// Float32Array to Uint8Array for BLOB storage
const vectorBlob = new Uint8Array(
  vector.buffer, 
  vector.byteOffset, 
  vector.byteLength
);

// Uint8Array back to Float32Array for search
const vector = new Float32Array(
  blob.buffer, 
  blob.byteOffset, 
  blob.byteLength / 4
);

chat_sessions

Stores chat session metadata.

CREATE TABLE IF NOT EXISTS chat_sessions (
  id TEXT NOT NULL PRIMARY KEY,
  query TEXT NOT NULL,
  created_at INTEGER NOT NULL,
  updated_at INTEGER NOT NULL
);

Session Management:

One session per initial query
Can have multiple follow-up messages
Sorted by updated_at DESC in UI
Updated on new message

chat_messages

Stores individual chat messages.

CREATE TABLE IF NOT EXISTS chat_messages (
  id TEXT NOT NULL PRIMARY KEY,
  session_id TEXT NOT NULL,
  role TEXT NOT NULL CHECK (role IN ('user','assistant','system')),
  content TEXT NOT NULL,
  sequence INTEGER NOT NULL,
  created_at INTEGER NOT NULL,
  FOREIGN KEY (session_id) REFERENCES chat_sessions(id) ON DELETE CASCADE
);

CREATE INDEX idx_chat_messages_session ON chat_messages(session_id);
CREATE INDEX idx_chat_messages_order ON chat_messages(session_id, created_at, sequence);

Message Ordering:

Primary: created_at ASC
Secondary: sequence ASC (for same timestamp)
Role: user, assistant, or system

index_meta

Stores indexing metadata and resume state.

CREATE TABLE IF NOT EXISTS index_meta (
  key TEXT PRIMARY KEY,
  value TEXT NOT NULL,
  updated_at INTEGER NOT NULL
);

Key Metadata:

embedding_backend: webgpu or wasm
embedding_pool_size: Number of workers used
checkpoint_policy_version: Tracking policy changes
last_checkpoint_at: Timestamp of last checkpoint
embedding_perf_last_run: JSON performance summary
large_library_cursor: Resume position for interrupted indexing

Chat Backup System

Chat sessions and messages are additionally backed up to IndexedDB (with localStorage fallback) to provide extra durability.

IndexedDB Structure

Database: gitstarrecall-chat-backup (version 1) Object Stores:

chat_sessions
- keyPath: id
- Stores: ChatSessionRecord[]
chat_messages
- keyPath: id
- Index: by_session_id on sessionId (non-unique)
- Stores: ChatMessageRecord[]

Backup Strategy

On Session Write:

await backupChatSession(session);

On Message Write:

await backupChatMessage(message);

Backup Priority:

Try IndexedDB
Fall back to localStorage
Silent failure (chat still in SQLite)

Backup Limits

Sessions: Max 200 (keep most recent)
Messages: Max 5000 (keep most recent by created_at)
Auto-pruning: After each backup write

localStorage Backup Keys

Sessions: gitstarrecall.chat.backup.sessions.v1
Messages: gitstarrecall.chat.backup.messages.v1

Recovery Flow

On app load, check IndexedDB for backup
If IndexedDB has data, use it
Otherwise, check localStorage
Merge backup into SQLite if SQLite is empty or corrupt

Checkpointing Strategy

Embedding writes are batched and checkpointed periodically to balance performance and durability.

Policy

Checkpoint Triggers:

Every 256 embeddings (configurable)
Every 3000ms (configurable)
On completion of indexing run
On manual flush

Environment Variables:

VITE_DB_CHECKPOINT_EVERY_EMBEDDINGS=256
VITE_DB_CHECKPOINT_EVERY_MS=3000

Implementation

class LocalDatabase {
  private pendingEmbeddingsSinceCheckpoint = 0;
  private lastEmbeddingCheckpointAt: number | null = null;

  private shouldCheckpointEmbeddings(now: number): boolean {
    if (this.pendingEmbeddingsSinceCheckpoint >= policy.everyEmbeddings) {
      return true;
    }
    const elapsed = now - (this.lastEmbeddingCheckpointAt ?? now);
    return elapsed >= policy.everyMs;
  }

  async upsertEmbeddings(embeddings: EmbeddingRecord[]): Promise<void> {
    // Write to DB
    this.runEmbeddingUpsert(embeddings);
    
    // Track pending
    this.noteEmbeddingWrites(embeddings.length);
    
    // Checkpoint if needed
    if (this.shouldCheckpointEmbeddings(Date.now())) {
      await this.flushPendingEmbeddingCheckpoint();
    }
  }
}

Benefits

Performance: Fewer disk writes
Durability: Regular checkpoints limit data loss
Tunability: Configurable based on device capabilities

Vector Search Implementation

GitStarRecall uses brute-force cosine similarity search with an in-memory cache.

Vector Index Cache

private vectorIndexCache: Array<{ 
  chunkId: string; 
  vector: Float32Array 
}> | null = null;
private vectorIndexCacheCount = -1;

Cache Invalidation:

On new embeddings written
On chunks deleted
When embedding count changes

Cache Rebuild:

const result = db.exec(`
  SELECT e.chunk_id, e.vector_blob
  FROM embeddings e
  INNER JOIN chunks c ON c.id = e.chunk_id;
`);

this.vectorIndexCache = result[0].values.map((row) => ({
  chunkId: String(row[0]),
  vector: new Float32Array(
    blob.buffer, 
    blob.byteOffset, 
    blob.byteLength / 4
  )
}));

Similarity Search

async findSimilarChunks(
  queryVector: Float32Array, 
  limit: number
): Promise<SearchResult[]> {
  // 1. Ensure cache is current
  const vectors = this.ensureVectorIndexCache();
  
  // 2. Compute similarity for all vectors
  const scores = vectors.map(({ chunkId, vector }) => ({
    chunkId,
    score: cosineSimilarity(queryVector, vector)
  }));
  
  // 3. Sort and slice top-K
  scores.sort((a, b) => b.score - a.score);
  const topChunks = scores.slice(0, limit);
  
  // 4. Hydrate with text and repo metadata
  // (SQL join for details)
  
  return results;
}

Cosine Similarity

function cosineSimilarity(a: Float32Array, b: Float32Array): number {
  let dot = 0, normA = 0, normB = 0;
  
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

Normalization:

All embeddings are L2-normalized before storage
Query vectors are L2-normalized before search
This ensures cosine similarity works correctly

Sync and Diff Strategy

Checksum Generation

import { createHash } from "crypto";

function computeRepoChecksum(repo: RepoMetadata): string {
  const canonical = JSON.stringify({
    id: repo.id,
    fullName: repo.fullName,
    description: repo.description ?? "",
    topics: repo.topics.sort(),
    language: repo.language ?? "",
    updatedAt: repo.updatedAt,
    readmeText: repo.readmeText ?? ""
  });
  
  return createHash("sha256")
    .update(canonical, "utf-8")
    .digest("hex");
}

Diff-Based Sync

On Fetch Stars:

Get current repo sync state from DB
Fetch latest stars from GitHub
Compute checksums for fetched repos
Compare:
- New: Not in local DB
- Changed: Checksum differs
- Unchanged: Checksum matches
- Removed: In local DB but not in fetched stars
Update/insert changed and new repos
Delete removed repos (cascades to chunks and embeddings)
Generate chunks only for new/changed repos
Queue chunks for embedding

README Caching

ETag and Last-Modified Headers:

const headers: Record<string, string> = {};
if (repo.readmeEtag) {
  headers["If-None-Match"] = repo.readmeEtag;
}
if (repo.readmeLastModified) {
  headers["If-Modified-Since"] = repo.readmeLastModified;
}

const response = await fetch(readmeUrl, { headers });

if (response.status === 304) {
  // Not modified, skip README update
  return;
}

// Update README and store new ETag/Last-Modified
repo.readmeEtag = response.headers.get("ETag");
repo.readmeLastModified = response.headers.get("Last-Modified");

Benefits:

Reduces GitHub API calls
Faster sync for unchanged READMEs
Respects rate limits

Data Cleanup

Clear All Data

// Clear SQLite
await clearOpfsFile();
clearLocalStorageBytes();

// Clear chat backup
await clearChatBackup();

// Reinitialize database
const db = await initializeDatabase();

User Triggers:

“Delete local data” button in settings
Clears all repos, chunks, embeddings, and chat data
Does not clear GitHub token

Clear Token

// Clear from memory
setToken(null);

// Clear from encrypted storage (if enabled)
await clearEncryptedToken();

User Triggers:

“Clear token” button in settings
Logs user out
Does not clear local data

Storage Diagnostics

Storage Mode Detection

const db = await getDatabase();
const mode = db.storageMode; // "opfs" | "local-storage" | "memory"

Displayed in UI:

Settings page shows current storage mode
Warning if in memory-only mode

Quota Estimation

if (navigator.storage?.estimate) {
  const estimate = await navigator.storage.estimate();
  const usedMB = (estimate.usage ?? 0) / (1024 * 1024);
  const quotaMB = (estimate.quota ?? 0) / (1024 * 1024);
  const percentUsed = (usedMB / quotaMB) * 100;
}

Database Size

const bytes = db.export();
const sizeMB = bytes.byteLength / (1024 * 1024);

Architecture - System architecture and tech stack
Troubleshooting - Common storage issues

Get Started

Core Features

Configuration

Deployment

Advanced

​Overview

​Storage Layers

​Primary Storage: SQLite WASM

​Persistence Layer: OPFS

​Fallback: localStorage

​Fallback: Memory-Only Mode

​Storage Mode Priority

​Database Schema

​Tables

​repos

​chunks

​embeddings

​chat_sessions

​chat_messages

​index_meta

​Chat Backup System

​IndexedDB Structure

​Backup Strategy

​Backup Limits

​localStorage Backup Keys

​Recovery Flow

​Checkpointing Strategy

​Policy

​Implementation

​Benefits

​Vector Search Implementation

​Vector Index Cache

​Similarity Search

​Cosine Similarity

​Sync and Diff Strategy

​Checksum Generation

​Diff-Based Sync

​README Caching

​Data Cleanup

​Clear All Data

​Clear Token

​Storage Diagnostics

​Storage Mode Detection

​Quota Estimation

​Database Size

​Related Documentation