Code indexing is the pipeline that turns your workspace into a searchable vector database. Once indexed, both Ask and Agent modes automatically retrieve relevant code as context, and you can search explicitly via the search_vector_db tool.
For how the search pipeline queries the index, see Semantic Search.
Quick start
Sección titulada «Quick start»CodeBuddy: Index Workspace for Semantic SearchThis scans your workspace, chunks every supported file, generates embeddings, and stores everything in a local SQLite database. A progress notification tracks file count, chunk count, and whether embeddings are available.
Indexing pipeline
Sección titulada «Indexing pipeline»File discovery
Sección titulada «File discovery»Supported languages
Sección titulada «Supported languages»The indexer scans for files matching these extensions:
.ts .tsx .js .jsx .py .java .go .rs .cpp .c .h .cs .rb .php
Exclusions
Sección titulada «Exclusions»Files are excluded at multiple levels:
| Layer | What’s excluded |
|---|---|
| Base excludes | node_modules, .git, dist, out, build, coverage, .codebuddy |
.codebuddyignore | Custom patterns (.gitignore syntax — globs, **, negation with !, directory patterns with trailing /) |
| Sync excludes | *.min.js, *.bundle.js, *.d.ts, .DS_Store, .vscode-test |
| File size | Varies by performance mode (see below) |
Run CodeBuddy: Init .codebuddyignore to create a starter ignore file.
The .codebuddyignore watcher automatically reloads patterns when the file changes — no restart required.
.codebuddyignore syntax
Sección titulada «.codebuddyignore syntax»The file uses .gitignore syntax. Place it in the workspace root.
# Ignore test fixturestests/fixtures/**
# Ignore generated codesrc/generated/
# Ignore large data files*.csv*.parquet
# Re-include a specific fixture needed for indexing!tests/fixtures/sample.ts
# Ignore a specific directory (trailing / = directory only)tmp/
# Anchored pattern (only matches at root, not nested)/scripts/legacy/| Syntax | Meaning | Example |
|---|---|---|
* | Match any characters except / | *.log |
** | Match any path depth | tests/**/snap* |
! | Negation — re-include a previously excluded pattern | !src/important.ts |
/ (trailing) | Match directories only | build/ |
/ (leading) | Anchored to root — won’t match in subdirectories | /scripts/ |
# | Comment line | # Ignore logs |
File size limits
Sección titulada «File size limits»| Performance mode | Max file size |
|---|---|
balanced (default) | 1 MB |
performance | 2 MB |
memory | 512 KB |
Set via codebuddy.vectorDb.performanceMode.
Chunking
Sección titulada «Chunking»Chunking runs in a dedicated worker thread to avoid blocking the editor UI.
| Parameter | Value |
|---|---|
| Chunk size | 1,000 characters |
| Overlap | 200 characters |
| Minimum chunk | 50 characters (smaller chunks are discarded) |
Each chunk records:
- ID —
{filePath}::{charOffset} - Text — the chunk content
- Line range — approximate start/end line numbers
- Type —
text_chunk,function,class,method, orblock - Language — detected from the file extension
The worker uses Tree-sitter for AST-aware chunking with a text-based fallback splitter. Language detection maps extensions to parsers: .ts/.tsx → TypeScript, .py → Python, .java → Java, .go → Go, .rs → Rust, .cpp → C++, .c/.h → C, .cs → C#, .rb → Ruby, .php → PHP.
Tree-sitter AST analysis
Sección titulada «Tree-sitter AST analysis»CodeBuddy ships 8 pre-compiled WASM grammars for AST-aware chunking:
| Grammar | Extensions | What it extracts |
|---|---|---|
| TypeScript | .ts, .tsx | Functions, classes, methods, interfaces, type aliases |
| JavaScript | .js, .jsx | Functions, classes, arrow functions, exports |
| Python | .py | Functions, classes, methods, decorators, docstrings |
| Java | .java | Classes, methods, interfaces, annotations |
| Go | .go | Functions, methods, structs, interfaces |
| Rust | .rs | Functions, impl blocks, structs, enums, traits |
| PHP | .php | Functions, classes, methods, namespaces |
| C/C++ | .c, .h, .cpp | Functions, structs, classes, macros |
For languages with dedicated analyzers (TypeScript, JavaScript, Python), the AST parser extracts:
- Function boundaries — splits at function/method definitions instead of arbitrary character offsets
- Class grouping — keeps class bodies together when they fit within a chunk
- Import regions — groups import statements into a single chunk
- Docstring preservation — attaches docstrings/JSDoc comments to their associated function
Languages without a WASM grammar fall back to the text-based splitter (character offset + overlap).
Embedding generation
Sección titulada «Embedding generation»Providers
Sección titulada «Providers»| Provider | Model | Notes |
|---|---|---|
| Gemini (default) | text-embedding-004 | Default when codebuddy.vectorDb.embeddingModel = "gemini" |
| OpenAI | text-embedding-3-small | OpenAI-compatible endpoint |
| Local | Configurable | Uses local server’s /embeddings endpoint (e.g., nomic-embed-text) |
| Deepseek / Groq | OpenAI-compatible | Same SDK, different base URL |
Anthropic does not support embeddings — automatically falls back to Gemini.
Batching and rate limiting
Sección titulada «Batching and rate limiting»| Parameter | Default |
|---|---|
| Batch size | 5 chunks per batch |
| Rate limit | 1,500 requests/min (40ms minimum interval) |
| Retries | 3, with exponential backoff (delay × retryCount) |
| Retry delay | 1,000ms base |
Between batches, the indexer yields to the event loop via setImmediate() to keep the editor responsive.
Smart embedding phases
Sección titulada «Smart embedding phases»The embedding system uses different configurations depending on context:
| Phase | Batch size | Max files | Delay between batches | Timeout | Retries |
|---|---|---|---|---|---|
| Immediate (on-save) | 5 | 20 | 100ms | 30s | 3 |
| On-demand (user query) | 3 | 15 | 200ms | 20s | 2 |
| Background (idle) | 10 | 100 | 1,000ms | 60s | 1 |
| Bulk (full index) | 20 | Unlimited | 500ms | 120s | 2 |
Pre-flight check
Sección titulada «Pre-flight check»Before bulk indexing, the system runs a test embedding to verify the API is reachable. If the check fails, indexing continues in text-only mode — chunks are stored without vectors, enabling keyword search but not semantic search. Re-run the index command after fixing your API key to generate embeddings.
Incremental indexing
Sección titulada «Incremental indexing»CodeBuddy uses content hashing for efficient incremental updates:
- On file save, the
onDidSaveTextDocumentlistener fires - SHA-256 hash of the file content is compared against the
file_metadatatable - If unchanged → skip (no work done)
- If changed → remove old chunks for this file → re-chunk → re-embed → persist
This means saving a file you didn’t actually change costs almost nothing — just a hash comparison.
What triggers re-indexing
Sección titulada «What triggers re-indexing»| Trigger | Scope | Behavior |
|---|---|---|
| File save | Single file | Immediate incremental index |
| Index Workspace command | Entire workspace | Bulk index, skips unchanged files |
| Background processing | Changed files | Debounced (default 1,000ms) |
Filtered out automatically: git commit messages, log files, and paths containing node_modules, .git, or .codebuddy.
Storage
Sección titulada «Storage»SQLite database
Sección titulada «SQLite database»The index is stored in a SQLite database powered by sql.js (SQLite compiled to WASM). Location: <workspace>/.codebuddy/vector_store.db (falls back to the editor’s global storage).
Schema:
-- Chunk storage with optional vector embeddingsCREATE TABLE chunks ( id TEXT PRIMARY KEY, text TEXT NOT NULL, vector BLOB, -- Float32Array as binary (NULL in text-only mode) file_path TEXT NOT NULL, start_line INTEGER NOT NULL, end_line INTEGER NOT NULL, chunk_type TEXT NOT NULL DEFAULT 'text_chunk', language TEXT NOT NULL DEFAULT '', indexed_at TEXT NOT NULL);
-- File change trackingCREATE TABLE file_metadata ( file_path TEXT PRIMARY KEY, file_hash TEXT NOT NULL, -- SHA-256 chunk_count INTEGER NOT NULL DEFAULT 0, indexed_at TEXT NOT NULL);FTS4 full-text index
Sección titulada «FTS4 full-text index»A FTS4 virtual table is maintained in sync with the chunks table via SQL triggers:
- INSERT trigger — adds chunk text to FTS on insert
- DELETE trigger — removes from FTS on delete
- UPDATE trigger — delete + re-insert on update
On startup, if the FTS row count falls behind the chunks table (e.g., after a crash), a back-fill runs automatically. Small gaps (< 100 rows) use an anti-join INSERT; larger gaps use FTS4’s rebuild command.
Persistence
Sección titulada «Persistence»The database uses a dirty flag with a 5-second debounce save timer. Changes accumulate in memory and flush to disk periodically. On extension deactivation, any pending changes are saved immediately.
Initialization order
Sección titulada «Initialization order»When the editor starts with CodeBuddy:
SqliteVectorStoresingleton — loads or creates the SQLite databaseHybridSearchServicesingleton — creates and initializes the FTS4 virtual tableAstIndexingServicesingleton — spawns the worker thread, wires the embedding serviceContextRetriever— wires the search pipelineonDidSaveTextDocumentlistener — enables incremental indexing on file savecodebuddy.indexWorkspacecommand — registered for manual full re-index
Settings
Sección titulada «Settings»All indexing settings are documented in the Settings Reference under Vector database and Hybrid search.
Key settings for indexing behavior:
| Setting | Default | Effect |
|---|---|---|
codebuddy.vectorDb.enabled | true | Master toggle for the entire indexing system |
codebuddy.vectorDb.performanceMode | "balanced" | Controls file size limits and resource usage |
codebuddy.vectorDb.enableBackgroundProcessing | true | Index changes in the background |
codebuddy.vectorDb.debounceDelay | 1000 | Milliseconds to wait before re-indexing a changed file |
codebuddy.vectorDb.batchSize | 10 | Files per embedding batch |
codebuddy.indexCodebase | false | Feature flag for automatic indexing on startup |