Code indexing is the pipeline that turns your workspace into a searchable vector database. Once indexed, both Ask and Agent modes automatically retrieve relevant code as context, and you can search explicitly via the search_vector_db tool.
For how the search pipeline queries the index, see Semantic Search.
Quick start
Section intitulée « Quick start »CodeBuddy: Index Workspace for Semantic SearchThis scans your workspace, chunks every supported file, generates embeddings, and stores everything in a local SQLite database. A progress notification tracks file count, chunk count, and whether embeddings are available.
Indexing pipeline
Section intitulée « Indexing pipeline »File discovery
Section intitulée « File discovery »Supported languages
Section intitulée « Supported languages »The indexer scans for files matching these extensions:
.ts .tsx .js .jsx .py .java .go .rs .cpp .c .h .cs .rb .php
Exclusions
Section intitulée « Exclusions »Files are excluded at multiple levels:
| Layer | What’s excluded |
|---|---|
| Base excludes | node_modules, .git, dist, out, build, coverage, .codebuddy |
.codebuddyignore | Custom patterns (.gitignore syntax — globs, **, negation with !, directory patterns with trailing /) |
| Sync excludes | *.min.js, *.bundle.js, *.d.ts, .DS_Store, .vscode-test |
| File size | Varies by performance mode (see below) |
Run CodeBuddy: Init .codebuddyignore to create a starter ignore file.
The .codebuddyignore watcher automatically reloads patterns when the file changes — no restart required.
.codebuddyignore syntax
Section intitulée « .codebuddyignore syntax »The file uses .gitignore syntax. Place it in the workspace root.
# Ignore test fixturestests/fixtures/**
# Ignore generated codesrc/generated/
# Ignore large data files*.csv*.parquet
# Re-include a specific fixture needed for indexing!tests/fixtures/sample.ts
# Ignore a specific directory (trailing / = directory only)tmp/
# Anchored pattern (only matches at root, not nested)/scripts/legacy/| Syntax | Meaning | Example |
|---|---|---|
* | Match any characters except / | *.log |
** | Match any path depth | tests/**/snap* |
! | Negation — re-include a previously excluded pattern | !src/important.ts |
/ (trailing) | Match directories only | build/ |
/ (leading) | Anchored to root — won’t match in subdirectories | /scripts/ |
# | Comment line | # Ignore logs |
File size limits
Section intitulée « File size limits »| Performance mode | Max file size |
|---|---|
balanced (default) | 1 MB |
performance | 2 MB |
memory | 512 KB |
Set via codebuddy.vectorDb.performanceMode.
Chunking
Section intitulée « Chunking »Chunking runs in a dedicated worker thread to avoid blocking the editor UI.
| Parameter | Value |
|---|---|
| Chunk size | 1,000 characters |
| Overlap | 200 characters |
| Minimum chunk | 50 characters (smaller chunks are discarded) |
Each chunk records:
- ID —
{filePath}::{charOffset} - Text — the chunk content
- Line range — approximate start/end line numbers
- Type —
text_chunk,function,class,method, orblock - Language — detected from the file extension
The worker uses Tree-sitter for AST-aware chunking with a text-based fallback splitter. Language detection maps extensions to parsers: .ts/.tsx → TypeScript, .py → Python, .java → Java, .go → Go, .rs → Rust, .cpp → C++, .c/.h → C, .cs → C#, .rb → Ruby, .php → PHP.
Tree-sitter AST analysis
Section intitulée « Tree-sitter AST analysis »CodeBuddy ships 8 pre-compiled WASM grammars for AST-aware chunking:
| Grammar | Extensions | What it extracts |
|---|---|---|
| TypeScript | .ts, .tsx | Functions, classes, methods, interfaces, type aliases |
| JavaScript | .js, .jsx | Functions, classes, arrow functions, exports |
| Python | .py | Functions, classes, methods, decorators, docstrings |
| Java | .java | Classes, methods, interfaces, annotations |
| Go | .go | Functions, methods, structs, interfaces |
| Rust | .rs | Functions, impl blocks, structs, enums, traits |
| PHP | .php | Functions, classes, methods, namespaces |
| C/C++ | .c, .h, .cpp | Functions, structs, classes, macros |
For languages with dedicated analyzers (TypeScript, JavaScript, Python), the AST parser extracts:
- Function boundaries — splits at function/method definitions instead of arbitrary character offsets
- Class grouping — keeps class bodies together when they fit within a chunk
- Import regions — groups import statements into a single chunk
- Docstring preservation — attaches docstrings/JSDoc comments to their associated function
Languages without a WASM grammar fall back to the text-based splitter (character offset + overlap).
Embedding generation
Section intitulée « Embedding generation »Providers
Section intitulée « Providers »| Provider | Model | Notes |
|---|---|---|
| Gemini (default) | text-embedding-004 | Default when codebuddy.vectorDb.embeddingModel = "gemini" |
| OpenAI | text-embedding-3-small | OpenAI-compatible endpoint |
| Local | Configurable | Uses local server’s /embeddings endpoint (e.g., nomic-embed-text) |
| Deepseek / Groq | OpenAI-compatible | Same SDK, different base URL |
Anthropic does not support embeddings — automatically falls back to Gemini.
Batching and rate limiting
Section intitulée « Batching and rate limiting »| Parameter | Default |
|---|---|
| Batch size | 5 chunks per batch |
| Rate limit | 1,500 requests/min (40ms minimum interval) |
| Retries | 3, with exponential backoff (delay × retryCount) |
| Retry delay | 1,000ms base |
Between batches, the indexer yields to the event loop via setImmediate() to keep the editor responsive.
Smart embedding phases
Section intitulée « Smart embedding phases »The embedding system uses different configurations depending on context:
| Phase | Batch size | Max files | Delay between batches | Timeout | Retries |
|---|---|---|---|---|---|
| Immediate (on-save) | 5 | 20 | 100ms | 30s | 3 |
| On-demand (user query) | 3 | 15 | 200ms | 20s | 2 |
| Background (idle) | 10 | 100 | 1,000ms | 60s | 1 |
| Bulk (full index) | 20 | Unlimited | 500ms | 120s | 2 |
Pre-flight check
Section intitulée « Pre-flight check »Before bulk indexing, the system runs a test embedding to verify the API is reachable. If the check fails, indexing continues in text-only mode — chunks are stored without vectors, enabling keyword search but not semantic search. Re-run the index command after fixing your API key to generate embeddings.
Incremental indexing
Section intitulée « Incremental indexing »CodeBuddy uses content hashing for efficient incremental updates:
- On file save, the
onDidSaveTextDocumentlistener fires - SHA-256 hash of the file content is compared against the
file_metadatatable - If unchanged → skip (no work done)
- If changed → remove old chunks for this file → re-chunk → re-embed → persist
This means saving a file you didn’t actually change costs almost nothing — just a hash comparison.
What triggers re-indexing
Section intitulée « What triggers re-indexing »| Trigger | Scope | Behavior |
|---|---|---|
| File save | Single file | Immediate incremental index |
| Index Workspace command | Entire workspace | Bulk index, skips unchanged files |
| Background processing | Changed files | Debounced (default 1,000ms) |
Filtered out automatically: git commit messages, log files, and paths containing node_modules, .git, or .codebuddy.
SQLite database
Section intitulée « SQLite database »The index is stored in a SQLite database powered by sql.js (SQLite compiled to WASM). Location: <workspace>/.codebuddy/vector_store.db (falls back to the editor’s global storage).
Schema:
-- Chunk storage with optional vector embeddingsCREATE TABLE chunks ( id TEXT PRIMARY KEY, text TEXT NOT NULL, vector BLOB, -- Float32Array as binary (NULL in text-only mode) file_path TEXT NOT NULL, start_line INTEGER NOT NULL, end_line INTEGER NOT NULL, chunk_type TEXT NOT NULL DEFAULT 'text_chunk', language TEXT NOT NULL DEFAULT '', indexed_at TEXT NOT NULL);
-- File change trackingCREATE TABLE file_metadata ( file_path TEXT PRIMARY KEY, file_hash TEXT NOT NULL, -- SHA-256 chunk_count INTEGER NOT NULL DEFAULT 0, indexed_at TEXT NOT NULL);FTS4 full-text index
Section intitulée « FTS4 full-text index »A FTS4 virtual table is maintained in sync with the chunks table via SQL triggers:
- INSERT trigger — adds chunk text to FTS on insert
- DELETE trigger — removes from FTS on delete
- UPDATE trigger — delete + re-insert on update
On startup, if the FTS row count falls behind the chunks table (e.g., after a crash), a back-fill runs automatically. Small gaps (< 100 rows) use an anti-join INSERT; larger gaps use FTS4’s rebuild command.
Persistence
Section intitulée « Persistence »The database uses a dirty flag with a 5-second debounce save timer. Changes accumulate in memory and flush to disk periodically. On extension deactivation, any pending changes are saved immediately.
Initialization order
Section intitulée « Initialization order »When the editor starts with CodeBuddy:
SqliteVectorStoresingleton — loads or creates the SQLite databaseHybridSearchServicesingleton — creates and initializes the FTS4 virtual tableAstIndexingServicesingleton — spawns the worker thread, wires the embedding serviceContextRetriever— wires the search pipelineonDidSaveTextDocumentlistener — enables incremental indexing on file savecodebuddy.indexWorkspacecommand — registered for manual full re-index
Settings
Section intitulée « Settings »All indexing settings are documented in the Settings Reference under Vector database and Hybrid search.
Key settings for indexing behavior:
| Setting | Default | Effect |
|---|---|---|
codebuddy.vectorDb.enabled | true | Master toggle for the entire indexing system |
codebuddy.vectorDb.performanceMode | "balanced" | Controls file size limits and resource usage |
codebuddy.vectorDb.enableBackgroundProcessing | true | Index changes in the background |
codebuddy.vectorDb.debounceDelay | 1000 | Milliseconds to wait before re-indexing a changed file |
codebuddy.vectorDb.batchSize | 10 | Files per embedding batch |
codebuddy.indexCodebase | false | Feature flag for automatic indexing on startup |