Skip to content

Local Models

CodeBuddy supports local models as a first-class provider. The local provider uses the OpenAI-compatible API protocol, which means it works with Ollama, Docker Model Runner, LM Studio, vLLM, and any other server that implements the same endpoint format — no cloud API keys required.

graph TB subgraph Ext["CodeBuddy Extension"] Ask["Ask Mode<br/>(LocalLLM)"] Agent["Agent Mode<br/>(ChatOpenAI)"] Comp["Completion<br/>(LocalLLM)"] SDK["OpenAI SDK<br/>(HTTP client)"] end Ask --> SDK Agent --> SDK Comp --> SDK SDK --> Ollama["Port 11434<br/>Ollama<br/>(native or Docker)"] SDK --> Docker["Port 12434<br/>Docker Model Runner<br/>(llama.cpp engine)"]

Ask mode and inline completion use LocalLLM, which wraps the OpenAI Node.js SDK pointed at a local endpoint. Agent mode uses LangChain’s ChatOpenAI class with a custom baseURL, giving the LangGraph pipeline full tool-calling support through the same local server.

RuntimeDefault portHow CodeBuddy connects
Ollama (native)11434http://localhost:11434/v1
Ollama (Docker)11434Same — started via bundled docker-compose.yml
Docker Model Runner12434http://localhost:12434/engines/llama.cpp/v1
LM Studio1234Set local.baseUrl to http://localhost:1234/v1
vLLM8000Set local.baseUrl to http://localhost:8000/v1
Any OpenAI-compatiblevariesSet local.baseUrl to the server’s endpoint

Install and start Ollama, then pull a model:

Terminal window
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull the recommended coding model
ollama pull qwen2.5-coder
# Ollama starts automatically on port 11434

In CodeBuddy settings, select Local as the provider. The default base URL (http://localhost:11434/v1) points to Ollama out of the box.

Option 2: Docker Compose (managed by CodeBuddy)

Section titled “Option 2: Docker Compose (managed by CodeBuddy)”

CodeBuddy bundles a docker-compose.yml that starts Ollama in a Docker container with persistent storage:

  1. Open Settings → Models
  2. Click Start Server under the Ollama section
  3. Select a model to pull from the predefined list
  4. Click Use to activate it

This runs:

Terminal window
docker compose -f <extension-path>/docker-compose.yml up -d

The container is configured with a 32 GB memory limit and persistent volume (ollama_data). GPU support (NVIDIA) can be enabled by uncommenting the deploy section in the compose file.

Docker Desktop 4.37+ includes a built-in model runner with a llama.cpp engine:

  1. Open Settings → Models
  2. Click Enable Docker Model Runner
  3. Pull models directly through the Docker model registry

This exposes models at http://localhost:12434/engines/llama.cpp/v1. Model names are prefixed with ai/ (e.g., ai/qwen2.5-coder).

Port fallback: If Docker Model Runner on port 12434 is unreachable, CodeBuddy automatically falls back to Ollama on port 11434 and updates your configuration.

ModelSizeBest for
Qwen 2.5 Coder (7B)~4.7 GBCode tasks — recommended default
Qwen 2.5 Coder (3B)~2 GBFaster, lighter coding model
DeepSeek Coder~6.7 GBStrong code completion benchmarks
CodeLlama (7B)~3.8 GBMeta’s code-focused model
Llama 3.2 (3B)~2 GBEfficient general-purpose model

The default model is qwen2.5-coder. You can use any model available in your local runtime — these are just the ones shown in the UI.

SettingTypeDefaultDescription
local.modelstring"qwen2.5-coder"Model name (must match what’s pulled)
local.baseUrlstring"http://localhost:11434/v1"API endpoint for the local server
local.apiKeystring"not-needed"API key (not required for local models)
generativeAi.optionenum"Groq"Set to "Local" to use local models

Local models are the default provider for inline code completion (ghost text). This gives you fast, private completions without cloud API calls.

SettingDefaultDescription
codebuddy.completion.provider"Local"Set to "Local" for local completions
codebuddy.completion.model"qwen2.5-coder"Model for completions
codebuddy.completion.debounceMs300Trigger delay in milliseconds
codebuddy.completion.maxTokens128Maximum tokens per completion
codebuddy.completion.triggerMode"automatic"automatic (as you type) or manual
codebuddy.completion.multiLinetrueAllow multi-line completions

The completion engine tries two strategies:

  1. Completion API (standard Fill-in-the-Middle) — used for models with FIM support
  2. Chat API fallback — used if the model only supports chat

Local models can power the vector database for semantic code search:

{
"codebuddy.vectorDb.embeddingModel": "local"
}

This calls the local server’s /embeddings endpoint. The default embedding model is text-embedding-v1. Not all local models support embeddings — if your model doesn’t expose this endpoint, use "gemini" (default) or "openai" instead.

When you select the Local provider and use Agent mode, the LangGraph pipeline uses LangChain’s ChatOpenAI class with your local endpoint:

new ChatOpenAI({
openAIApiKey: "not-needed",
modelName: "qwen2.5-coder",
configuration: {
baseURL: "http://localhost:11434/v1",
},
});

This gives you full agent capabilities — tool calling, subagent delegation, multi-step reasoning — powered by your local hardware. The system prompt explicitly prevents local models from hallucinating tool calls when not in agent context.

Performance note: Agent mode involves multiple sequential LLM calls (reasoning → tool selection → execution → reasoning). Expect slower responses with local models compared to cloud providers, especially with 3B–7B parameter models. Larger models (13B+) or GPU acceleration significantly improve agent performance.

The Settings → Models page provides a visual interface for managing local models:

  • Status indicators: Green/red badges showing whether Ollama or Docker Model Runner is running
  • Model cards: Each predefined model shows Pull / Use / Delete buttons
  • Pull progress: Loading states during model downloads
  • Active model: The currently configured model is highlighted
  • Docker controls: Enable Docker Model Runner, start Ollama via Docker Compose

The model selector pill in the sidebar header shows the active model name and polls every 30 seconds for local runtime status.

Since LocalLLM uses the standard OpenAI SDK, you can point it at any compatible server:

{
"local.baseUrl": "http://localhost:8080/v1",
"local.model": "my-custom-model",
"local.apiKey": "not-needed",
"generativeAi.option": "Local"
}

Compatible servers include Ollama, LM Studio, vLLM, text-generation-webui (with --api), KoboldCpp, LocalAI, and any server implementing the OpenAI chat completions endpoint (/v1/chat/completions).