Claude Code on Ollama: How you can Run a Native Coding Agent With out Burning API Credit

June 27, 2026

3

How you can Run Claude Code on Ollama

Set up Ollama through Homebrew or the official set up script and begin the server.
Pull a coding mannequin corresponding to qwen2.5-coder:14b with ollama pull.
Confirm Ollama’s OpenAI-compatible endpoint responds at localhost:11434/v1.
Set up Claude Code globally through npm set up -g @anthropic-ai/claude-code.
Unset any present ANTHROPIC_API_KEY to stop unintentional API billing.
Export surroundings variables pointing Claude Code to the native Ollama endpoint.
Launch Claude Code in your venture listing and ensure the native mannequin title seems.
Affirm native routing by checking for lively connections to port 11434 throughout a session.

Operating Claude Code towards Anthropic’s API will get costly quick. Run Claude Code towards a neighborhood mannequin by Ollama and also you pay zero marginal value per question—this tutorial walks by the whole setup, from putting in Ollama and pulling an applicable coding mannequin, to configuring Claude Code’s surroundings variables, to operating actual coding duties towards a React and Node.js venture.

Desk of Contents

Why Your Claude Code API Invoice Is a Drawback

Operating Claude Code towards Anthropic’s API will get costly quick. Builders on the Anthropic subreddit and numerous boards have reported spending between $100 and $200 in a single day of heavy agentic coding periods. One broadly cited, self-reported group account described burning by $175 in simply 4 hours whereas refactoring a medium-sized codebase (outcomes will range considerably by process kind and codebase dimension). Even conservative utilization patterns, involving periodic prompts for code opinions, check era, and debugging, can simply generate month-to-month payments exceeding $500 based on related anecdotal reviews. The token-intensive nature of agentic workflows, the place Claude Code reads whole information, causes throughout a number of steps, and writes again adjustments, compounds the fee far past what a single chat-style API name would.

Run Claude Code towards a neighborhood mannequin by Ollama and also you pay zero marginal value per question. The mannequin runs on {hardware} already sitting on the desk.

This tutorial walks by the whole setup, from putting in Ollama and pulling an applicable coding mannequin, to configuring Claude Code’s surroundings variables, to operating actual coding duties towards a React and Node.js venture. The goal reader is a developer with intermediate familiarity with CLI instruments, Node.js, and native improvement environments.

Claude Code model compatibility: Claude Code is underneath fast improvement and its configuration interface, together with supported surroundings variables, could change between releases. This information paperwork one method to native mannequin routing through OpenAI-compatible endpoints. After set up, run claude --version and seek the advice of Anthropic’s present documentation or claude --help to substantiate the precise surroundings variable names supported by your put in model. If variable names have modified, adapt the directions accordingly.

What Is Claude Code and Why Go Native?

Claude Code in 60 Seconds

Claude Code is Anthropic’s agentic command-line coding device. Not like GitHub Copilot, which operates primarily as an inline autocomplete engine, or Cursor, which embeds AI inside a customized IDE fork, Claude Code capabilities as a standalone CLI agent. It reads venture information, causes about codebases, writes and edits code throughout a number of information, runs shell instructions, and iterates by itself output. Its default working mannequin requires an Anthropic API key, routing all requests to Claude Sonnet 4 or Claude Opus, with prices decided by token consumption. A typical multi-step agentic process can eat tens of hundreds of tokens per interplay.

The Case for Native Fashions

Operating Claude Code towards a neighborhood mannequin solves three issues. Privateness and knowledge sovereignty come first: supply code by no means leaves the developer’s machine, which issues for proprietary codebases and organizations with strict knowledge dealing with insurance policies. You additionally eradicate per-query prices after the one-time {hardware} funding. And the setup works with out an web connection, so you retain working when connectivity drops.

The trade-offs deserve sincere acknowledgment. Native fashions, even the most effective open-weight coding fashions within the 7B to 16B parameter vary, don’t match Claude Sonnet 4 or Opus in advanced multi-file reasoning, nuanced architectural selections, or large-context understanding. For simple duties like boilerplate era, refactoring, and check scaffolding, native fashions produce usable output on first try for single-file edits. For duties requiring deep contextual reasoning throughout hundreds of strains, the standard hole stays important.

Understanding the Structure: Claude Code + Ollama + OpenAI-Suitable APIs

How the Items Match Collectively

Claude Code helps third-party mannequin suppliers by OpenAI-compatible API endpoints. That is the mechanism that makes native utilization doable. Ollama, a neighborhood mannequin server, exposes precisely such an endpoint at localhost:11434/v1. Once you configure the appropriate surroundings variables, Claude Code sends its requests to this native endpoint as an alternative of Anthropic’s servers.

The request move is simple:

Claude Code CLI  →  http://localhost:11434/v1/chat/completions  →  Ollama Server  →  Native LLM (e.g., qwen2.5-coder:14b)
     [prompt]           [OpenAI-compatible API]                      [inference]         [response]

Claude Code constructs its prompts and tool-use payloads within the OpenAI chat completions format. Ollama receives these, runs inference on the required native mannequin, and returns the completion. From Claude Code’s perspective, it talks to an OpenAI-compatible supplier. From the mannequin’s perspective, it handles commonplace chat completion requests.

Conditions and System Necessities

{Hardware} Concerns

Native LLM inference is memory-bound. The RAM figures beneath consult with obtainable (free) RAM, not complete put in RAM. For 7B parameter fashions at This fall quantization, you want at the least 16GB of obtainable RAM. Operating 13B or 14B parameter fashions comfortably requires 32GB or extra, and fashions with 30B+ parameters sometimes demand 64GB of obtainable RAM or a GPU with substantial VRAM. Increased quantization ranges (e.g., Q8) roughly double the RAM requirement in comparison with This fall variants.

For GPU acceleration, Ollama helps NVIDIA GPUs through CUDA, Apple Silicon through Metallic (computerized on macOS), and AMD GPUs through ROCm on Linux. Disk area necessities range by mannequin: count on 4GB to 10GB per quantized mannequin file.

Software program Necessities

The setup requires Node.js 18 or later (with npm), Ollama put in and operating as a neighborhood server, and the Claude Code CLI put in globally through npm.

Step 1: Set up and Configure Ollama

Putting in Ollama

On macOS and Linux, Ollama installs with a single command. Home windows customers can obtain the installer from the Ollama web site.


brew set up ollama




curl -fsSL https://ollama.com/set up.sh | sh


ollama --version






ollama serve

On macOS, Ollama sometimes launches as a background service mechanically after Homebrew set up. On Linux, ollama serve begins the server course of. Confirm it’s operating by checking that port 11434 is listening.

Pulling the Proper Mannequin

Not all fashions deal with code era equally. The next fashions are well-suited for coding duties by Claude Code:

For the most effective stability of high quality and useful resource utilization, pull qwen2.5-coder:14b. It handles multi-file edits in Python, TypeScript, and Go together with fewer syntax errors than different fashions on this parameter vary.
deepseek-coder-v2:16b generates syntactically legitimate Python and JavaScript in single-file duties (efficiency varies by process; consider towards your personal workload).
Meta’s codellama:13b is a purpose-built coding mannequin based mostly on Llama 2 (launched 2023; based mostly on the older Llama 2 structure, so the newer options above typically produce higher outcomes).
When RAM is tight, llama3.1:8b gives a lighter-weight general-purpose possibility.

Mannequin selection immediately impacts output high quality. Objective-built coding fashions like Qwen 2.5 Coder produce noticeably higher structured code, deal with edge instances extra reliably, and observe coding conventions extra persistently than general-purpose fashions of equal dimension.


ollama pull qwen2.5-coder:14b


ollama listing

The ollama listing command ought to present the mannequin title, dimension, and modification date, confirming the weights are downloaded and prepared.

Verifying the Native API

Earlier than configuring Claude Code, verify that Ollama’s OpenAI-compatible endpoint is responding:

curl http://localhost:11434/v1/chat/completions 
  -H "Content material-Kind: utility/json" 
  -H "Authorization: Bearer not-a-real-key-local-ollama-only" 
  -d '{
    "mannequin": "qwen2.5-coder:14b",
    "stream": false,
    "messages": [{"role": "user", "content": "Write a hello world function in JavaScript"}]
  }'

A profitable response returns a single JSON object containing the mannequin’s completion. If this command fails with “connection refused,” Ollama shouldn’t be operating. If it returns a model-not-found error, the mannequin title doesn’t match what was pulled.

Step 2: Set up and Configure Claude Code for Native Use

Putting in Claude Code CLI

Set up Claude Code globally by npm:

npm set up -g @anthropic-ai/claude-code


claude --version

This installs the claude command globally. The CLI requires Node.js 18 or later. Observe the model quantity displayed — the surroundings variables described beneath are version-dependent. Run claude --help to substantiate the supported configuration choices to your model.

Configuring Claude Code to Use Ollama

First: when you’ve got ANTHROPIC_API_KEY set in your surroundings, unset it. Leaving it set could trigger Claude Code to route requests to Anthropic’s API as an alternative of Ollama, silently incurring prices.

unset ANTHROPIC_API_KEY

You configure Claude Code’s third-party supplier help with surroundings variables. The precise variable names rely in your Claude Code model. Run claude --help to substantiate the proper names. The variables beneath signify one documented configuration method — confirm them towards the present Anthropic documentation to your put in model:




export OPENAI_API_KEY="not-a-real-key-local-ollama-only"
export ANTHROPIC_BASE_URL="http://localhost:11434/v1"
export CLAUDE_CODE_USE_OPENAI=1
export CLAUDE_MODEL="qwen2.5-coder:14b"

Model-dependent variables: The variable names CLAUDE_CODE_USE_OPENAI, CLAUDE_MODEL, and the selection between ANTHROPIC_BASE_URL and OPENAI_BASE_URL could differ throughout Claude Code releases. Affirm them with claude --help or the Anthropic documentation to your model. If the variables are incorrect, Claude Code could silently fall again to the Anthropic API, incurring prices.

You set OPENAI_API_KEY to a placeholder string as a result of Ollama doesn’t require authentication, however Claude Code refuses to begin with out a non-empty key worth. ANTHROPIC_BASE_URL factors to the native Ollama server’s OpenAI-compatible API path. CLAUDE_CODE_USE_OPENAI indicators Claude Code to make use of the OpenAI-compatible supplier path reasonably than the Anthropic API. CLAUDE_MODEL specifies which Ollama mannequin to make use of and should match the mannequin title precisely as proven by ollama listing, together with the tag (e.g., :14b).

For persistence, add these exports to ~/.bashrc, ~/.zshrc, or a project-level .env file. If utilizing a project-level .env file, guarantee it’s listed in .gitignore to stop unintentional commits.

Home windows customers (PowerShell):

$env:OPENAI_API_KEY="not-a-real-key-local-ollama-only"
$env:ANTHROPIC_BASE_URL="http://localhost:11434/v1"
$env:CLAUDE_CODE_USE_OPENAI="1"
$env:CLAUDE_MODEL="qwen2.5-coder:14b"

Launching Claude Code in Native Mode

With the surroundings variables set, begin Claude Code in any venture listing:

cd /path/to/your/venture
claude

On startup, Claude Code ought to show the configured mannequin title (e.g., qwen2.5-coder:14b) reasonably than a Claude Sonnet or Opus identifier. That is an preliminary indicator that configuration was utilized, however displaying the mannequin title alone doesn’t assure native routing — the configured variable worth might be proven even when routing fails. To definitively verify that requests attain Ollama, monitor connections throughout a session:


lsof -i :11434 | grep ESTABLISHED

It’s best to see an lively TCP connection to 127.0.0.1:11434. If no connection is proven, requests could also be going to Anthropic’s servers.

Step 3: Take It for a Spin with a React + Node.js Challenge

Scaffolding a Check Challenge

Create a minimal venture that offers Claude Code actual information to work with:

npm create vite@newest test-project -- --template react
cd test-project
npm set up
npm set up specific

Add a minimal Categorical server on the venture root. As a result of the Vite scaffold creates an ES module venture ("kind": "module" in bundle.json), the CommonJS require() syntax won’t work by default. Both rename the file server.cjs, or add "kind": "commonjs" to a separate root-level bundle.json, or rewrite utilizing ES module import syntax. The instance beneath makes use of the .cjs method:


const specific = require('specific');
const app = specific();
const PORT = course of.env.PORT ?? 3001;

app.use(specific.json());

app.get("https://www.sitepoint.com/", (req, res) => {
  res.json({ message: 'Server is operating' });
});

const server = app.hear(PORT, () => {
  console.log(`Server listening on port ${PORT}`);
});

server.on('error', (err) => {
  if (err.code === 'EADDRINUSE') {
    console.error(`Port ${PORT} is already in use. Set PORT env var to make use of a distinct port.`);
  } else {
    console.error('Server failed to begin:', err);
  }
  course of.exit(1);
});

This gives each a React frontend and a Node.js backend for Claude Code to function on.

Operating Actual Coding Duties

With Claude Code operating within the venture listing, situation a sensible immediate:

Add a /api/well being endpoint to server.cjs that returns { standing: "wholesome", uptime: course of.uptime() }
and create a React element known as HealthStatus that fetches and shows this knowledge.

With qwen2.5-coder:14b, count on output structured like this (your outcomes will range based mostly on immediate phrasing and mannequin state):


app.get('/api/well being', (req, res) => {
  res.json({
    standing: 'wholesome',
    uptime: course of.uptime(),
    timestamp: new Date().toISOString()
  });
});


import { useState, useEffect } from 'react';

const API_BASE = import.meta.env.VITE_API_BASE ?? 'http://localhost:3001';

operate HealthStatus() {
  const [health, setHealth] = useState(null);
  const [loading, setLoading] = useState(true);
  const [error, setError] = useState(null);

  useEffect(() => {
    const controller = new AbortController();

    fetch(`${API_BASE}/api/well being`, { sign: controller.sign })
      .then((res) => {
        if (!res.okay) throw new Error(`HTTP ${res.standing}`);
        return res.json();
      })
      .then((knowledge) => {
        setHealth(knowledge);
        setLoading(false);
      })
      .catch((err) => {
        if (err.title === 'AbortError') return;
        console.error('Did not fetch well being standing:', err);
        setError(err.message);
        setLoading(false);
      });

    return () => controller.abort();
  }, []);

  if (loading) return <p>Loading well being standing...</p>;
  if (error) return <p>Error: {error}</p>;
  if (!well being) return <p>Unable to achieve server.</p>;

  return (
    <div>
      <h2>Server Well being</h2>
      <p>Standing: {well being.standing}</p>
      <p>Uptime: {Math.spherical(well being.uptime)}s</p>
    </div>
  );
}

export default HealthStatus;

Observe on fetch URLs: The React frontend runs on the Vite dev server (sometimes port 5173), whereas the Categorical backend runs on port 3001. The element above makes use of the VITE_API_BASE surroundings variable to configure the API origin, falling again to http://localhost:3001 for native improvement. For manufacturing or containerised deployments, set VITE_API_BASE to the suitable backend URL. Alternatively, configure a Vite proxy by including server: { proxy: { '/api': 'http://localhost:3001' } } to vite.config.js and use relative fetch paths.

Claude Code’s agentic capabilities imply it reads the present server.cjs, identifies the place to insert the brand new endpoint, writes the adjustments, creates the brand new element file, and might even replace imports in App.jsx if prompted.

Evaluating Output High quality

Native fashions within the 7B to 14B vary deal with boilerplate code, CRUD endpoint era, easy element creation, check scaffolding, and simple refactoring properly. For single-endpoint handlers and remoted element information, they produce usable output on first try with out guide correction.

The place native fashions fall quick is in advanced multi-file reasoning: tracing a bug throughout a number of interconnected modules, making architectural selections that require understanding a full codebase’s patterns, or producing right output when the context window fills up. Claude Sonnet 4 handles these situations with noticeably greater accuracy. For instance, Sonnet accurately traces cross-module kind errors that qwen2.5-coder:14b misses after a number of makes an attempt, and it maintains coherence throughout longer context home windows.

Efficiency Tuning and Optimization

Ollama Configuration for Higher Efficiency

Ollama exposes a number of surroundings variables and configuration choices that have an effect on inference velocity:




export OLLAMA_NUM_PARALLEL=2















ollama present qwen2.5-coder:14b --modelfile | grep num_ctx

Setting OLLAMA_NUM_PARALLEL above 1 permits concurrent request dealing with, which issues much less for single-user Claude Code periods however helps if different instruments share the identical Ollama occasion. Rising the context size permits the mannequin to purpose over extra code without delay, however will increase reminiscence consumption considerably; very lengthy contexts can eat considerably extra reminiscence than the bottom mannequin load.

Selecting the Proper Mannequin for the Process

A sensible technique is to maintain a number of fashions pulled and change between them. Use a smaller mannequin like llama3.1:8b for fast completions and easy edits the place velocity issues. Change to qwen2.5-coder:14b or deepseek-coder-v2:16b for duties requiring greater code high quality. Switching fashions requires solely altering the CLAUDE_MODEL surroundings variable (or the equal to your Claude Code model) and restarting Claude Code.

Full Implementation Guidelines and Mannequin Comparability Desk

Setup Guidelines

Set up Ollama (brew set up ollama or curl set up script) and confirm with ollama --version
Begin Ollama server (ollama serve or brew companies begin ollama on macOS) and ensure port 11434 is listening
Pull a coding mannequin (ollama pull qwen2.5-coder:14b) and confirm with ollama listing
Check the API endpoint with curl http://localhost:11434/v1/chat/completions (embrace "stream": false within the request physique)
Set up Claude Code (npm set up -g @anthropic-ai/claude-code) and confirm with claude --version
Unset ANTHROPIC_API_KEY if current (unset ANTHROPIC_API_KEY)
Verify claude --help to substantiate the proper surroundings variable names to your model
Set surroundings variables (OPENAI_API_KEY, ANTHROPIC_BASE_URL, CLAUDE_CODE_USE_OPENAI, CLAUDE_MODEL), adapting variable names in case your model differs
Launch Claude Code in a venture listing and ensure the mannequin title in startup output
Run lsof -i :11434 (or netstat -ano | findstr :11434 on Home windows) throughout a session to confirm native routing
Run a check immediate and confirm the response comes from the native mannequin

Native Coding Mannequin Comparability Desk

Mannequin	Measurement	Min. Free RAM (This fall)	Coding High quality*	Velocity	Greatest For
`llama3.1:8b`	~4.7GB	16GB	Average	Quick	Fast completions, easy edits
`codellama:13b`	~7.4GB	32GB**	Good	Average	Normal code era
`qwen2.5-coder:14b`	~8.9GB	32GB	Very Good	Average	Greatest total for coding duties
`deepseek-coder-v2:16b`	~9.1GB	32GB	Very Good	Average	Complicated code era
`codellama:34b`	~19GB	64GB	Wonderful	Gradual	Most native high quality
`llama3.1:70b`	~40GB	64GB+	Wonderful	Very Gradual	Close to-API high quality (if {hardware} permits)

*Coding High quality rankings mirror casual single-file cross charges on HumanEval-style duties. “Average” = frequent guide fixes wanted; “Good” = occasional fixes; “Very Good” = first-attempt success on most single-file duties; “Wonderful” = constant first-attempt success together with multi-function information.

**16GB is the technical minimal for codellama:13b; 32GB is really useful for secure inference with out swapping. Sizes and RAM figures assume This fall quantization; Q8 quantization roughly doubles RAM necessities. Confirm precise on-disk dimension with ollama listing after pulling.

Greatest total choose: qwen2.5-coder:14b gives the strongest stability of code era high quality, cheap useful resource necessities, and sensible inference velocity for iterative improvement workflows.

Troubleshooting Widespread Points

Connection Refused or Mannequin Not Discovered

If Claude Code reviews connection errors, confirm that ollama serve is operating and that http://localhost:11434 responds to requests. On macOS, verify whether or not the Homebrew service is already operating with brew companies listing — operating ollama serve manually when the service is lively causes a port battle. A “mannequin not discovered” error means the worth in CLAUDE_MODEL doesn’t precisely match the mannequin title proven by ollama listing, together with the tag (e.g., :14b).

Gradual Responses or Out-of-Reminiscence Errors

If inference is unacceptably sluggish or the system runs out of reminiscence, cut back the context window (through the Modelfile PARAMETER num_ctx or the per-request choices area), change to a smaller quantized mannequin, or confirm that GPU offloading is lively. On NVIDIA techniques, nvidia-smi confirms whether or not Ollama is using the GPU. On Apple Silicon, Metallic acceleration is computerized.

Claude Code Ignoring Native Config

Surroundings variables override one another in ways in which trigger routing errors. When you’ve got an ANTHROPIC_API_KEY set within the shell surroundings or in a world configuration file, Claude Code could prioritize the Anthropic supplier over the OpenAI-compatible path. Unset any Anthropic-specific variables (unset ANTHROPIC_API_KEY) earlier than launching Claude Code in native mode. Moreover, confirm that the surroundings variable names you’re utilizing match these supported by your put in Claude Code model — run claude --help to substantiate.

Warning: If surroundings variables are misconfigured, Claude Code could silently route requests to Anthropic’s API, incurring surprising prices. At all times confirm native routing by checking for lively connections to localhost:11434 throughout your session.

When to Use Native vs. API: A Sensible Framework

Use native fashions for iterative improvement, boilerplate era, check writing, refactoring, and work on non-public or proprietary codebases the place knowledge should not depart the machine. Use the Anthropic API for advanced architectural reasoning, large-context multi-file adjustments that exceed native mannequin capabilities, and code that ships to manufacturing with out further human overview.

Essentially the most sensible method is a hybrid one: default to native for the majority of day by day coding duties and change to the API selectively for heavy lifts. This sample captures nearly all of value financial savings whereas preserving entry to frontier mannequin high quality when it issues.

What Comes Subsequent

This setup eliminates API prices for almost all of routine coding agent interactions. Builders who beforehand spent $100 or extra per day on Anthropic API credit can reserve that spend for duties that genuinely require frontier mannequin capabilities. Builders who route nearly all of routine duties regionally can considerably cut back API prices; precise financial savings rely upon particular person workflow composition and the ratio of local-suitable duties to these requiring frontier fashions.

From right here, the pure subsequent steps are experimenting with further fashions because the open-weight ecosystem evolves and creating task-specific Modelfile configurations tuned for specific programming languages or frameworks. Past that, you may combine native Claude Code periods into CI workflows for automated code overview on non-public repositories.

Supply hyperlink