AI Development

Free local AI assistant on macOS with no API keys or subscriptions

Actualizado el 14 de abril de 2026

Categoría: AI Development

Etiquetas AI Developer Tools Productivity AI Agents Machine Learning

Local AI assistant running on macOS with Ollama

You do not need an API key to run a capable AI assistant in 2026. On a modern Mac, you can pull down an open source model, run it locally, and have a chat interface ready in under ten minutes. Nothing leaves your machine. There is no subscription, no rate limit, and no vendor deciding what the model is allowed to say.

This guide walks through the full setup using Ollama, which has become the standard way to manage local models on macOS. By the end you will have a working local AI assistant, an optional chat UI, and a simple HTTP API you can call from your own code.

If you are already building AI agent systems and want to replace cloud inference with something private and free, this is a practical starting point.

Why run AI locally on your Mac?

The answer depends on what matters to you. Here are the reasons that come up most often:

Privacy. Nothing is sent to a third party server. If you paste in sensitive code, a draft email, or a private document, it stays on your machine.

Cost. Once the model is downloaded, inference is free. No per-token pricing, no monthly bill.

Availability. It works offline. No outages, no API rate limits during peak hours.

Experimentation. You can swap models freely, try quantized variants, and tune prompts without worrying about burning through credits.

The tradeoff is speed and capability. A model running on your MacBook will be slower than GPT-4 or Claude Sonnet and the largest models will not fit in RAM on most consumer hardware. But for everyday tasks like summarization, drafting, coding help, and Q&A, a 7B or 8B parameter model is genuinely useful.

What you need before you start

You need a Mac running macOS 12 Monterey or later. Homebrew is optional but makes things easier. You do not need an Nvidia GPU. Ollama has native support for Apple Silicon, and models run on the Neural Engine and GPU via Metal.

Recommended minimum specs:

Spec	Minimum	Better
RAM	8 GB	16 GB or more
Storage	5 GB free	20 GB free
Chip	Intel Core i5	Apple Silicon (M1 or later)

Apple Silicon machines (M1 through M4) get much better performance because Ollama uses Metal to run inference on the GPU cores directly.

Step 1: Install Ollama

Ollama is an open source tool that handles model downloads, runtime management, and a local HTTP API. It runs as a background service.

The quickest install is from the official site:

→ Download Ollama for macOS

Or install it via Homebrew:

Terminal

brew install ollama

Once installed, start the Ollama service:

Terminal

ollama serve

You should see output confirming the server is running on http://localhost:11434. Keep this terminal open or run it as a background service (covered in Step 5).

Step 2: Pull a model

Ollama uses a model registry similar to Docker Hub. You pull a model by name and it downloads the weights locally.

Terminal

ollama pull llama3.2

This pulls Meta’s Llama 3.2 model at roughly 2 GB for the default 3B variant. Once the download finishes, the model is stored at ~/.ollama/models.

Which model should you pick?

Here is a practical breakdown of what runs well on a Mac in 2026:

Model	Size on disk	Best for	Minimum RAM
`llama3.2:3b`	2 GB	Quick answers, coding help	8 GB
`llama3.2:8b`	4.7 GB	General assistant, longer context	8 GB
`gemma3:4b`	3.3 GB	Instruction following, chat	8 GB
`qwen2.5:7b`	4.4 GB	Code, reasoning, multilingual	8 GB
`mistral:7b`	4.1 GB	Writing, summarization	8 GB

If you are on Apple Silicon with 16 GB or more of RAM, start with llama3.2:8b or qwen2.5:7b. Both produce solid results for everyday developer tasks.

Terminal

# Pull the 8B variant explicitly
ollama pull llama3.2:8b

# Or try Qwen 2.5 for coding tasks
ollama pull qwen2.5:7b

You can have multiple models installed at the same time and switch between them freely.

Step 3: Run your first conversation

Once a model is pulled, you can chat with it directly in the terminal:

Terminal

ollama run llama3.2:8b

This drops you into an interactive prompt. Type your message and press Enter. To exit, type /bye.

Terminal

>>> Explain the difference between a process and a thread in simple terms.

A process is an independent program running in its own memory space...

The first response after loading may take a few seconds while the model weights are read into memory. Subsequent responses in the same session are faster.

Step 4: Add a chat interface

The terminal works fine, but most people prefer a proper chat UI. There are two good options depending on what you want.

Option A: Open WebUI

Open WebUI is a self-hosted chat interface that connects to your local Ollama server. It looks similar to ChatGPT but runs entirely on your machine.

→ Open WebUI on GitHub

Install and run it with Docker:

Terminal

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser. You will see a full chat UI where you can select any installed Ollama model from a dropdown and start a conversation.

Open WebUI also supports document uploads, image input (with vision models), and basic prompt management. It is a significant upgrade from the terminal if you plan to use the assistant regularly.

Option B: A floating macOS app

If you want something lighter that lives outside the browser, there are several native macOS apps that wrap Ollama. The one that got attention on r/SideProject recently is a floating assistant that sits above your other windows and responds to a keyboard shortcut.

The general pattern for this type of app:

Download the .dmg from the project’s GitHub releases page
Open it and allow it in System Preferences under Privacy and Security
Grant Accessibility permissions so it can read selected text
Set a hotkey to summon it from anywhere

This type of interface is useful when you want to ask the model about something you are reading or drafting without switching apps.

Step 5: Keep it running headlessly

If you want Ollama to start automatically when you log in and run in the background without keeping a terminal open, you can register it as a macOS launch agent.

Create the plist file:

Terminal

cat > ~/Library/LaunchAgents/com.ollama.server.plist << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.ollama.server</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/ollama</string>
        <string>serve</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/ollama.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/ollama.err</string>
</dict>
</plist>
EOF

If you installed Ollama via the macOS app installer rather than Homebrew, the binary path may be /usr/bin/ollama. Check with which ollama first.

Load the agent:

Terminal

launchctl load ~/Library/LaunchAgents/com.ollama.server.plist

Ollama will now start automatically on login and restart if it crashes. You can check logs at /tmp/ollama.log.

Using the Ollama API from your own code

Ollama exposes a local REST API on port 11434. You can call it from any language. This is useful if you want to add AI features to a script or app without paying for cloud inference.

Generate a completion in Python:

Python

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.2:8b",
        "prompt": "Summarize the key ideas in the Unix philosophy in three bullet points.",
        "stream": False
    }
)

print(response.json()["response"])

Chat completions in JavaScript:

JavaScript

const response = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
        model: "llama3.2:8b",
        messages: [
            { role: "user", content: "What is a tail call optimization?" }
        ],
        stream: false
    })
});

const data = await response.json();
console.log(data.message.content);

The API is compatible with the OpenAI client library format for chat completions, so you can often swap in the Ollama base URL to test your existing OpenAI-powered code locally at zero cost.

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by the client but not checked
)

completion = client.chat.completions.create(
    model="llama3.2:8b",
    messages=[{"role": "user", "content": "What is memoization?"}]
)

print(completion.choices[0].message.content)

This compatibility layer makes it easy to build something locally and switch to a cloud model later if you need more capability at scale.

Hardware reality check

Here is what to expect on common Mac configurations:

Hardware	Model	Tokens per second
MacBook Air M1 (8 GB RAM)	llama3.2:3b	35 to 50
MacBook Pro M2 (16 GB RAM)	llama3.2:8b	40 to 60
Mac Studio M2 Ultra (64 GB RAM)	llama3.3:70b	20 to 35
MacBook Pro M4 (32 GB RAM)	qwen2.5:14b	30 to 50

For comparison, a typical cloud API response streams at around 50 to 80 tokens per second. Local models on Apple Silicon are in the same ballpark for 7B to 8B models, and the latency is lower because there is no network round trip.

On an Intel Mac with 8 GB RAM, you will get usable but slow responses. Expect 5 to 15 tokens per second on a 3B model. It works, but the experience is noticeably sluggish compared to Apple Silicon.

What this setup cannot do

It is worth being honest about the gaps:

Context length. Consumer models running locally typically handle 4K to 8K tokens well. Long documents or large codebases will hit limits faster than GPT-4 class models.

Multimodal input. Some models support images (llava, moondream), but vision quality is well below GPT-4o on most tasks. It is getting better with each generation.

Tool use and function calling. Ollama supports function calling in newer models, but reliability varies. If you need robust agent behavior with tool use, a cloud model is still more dependable. See the AI agent frameworks comparison for a broader look at what each framework expects from the underlying model.

Memory across sessions. Each new ollama run session starts fresh. Persistent memory requires an application layer built on top of the API.

None of these are blockers for most everyday use cases. For writing help, code explanation, summarization, and general Q&A, a local 8B model is genuinely capable.

The combination of Apple Silicon and well-quantized open source models has made this setup practical for the first time. A year ago, running a useful model locally on a laptop required careful compromises. Today it is straightforward enough that the main reason not to try it is not knowing it exists.

If you are curious about privacy-focused inference at a larger scale or want to explore how confidential compute fits into this picture, Cocoon’s decentralized inference network is another approach worth reading about.

Launching an AI tool or side project? Check out the list of AI directories you can submit to, and the basic SEO guide for getting your post-launch content right.

Categoría AI Development

Etiquetas AI Developer Tools Productivity AI Agents Machine Learning

Free local AI assistant on macOS with no API keys or subscriptions

Local AI assistant running on macOS with Ollama

Why run AI locally on your Mac?

What you need before you start

Step 1: Install Ollama

Step 2: Pull a model

Which model should you pick?

Step 3: Run your first conversation

Step 4: Add a chat interface

Option A: Open WebUI

Option B: A floating macOS app

Step 5: Keep it running headlessly

Using the Ollama API from your own code

Hardware reality check

What this setup cannot do

Publicaciones Relacionadas

MiniMax M2: Developer guide to the 230B open-source coding model

Raptor mini in GitHub Copilot: When to use it for multi-file refactors

Code Wiki: Google’s Living Repo Wiki That Keeps Docs in Sync (and Adds a Gemini Chat)

Recibe los últimos conocimientos sobre IA directamente en tu bandeja de entrada

Local AI assistant running on macOS with Ollama

Why run AI locally on your Mac?

What you need before you start

Step 1: Install Ollama

Step 2: Pull a model

Which model should you pick?

Step 3: Run your first conversation

Step 4: Add a chat interface

Option A: Open WebUI

Option B: A floating macOS app

Step 5: Keep it running headlessly

Using the Ollama API from your own code

Hardware reality check

What this setup cannot do

Publicaciones Relacionadas

MiniMax M2: Developer guide to the 230B open-source coding model

Raptor mini in GitHub Copilot: When to use it for multi-file refactors

Code Wiki: Google’s Living Repo Wiki That Keeps Docs in Sync (and Adds a Gemini Chat)

Tabla de contenidos

Temas Populares

Popular Topics

Recibe los últimos conocimientos sobre IA directamente en tu bandeja de entrada