Free local AI assistant on macOS with no API keys or subscriptions
Aktualisiert am 14. April 2026
Local AI assistant running on macOS with Ollama
You do not need an API key to run a capable AI assistant in 2026. On a modern Mac, you can pull down an open source model, run it locally, and have a chat interface ready in under ten minutes. Nothing leaves your machine. There is no subscription, no rate limit, and no vendor deciding what the model is allowed to say.
This guide walks through the full setup using Ollama, which has become the standard way to manage local models on macOS. By the end you will have a working local AI assistant, an optional chat UI, and a simple HTTP API you can call from your own code.
If you are already building AI agent systems and want to replace cloud inference with something private and free, this is a practical starting point.
Why run AI locally on your Mac?
The answer depends on what matters to you. Here are the reasons that come up most often:
Privacy. Nothing is sent to a third party server. If you paste in sensitive code, a draft email, or a private document, it stays on your machine.
Cost. Once the model is downloaded, inference is free. No per-token pricing, no monthly bill.
Availability. It works offline. No outages, no API rate limits during peak hours.
Experimentation. You can swap models freely, try quantized variants, and tune prompts without worrying about burning through credits.
The tradeoff is speed and capability. A model running on your MacBook will be slower than GPT-4 or Claude Sonnet and the largest models will not fit in RAM on most consumer hardware. But for everyday tasks like summarization, drafting, coding help, and Q&A, a 7B or 8B parameter model is genuinely useful.
What you need before you start
You need a Mac running macOS 12 Monterey or later. Homebrew is optional but makes things easier. You do not need an Nvidia GPU. Ollama has native support for Apple Silicon, and models run on the Neural Engine and GPU via Metal.
Recommended minimum specs:
| Spec | Minimum | Better |
|---|---|---|
| RAM | 8 GB | 16 GB or more |
| Storage | 5 GB free | 20 GB free |
| Chip | Intel Core i5 | Apple Silicon (M1 or later) |
Apple Silicon machines (M1 through M4) get much better performance because Ollama uses Metal to run inference on the GPU cores directly.
Step 1: Install Ollama
Ollama is an open source tool that handles model downloads, runtime management, and a local HTTP API. It runs as a background service.
The quickest install is from the official site:
→ Download Ollama for macOSOr install it via Homebrew:
brew install ollama
Once installed, start the Ollama service:
ollama serve
You should see output confirming the server is running on http://localhost:11434. Keep this terminal open or run it as a background service (covered in Step 5).
Step 2: Pull a model
Ollama uses a model registry similar to Docker Hub. You pull a model by name and it downloads the weights locally.
ollama pull llama3.2
This pulls Meta’s Llama 3.2 model at roughly 2 GB for the default 3B variant. Once the download finishes, the model is stored at ~/.ollama/models.
Which model should you pick?
Here is a practical breakdown of what runs well on a Mac in 2026:
| Model | Size on disk | Best for | Minimum RAM |
|---|---|---|---|
llama3.2:3b | 2 GB | Quick answers, coding help | 8 GB |
llama3.2:8b | 4.7 GB | General assistant, longer context | 8 GB |
gemma3:4b | 3.3 GB | Instruction following, chat | 8 GB |
qwen2.5:7b | 4.4 GB | Code, reasoning, multilingual | 8 GB |
mistral:7b | 4.1 GB | Writing, summarization | 8 GB |
If you are on Apple Silicon with 16 GB or more of RAM, start with llama3.2:8b or qwen2.5:7b. Both produce solid results for everyday developer tasks.
# Pull the 8B variant explicitly
ollama pull llama3.2:8b
# Or try Qwen 2.5 for coding tasks
ollama pull qwen2.5:7b
You can have multiple models installed at the same time and switch between them freely.
Step 3: Run your first conversation
Once a model is pulled, you can chat with it directly in the terminal:
ollama run llama3.2:8b
This drops you into an interactive prompt. Type your message and press Enter. To exit, type /bye.
>>> Explain the difference between a process and a thread in simple terms.
A process is an independent program running in its own memory space...
The first response after loading may take a few seconds while the model weights are read into memory. Subsequent responses in the same session are faster.
Step 4: Add a chat interface
The terminal works fine, but most people prefer a proper chat UI. There are two good options depending on what you want.
Option A: Open WebUI
Open WebUI is a self-hosted chat interface that connects to your local Ollama server. It looks similar to ChatGPT but runs entirely on your machine.
→ Open WebUI on GitHubInstall and run it with Docker:
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 in your browser. You will see a full chat UI where you can select any installed Ollama model from a dropdown and start a conversation.
Open WebUI also supports document uploads, image input (with vision models), and basic prompt management. It is a significant upgrade from the terminal if you plan to use the assistant regularly.
Option B: A floating macOS app
If you want something lighter that lives outside the browser, there are several native macOS apps that wrap Ollama. The one that got attention on r/SideProject recently is a floating assistant that sits above your other windows and responds to a keyboard shortcut.
The general pattern for this type of app:
- Download the
.dmgfrom the project’s GitHub releases page - Open it and allow it in System Preferences under Privacy and Security
- Grant Accessibility permissions so it can read selected text
- Set a hotkey to summon it from anywhere
This type of interface is useful when you want to ask the model about something you are reading or drafting without switching apps.
Step 5: Keep it running headlessly
If you want Ollama to start automatically when you log in and run in the background without keeping a terminal open, you can register it as a macOS launch agent.
Create the plist file:
cat > ~/Library/LaunchAgents/com.ollama.server.plist << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.ollama.server</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/ollama</string>
<string>serve</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/tmp/ollama.log</string>
<key>StandardErrorPath</key>
<string>/tmp/ollama.err</string>
</dict>
</plist>
EOF
If you installed Ollama via the macOS app installer rather than Homebrew, the binary path may be /usr/bin/ollama. Check with which ollama first.
Load the agent:
launchctl load ~/Library/LaunchAgents/com.ollama.server.plist
Ollama will now start automatically on login and restart if it crashes. You can check logs at /tmp/ollama.log.
Using the Ollama API from your own code
Ollama exposes a local REST API on port 11434. You can call it from any language. This is useful if you want to add AI features to a script or app without paying for cloud inference.
Generate a completion in Python:
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.2:8b",
"prompt": "Summarize the key ideas in the Unix philosophy in three bullet points.",
"stream": False
}
)
print(response.json()["response"])
Chat completions in JavaScript:
const response = await fetch("http://localhost:11434/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "llama3.2:8b",
messages: [
{ role: "user", content: "What is a tail call optimization?" }
],
stream: false
})
});
const data = await response.json();
console.log(data.message.content);
The API is compatible with the OpenAI client library format for chat completions, so you can often swap in the Ollama base URL to test your existing OpenAI-powered code locally at zero cost.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required by the client but not checked
)
completion = client.chat.completions.create(
model="llama3.2:8b",
messages=[{"role": "user", "content": "What is memoization?"}]
)
print(completion.choices[0].message.content)
This compatibility layer makes it easy to build something locally and switch to a cloud model later if you need more capability at scale.
Hardware reality check
Here is what to expect on common Mac configurations:
| Hardware | Model | Tokens per second |
|---|---|---|
| MacBook Air M1 (8 GB RAM) | llama3.2:3b | 35 to 50 |
| MacBook Pro M2 (16 GB RAM) | llama3.2:8b | 40 to 60 |
| Mac Studio M2 Ultra (64 GB RAM) | llama3.3:70b | 20 to 35 |
| MacBook Pro M4 (32 GB RAM) | qwen2.5:14b | 30 to 50 |
For comparison, a typical cloud API response streams at around 50 to 80 tokens per second. Local models on Apple Silicon are in the same ballpark for 7B to 8B models, and the latency is lower because there is no network round trip.
On an Intel Mac with 8 GB RAM, you will get usable but slow responses. Expect 5 to 15 tokens per second on a 3B model. It works, but the experience is noticeably sluggish compared to Apple Silicon.
What this setup cannot do
It is worth being honest about the gaps:
Context length. Consumer models running locally typically handle 4K to 8K tokens well. Long documents or large codebases will hit limits faster than GPT-4 class models.
Multimodal input. Some models support images (llava, moondream), but vision quality is well below GPT-4o on most tasks. It is getting better with each generation.
Tool use and function calling. Ollama supports function calling in newer models, but reliability varies. If you need robust agent behavior with tool use, a cloud model is still more dependable. See the AI agent frameworks comparison for a broader look at what each framework expects from the underlying model.
Memory across sessions. Each new ollama run session starts fresh. Persistent memory requires an application layer built on top of the API.
None of these are blockers for most everyday use cases. For writing help, code explanation, summarization, and general Q&A, a local 8B model is genuinely capable.
The combination of Apple Silicon and well-quantized open source models has made this setup practical for the first time. A year ago, running a useful model locally on a laptop required careful compromises. Today it is straightforward enough that the main reason not to try it is not knowing it exists.
If you are curious about privacy-focused inference at a larger scale or want to explore how confidential compute fits into this picture, Cocoon’s decentralized inference network is another approach worth reading about.
Launching an AI tool or side project? Check out the list of AI directories you can submit to, and the basic SEO guide for getting your post-launch content right.