Blog Recursos Acerca de Buscar Temas
AI Development

Free local AI assistant on macOS with no API keys or subscriptions

Actualizado el 14 de abril de 2026

Categoría: AI Development
Compartir

Local AI assistant running on macOS with Ollama

You do not need an API key to run a capable AI assistant in 2026. On a modern Mac, you can pull down an open source model, run it locally, and have a chat interface ready in under ten minutes. Nothing leaves your machine. There is no subscription, no rate limit, and no vendor deciding what the model is allowed to say.

This guide walks through the full setup using Ollama, which has become the standard way to manage local models on macOS. By the end you will have a working local AI assistant, an optional chat UI, and a simple HTTP API you can call from your own code.

If you are already building AI agent systems and want to replace cloud inference with something private and free, this is a practical starting point.


Why run AI locally on your Mac?

The answer depends on what matters to you. Here are the reasons that come up most often:

Privacy. Nothing is sent to a third party server. If you paste in sensitive code, a draft email, or a private document, it stays on your machine.

Cost. Once the model is downloaded, inference is free. No per-token pricing, no monthly bill.

Availability. It works offline. No outages, no API rate limits during peak hours.

Experimentation. You can swap models freely, try quantized variants, and tune prompts without worrying about burning through credits.

The tradeoff is speed and capability. A model running on your MacBook will be slower than GPT-4 or Claude Sonnet and the largest models will not fit in RAM on most consumer hardware. But for everyday tasks like summarization, drafting, coding help, and Q&A, a 7B or 8B parameter model is genuinely useful.


What you need before you start

You need a Mac running macOS 12 Monterey or later. Homebrew is optional but makes things easier. You do not need an Nvidia GPU. Ollama has native support for Apple Silicon, and models run on the Neural Engine and GPU via Metal.

Recommended minimum specs:

SpecMinimumBetter
RAM8 GB16 GB or more
Storage5 GB free20 GB free
ChipIntel Core i5Apple Silicon (M1 or later)

Apple Silicon machines (M1 through M4) get much better performance because Ollama uses Metal to run inference on the GPU cores directly.


Step 1: Install Ollama

Ollama is an open source tool that handles model downloads, runtime management, and a local HTTP API. It runs as a background service.

The quickest install is from the official site:

→ Download Ollama for macOS

Or install it via Homebrew:

Terminal
brew install ollama

Once installed, start the Ollama service:

Terminal
ollama serve

You should see output confirming the server is running on http://localhost:11434. Keep this terminal open or run it as a background service (covered in Step 5).


Step 2: Pull a model

Ollama uses a model registry similar to Docker Hub. You pull a model by name and it downloads the weights locally.

Terminal
ollama pull llama3.2

This pulls Meta’s Llama 3.2 model at roughly 2 GB for the default 3B variant. Once the download finishes, the model is stored at ~/.ollama/models.

Which model should you pick?

Here is a practical breakdown of what runs well on a Mac in 2026:

ModelSize on diskBest forMinimum RAM
llama3.2:3b2 GBQuick answers, coding help8 GB
llama3.2:8b4.7 GBGeneral assistant, longer context8 GB
gemma3:4b3.3 GBInstruction following, chat8 GB
qwen2.5:7b4.4 GBCode, reasoning, multilingual8 GB
mistral:7b4.1 GBWriting, summarization8 GB

If you are on Apple Silicon with 16 GB or more of RAM, start with llama3.2:8b or qwen2.5:7b. Both produce solid results for everyday developer tasks.

Terminal
# Pull the 8B variant explicitly
ollama pull llama3.2:8b

# Or try Qwen 2.5 for coding tasks
ollama pull qwen2.5:7b

You can have multiple models installed at the same time and switch between them freely.


Step 3: Run your first conversation

Once a model is pulled, you can chat with it directly in the terminal:

Terminal
ollama run llama3.2:8b

This drops you into an interactive prompt. Type your message and press Enter. To exit, type /bye.

Terminal
>>> Explain the difference between a process and a thread in simple terms.

A process is an independent program running in its own memory space...

The first response after loading may take a few seconds while the model weights are read into memory. Subsequent responses in the same session are faster.


Step 4: Add a chat interface

The terminal works fine, but most people prefer a proper chat UI. There are two good options depending on what you want.

Option A: Open WebUI

Open WebUI is a self-hosted chat interface that connects to your local Ollama server. It looks similar to ChatGPT but runs entirely on your machine.

→ Open WebUI on GitHub

Install and run it with Docker:

Terminal
docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser. You will see a full chat UI where you can select any installed Ollama model from a dropdown and start a conversation.

Open WebUI also supports document uploads, image input (with vision models), and basic prompt management. It is a significant upgrade from the terminal if you plan to use the assistant regularly.

Option B: A floating macOS app

If you want something lighter that lives outside the browser, there are several native macOS apps that wrap Ollama. The one that got attention on r/SideProject recently is a floating assistant that sits above your other windows and responds to a keyboard shortcut.

The general pattern for this type of app:

  1. Download the .dmg from the project’s GitHub releases page
  2. Open it and allow it in System Preferences under Privacy and Security
  3. Grant Accessibility permissions so it can read selected text
  4. Set a hotkey to summon it from anywhere

This type of interface is useful when you want to ask the model about something you are reading or drafting without switching apps.


Step 5: Keep it running headlessly

If you want Ollama to start automatically when you log in and run in the background without keeping a terminal open, you can register it as a macOS launch agent.

Create the plist file:

Terminal
cat > ~/Library/LaunchAgents/com.ollama.server.plist << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.ollama.server</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/ollama</string>
        <string>serve</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/ollama.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/ollama.err</string>
</dict>
</plist>
EOF

If you installed Ollama via the macOS app installer rather than Homebrew, the binary path may be /usr/bin/ollama. Check with which ollama first.

Load the agent:

Terminal
launchctl load ~/Library/LaunchAgents/com.ollama.server.plist

Ollama will now start automatically on login and restart if it crashes. You can check logs at /tmp/ollama.log.


Using the Ollama API from your own code

Ollama exposes a local REST API on port 11434. You can call it from any language. This is useful if you want to add AI features to a script or app without paying for cloud inference.

Generate a completion in Python:

Python
import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.2:8b",
        "prompt": "Summarize the key ideas in the Unix philosophy in three bullet points.",
        "stream": False
    }
)

print(response.json()["response"])

Chat completions in JavaScript:

JavaScript
const response = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
        model: "llama3.2:8b",
        messages: [
            { role: "user", content: "What is a tail call optimization?" }
        ],
        stream: false
    })
});

const data = await response.json();
console.log(data.message.content);

The API is compatible with the OpenAI client library format for chat completions, so you can often swap in the Ollama base URL to test your existing OpenAI-powered code locally at zero cost.

Python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by the client but not checked
)

completion = client.chat.completions.create(
    model="llama3.2:8b",
    messages=[{"role": "user", "content": "What is memoization?"}]
)

print(completion.choices[0].message.content)

This compatibility layer makes it easy to build something locally and switch to a cloud model later if you need more capability at scale.


Hardware reality check

Here is what to expect on common Mac configurations:

HardwareModelTokens per second
MacBook Air M1 (8 GB RAM)llama3.2:3b35 to 50
MacBook Pro M2 (16 GB RAM)llama3.2:8b40 to 60
Mac Studio M2 Ultra (64 GB RAM)llama3.3:70b20 to 35
MacBook Pro M4 (32 GB RAM)qwen2.5:14b30 to 50

For comparison, a typical cloud API response streams at around 50 to 80 tokens per second. Local models on Apple Silicon are in the same ballpark for 7B to 8B models, and the latency is lower because there is no network round trip.

On an Intel Mac with 8 GB RAM, you will get usable but slow responses. Expect 5 to 15 tokens per second on a 3B model. It works, but the experience is noticeably sluggish compared to Apple Silicon.


What this setup cannot do

It is worth being honest about the gaps:

Context length. Consumer models running locally typically handle 4K to 8K tokens well. Long documents or large codebases will hit limits faster than GPT-4 class models.

Multimodal input. Some models support images (llava, moondream), but vision quality is well below GPT-4o on most tasks. It is getting better with each generation.

Tool use and function calling. Ollama supports function calling in newer models, but reliability varies. If you need robust agent behavior with tool use, a cloud model is still more dependable. See the AI agent frameworks comparison for a broader look at what each framework expects from the underlying model.

Memory across sessions. Each new ollama run session starts fresh. Persistent memory requires an application layer built on top of the API.

None of these are blockers for most everyday use cases. For writing help, code explanation, summarization, and general Q&A, a local 8B model is genuinely capable.


The combination of Apple Silicon and well-quantized open source models has made this setup practical for the first time. A year ago, running a useful model locally on a laptop required careful compromises. Today it is straightforward enough that the main reason not to try it is not knowing it exists.

If you are curious about privacy-focused inference at a larger scale or want to explore how confidential compute fits into this picture, Cocoon’s decentralized inference network is another approach worth reading about.


Launching an AI tool or side project? Check out the list of AI directories you can submit to, and the basic SEO guide for getting your post-launch content right.

Categoría AI Development
Compartir

Publicaciones Relacionadas

Recibe los últimos conocimientos sobre IA directamente en tu bandeja de entrada

Manténgase actualizado con las últimas tendencias, tutoriales e insights de la industria. Únase a la comunidad de desarrolladores que confían en nuestro boletín.

Solo cuentas nuevas. Al enviar tu correo electrónico aceptas nuestro Política de Privacidad