Step-by-step guide for installing cura-llm-native on Apple Silicon. The TL;DR is in the README — this page explains what each step does, how to verify it, and what the common failure modes look like.
Windows users: the PowerShell variant follows the same logic. Replace
*.shwith*.ps1throughout.
| Required | Verify | |
|---|---|---|
| macOS | 12+ on Apple Silicon (M1/M2/M3/M4) | uname -sm → Darwin arm64 |
| Homebrew | any recent version | brew --version |
| Disk | 10 GB free per model (q4 8B ≈ 5 GB; q4 70B ≈ 40 GB) | df -h ~ |
| RAM | 16 GB minimum, 32 GB+ for 13B–34B, 64 GB+ for 70B | sysctl -n hw.memsize |
If brew is missing, install from brew.sh first. The install script will refuse to continue without it.
git clone https://github.com/chevp/cura-llm-native.git
cd cura-llm-native
cp .env.example .env
The .env file controls the runtime. Defaults are sensible — only edit if you hit a port conflict (see §3) or want to bind on the LAN (OLLAMA_HOST=0.0.0.0:11434).
| Variable | Default | When to change |
|---|---|---|
OLLAMA_HOST |
127.0.0.1:11434 |
Conflict with cura-llm-local (Docker) → set :11435 |
OLLAMA_KEEP_ALIVE |
24h |
Lower if you want models unloaded after idle |
OLLAMA_MAX_LOADED_MODELS |
2 |
Raise only if you have headroom for chat + embed + extras |
OLLAMA_NUM_PARALLEL |
1 |
Raise for a multi-user setup |
OLLAMA_FLASH_ATTENTION |
1 |
Set 0 only when debugging unexpected output |
./scripts/install.sh
What happens:
Darwin (arm64).brew is on PATH.✓ Ollama already installed and exits, or runs brew install ollama (pulls in mlx, mlx-c, python@3.14, sqlite, openssl@3 as dependencies).Verify:
ollama --version # ollama version is 0.21.2 (or newer)
which ollama # /opt/homebrew/bin/ollama → confirms arm64 build
If which ollama resolves under /usr/local/... you have the Intel build under Rosetta — re-install under arm64:
arch -arm64 brew reinstall ollama
./scripts/start.sh
What happens:
.env and reads OLLAMA_HOST.✓ Ollama already listening and exits.ollama serve via nohup, writes the PID to .run/ollama.pid and stdout/stderr to .run/ollama.log.Verify:
curl -s http://127.0.0.1:11434/api/tags | jq .
# {"models": []} ← fresh install, no models yet
cura-llm-local (Docker variant)The most common failure on cura developer machines:
Error: listen tcp 127.0.0.1:11434: bind: address already in use
Both cura-llm-local (Docker) and cura-llm-native default to port 11434. They cannot run side-by-side on the same port. Pick one:
Option A — keep both, native on 11435 (recommended for cura devs)
# .env
OLLAMA_HOST=127.0.0.1:11435
Then restart: ./scripts/stop.sh && ./scripts/start.sh.
Option B — stop the Docker variant
docker compose -f /path/to/cura-llm-local/docker-compose.yml down
./scripts/start.sh
Identify which Ollama is bound to the port:
lsof -iTCP:11434 -sTCP:LISTEN
# COMMAND PID USER ...
# com.docke ... ← Docker variant
# ollama ... ← native variant
./scripts/pull-model.sh llama3.1:8b-instruct-q4_K_M
What happens: a streaming download from the Ollama registry into ~/.ollama/models/. The default 8B model is ~5 GB — expect a few minutes on a fast connection.
Pick the right model for your machine:
| Unified memory | Recommended max model | Disk |
|---|---|---|
| 16 GB | llama3.1:8b-instruct-q4_K_M |
~5 GB |
| 32 GB | mixtral:8x7b-instruct-q4_K_M |
~26 GB |
| 64 GB+ | llama3.1:70b-instruct-q4_K_M |
~40 GB |
| any | nomic-embed-text (embeddings) |
~270 MB |
Verify:
ollama list
# NAME SIZE MODIFIED
# llama3.1:8b-instruct-q4_K_M 4.7 GB 1 minute ago
./scripts/test-prompt.sh llama3.1:8b-instruct-q4_K_M "Was ist Apple Metal?"
A correct answer is a paragraph of text. If you get null, see Troubleshooting → null response.
Verify GPU utilization while the model is loaded:
ollama ps
# NAME PROCESSOR
# llama3.1:8b-instruct-q4_K_M 100% GPU ← Metal active
If ollama ps shows 100% CPU, the model is running on CPU — almost always because the wrong arch was installed (see §2).
For chat + embeddings (cura’s default RAG configuration):
./scripts/setup-rag.sh
This pulls one chat model and nomic-embed-text so the server can keep both warm (OLLAMA_MAX_LOADED_MODELS=2).
null responses, CPU fallback, OOM