Direct Apple Metal on macOS. Direct CUDA on Windows. No container overhead, no GPU passthrough headaches.
Sister repo to cura-llm-local (Docker variant). Same Ollama HTTP API on localhost:11434 — the rest of the cura stack is interchangeable between the two.
MIT License · Submodule of cura at infrastructure/llm-native/
cura-llm-local runs Ollama in Docker for prod parity (production runs Linux/Docker on AWS). The trade-off: Docker on macOS does not expose the Apple Silicon GPU, so the container is CPU-only — fine for 7B-q4 models, painful for anything larger.
| cura-llm-local (Docker) | cura-llm-native (this repo) | |
|---|---|---|
| Prod parity | identical to AWS | host-specific |
| macOS GPU (Metal) | CPU-only | full Metal |
| Windows GPU (CUDA) | via WSL2 passthrough | direct |
| Larger models (13B+) | slow | fast |
| Setup | docker compose up | OS installer + ollama serve |
Ollama auto-detects Metal on Apple Silicon and CUDA on NVIDIA. No container layer between the model and the silicon.
Tuned for the 13B–70B range — long German reports, anamneses, RAG over large sources, multilingual patient data.
Same Ollama HTTP API on 11434. Swap in either repo and the rest of cura keeps working unchanged.
Clone, install, pull a model, prompt it.
git clone https://github.com/chevp/cura-llm-native.git cd cura-llm-native cp .env.example .env ./scripts/install.sh ./scripts/start.sh ./scripts/pull-model.sh llama3.1:8b-instruct-q4_K_M ./scripts/test-prompt.sh llama3.1:8b-instruct-q4_K_M "Was ist Metal?"
Installs Ollama via Homebrew, starts the service, pulls the default 8B chat model.
git clone https://github.com/chevp/cura-llm-native.git cd cura-llm-native Copy-Item .env.example .env .\scripts\install.ps1 .\scripts\start.ps1 .\scripts\pull-model.ps1 llama3.1:8b-instruct-q4_K_M .\scripts\test-prompt.ps1 llama3.1:8b-instruct-q4_K_M "Was ist CUDA?"
Downloads the official Ollama Windows installer and runs it. Accept the UAC prompt. Requires NVIDIA driver (CUDA 12+) for GPU acceleration.
# HTTP API (same on both platforms) curl http://localhost:11434/api/generate \ -d '{"model":"llama3.1:8b-instruct-q4_K_M","prompt":"Hello","stream":false}'
# Or interactive shell ollama run llama3.1:8b-instruct-q4_K_M
Want a basic RAG stack?
Pulls a chat model and an embedding model in one go:
# macOS ./scripts/setup-rag.sh # Windows .\scripts\setup-rag.ps1
Two tiers: a snappy default set that runs anywhere, and a heavyweight set that justifies the native install.
Comfortable on a 16 GB Mac or an 8 GB NVIDIA GPU. Use these as defaults; reach for the heavyweight tier when quality, vision, or context length actually matters.
| Model | Disk | Strength | Use case |
|---|---|---|---|
| llama3.1:8b-instruct-q4_K_M default | ~5 GB | all-rounder | Well-rounded chat, the cura-llm-native default |
| qwen2.5:7b-instruct | ~4.5 GB | DE text | Noticeably stronger on German prose & multilingual content than Llama-3.1 |
| llama3:8b-instruct | ~5 GB | stable | Stable, well-formatted reports — the original Llama-3.0 line |
| mistral:7b-instruct-q4_K_M | ~4 GB | snappy | Snappy fallback, matches the cura-llm-local default |
| nomic-embed-text | ~270 MB | embed | Embeddings for RAG over reports & anamneses |
Which 7B/8B for what?
qwen2.5:7b-instruct — noticeably more precise than Llama on DEllama3:8b-instruct — stable and well-formattedmistral:7b-instructmixtral:8x7b (see heavyweight tier, ~26 GB)For long German reports, anamneses, multilingual patient data, and RAG over large sources. Realistic on Apple Silicon M-Pro/Max with 32 GB+ unified memory, or an NVIDIA GPU with ≥ 16 GB VRAM. The 70B/72B rows need 64 GB+ unified memory or 48 GB+ VRAM.
| Model | Disk | Strength | Use case |
|---|---|---|---|
| mistral-nemo:12b-instruct-q4_K_M | ~7 GB | long-ctx · DE | 128k context — ideal for long patient reports and longitudinal records |
| qwen2.5:14b-instruct | ~9 GB | DE text | Stronger than Llama on German prose — cleanly structured reports |
| llama3.1:13b-instruct-q4_K_M | ~8 GB | stable | Solid step up from 8B, fits 12 GB VRAM / 16 GB unified |
| gemma2:27b-instruct-q4_K_M | ~16 GB | DE text | Google Gemma 2 — very solid German output, good report phrasing |
| qwen2.5:32b-instruct-q4_K_M | ~20 GB | DE top mid | Best German quality in the mid-size class — precise, formal register |
| command-r:35b | ~20 GB | RAG · DE | Cohere Command-R — RAG-optimized, strong German, built for citations |
| mixtral:8x7b-instruct-q4_K_M | ~26 GB | MoE · multi-lang | Very good when hardware allows — broad knowledge, fast for its size, solid German |
| llama3.3:70b-instruct-q4_K_M | ~43 GB | top-tier | Latest 70B Llama — best instruction following, very good report prose |
| qwen2.5:72b-instruct-q4_K_M | ~44 GB | top-tier DE | Best open model for German reports — top pick when 64 GB+ is available |
| Memory | Recommended max |
|---|---|
| 16 GB | llama3.1:8b · qwen2.5:7b · llama3:8b |
| 32 GB | qwen2.5:32b · gemma2:27b · mixtral:8x7b · command-r:35b |
| 64 GB+ | qwen2.5:72b · llama3.3:70b |
| VRAM | Recommended max |
|---|---|
| 8 GB | llama3.1:8b · qwen2.5:7b · llama3:8b |
| 12 GB | mistral-nemo:12b · qwen2.5:14b · llama3.1:13b |
| 16–24 GB | qwen2.5:32b · gemma2:27b · command-r:35b · mixtral:8x7b (partial offload) |
| 48 GB+ | qwen2.5:72b · llama3.3:70b |
Picking from the heavyweight tier
mistral-nemo:12b (128k context)command-r:35bqwen2.5:32b-instruct or gemma2:27bqwen2.5:14b-instruct or llama3.1:13bqwen2.5:72b-instruct-q4_K_Mllama3.3:70b-instruct-q4_K_Mmixtral:8x7b (MoE).env is read by the start scripts and exported into the Ollama process.
| Variable | Default | Meaning |
|---|---|---|
| OLLAMA_HOST | 127.0.0.1:11434 | Bind address |
| OLLAMA_KEEP_ALIVE | 24h | How long a model stays in (V)RAM after the last request |
| OLLAMA_MAX_LOADED_MODELS | 2 | Parallel loaded models (e.g. chat + embed) |
| OLLAMA_NUM_PARALLEL | 1 | Concurrent requests per model |
| OLLAMA_FLASH_ATTENTION | 1 | Faster attention kernels (Metal + CUDA) |
What to expect on each OS.
Metal acceleration auto-detected. Watch unified-memory pressure with mactop or Activity Monitor — if you see swapping, drop a tier.
Best models: 8B-q4 default, 27B–35B class on M-Pro/Max with 32 GB+, 70B–72B-q4 on 64 GB+.
Ollama detects CUDA on launch. Verify with ollama ps — the GPU layer count should be non-zero. If 0 layers on GPU, the driver is too old or the model is too big.
Best models: sized to VRAM — see the table above.
ROCm on Windows is experimental — expect CPU fallback in most cases.
Recommendation: for AMD GPUs, prefer the Linux + Docker path in cura-llm-local.
Remove a single model, or wipe everything.
# stop the server ./scripts/stop.sh # remove a single model ollama rm <model> # full removal incl. models brew uninstall ollama rm -rf ~/.ollama
.\scripts\stop.ps1 ollama rm <model> # Full removal: uninstall via Apps & Features, then: Remove-Item -Recurse -Force "$env:USERPROFILE\.ollama"
The five things that usually go wrong on a native install.
Another Ollama instance is running — very often the Docker variant from cura-llm-local. Stop it, or move this instance to a different port:
# Move this instance to 11435 OLLAMA_HOST=127.0.0.1:11435 # add to .env
Confirm with ollama ps. If it shows 100% CPU, you likely installed an Intel build of Ollama. Reinstall under arm64:
arch -arm64 brew reinstall ollama
Run nvidia-smi. If it fails, install or update the NVIDIA driver and reboot. After driver updates, restart Ollama.
q4 70B needs ~46 GB. On a 32 GB Mac, fall back one tier:
ollama pull qwen2.5:32b-instruct-q4_K_M
# or
ollama pull mixtral:8x7b-instruct-q4_K_M
Cold start — the model is loading into (V)RAM. Increase OLLAMA_KEEP_ALIVE in .env to keep it warm longer between requests.
Step-by-step guides for the parts the README only summarises.
Submodule of cura at infrastructure/llm-native/.
Docker variant. Prod-parity, CPU-only on macOS.
Backend that talks to Ollama on 11434.
Frontend chat UI.
Either LLM repo exposes the same Ollama HTTP API on localhost:11434 — the rest of the stack is interchangeable.