cura-llm-native

Why native — and not Docker?

cura-llm-local runs Ollama in Docker for prod parity (production runs Linux/Docker on AWS). The trade-off: Docker on macOS does not expose the Apple Silicon GPU, so the container is CPU-only — fine for 7B-q4 models, painful for anything larger.

	cura-llm-local (Docker)	cura-llm-native (this repo)
Prod parity	identical to AWS	host-specific
macOS GPU (Metal)	CPU-only	full Metal
Windows GPU (CUDA)	via WSL2 passthrough	direct
Larger models (13B+)	slow	fast
Setup	docker compose up	OS installer + `ollama serve`

memory

Native install, full GPU

Ollama auto-detects Metal on Apple Silicon and CUDA on NVIDIA. No container layer between the model and the silicon.

trending_up

Heavier models welcome

Tuned for the 13B–70B range — long German reports, anamneses, RAG over large sources, multilingual patient data.

swap_horiz

API-compatible with Docker

Same Ollama HTTP API on 11434. Swap in either repo and the rest of cura keeps working unchanged.

Quick start

Clone, install, pull a model, prompt it.

laptop_mac

macOS (Apple Silicon)

git clone https://github.com/chevp/cura-llm-native.git
cd cura-llm-native
cp .env.example .env
./scripts/install.sh
./scripts/start.sh
./scripts/pull-model.sh llama3.1:8b-instruct-q4_K_M
./scripts/test-prompt.sh llama3.1:8b-instruct-q4_K_M "Was ist Metal?"

Installs Ollama via Homebrew, starts the service, pulls the default 8B chat model.

desktop_windows

Windows (PowerShell)

git clone https://github.com/chevp/cura-llm-native.git
cd cura-llm-native
Copy-Item .env.example .env
.\scripts\install.ps1
.\scripts\start.ps1
.\scripts\pull-model.ps1 llama3.1:8b-instruct-q4_K_M
.\scripts\test-prompt.ps1 llama3.1:8b-instruct-q4_K_M "Was ist CUDA?"

Downloads the official Ollama Windows installer and runs it. Accept the UAC prompt. Requires NVIDIA driver (CUDA 12+) for GPU acceleration.

api

Hit the API directly

# HTTP API (same on both platforms)
curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.1:8b-instruct-q4_K_M","prompt":"Hello","stream":false}'

# Or interactive shell
ollama run llama3.1:8b-instruct-q4_K_M

extension

Want a basic RAG stack?

Pulls a chat model and an embedding model in one go:

# macOS
./scripts/setup-rag.sh
# Windows
.\scripts\setup-rag.ps1

Recommended models

Two tiers: a snappy default set that runs anywhere, and a heavyweight set that justifies the native install.

Lightweight tier — fast everywhere

Comfortable on a 16 GB Mac or an 8 GB NVIDIA GPU. Use these as defaults; reach for the heavyweight tier when quality, vision, or context length actually matters.

Model	Disk	Strength	Use case
llama3.1:8b-instruct-q4_K_M default	~5 GB	all-rounder	Well-rounded chat, the cura-llm-native default
qwen2.5:7b-instruct	~4.5 GB	DE text	Noticeably stronger on German prose & multilingual content than Llama-3.1
llama3:8b-instruct	~5 GB	stable	Stable, well-formatted reports — the original Llama-3.0 line
mistral:7b-instruct-q4_K_M	~4 GB	snappy	Snappy fallback, matches the cura-llm-local default
nomic-embed-text	~270 MB	embed	Embeddings for RAG over reports & anamneses

translate

Which 7B/8B for what?

German prose / patient data / anamneses → qwen2.5:7b-instruct — noticeably more precise than Llama on DE
Reports, lists, structured JSON output → llama3:8b-instruct — stable and well-formatted
Fast iteration / short chat → mistral:7b-instruct
Best quality if hardware allows → mixtral:8x7b (see heavyweight tier, ~26 GB)

Heavyweight tier — German reports native install pays off here

For long German reports, anamneses, multilingual patient data, and RAG over large sources. Realistic on Apple Silicon M-Pro/Max with 32 GB+ unified memory, or an NVIDIA GPU with ≥ 16 GB VRAM. The 70B/72B rows need 64 GB+ unified memory or 48 GB+ VRAM.

Model	Disk	Strength	Use case
mistral-nemo:12b-instruct-q4_K_M	~7 GB	long-ctx · DE	128k context — ideal for long patient reports and longitudinal records
qwen2.5:14b-instruct	~9 GB	DE text	Stronger than Llama on German prose — cleanly structured reports
llama3.1:13b-instruct-q4_K_M	~8 GB	stable	Solid step up from 8B, fits 12 GB VRAM / 16 GB unified
gemma2:27b-instruct-q4_K_M	~16 GB	DE text	Google Gemma 2 — very solid German output, good report phrasing
qwen2.5:32b-instruct-q4_K_M	~20 GB	DE top mid	Best German quality in the mid-size class — precise, formal register
command-r:35b	~20 GB	RAG · DE	Cohere Command-R — RAG-optimized, strong German, built for citations
mixtral:8x7b-instruct-q4_K_M	~26 GB	MoE · multi-lang	Very good when hardware allows — broad knowledge, fast for its size, solid German
llama3.3:70b-instruct-q4_K_M	~43 GB	top-tier	Latest 70B Llama — best instruction following, very good report prose
qwen2.5:72b-instruct-q4_K_M	~44 GB	top-tier DE	Best open model for German reports — top pick when 64 GB+ is available

laptop_mac

Apple Silicon — sized to unified memory

Memory	Recommended max
16 GB	llama3.1:8b · qwen2.5:7b · llama3:8b
32 GB	qwen2.5:32b · gemma2:27b · mixtral:8x7b · command-r:35b
64 GB+	qwen2.5:72b · llama3.3:70b

developer_board

Windows NVIDIA — sized to VRAM

VRAM	Recommended max
8 GB	llama3.1:8b · qwen2.5:7b · llama3:8b
12 GB	mistral-nemo:12b · qwen2.5:14b · llama3.1:13b
16–24 GB	qwen2.5:32b · gemma2:27b · command-r:35b · mixtral:8x7b (partial offload)
48 GB+	qwen2.5:72b · llama3.3:70b

tips_and_updates

Picking from the heavyweight tier

Long reports / RAG over large sources → mistral-nemo:12b (128k context)
RAG with citations over German documents → command-r:35b
Best German report quality up to 32B → qwen2.5:32b-instruct or gemma2:27b
Stable German reports on moderate hardware → qwen2.5:14b-instruct or llama3.1:13b
Best open-source German quality overall → qwen2.5:72b-instruct-q4_K_M
Best general quality overall → llama3.3:70b-instruct-q4_K_M
Best knowledge/speed ratio (if hardware allows: 32 GB+ unified / 24 GB+ VRAM) → mixtral:8x7b (MoE)

Configuration

.env is read by the start scripts and exported into the Ollama process.

Variable	Default	Meaning
OLLAMA_HOST	127.0.0.1:11434	Bind address
OLLAMA_KEEP_ALIVE	24h	How long a model stays in (V)RAM after the last request
OLLAMA_MAX_LOADED_MODELS	2	Parallel loaded models (e.g. chat + embed)
OLLAMA_NUM_PARALLEL	1	Concurrent requests per model
OLLAMA_FLASH_ATTENTION	1	Faster attention kernels (Metal + CUDA)

Platform notes

What to expect on each OS.

laptop_mac

macOS Apple Silicon

Metal acceleration auto-detected. Watch unified-memory pressure with mactop or Activity Monitor — if you see swapping, drop a tier.

Best models: 8B-q4 default, 27B–35B class on M-Pro/Max with 32 GB+, 70B–72B-q4 on 64 GB+.

desktop_windows

Windows NVIDIA

Ollama detects CUDA on launch. Verify with ollama ps — the GPU layer count should be non-zero. If 0 layers on GPU, the driver is too old or the model is too big.

Best models: sized to VRAM — see the table above.

warning

Windows AMD

ROCm on Windows is experimental — expect CPU fallback in most cases.

Recommendation: for AMD GPUs, prefer the Linux + Docker path in cura-llm-local.

Stop & clean up

Remove a single model, or wipe everything.

laptop_mac macOS

# stop the server
./scripts/stop.sh

# remove a single model
ollama rm <model>

# full removal incl. models
brew uninstall ollama
rm -rf ~/.ollama

desktop_windows Windows

.\scripts\stop.ps1
ollama rm <model>

# Full removal: uninstall via Apps & Features, then:
Remove-Item -Recurse -Force "$env:USERPROFILE\.ollama"

Troubleshooting

The five things that usually go wrong on a native install.

Port 11434 already in use add

Another Ollama instance is running — very often the Docker variant from cura-llm-local. Stop it, or move this instance to a different port:

# Move this instance to 11435
OLLAMA_HOST=127.0.0.1:11435   # add to .env

macOS: model loads on CPU instead of Metal add

Confirm with ollama ps. If it shows 100% CPU, you likely installed an Intel build of Ollama. Reinstall under arm64:

arch -arm64 brew reinstall ollama

Windows: CUDA not detected add

Run nvidia-smi. If it fails, install or update the NVIDIA driver and reboot. After driver updates, restart Ollama.

Out of memory loading a 70B model add

q4 70B needs ~46 GB. On a 32 GB Mac, fall back one tier:

ollama pull qwen2.5:32b-instruct-q4_K_M
# or
ollama pull mixtral:8x7b-instruct-q4_K_M

Slow first response, fast afterwards add

Cold start — the model is loading into (V)RAM. Increase OLLAMA_KEEP_ALIVE in .env to keep it warm longer between requests.

Run an LLM natively.
Full GPU, larger models.

Why native — and not Docker?

Native install, full GPU

Heavier models welcome

API-compatible with Docker

Quick start

macOS (Apple Silicon)

Windows (PowerShell)

Hit the API directly

Recommended models

Lightweight tier — fast everywhere

Heavyweight tier — German reports native install pays off here

Apple Silicon — sized to unified memory

Windows NVIDIA — sized to VRAM

Configuration

Platform notes

macOS Apple Silicon

Windows NVIDIA

Windows AMD

Stop & clean up

laptop_mac macOS

desktop_windows Windows

Troubleshooting

Deep dive

Installation walkthrough

Troubleshooting

Architecture

Relation to cura

cura-llm-local

cura-server

cura-app

Run an LLM natively.Full GPU, larger models.

Why native — and not Docker?

Native install, full GPU

Heavier models welcome

API-compatible with Docker

Quick start

macOS (Apple Silicon)

Windows (PowerShell)

Hit the API directly

Recommended models

Lightweight tier — fast everywhere

Heavyweight tier — German reports native install pays off here

Apple Silicon — sized to unified memory

Windows NVIDIA — sized to VRAM

Configuration

Platform notes

macOS Apple Silicon

Windows NVIDIA

Windows AMD

Stop & clean up

laptop_mac macOS

desktop_windows Windows

Troubleshooting

Deep dive

Installation walkthrough

Troubleshooting

Architecture

Relation to cura

cura-llm-local

cura-server

cura-app

Run an LLM natively.
Full GPU, larger models.