bolt

cura-llm-native

What Quick Start Models Config Platforms Troubleshooting | Install Guide Help Architecture
smart_toy Ollama memory Apple Metal developer_board NVIDIA CUDA trending_up 13B – 70B ready

Run an LLM natively.
Full GPU, larger models.

Direct Apple Metal on macOS. Direct CUDA on Windows. No container overhead, no GPU passthrough headaches.

Sister repo to cura-llm-local (Docker variant). Same Ollama HTTP API on localhost:11434 — the rest of the cura stack is interchangeable between the two.

Quick start View on GitHub

MIT License · Submodule of cura at infrastructure/llm-native/

Why native — and not Docker?

cura-llm-local runs Ollama in Docker for prod parity (production runs Linux/Docker on AWS). The trade-off: Docker on macOS does not expose the Apple Silicon GPU, so the container is CPU-only — fine for 7B-q4 models, painful for anything larger.

cura-llm-local (Docker) cura-llm-native (this repo)
Prod parity identical to AWS host-specific
macOS GPU (Metal) CPU-only full Metal
Windows GPU (CUDA) via WSL2 passthrough direct
Larger models (13B+) slow fast
Setup docker compose up OS installer + ollama serve
memory

Native install, full GPU

Ollama auto-detects Metal on Apple Silicon and CUDA on NVIDIA. No container layer between the model and the silicon.

trending_up

Heavier models welcome

Tuned for the 13B–70B range — long German reports, anamneses, RAG over large sources, multilingual patient data.

swap_horiz

API-compatible with Docker

Same Ollama HTTP API on 11434. Swap in either repo and the rest of cura keeps working unchanged.

Quick start

Clone, install, pull a model, prompt it.

laptop_mac

macOS (Apple Silicon)

git clone https://github.com/chevp/cura-llm-native.git
cd cura-llm-native
cp .env.example .env
./scripts/install.sh
./scripts/start.sh
./scripts/pull-model.sh llama3.1:8b-instruct-q4_K_M
./scripts/test-prompt.sh llama3.1:8b-instruct-q4_K_M "Was ist Metal?"

Installs Ollama via Homebrew, starts the service, pulls the default 8B chat model.

desktop_windows

Windows (PowerShell)

git clone https://github.com/chevp/cura-llm-native.git
cd cura-llm-native
Copy-Item .env.example .env
.\scripts\install.ps1
.\scripts\start.ps1
.\scripts\pull-model.ps1 llama3.1:8b-instruct-q4_K_M
.\scripts\test-prompt.ps1 llama3.1:8b-instruct-q4_K_M "Was ist CUDA?"

Downloads the official Ollama Windows installer and runs it. Accept the UAC prompt. Requires NVIDIA driver (CUDA 12+) for GPU acceleration.

api

Hit the API directly

# HTTP API (same on both platforms)
curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.1:8b-instruct-q4_K_M","prompt":"Hello","stream":false}'
# Or interactive shell
ollama run llama3.1:8b-instruct-q4_K_M
extension

Want a basic RAG stack?

Pulls a chat model and an embedding model in one go:

# macOS
./scripts/setup-rag.sh
# Windows
.\scripts\setup-rag.ps1

Recommended models

Two tiers: a snappy default set that runs anywhere, and a heavyweight set that justifies the native install.

Lightweight tier — fast everywhere

Comfortable on a 16 GB Mac or an 8 GB NVIDIA GPU. Use these as defaults; reach for the heavyweight tier when quality, vision, or context length actually matters.

Model Disk Strength Use case
llama3.1:8b-instruct-q4_K_M default ~5 GB all-rounder Well-rounded chat, the cura-llm-native default
qwen2.5:7b-instruct ~4.5 GB DE text Noticeably stronger on German prose & multilingual content than Llama-3.1
llama3:8b-instruct ~5 GB stable Stable, well-formatted reports — the original Llama-3.0 line
mistral:7b-instruct-q4_K_M ~4 GB snappy Snappy fallback, matches the cura-llm-local default
nomic-embed-text ~270 MB embed Embeddings for RAG over reports & anamneses
translate

Which 7B/8B for what?

  • German prose / patient data / anamnesesqwen2.5:7b-instruct — noticeably more precise than Llama on DE
  • Reports, lists, structured JSON outputllama3:8b-instruct — stable and well-formatted
  • Fast iteration / short chatmistral:7b-instruct
  • Best quality if hardware allowsmixtral:8x7b (see heavyweight tier, ~26 GB)

Heavyweight tier — German reports native install pays off here

For long German reports, anamneses, multilingual patient data, and RAG over large sources. Realistic on Apple Silicon M-Pro/Max with 32 GB+ unified memory, or an NVIDIA GPU with ≥ 16 GB VRAM. The 70B/72B rows need 64 GB+ unified memory or 48 GB+ VRAM.

Model Disk Strength Use case
mistral-nemo:12b-instruct-q4_K_M ~7 GB long-ctx · DE 128k context — ideal for long patient reports and longitudinal records
qwen2.5:14b-instruct ~9 GB DE text Stronger than Llama on German prose — cleanly structured reports
llama3.1:13b-instruct-q4_K_M ~8 GB stable Solid step up from 8B, fits 12 GB VRAM / 16 GB unified
gemma2:27b-instruct-q4_K_M ~16 GB DE text Google Gemma 2 — very solid German output, good report phrasing
qwen2.5:32b-instruct-q4_K_M ~20 GB DE top mid Best German quality in the mid-size class — precise, formal register
command-r:35b ~20 GB RAG · DE Cohere Command-R — RAG-optimized, strong German, built for citations
mixtral:8x7b-instruct-q4_K_M ~26 GB MoE · multi-lang Very good when hardware allows — broad knowledge, fast for its size, solid German
llama3.3:70b-instruct-q4_K_M ~43 GB top-tier Latest 70B Llama — best instruction following, very good report prose
qwen2.5:72b-instruct-q4_K_M ~44 GB top-tier DE Best open model for German reports — top pick when 64 GB+ is available
laptop_mac

Apple Silicon — sized to unified memory

Memory Recommended max
16 GB llama3.1:8b · qwen2.5:7b · llama3:8b
32 GB qwen2.5:32b · gemma2:27b · mixtral:8x7b · command-r:35b
64 GB+ qwen2.5:72b · llama3.3:70b
developer_board

Windows NVIDIA — sized to VRAM

VRAM Recommended max
8 GB llama3.1:8b · qwen2.5:7b · llama3:8b
12 GB mistral-nemo:12b · qwen2.5:14b · llama3.1:13b
16–24 GB qwen2.5:32b · gemma2:27b · command-r:35b · mixtral:8x7b (partial offload)
48 GB+ qwen2.5:72b · llama3.3:70b
tips_and_updates

Picking from the heavyweight tier

  • Long reports / RAG over large sourcesmistral-nemo:12b (128k context)
  • RAG with citations over German documentscommand-r:35b
  • Best German report quality up to 32Bqwen2.5:32b-instruct or gemma2:27b
  • Stable German reports on moderate hardwareqwen2.5:14b-instruct or llama3.1:13b
  • Best open-source German quality overallqwen2.5:72b-instruct-q4_K_M
  • Best general quality overallllama3.3:70b-instruct-q4_K_M
  • Best knowledge/speed ratio (if hardware allows: 32 GB+ unified / 24 GB+ VRAM)mixtral:8x7b (MoE)

Configuration

.env is read by the start scripts and exported into the Ollama process.

Variable Default Meaning
OLLAMA_HOST 127.0.0.1:11434 Bind address
OLLAMA_KEEP_ALIVE 24h How long a model stays in (V)RAM after the last request
OLLAMA_MAX_LOADED_MODELS 2 Parallel loaded models (e.g. chat + embed)
OLLAMA_NUM_PARALLEL 1 Concurrent requests per model
OLLAMA_FLASH_ATTENTION 1 Faster attention kernels (Metal + CUDA)

Platform notes

What to expect on each OS.

laptop_mac

macOS Apple Silicon

Metal acceleration auto-detected. Watch unified-memory pressure with mactop or Activity Monitor — if you see swapping, drop a tier.

Best models: 8B-q4 default, 27B–35B class on M-Pro/Max with 32 GB+, 70B–72B-q4 on 64 GB+.

desktop_windows

Windows NVIDIA

Ollama detects CUDA on launch. Verify with ollama ps — the GPU layer count should be non-zero. If 0 layers on GPU, the driver is too old or the model is too big.

Best models: sized to VRAM — see the table above.

warning

Windows AMD

ROCm on Windows is experimental — expect CPU fallback in most cases.

Recommendation: for AMD GPUs, prefer the Linux + Docker path in cura-llm-local.

Stop & clean up

Remove a single model, or wipe everything.

laptop_mac macOS

# stop the server
./scripts/stop.sh

# remove a single model
ollama rm <model>

# full removal incl. models
brew uninstall ollama
rm -rf ~/.ollama

desktop_windows Windows

.\scripts\stop.ps1
ollama rm <model>

# Full removal: uninstall via Apps & Features, then:
Remove-Item -Recurse -Force "$env:USERPROFILE\.ollama"

Troubleshooting

The five things that usually go wrong on a native install.

Port 11434 already in use add

Another Ollama instance is running — very often the Docker variant from cura-llm-local. Stop it, or move this instance to a different port:

# Move this instance to 11435
OLLAMA_HOST=127.0.0.1:11435   # add to .env
macOS: model loads on CPU instead of Metal add

Confirm with ollama ps. If it shows 100% CPU, you likely installed an Intel build of Ollama. Reinstall under arm64:

arch -arm64 brew reinstall ollama
Windows: CUDA not detected add

Run nvidia-smi. If it fails, install or update the NVIDIA driver and reboot. After driver updates, restart Ollama.

Out of memory loading a 70B model add

q4 70B needs ~46 GB. On a 32 GB Mac, fall back one tier:

ollama pull qwen2.5:32b-instruct-q4_K_M
# or
ollama pull mixtral:8x7b-instruct-q4_K_M
Slow first response, fast afterwards add

Cold start — the model is loading into (V)RAM. Increase OLLAMA_KEEP_ALIVE in .env to keep it warm longer between requests.

Deep dive

Step-by-step guides for the parts the README only summarises.

terminal

Installation walkthrough

Every step explained, with commands to verify each one worked.

build

Troubleshooting

null responses, port conflicts with the Docker variant, CPU fallback, OOM.

schema

Architecture

How the two cura LLM backends fit together — and when to pick which.

Relation to cura

Submodule of cura at infrastructure/llm-native/.

deployed_code

cura-llm-local

Docker variant. Prod-parity, CPU-only on macOS.

dns

cura-server

Backend that talks to Ollama on 11434.

chat

cura-app

Frontend chat UI.

Either LLM repo exposes the same Ollama HTTP API on localhost:11434 — the rest of the stack is interchangeable.