cura-llm-local — Run an LLM locally in Docker

What you get

A minimal, opinionated Ollama-in-Docker setup — ready for chat and RAG.

Single command

Copy .env.example, run docker compose up -d. Healthcheck included, restarts automatically.

Persistent models

Models live in a named Docker volume. compose down keeps them; down -v wipes them on purpose.

RAG-ready

Two slots warm by default — one chat model, one embedding model. setup-rag.sh pulls both.

Quick start

From zero to a working LLM in three commands.

Clone & configure

git clone https://github.com/chevp/cura-llm-local.git
cd cura-llm-local
cp .env.example .env

Start the container

# CPU (default — works on macOS, Linux, Windows)
docker compose up -d

# Linux + NVIDIA GPU
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d

Pull a model & talk to it

./scripts/pull-model.sh mistral:7b-instruct-q4_K_M
./scripts/test-prompt.sh mistral:7b-instruct-q4_K_M 'Was ist Docker?'

# Or hit the HTTP API directly
curl http://localhost:11434/api/generate \
  -d '{"model":"mistral:7b-instruct-q4_K_M","prompt":"Hello","stream":false}'

Recommended models

CPU-only on Apple Silicon — these are the ones that actually feel fast.

Model	Disk	Use case
llama3.2:3b	~2 GB	Fast dev loop, short answers
phi3:mini	~2 GB	Very fast, small reasoning tasks
mistral:7b-instruct-q4_K_M default	~4 GB	Well-rounded chat
llama3.1:8b-instruct-q4_K_M	~5 GB	Slightly higher quality
nomic-embed-text	~270 MB	Embeddings (RAG)

Want a basic RAG stack?

Pull a chat model and an embedding model in one go:

./scripts/setup-rag.sh

Heavyweight tier

When you have a GPU and you want the good stuff.

GPU required.

These models target Linux / Windows hosts with an NVIDIA GPU and ≥ 24 GB VRAM. Use docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d.

macOS Apple Silicon Do not attempt these locally. The Apple GPU is not exposed to Docker — inference falls back to CPU and these models will swap your machine into the ground.

Model	Disk	VRAM	Use case
llama3.1:70b-instruct-q4_K_M	~40 GB	≥ 48 GB	Flagship general & reasoning
qwen2.5:32b-instruct-q4_K_M	~20 GB	≥ 24 GB	Strong coding & multilingual
mixtral:8x7b-instruct-v0.1-q4_K_M	~26 GB	≥ 32 GB	MoE — fast inference for its size
deepseek-r1:32b	~20 GB	≥ 24 GB	Reasoning specialist

q4_K_M is the documented default — pull q5_K_M or q8_0 if you have VRAM headroom and want quality over throughput.

Platform notes

What to expect on each OS.

macOS (Apple Silicon)

Native ARM64 image, no Rosetta. CPU-only inside the container — the Apple GPU is not exposed to Docker.

Best models: q4-quantised 7B–8B.

Linux

Full GPU support via nvidia-container-toolkit. Use the GPU compose override.

Best models: any — GPU handles 13B+ easily.

Windows

WSL2 + Docker Desktop. NVIDIA GPU works through WSL CUDA with a current driver.

Best models: CPU-only or NVIDIA, like Linux.

Troubleshooting

The five things that usually go wrong, and how to fix them.

Port 11434 already in use

Native Ollama is running. Either stop it or change the port:

# macOS
brew services stop ollama
# Or pick a different port in .env
OLLAMA_PORT=11435

could not select device driver "nvidia" on Linux

Install nvidia-container-toolkit:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify with docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi.

GPU not detected inside the container

Run docker compose exec ollama nvidia-smi. If that doesn't list the GPU, the override isn't active or the host driver is missing. Check nvidia-smi on the host first, then re-up with the GPU override.

Healthcheck reports unhealthy

The first 20 s is the configured start_period. If the issue persists, look at the logs:

docker compose logs ollama

Responses feel painfully slow on CPU

Expected without GPU. Switch to a smaller, quantised model:

./scripts/pull-model.sh llama3.2:3b
./scripts/pull-model.sh phi3:mini

If you're on macOS and just want speed, install Ollama natively (brew install ollama) — it uses the Apple GPU via Metal. You lose dev/prod parity, though.

Run an LLM locally.In Docker. No surprises.