One docker compose up. Persistent models. CPU or GPU.
Same container locally and on AWS — the model you debug is the model that ships. macOS, Linux and Windows supported. NVIDIA GPU optional.
MIT License
A minimal, opinionated Ollama-in-Docker setup — ready for chat and RAG.
Copy .env.example, run docker compose up -d. Healthcheck included, restarts automatically.
Models live in a named Docker volume. compose down keeps them; down -v wipes them on purpose.
Two slots warm by default — one chat model, one embedding model. setup-rag.sh pulls both.
From zero to a working LLM in three commands.
git clone https://github.com/chevp/cura-llm-local.git cd cura-llm-local cp .env.example .env
# CPU (default — works on macOS, Linux, Windows) docker compose up -d
# Linux + NVIDIA GPU docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
./scripts/pull-model.sh mistral:7b-instruct-q4_K_M ./scripts/test-prompt.sh mistral:7b-instruct-q4_K_M 'Was ist Docker?'
# Or hit the HTTP API directly curl http://localhost:11434/api/generate \ -d '{"model":"mistral:7b-instruct-q4_K_M","prompt":"Hello","stream":false}'
CPU-only on Apple Silicon — these are the ones that actually feel fast.
| Model | Disk | Use case |
|---|---|---|
| llama3.2:3b | ~2 GB | Fast dev loop, short answers |
| phi3:mini | ~2 GB | Very fast, small reasoning tasks |
| mistral:7b-instruct-q4_K_M default | ~4 GB | Well-rounded chat |
| llama3.1:8b-instruct-q4_K_M | ~5 GB | Slightly higher quality |
| nomic-embed-text | ~270 MB | Embeddings (RAG) |
Want a basic RAG stack?
Pull a chat model and an embedding model in one go:
./scripts/setup-rag.sh
When you have a GPU and you want the good stuff.
GPU required.
These models target Linux / Windows hosts with an NVIDIA GPU and ≥ 24 GB VRAM.
Use docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d.
macOS Apple Silicon Do not attempt these locally. The Apple GPU is not exposed to Docker — inference falls back to CPU and these models will swap your machine into the ground.
| Model | Disk | VRAM | Use case |
|---|---|---|---|
| llama3.1:70b-instruct-q4_K_M | ~40 GB | ≥ 48 GB | Flagship general & reasoning |
| qwen2.5:32b-instruct-q4_K_M | ~20 GB | ≥ 24 GB | Strong coding & multilingual |
| mixtral:8x7b-instruct-v0.1-q4_K_M | ~26 GB | ≥ 32 GB | MoE — fast inference for its size |
| deepseek-r1:32b | ~20 GB | ≥ 24 GB | Reasoning specialist |
q4_K_M is the documented default — pull q5_K_M or q8_0 if you have VRAM headroom and want quality over throughput.
What to expect on each OS.
Native ARM64 image, no Rosetta. CPU-only inside the container — the Apple GPU is not exposed to Docker.
Best models: q4-quantised 7B–8B.
Full GPU support via nvidia-container-toolkit. Use the GPU compose override.
Best models: any — GPU handles 13B+ easily.
WSL2 + Docker Desktop. NVIDIA GPU works through WSL CUDA with a current driver.
Best models: CPU-only or NVIDIA, like Linux.
The five things that usually go wrong, and how to fix them.
Native Ollama is running. Either stop it or change the port:
# macOS brew services stop ollama # Or pick a different port in .env OLLAMA_PORT=11435
could not select device driver "nvidia" on Linux
Install nvidia-container-toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \ | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \ | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt update && sudo apt install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker
Verify with docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi.
Run docker compose exec ollama nvidia-smi. If that doesn't list the GPU, the override isn't active or the host driver is missing. Check nvidia-smi on the host first, then re-up with the GPU override.
unhealthy
The first 20 s is the configured start_period. If the issue persists, look at the logs:
docker compose logs ollama
Expected without GPU. Switch to a smaller, quantised model:
./scripts/pull-model.sh llama3.2:3b ./scripts/pull-model.sh phi3:mini
If you're on macOS and just want speed, install Ollama natively (brew install ollama) — it uses the Apple GPU via Metal. You lose dev/prod parity, though.