RAG (Retrieval Augmented Generation) Tutorial
Introduction
Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by combining them with external knowledge retrieval systems. Instead of relying solely on the model's training data, RAG allows LLMs to access and reason over specific documents, databases, or knowledge bases in real-time.
This tutorial covers the concepts, architecture, and practical implementation of RAG systems using Java and LangChain4j.
What is RAG?
RAG (Retrieval Augmented Generation) is a two-step process:
- Retrieval: Find relevant information from a knowledge base
- Augmentation: Provide this information as context to the LLM
- Generation: LLM generates a response based on the retrieved context
Why Use RAG?
Problems RAG Solves:
- Hallucinations: LLMs sometimes generate false information
- Outdated Knowledge: Training data has a cutoff date
- Domain-Specific Knowledge: LLMs lack specialized information
- Verifiable Sources: RAG provides cited, traceable information
Benefits:
- Access to current information
- Domain-specific expertise
- Reduced hallucinations
- Cost-effective (no model retraining needed)
- Source attribution
RAG Architecture
Traditional LLM Flow
User Question → LLM → Response
RAG Flow
User Question
↓
Question Embedding
↓
Vector Database Search (Similarity)
↓
Retrieve Relevant Documents
↓
Combine Question + Retrieved Context
↓
LLM (with context)
↓
Response (based on retrieved knowledge)
Core Components
1. Document Loader
Loads and processes documents from various sources.
2. Text Splitter
Breaks documents into manageable chunks.
3. Embedding Model
Converts text chunks into vector representations.
4. Vector Store
Stores and indexes embeddings for fast similarity search.
5. Retriever
Finds relevant chunks based on query similarity.
6. LLM
Generates responses using retrieved context.
Implementation with LangChain4j
Prerequisites
<dependencies>
<!-- LangChain4j Core -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j</artifactId>
<version>0.34.0</version>
</dependency>
<!-- Embeddings -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings</artifactId>
<version>0.34.0</version>
</dependency>
<!-- Document Loaders -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-document-loader</artifactId>
<version>0.34.0</version>
</dependency>
<!-- OpenAI -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-open-ai</artifactId>
<version>0.34.0</version>
</dependency>
<!-- In-Memory Vector Store (for development) -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embedding-store-inmemory</artifactId>
<version>0.34.0</version>
</dependency>
</dependencies>
Basic RAG Example
import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.DocumentParser;
import dev.langchain4j.data.document.DocumentSplitter;
import dev.langchain4j.data.document.loader.FileSystemDocumentLoader;
import dev.langchain4j.data.document.parser.TextDocumentParser;
import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.openai.OpenAiChatModel;
import dev.langchain4j.model.openai.OpenAiEmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore;
import dev.langchain4j.store.embedding.EmbeddingSearchRequest;
import dev.langchain4j.store.embedding.EmbeddingSearchResult;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import java.nio.file.Path;
import java.util.List;
public class SimpleRAGExample {
public static void main(String[] args) {
// 1. Load documents
Path documentPath = Path.of("documents/company-policies.txt");
Document document = FileSystemDocumentLoader.loadDocument(documentPath);
// 2. Split document into chunks
DocumentSplitter splitter = DocumentSplitters.recursive(
300, // chunk size in tokens
20 // overlap between chunks
);
List<TextSegment> segments = splitter.split(document);
// 3. Create embedding model
EmbeddingModel embeddingModel = OpenAiEmbeddingModel.builder()
.apiKey(System.getenv("OPENAI_API_KEY"))
.modelName("text-embedding-ada-002")
.build();
// 4. Create embeddings and store them
EmbeddingStore<TextSegment> embeddingStore =
new InMemoryEmbeddingStore<>();
for (TextSegment segment : segments) {
Embedding embedding = embeddingModel.embed(segment).content();
embeddingStore.add(embedding, segment);
}
// 5. User asks a question
String question = "What is the vacation policy?";
// 6. Embed the question
Embedding questionEmbedding = embeddingModel.embed(question).content();
// 7. Find relevant document segments
EmbeddingSearchRequest searchRequest = EmbeddingSearchRequest.builder()
.queryEmbedding(questionEmbedding)
.maxResults(3)
.minScore(0.7)
.build();
EmbeddingSearchResult<TextSegment> searchResult =
embeddingStore.search(searchRequest);
// 8. Build context from retrieved segments
String context = searchResult.matches().stream()
.map(match -> match.embedded().text())
.reduce("", (a, b) -> a + "\n\n" + b);
// 9. Create prompt with context
String prompt = "Based on the following context:\n\n" +
context +
"\n\nQuestion: " + question +
"\n\nPlease answer the question based only on the provided context.";
// 10. Generate answer using LLM
ChatLanguageModel chatModel = OpenAiChatModel.builder()
.apiKey(System.getenv("OPENAI_API_KEY"))
.modelName("gpt-3.5-turbo")
.build();
String answer = chatModel.generate(prompt);
System.out.println("Question: " + question);
System.out.println("\nAnswer: " + answer);
System.out.println("\nSources:");
searchResult.matches().forEach(match -> {
System.out.println("- Score: " + match.score());
System.out.println(" Text: " + match.embedded().text());
});
}
}
Document Processing
Loading Different Document Types
import dev.langchain4j.data.document.loader.FileSystemDocumentLoader;
import dev.langchain4j.data.document.parser.TextDocumentParser;
import dev.langchain4j.data.document.parser.apache.pdfbox.ApachePdfBoxDocumentParser;
public class DocumentLoaderExample {
public void loadTextFile() {
Document doc = FileSystemDocumentLoader.loadDocument(
Path.of("document.txt"),
new TextDocumentParser()
);
}
public void loadPdfFile() {
Document doc = FileSystemDocumentLoader.loadDocument(
Path.of("document.pdf"),
new ApachePdfBoxDocumentParser()
);
}
public void loadMultipleFiles() {
List<Document> docs = FileSystemDocumentLoader.loadDocuments(
Path.of("documents/"),
new TextDocumentParser()
);
}
}
Text Splitting Strategies
import dev.langchain4j.data.document.splitter.DocumentSplitters;
public class TextSplittingExample {
// 1. Fixed Token Size Splitter
public void fixedTokenSplitter() {
DocumentSplitter splitter = DocumentSplitters.recursive(
500, // maximum chunk size in tokens
50 // overlap between chunks
);
List<TextSegment> segments = splitter.split(document);
}
// 2. Paragraph-based Splitter
public void paragraphSplitter() {
DocumentSplitter splitter = DocumentSplitters.recursive(
300,
0 // no overlap
);
}
// 3. Custom Splitter
public List<TextSegment> customSplit(Document document, int maxChunkSize) {
String text = document.text();
List<TextSegment> segments = new ArrayList<>();
// Split by sentences
String[] sentences = text.split("\\. ");
StringBuilder currentChunk = new StringBuilder();
for (String sentence : sentences) {
if (currentChunk.length() + sentence.length() > maxChunkSize) {
segments.add(TextSegment.from(currentChunk.toString()));
currentChunk = new StringBuilder();
}
currentChunk.append(sentence).append(". ");
}
if (currentChunk.length() > 0) {
segments.add(TextSegment.from(currentChunk.toString()));
}
return segments;
}
}
Why Chunking Matters
// ❌ BAD: Entire document is too large for context window
String entireDocument = loadEntireDocument(); // 50,000 tokens
String prompt = entireDocument + "\n\nQuestion: " + question;
// This exceeds most LLM context windows!
// ✅ GOOD: Split into chunks and retrieve only relevant parts
List<TextSegment> chunks = splitDocument(document, 300); // 300 tokens each
List<TextSegment> relevantChunks = findRelevant(chunks, question, 3); // 900 tokens
String context = buildContext(relevantChunks);
String prompt = context + "\n\nQuestion: " + question; // Fits in context window
Advanced RAG Patterns
RAG with Metadata Filtering
import dev.langchain4j.data.document.Metadata;
public class MetadataFilteringExample {
public void addDocumentsWithMetadata() {
// Add metadata to documents
TextSegment segment1 = TextSegment.from(
"Employee benefits include health insurance...",
Metadata.from("category", "HR")
.put("department", "Human Resources")
.put("year", "2024")
);
TextSegment segment2 = TextSegment.from(
"Our tech stack includes Java, Spring Boot...",
Metadata.from("category", "Engineering")
.put("department", "IT")
.put("year", "2024")
);
Embedding emb1 = embeddingModel.embed(segment1).content();
Embedding emb2 = embeddingModel.embed(segment2).content();
embeddingStore.add(emb1, segment1);
embeddingStore.add(emb2, segment2);
}
public void searchWithMetadataFilter(String question) {
Embedding questionEmbedding = embeddingModel.embed(question).content();
// Search only in HR documents
EmbeddingSearchRequest request = EmbeddingSearchRequest.builder()
.queryEmbedding(questionEmbedding)
.maxResults(3)
.minScore(0.7)
.filter(metadata -> "HR".equals(metadata.getString("category")))
.build();
EmbeddingSearchResult<TextSegment> result = embeddingStore.search(request);
}
}
Hybrid Search (Keyword + Semantic)
public class HybridSearchExample {
public List<TextSegment> hybridSearch(String query, List<TextSegment> allSegments) {
// 1. Keyword search (BM25 or simple text matching)
List<TextSegment> keywordResults = keywordSearch(query, allSegments);
// 2. Semantic search (vector similarity)
Embedding queryEmbedding = embeddingModel.embed(query).content();
EmbeddingSearchResult<TextSegment> semanticResults =
embeddingStore.search(EmbeddingSearchRequest.builder()
.queryEmbedding(queryEmbedding)
.maxResults(10)
.build());
// 3. Combine and rerank results
return rerank(keywordResults, semanticResults);
}
private List<TextSegment> keywordSearch(String query, List<TextSegment> segments) {
String[] keywords = query.toLowerCase().split("\\s+");
return segments.stream()
.filter(segment -> {
String text = segment.text().toLowerCase();
return Arrays.stream(keywords).anyMatch(text::contains);
})
.collect(Collectors.toList());
}
private List<TextSegment> rerank(
List<TextSegment> keywordResults,
EmbeddingSearchResult<TextSegment> semanticResults) {
// Combine scores and return top results
Map<TextSegment, Double> combinedScores = new HashMap<>();
// Add keyword scores
for (int i = 0; i < keywordResults.size(); i++) {
combinedScores.merge(
keywordResults.get(i),
1.0 / (i + 1), // Position-based score
Double::sum
);
}
// Add semantic scores
for (EmbeddingMatch<TextSegment> match : semanticResults.matches()) {
combinedScores.merge(
match.embedded(),
match.score(),
Double::sum
);
}
// Sort by combined score
return combinedScores.entrySet().stream()
.sorted(Map.Entry.<TextSegment, Double>comparingByValue().reversed())
.limit(5)
.map(Map.Entry::getKey)
.collect(Collectors.toList());
}
}
Query Rewriting
public class QueryRewritingExample {
private final ChatLanguageModel llm;
public String improveQuery(String userQuery) {
String prompt = "Rewrite the following query to be more specific " +
"and suitable for searching documentation:\n\n" +
"Original query: " + userQuery + "\n\n" +
"Improved query:";
return llm.generate(prompt);
}
public List<String> generateMultipleQueries(String userQuery) {
String prompt = "Generate 3 different search queries based on this question:\n" +
userQuery + "\n\n" +
"Return only the queries, one per line.";
String response = llm.generate(prompt);
return Arrays.asList(response.split("\n"));
}
public String answerWithQueryRewriting(String userQuery) {
// 1. Rewrite query for better retrieval
List<String> queries = generateMultipleQueries(userQuery);
// 2. Search with all queries
Set<TextSegment> allResults = new HashSet<>();
for (String query : queries) {
Embedding embedding = embeddingModel.embed(query).content();
EmbeddingSearchResult<TextSegment> result = embeddingStore.search(
EmbeddingSearchRequest.builder()
.queryEmbedding(embedding)
.maxResults(3)
.build()
);
result.matches().forEach(match -> allResults.add(match.embedded()));
}
// 3. Generate answer with combined context
String context = allResults.stream()
.map(TextSegment::text)
.collect(Collectors.joining("\n\n"));
String prompt = "Context:\n" + context +
"\n\nQuestion: " + userQuery +
"\n\nAnswer:";
return llm.generate(prompt);
}
}
RAG with Conversation Memory
import dev.langchain4j.memory.ChatMemory;
import dev.langchain4j.memory.chat.MessageWindowChatMemory;
public class ConversationalRAGExample {
private final ChatLanguageModel llm;
private final EmbeddingModel embeddingModel;
private final EmbeddingStore<TextSegment> embeddingStore;
private final ChatMemory chatMemory;
public ConversationalRAGExample() {
this.llm = OpenAiChatModel.builder()
.apiKey(System.getenv("OPENAI_API_KEY"))
.build();
this.embeddingModel = OpenAiEmbeddingModel.builder()
.apiKey(System.getenv("OPENAI_API_KEY"))
.build();
this.embeddingStore = new InMemoryEmbeddingStore<>();
this.chatMemory = MessageWindowChatMemory.withMaxMessages(10);
}
public String chat(String userMessage) {
// 1. Retrieve relevant context
Embedding questionEmbedding = embeddingModel.embed(userMessage).content();
EmbeddingSearchResult<TextSegment> searchResult = embeddingStore.search(
EmbeddingSearchRequest.builder()
.queryEmbedding(questionEmbedding)
.maxResults(3)
.build()
);
String context = searchResult.matches().stream()
.map(match -> match.embedded().text())
.collect(Collectors.joining("\n\n"));
// 2. Build prompt with context and conversation history
SystemMessage systemMessage = SystemMessage.from(
"You are a helpful assistant. Answer questions based on the provided context. " +
"If the context doesn't contain relevant information, say so."
);
UserMessage currentMessage = UserMessage.from(
"Context:\n" + context +
"\n\nQuestion: " + userMessage
);
// 3. Add to memory
if (chatMemory.messages().isEmpty()) {
chatMemory.add(systemMessage);
}
chatMemory.add(currentMessage);
// 4. Generate response
String response = llm.generate(chatMemory.messages());
// 5. Add assistant response to memory
chatMemory.add(AssistantMessage.from(response));
return response;
}
}
Production RAG System
Complete RAG Service
import org.springframework.stereotype.Service;
@Service
public class RAGService {
private final ChatLanguageModel chatModel;
private final EmbeddingModel embeddingModel;
private final EmbeddingStore<TextSegment> embeddingStore;
private final Map<String, ChatMemory> userSessions;
public RAGService(
ChatLanguageModel chatModel,
EmbeddingModel embeddingModel,
EmbeddingStore<TextSegment> embeddingStore) {
this.chatModel = chatModel;
this.embeddingModel = embeddingModel;
this.embeddingStore = embeddingStore;
this.userSessions = new ConcurrentHashMap<>();
}
/**
* Index documents for RAG
*/
public void indexDocuments(List<Document> documents) {
// Split documents
DocumentSplitter splitter = DocumentSplitters.recursive(300, 20);
for (Document document : documents) {
List<TextSegment> segments = splitter.split(document);
// Create embeddings and store
for (TextSegment segment : segments) {
Embedding embedding = embeddingModel.embed(segment).content();
embeddingStore.add(embedding, segment);
}
}
}
/**
* Answer question using RAG
*/
public RAGResponse ask(String userId, String question) {
// Get or create user session
ChatMemory memory = userSessions.computeIfAbsent(
userId,
k -> MessageWindowChatMemory.withMaxMessages(10)
);
// Retrieve relevant context
Embedding questionEmbedding = embeddingModel.embed(question).content();
EmbeddingSearchResult<TextSegment> searchResult = embeddingStore.search(
EmbeddingSearchRequest.builder()
.queryEmbedding(questionEmbedding)
.maxResults(5)
.minScore(0.7)
.build()
);
if (searchResult.matches().isEmpty()) {
return new RAGResponse(
"I couldn't find relevant information to answer your question.",
List.of()
);
}
// Build context
String context = searchResult.matches().stream()
.map(match -> match.embedded().text())
.collect(Collectors.joining("\n\n"));
// Create prompt
String prompt = String.format("""
Based on the following context, answer the user's question.
If the context doesn't contain enough information, say so.
Context:
%s
Question: %s
Answer:
""", context, question);
// Generate answer
UserMessage userMessage = UserMessage.from(prompt);
memory.add(userMessage);
String answer = chatModel.generate(memory.messages());
memory.add(AssistantMessage.from(answer));
// Build response with sources
List<Source> sources = searchResult.matches().stream()
.map(match -> new Source(
match.embedded().text(),
match.score(),
match.embedded().metadata()
))
.collect(Collectors.toList());
return new RAGResponse(answer, sources);
}
/**
* Clear user session
*/
public void clearSession(String userId) {
userSessions.remove(userId);
}
// Response DTOs
public record RAGResponse(String answer, List<Source> sources) {}
public record Source(String text, double score, Metadata metadata) {}
}
REST Controller for RAG
@RestController
@RequestMapping("/api/rag")
public class RAGController {
private final RAGService ragService;
public RAGController(RAGService ragService) {
this.ragService = ragService;
}
@PostMapping("/ask")
public ResponseEntity<RAGResponse> ask(
@RequestHeader("User-Id") String userId,
@RequestBody QuestionRequest request) {
RAGResponse response = ragService.ask(userId, request.question());
return ResponseEntity.ok(response);
}
@PostMapping("/index")
public ResponseEntity<Void> indexDocuments(@RequestBody List<String> filePaths) {
List<Document> documents = filePaths.stream()
.map(path -> FileSystemDocumentLoader.loadDocument(Path.of(path)))
.collect(Collectors.toList());
ragService.indexDocuments(documents);
return ResponseEntity.ok().build();
}
@DeleteMapping("/session/{userId}")
public ResponseEntity<Void> clearSession(@PathVariable String userId) {
ragService.clearSession(userId);
return ResponseEntity.ok().build();
}
record QuestionRequest(String question) {}
}
Best Practices
1. Chunk Size Optimization
// Small chunks (100-200 tokens): Better precision, may miss context
// Medium chunks (300-500 tokens): Good balance
// Large chunks (500-1000 tokens): More context, less precise
DocumentSplitter splitter = DocumentSplitters.recursive(
300, // Recommended: 300-500 tokens
20 // 10-20% overlap
);
2. Number of Retrieved Chunks
// Too few: Miss relevant information
// Too many: Noise and context window issues
// Recommended: 3-5 chunks
EmbeddingSearchRequest.builder()
.queryEmbedding(queryEmbedding)
.maxResults(5) // Typically 3-5
.minScore(0.7) // Filter low-quality matches
.build();
3. Prompt Engineering for RAG
String goodPrompt = """
Based ONLY on the following context, answer the question.
If the context doesn't contain the answer, say "I don't have enough information."
Context:
%s
Question: %s
Answer:
""";
// ❌ Bad: Allows hallucination
String badPrompt = "Answer this question: " + question;
4. Error Handling
public RAGResponse robustAsk(String question) {
try {
// Retrieve context
EmbeddingSearchResult<TextSegment> result = retrieveContext(question);
if (result.matches().isEmpty()) {
return new RAGResponse(
"No relevant information found.",
List.of()
);
}
// Generate answer
String answer = generateAnswer(question, result);
return new RAGResponse(answer, extractSources(result));
} catch (Exception e) {
log.error("RAG error", e);
return new RAGResponse(
"Sorry, I encountered an error processing your question.",
List.of()
);
}
}
5. Caching
@Service
public class CachedRAGService {
private final LoadingCache<String, Embedding> embeddingCache;
public CachedRAGService() {
this.embeddingCache = Caffeine.newBuilder()
.maximumSize(10_000)
.expireAfterWrite(1, TimeUnit.HOURS)
.build(question -> embeddingModel.embed(question).content());
}
public RAGResponse ask(String question) {
// Use cached embedding if available
Embedding questionEmbedding = embeddingCache.get(question);
// Continue with RAG...
}
}
Common Pitfalls and Solutions
Problem 1: Irrelevant Results
Solution: Adjust similarity threshold and improve chunking
EmbeddingSearchRequest.builder()
.queryEmbedding(questionEmbedding)
.maxResults(5)
.minScore(0.75) // Increase threshold (0.7 → 0.75)
.build();
Problem 2: Context Too Large
Solution: Retrieve more chunks but use only top results
// Retrieve 10 candidates
EmbeddingSearchResult<TextSegment> candidates = embeddingStore.search(
EmbeddingSearchRequest.builder()
.queryEmbedding(questionEmbedding)
.maxResults(10)
.build()
);
// Use only top 3 for final context
String context = candidates.matches().stream()
.limit(3)
.map(match -> match.embedded().text())
.collect(Collectors.joining("\n\n"));
Problem 3: Slow Performance
Solution: Use batch embedding and async processing
public void indexDocumentsAsync(List<Document> documents) {
CompletableFuture.runAsync(() -> {
// Batch embed segments
List<TextSegment> allSegments = new ArrayList<>();
for (Document doc : documents) {
allSegments.addAll(splitter.split(doc));
}
// Embed in batches of 100
for (int i = 0; i < allSegments.size(); i += 100) {
List<TextSegment> batch = allSegments.subList(
i,
Math.min(i + 100, allSegments.size())
);
List<Embedding> embeddings = embeddingModel.embedAll(batch);
for (int j = 0; j < batch.size(); j++) {
embeddingStore.add(embeddings.get(j), batch.get(j));
}
}
});
}
Evaluation and Monitoring
Metrics to Track
public class RAGMetrics {
// 1. Retrieval Precision
public double calculateRetrievalPrecision(
List<TextSegment> retrieved,
List<TextSegment> relevant) {
long correctRetrievals = retrieved.stream()
.filter(relevant::contains)
.count();
return (double) correctRetrievals / retrieved.size();
}
// 2. Answer Quality (requires human evaluation)
public void logAnswerQuality(String question, String answer, int rating) {
// Log for analysis
log.info("Question: {}, Rating: {}", question, rating);
}
// 3. Response Time
public void measureResponseTime() {
long start = System.currentTimeMillis();
String answer = ragService.ask(userId, question);
long duration = System.currentTimeMillis() - start;
metrics.recordResponseTime(duration);
}
}
Next Steps
- Vector Databases: Learn about specialized databases for embeddings
- Fine-tuning Embeddings: Improve retrieval accuracy for your domain
- Multi-modal RAG: Handle images, tables, and other formats
- Agent Systems: Combine RAG with tool calling
- Production Deployment: Scale RAG systems for high traffic
Conclusion
RAG is a powerful technique to enhance LLMs with external knowledge:
- Reduces hallucinations by grounding responses in facts
- Enables up-to-date information without retraining
- Provides source attribution for verifiable answers
- Cost-effective compared to fine-tuning models
Key implementation steps:
- Load and chunk documents
- Create embeddings
- Store in vector database
- Retrieve relevant context
- Generate augmented responses
References
This tutorial is based on the excellent workshop "KI Anwendungen im Unternehmen" presented at BaselOne 2025. Special thanks to David Beisert (beisdog) for creating comprehensive and practical workshop materials that demonstrate how to implement production-ready RAG systems using Java, LangChain4j, and vector databases.
Original Workshop Materials: BaselOne AI Workshop on GitHub
The workshop provides detailed examples of RAG implementation patterns, including document processing, embedding strategies, and integration with PostgreSQL pgvector for enterprise applications.
Content Review
The content in this tutorial has been reviewed and curated by chevp, focusing on accuracy, clarity, and practical applicability for developers implementing RAG systems with Java and LangChain4j.