Retrieval-Augmented Generation (RAG)¶
RAG combines document retrieval with LLM generation to answer questions based on retrieved context. For each query, the system retrieves the top-k most relevant document chunks from a vector database and prompts the model to generate an answer using only that context.
ALUE provides a complete RAG pipeline: dataset preparation → vector database setup → retrieval → inference → evaluation.
Dataset Format¶
RAG expects a JSON file with few-shot examples and SQuAD-style questions:
{
"examples": [
{
"query": "What is the purpose of FAA Order 8040.1C?",
"context": "This order describes the Federal Aviation Administration's (FAA) authority and assigns responsibility for the development and issuance of Airworthiness Directives (AD) in accordance with applicable statutes and regulations.",
"answer": "To describe the FAA's authority and assign responsibility for developing and issuing Airworthiness Directives in accordance with applicable statutes and regulations."
}
],
"data": [
{
"title": "FAA Order 8040.1C - Airworthiness Directives",
"paragraphs": [
{
"qas": [
{
"id": "0",
"question": "What is the effective date of FAA Order 8040.1C on Airworthiness Directives?",
"answers": [
{
"text": "10/03/07",
"document_id": ["faa_order_8040_1c"]
}
]
}
]
}
]
}
]
}
Key fields:
- examples — Few-shot examples for prompt construction (optional but recommended)
- data[].paragraphs[].qas[] — Evaluation questions
- answers[].text — Ground-truth answer text
- answers[].document_id — Optional but required for Recall@k: Chunk IDs in your vector database that contain the answer
Important: If you want to compute Recall@k, the
document_idvalues must exactly match the chunk IDs stored in your ChromaDB collection.
Quick Start¶
Prerequisites¶
Ensure you have configured your environment (see Getting Started): - Inference backend (e.g., OpenAI, vLLM, Ollama) - LLM Judge for evaluation (recommended: different from inference model) - Embedding provider for vector database
Minimal Working Example¶
Assuming you have an existing ChromaDB database:
# Run inference + evaluation
python -m scripts.rag both \
-i data/ASRS_rag/rag_qa.json \
-o runs/rag \
-m gpt-4o-mini \
--database-path ./chroma_db \
--collection-name documents \
--top-k 5 \
--num-examples 3 \
--llm_judge_model_name gpt-4o-mini \
--evaluate_retrieval \
--evaluate_generation
This will: 1. Retrieve top-5 chunks for each question 2. Generate answers using the inference model 3. Evaluate retrieval quality (Context Relevancy) 4. Evaluate generation quality (Composite Correctness)
Results are saved to runs/rag_<timestamp>/.
Running RAG¶
The RAG script supports three modes: inference, evaluation, or both.
Inference Only¶
Generate answers without evaluation:
python -m scripts.rag inference \
-i data/ASRS_rag/rag_qa.json \
-o runs/rag \
-m gpt-4o-mini \
--database-path ./chroma_db \
--collection-name documents \
--top-k 5 \
--num-examples 3 \
--task_type rag \
--temperature 0.1 \
--max_tokens 150
Output:
- predictions.json — Contains answer, ground_truth_answer, predicted_doc_ids, question for each item
- results.json — Run parameters and summary
Evaluation Only¶
Evaluate existing predictions:
python -m scripts.rag evaluation \
-i data/ASRS_rag/rag_qa.json \
-o runs/rag_eval \
--predictions_file runs/rag_<timestamp>/predictions.json \
--llm_judge_model_name gpt-4o-mini \
--database-path ./chroma_db \
--collection-name documents \
--top-k 5 \
--evaluate_retrieval \
--evaluate_generation \
--use_recall_k
Outputs:
- rag_evaluation_summary.json — Overall metrics
- context_relevancy.json — Per-chunk relevancy scores (if --evaluate_retrieval)
- doc_retrieval.json — Recall@k scores (if --use_recall_k and ground-truth chunk IDs exist)
- composite_correctness.json — Claim-level scoring (if --evaluate_generation)
Note:
--use_recall_krequiresdocument_idannotations in your dataset.
Inference + Evaluation¶
Run both steps in sequence:
python -m scripts.rag both \
-i data/ASRS_rag/rag_qa.json \
-o runs/rag \
-m gpt-4o-mini \
--database-path ./chroma_db \
--collection-name documents \
--top-k 5 \
--num-examples 3 \
--llm_judge_model_name gpt-4o-mini \
--evaluate_retrieval \
--evaluate_generation
Templates and Variables¶
RAG templates are located in templates/rag/:
system.jinja2
You are an aviation safety analyst. Answer questions using only the provided context.
{% if examples %}
Here are some examples:
{% for example in examples %}
Question: {{ example.query }}
Context: {{ example.context }}
Answer: {{ example.answer }}
{% endfor %}
{% endif %}
user.jinja2
Question: {{ query }}
Context: {{ context }}
Expected Template Variables¶
Templates must reference:
| Variable | Description |
|---|---|
examples |
Few-shot examples (list of {query, context, answer}) |
query |
The user's question |
context |
Retrieved document chunks (concatenated) |
Note: RAG typically does not require structured generation schemas. Leave
--schema_classunset unless you define a custom Pydantic schema.
Building a Vector Database¶
If you don't have an existing ChromaDB, you can build one from PDFs using alue.rag_utils.
From PDF Documents¶
python -m alue.rag_utils \
--document-directory ./docs_pdfs \
--database-path ./chroma_db \
--collection-name documents \
--output-path ./artifacts \
--partition-strategy hi_res \
--chunk-hard-max 1200 \
--chunk-soft-max 700 \
--overlap-size 50
Parameters:
- --document-directory — Folder containing PDF files
- --database-path — Output path for ChromaDB
- --collection-name — Name for the collection
- --partition-strategy — hi_res (higher quality, uses CV model) or fast (faster, no CV)
- --chunk-hard-max — Maximum chunk size in characters
- --chunk-soft-max — Target chunk size (will try to split at sentence boundaries)
- --overlap-size — Number of characters to overlap between chunks
Chunking Strategies¶
hi_res (recommended for quality):
- Uses a computer vision model for better document structure detection
- Downloads a detector model on first use (~300MB)
- Better handling of tables, figures, and complex layouts
fast (recommended for speed):
- No CV model required
- Faster processing
- Good for simple text documents
Embedding Configuration¶
The embedding provider is controlled by your .env file:
EMBEDDING_ENDPOINT_TYPE=local # or openai, ollama, hf, openai-compatible
See Configuration Reference for embedding setup details.
Chunk IDs for Recall@k¶
When building a database with rag_utils, chunks are assigned stable IDs based on the source document and chunk position. To use Recall@k evaluation:
- Note the chunk IDs during database creation (they're logged)
- Add these IDs to your dataset's
answers[].document_idfields - Use
--use_recall_kflag during evaluation
Evaluation Metrics¶
Recall@k (Retrieval Supervision)¶
What it measures: Fraction of ground-truth chunk IDs retrieved among top-k results.
Requirements:
- Dataset must include document_id annotations
- Chunk IDs must match exactly with ChromaDB collection
- Enable with --use_recall_k flag
Formula: Recall@k = (# ground-truth chunks retrieved) / (# total ground-truth chunks)
Reported as overall average across all questions.
Context Relevancy (LLM-as-Judge)¶
What it measures: Whether each retrieved chunk is relevant to answering the question.
How it works: - LLM judge assigns {0, 1} to each retrieved chunk - Per-question score = average across retrieved chunks - Final score = average across all questions
Requirements:
- Requires --database-path and --collection-name (to fetch chunk content)
- Enable with --evaluate_retrieval flag
- Does not require ground-truth chunk IDs
Composite Correctness (LLM-as-Judge)¶
What it measures: Whether the generated answer is factually correct and grounded in retrieved context.
How it works: 1. Generated answer is decomposed into atomic claims 2. Each claim is checked for: - Containment in the reference answer - Contradiction with the reference - Support from relevant retrieved context 3. If at least one main claim correctly answers the question, all claim scores are averaged; otherwise the response scores 0
Requirements:
- Enable with --evaluate_generation flag
- Automatically uses retrieved context from predicted_doc_ids
Reported metrics:
- average_composite_correctness — Overall score across questions
- Per-claim breakdowns in composite_correctness.json
Configuration Notes¶
Required Settings¶
For inference:
ALUE_ENDPOINT_TYPE=openai
ALUE_OPENAI_API_KEY=sk-...
EMBEDDING_ENDPOINT_TYPE=local
For evaluation:
ALUE_LLM_JUDGE_ENDPOINT_TYPE=openai
ALUE_LLM_JUDGE_OPENAI_API_KEY=sk-...
Recommendation: Use a different model for the LLM judge than for inference to reduce evaluation bias. Mixing backends is supported.
When to Use Each Evaluation Metric¶
| Metric | Use When | Requires |
|---|---|---|
| Recall@k | You have ground-truth chunk annotations | document_id in dataset |
| Context Relevancy | You want to evaluate retrieval quality without annotations | ChromaDB access |
| Composite Correctness | You want to evaluate generation quality | LLM judge |
You can enable any combination with --evaluate_retrieval, --evaluate_generation, and --use_recall_k flags.
Using an Existing ChromaDB¶
If you already have a ChromaDB collection from another source:
- Ensure your embedding configuration matches the embeddings in the database
- Provide the database path and collection name to the RAG script
- For Recall@k, manually add chunk IDs to your dataset's
document_idfields
Example with external database:
python -m scripts.rag inference \
-i data/ASRS_rag/rag_qa.json \
-o runs/rag \
-m gpt-4o-mini \
--database-path /path/to/external/chroma_db \
--collection-name my_collection \
--top-k 5
See Also¶
- Configuration Reference — Embedding and LLM judge setup
- Creating Datasets — RAG dataset format specification
- Models & Backends — Supported inference engines