Skip to content

Retrieval-Augmented Generation (RAG)

RAG combines document retrieval with LLM generation to answer questions based on retrieved context. For each query, the system retrieves the top-k most relevant document chunks from a vector database and prompts the model to generate an answer using only that context.

ALUE provides a complete RAG pipeline: dataset preparation → vector database setup → retrieval → inference → evaluation.


Dataset Format

RAG expects a JSON file with few-shot examples and SQuAD-style questions:

{
  "examples": [
    {
      "query": "What is the purpose of FAA Order 8040.1C?",
      "context": "This order describes the Federal Aviation Administration's (FAA) authority and assigns responsibility for the development and issuance of Airworthiness Directives (AD) in accordance with applicable statutes and regulations.",
      "answer": "To describe the FAA's authority and assign responsibility for developing and issuing Airworthiness Directives in accordance with applicable statutes and regulations."
    }
  ],
  "data": [
    {
      "title": "FAA Order 8040.1C - Airworthiness Directives",
      "paragraphs": [
        {
          "qas": [
            {
              "id": "0",
              "question": "What is the effective date of FAA Order 8040.1C on Airworthiness Directives?",
              "answers": [
                {
                  "text": "10/03/07",
                  "document_id": ["faa_order_8040_1c"]
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Key fields: - examples — Few-shot examples for prompt construction (optional but recommended) - data[].paragraphs[].qas[] — Evaluation questions - answers[].text — Ground-truth answer text - answers[].document_idOptional but required for Recall@k: Chunk IDs in your vector database that contain the answer

Important: If you want to compute Recall@k, the document_id values must exactly match the chunk IDs stored in your ChromaDB collection.


Quick Start

Prerequisites

Ensure you have configured your environment (see Getting Started): - Inference backend (e.g., OpenAI, vLLM, Ollama) - LLM Judge for evaluation (recommended: different from inference model) - Embedding provider for vector database

Minimal Working Example

Assuming you have an existing ChromaDB database:

# Run inference + evaluation
python -m scripts.rag both \
  -i data/ASRS_rag/rag_qa.json \
  -o runs/rag \
  -m gpt-4o-mini \
  --database-path ./chroma_db \
  --collection-name documents \
  --top-k 5 \
  --num-examples 3 \
  --llm_judge_model_name gpt-4o-mini \
  --evaluate_retrieval \
  --evaluate_generation

This will: 1. Retrieve top-5 chunks for each question 2. Generate answers using the inference model 3. Evaluate retrieval quality (Context Relevancy) 4. Evaluate generation quality (Composite Correctness)

Results are saved to runs/rag_<timestamp>/.


Running RAG

The RAG script supports three modes: inference, evaluation, or both.

Inference Only

Generate answers without evaluation:

python -m scripts.rag inference \
  -i data/ASRS_rag/rag_qa.json \
  -o runs/rag \
  -m gpt-4o-mini \
  --database-path ./chroma_db \
  --collection-name documents \
  --top-k 5 \
  --num-examples 3 \
  --task_type rag \
  --temperature 0.1 \
  --max_tokens 150

Output: - predictions.json — Contains answer, ground_truth_answer, predicted_doc_ids, question for each item - results.json — Run parameters and summary

Evaluation Only

Evaluate existing predictions:

python -m scripts.rag evaluation \
  -i data/ASRS_rag/rag_qa.json \
  -o runs/rag_eval \
  --predictions_file runs/rag_<timestamp>/predictions.json \
  --llm_judge_model_name gpt-4o-mini \
  --database-path ./chroma_db \
  --collection-name documents \
  --top-k 5 \
  --evaluate_retrieval \
  --evaluate_generation \
  --use_recall_k

Outputs: - rag_evaluation_summary.json — Overall metrics - context_relevancy.json — Per-chunk relevancy scores (if --evaluate_retrieval) - doc_retrieval.json — Recall@k scores (if --use_recall_k and ground-truth chunk IDs exist) - composite_correctness.json — Claim-level scoring (if --evaluate_generation)

Note: --use_recall_k requires document_id annotations in your dataset.

Inference + Evaluation

Run both steps in sequence:

python -m scripts.rag both \
  -i data/ASRS_rag/rag_qa.json \
  -o runs/rag \
  -m gpt-4o-mini \
  --database-path ./chroma_db \
  --collection-name documents \
  --top-k 5 \
  --num-examples 3 \
  --llm_judge_model_name gpt-4o-mini \
  --evaluate_retrieval \
  --evaluate_generation

Templates and Variables

RAG templates are located in templates/rag/:

system.jinja2

You are an aviation safety analyst. Answer questions using only the provided context.

{% if examples %}
Here are some examples:
{% for example in examples %}
    Question: {{ example.query }}
    Context: {{ example.context }}
    Answer: {{ example.answer }}
{% endfor %}
{% endif %}

user.jinja2

Question: {{ query }}
Context: {{ context }}

Expected Template Variables

Templates must reference:

Variable Description
examples Few-shot examples (list of {query, context, answer})
query The user's question
context Retrieved document chunks (concatenated)

Note: RAG typically does not require structured generation schemas. Leave --schema_class unset unless you define a custom Pydantic schema.


Building a Vector Database

If you don't have an existing ChromaDB, you can build one from PDFs using alue.rag_utils.

From PDF Documents

python -m alue.rag_utils \
  --document-directory ./docs_pdfs \
  --database-path ./chroma_db \
  --collection-name documents \
  --output-path ./artifacts \
  --partition-strategy hi_res \
  --chunk-hard-max 1200 \
  --chunk-soft-max 700 \
  --overlap-size 50

Parameters: - --document-directory — Folder containing PDF files - --database-path — Output path for ChromaDB - --collection-name — Name for the collection - --partition-strategyhi_res (higher quality, uses CV model) or fast (faster, no CV) - --chunk-hard-max — Maximum chunk size in characters - --chunk-soft-max — Target chunk size (will try to split at sentence boundaries) - --overlap-size — Number of characters to overlap between chunks

Chunking Strategies

hi_res (recommended for quality): - Uses a computer vision model for better document structure detection - Downloads a detector model on first use (~300MB) - Better handling of tables, figures, and complex layouts

fast (recommended for speed): - No CV model required - Faster processing - Good for simple text documents

Embedding Configuration

The embedding provider is controlled by your .env file:

EMBEDDING_ENDPOINT_TYPE=local  # or openai, ollama, hf, openai-compatible

See Configuration Reference for embedding setup details.

Chunk IDs for Recall@k

When building a database with rag_utils, chunks are assigned stable IDs based on the source document and chunk position. To use Recall@k evaluation:

  1. Note the chunk IDs during database creation (they're logged)
  2. Add these IDs to your dataset's answers[].document_id fields
  3. Use --use_recall_k flag during evaluation

Evaluation Metrics

Recall@k (Retrieval Supervision)

What it measures: Fraction of ground-truth chunk IDs retrieved among top-k results.

Requirements: - Dataset must include document_id annotations - Chunk IDs must match exactly with ChromaDB collection - Enable with --use_recall_k flag

Formula: Recall@k = (# ground-truth chunks retrieved) / (# total ground-truth chunks)

Reported as overall average across all questions.

Context Relevancy (LLM-as-Judge)

What it measures: Whether each retrieved chunk is relevant to answering the question.

How it works: - LLM judge assigns {0, 1} to each retrieved chunk - Per-question score = average across retrieved chunks - Final score = average across all questions

Requirements: - Requires --database-path and --collection-name (to fetch chunk content) - Enable with --evaluate_retrieval flag - Does not require ground-truth chunk IDs

Composite Correctness (LLM-as-Judge)

What it measures: Whether the generated answer is factually correct and grounded in retrieved context.

How it works: 1. Generated answer is decomposed into atomic claims 2. Each claim is checked for: - Containment in the reference answer - Contradiction with the reference - Support from relevant retrieved context 3. If at least one main claim correctly answers the question, all claim scores are averaged; otherwise the response scores 0

Requirements: - Enable with --evaluate_generation flag - Automatically uses retrieved context from predicted_doc_ids

Reported metrics: - average_composite_correctness — Overall score across questions - Per-claim breakdowns in composite_correctness.json


Configuration Notes

Required Settings

For inference:

ALUE_ENDPOINT_TYPE=openai
ALUE_OPENAI_API_KEY=sk-...
EMBEDDING_ENDPOINT_TYPE=local

For evaluation:

ALUE_LLM_JUDGE_ENDPOINT_TYPE=openai
ALUE_LLM_JUDGE_OPENAI_API_KEY=sk-...

Recommendation: Use a different model for the LLM judge than for inference to reduce evaluation bias. Mixing backends is supported.

When to Use Each Evaluation Metric

Metric Use When Requires
Recall@k You have ground-truth chunk annotations document_id in dataset
Context Relevancy You want to evaluate retrieval quality without annotations ChromaDB access
Composite Correctness You want to evaluate generation quality LLM judge

You can enable any combination with --evaluate_retrieval, --evaluate_generation, and --use_recall_k flags.


Using an Existing ChromaDB

If you already have a ChromaDB collection from another source:

  1. Ensure your embedding configuration matches the embeddings in the database
  2. Provide the database path and collection name to the RAG script
  3. For Recall@k, manually add chunk IDs to your dataset's document_id fields

Example with external database:

python -m scripts.rag inference \
  -i data/ASRS_rag/rag_qa.json \
  -o runs/rag \
  -m gpt-4o-mini \
  --database-path /path/to/external/chroma_db \
  --collection-name my_collection \
  --top-k 5

See Also