Extractive Question Answering (Extractive QA)¶

Extractive QA evaluates a model’s ability to identify text spans from a transcript or document that directly answer a given question.
This task is especially relevant for NTSB transcripts or structured regulatory text where answers must be extracted verbatim.

Dataset Format¶

Datasets follow a SQuAD-style format with examples (few-shot for prompt construction) and paragraphs (test/eval instances):

{
  "version": "squad_v1",
  "data": [
    {
      "title": "ntsb_extractive_qa",
      "task": "Extract the exact span ...",
      "examples": [
        {
          "question": "Is a tail number mentioned ...?",
          "transcript": "On March 18, 2008 ... a Robinson R22 beta, M3056T, ...",
          "answer": "[M3056T]"
        }
      ],
      "paragraphs": [
        {
          "context": "The pilot reported while making an approach ...",
          "qas": [
            {
              "id": "e80026dc-c640-484f-94c6-4f7597cfce1f",
              "question": "Is a tail number mentioned ...?",
              "answers": [{ "text": "['NONE']" }]
            }
          ]
        }
      ]
    }
  ]
}
````

* `examples` → Few-shot examples for prompt construction
* `paragraphs` → Contexts and QA pairs for inference/evaluation
* `answers.text` → Gold-standard text span(s)

---

## Prompt Templates

Example prompt for **tail number extraction** (`templates/extract_tail_number/`):

**system.jinja2**

```jinja2
You are a helpful AI assistant that identifies tail numbers from NTSB transcripts, if a tail number exists.

{% if examples %}
{% for example in examples %}
Question: {{ example.question }}
Context: {{ example.transcript }}
Tail Number: {{ example.answer }}

{% endfor %}
{% endif %}

Do not provide sentence-long spans or explanations in your answers. Just provide the tail numbers formatted as a list.

user.jinja2

Question: {{query}}
Context: {{context}}

Template Variables¶

For Extractive QA, templates must reference:

Variable	Description
`query`	The input question
`context`	The transcript or document passage
`examples`	Few-shot QA pairs for in-context learning

Running Extractive QA¶

The entrypoint script scripts/extractive_qa.py supports three modes: inference, evaluation, or both.

Example: Run Inference Only¶

python -m scripts.extractive_qa.py inference \
  -i data/extractive_qa/tail_number.json \
  -o runs/extractive_qa \
  -m gpt-4o-mini \
  --task_type extract_tail_number \
  --num_examples 3

Example: Run Evaluation Only¶

python -m scripts.extractive_qa.py evaluation \
  -i data/extractive_qa/tail_number.json \
  -o runs/extractive_qa \
  --predictions_file runs/extractive_qa/predictions.json \
  --normalizer_func normalize_tail_extraction_predictions

Example: Run Inference + Evaluation¶

python -m scripts.extractive_qa.py both \
  -i data/extractive_qa/tail_number.json \
  -o runs/extractive_qa \
  -m gpt-4o-mini \
  --task_type extract_tail_number \
  --num_examples 3 \
  --normalizer_func normalize_tail_extraction_predictions

Structured Output with Schemas (Optional)¶

ALUE supports structured generation using Pydantic schemas. For example, the tail number extraction task can enforce output via:

from pydantic import BaseModel
from typing import List

class TailNumberResponse(BaseModel):
    tail_numbers: List[str]

Run with schema enforcement:

python -m scripts.extractive_qa.py inference \
  -i data/extractive_qa/tail_number.json \
  -o runs/extractive_qa \
  -m gpt-4o-mini \
  --task_type extract_tail_number \
  --num_examples 3 \
  --schema_class TailNumberResponse \
  --field_to_extract tail_numbers

Result Example:

{
  "e80026dc-c640-484f-94c6-4f7597cfce1f": ["N404EH"],
  "7a661465-9721-4ab0-b27f-f935c4e166a0": ["NONE"]
}

Normalization Functions¶

Since models may output answers in varied formats (lists, strings, or raw spans), ALUE provides normalization utilities.

Example: normalize_tail_extraction_predictions

def normalize_tail_extraction_predictions(predictions: Dict[str, Any]) -> Dict[str, Any]:
    """
    Normalize tail number predictions.

    [] -> "[NONE]"
    ["N66372"] -> "[N66372]"
    "[NONE]" -> "[NONE]"
    "[N66372]" -> "[N66372]"
    """

Call via CLI with:

--normalizer_func normalize_tail_extraction_predictions

Metrics¶

Extractive QA uses SQuAD-style evaluation:

Exact Match (EM) → % of answers exactly matching gold spans
F1 Score → Token-level overlap between prediction and gold span(s)
Handles multiple spans, no-answer cases, and normalized predictions

Configuration Notes¶

LLM Judge is not required for Extractive QA (unlike RAG or Summarization).
Embedding configuration is not required (no retrieval step involved).
Schema and normalization are optional.