Skip to content

Extractive Question Answering (Extractive QA)

Extractive QA evaluates a model’s ability to identify text spans from a transcript or document that directly answer a given question.
This task is especially relevant for NTSB transcripts or structured regulatory text where answers must be extracted verbatim.


Dataset Format

Datasets follow a SQuAD-style format with examples (few-shot for prompt construction) and paragraphs (test/eval instances):

{
  "version": "squad_v1",
  "data": [
    {
      "title": "ntsb_extractive_qa",
      "task": "Extract the exact span ...",
      "examples": [
        {
          "question": "Is a tail number mentioned ...?",
          "transcript": "On March 18, 2008 ... a Robinson R22 beta, M3056T, ...",
          "answer": "[M3056T]"
        }
      ],
      "paragraphs": [
        {
          "context": "The pilot reported while making an approach ...",
          "qas": [
            {
              "id": "e80026dc-c640-484f-94c6-4f7597cfce1f",
              "question": "Is a tail number mentioned ...?",
              "answers": [{ "text": "['NONE']" }]
            }
          ]
        }
      ]
    }
  ]
}
````

* `examples`  Few-shot examples for prompt construction
* `paragraphs`  Contexts and QA pairs for inference/evaluation
* `answers.text`  Gold-standard text span(s)

---

## Prompt Templates

Example prompt for **tail number extraction** (`templates/extract_tail_number/`):

**system.jinja2**

```jinja2
You are a helpful AI assistant that identifies tail numbers from NTSB transcripts, if a tail number exists.

{% if examples %}
{% for example in examples %}
Question: {{ example.question }}
Context: {{ example.transcript }}
Tail Number: {{ example.answer }}

{% endfor %}
{% endif %}

Do not provide sentence-long spans or explanations in your answers. Just provide the tail numbers formatted as a list.

user.jinja2

Question: {{query}}
Context: {{context}}

Template Variables

For Extractive QA, templates must reference:

Variable Description
query The input question
context The transcript or document passage
examples Few-shot QA pairs for in-context learning

Running Extractive QA

The entrypoint script scripts/extractive_qa.py supports three modes: inference, evaluation, or both.

Example: Run Inference Only

python -m scripts.extractive_qa.py inference \
  -i data/extractive_qa/tail_number.json \
  -o runs/extractive_qa \
  -m gpt-4o-mini \
  --task_type extract_tail_number \
  --num_examples 3

Example: Run Evaluation Only

python -m scripts.extractive_qa.py evaluation \
  -i data/extractive_qa/tail_number.json \
  -o runs/extractive_qa \
  --predictions_file runs/extractive_qa/predictions.json \
  --normalizer_func normalize_tail_extraction_predictions

Example: Run Inference + Evaluation

python -m scripts.extractive_qa.py both \
  -i data/extractive_qa/tail_number.json \
  -o runs/extractive_qa \
  -m gpt-4o-mini \
  --task_type extract_tail_number \
  --num_examples 3 \
  --normalizer_func normalize_tail_extraction_predictions

Structured Output with Schemas (Optional)

ALUE supports structured generation using Pydantic schemas. For example, the tail number extraction task can enforce output via:

from pydantic import BaseModel
from typing import List

class TailNumberResponse(BaseModel):
    tail_numbers: List[str]

Run with schema enforcement:

python -m scripts.extractive_qa.py inference \
  -i data/extractive_qa/tail_number.json \
  -o runs/extractive_qa \
  -m gpt-4o-mini \
  --task_type extract_tail_number \
  --num_examples 3 \
  --schema_class TailNumberResponse \
  --field_to_extract tail_numbers

Result Example:

{
  "e80026dc-c640-484f-94c6-4f7597cfce1f": ["N404EH"],
  "7a661465-9721-4ab0-b27f-f935c4e166a0": ["NONE"]
}

Normalization Functions

Since models may output answers in varied formats (lists, strings, or raw spans), ALUE provides normalization utilities.

Example: normalize_tail_extraction_predictions

def normalize_tail_extraction_predictions(predictions: Dict[str, Any]) -> Dict[str, Any]:
    """
    Normalize tail number predictions.

    [] -> "[NONE]"
    ["N66372"] -> "[N66372]"
    "[NONE]" -> "[NONE]"
    "[N66372]" -> "[N66372]"
    """

Call via CLI with:

--normalizer_func normalize_tail_extraction_predictions

Metrics

Extractive QA uses SQuAD-style evaluation:

  • Exact Match (EM) → % of answers exactly matching gold spans
  • F1 Score → Token-level overlap between prediction and gold span(s)
  • Handles multiple spans, no-answer cases, and normalized predictions

Configuration Notes

  • LLM Judge is not required for Extractive QA (unlike RAG or Summarization).
  • Embedding configuration is not required (no retrieval step involved).
  • Schema and normalization are optional.