CIS Benchmark CLI - Data Model and JSON Schema¶

Documentation Path

You are here: Technical Reference > Data Model

For architecture: See Architecture Overview
For data flow: See Data Flow Pipeline

Overview¶

This document defines our canonical data format for CIS benchmarks. All data flows through this format.

Data Flow Architecture¶

flowchart LR
    subgraph Input["Input Sources"]
        WB["CIS WorkBench<br/>(HTML)"]
        API["CIS API<br/>(JSON navtree)"]
        SCAP["Future: SCAP"]
        EXCEL["Future: Excel"]
        PDF["Future: PDF"]
    end

    subgraph Canonical["Canonical Format"]
        SCHEMA["JSON Schema<br/>(Our Format)<br/>VALIDATED"]
    end

    subgraph Output["Output Formats"]
        XCCDF["XCCDF XML"]
        YAML["YAML"]
        CSV["CSV"]
        MD["Markdown"]
        JSON["JSON"]
        FUTURE["Future formats"]
    end

    WB --> SCHEMA
    API --> SCHEMA
    SCAP --> SCHEMA
    EXCEL --> SCHEMA
    PDF --> SCHEMA

    SCHEMA --> XCCDF
    SCHEMA --> YAML
    SCHEMA --> CSV
    SCHEMA --> MD
    SCHEMA --> JSON
    SCHEMA --> FUTURE

    style SCHEMA fill:#90EE90,stroke:#006400,stroke-width:3px
    style Input fill:#E8F4F8,stroke:#0066CC
    style Output fill:#FFF4E6,stroke:#CC6600

KEY PRINCIPLE: All data MUST validate against our JSON Schema before it can be exported to any format.

Why JSON Schema as Canonical Format?¶

Benefits¶

Single Source of Truth

All exporters map from ONE format
No exporter-specific logic in scraper
Clear contract between components

Validation

Ensure scraped data is complete
Catch extraction errors early
Verify required fields present

Documentation

Self-documenting data format
Clear field definitions
Type information

Versioning

Schema v1.0, v2.0 as needs evolve
Backward compatibility tracking
Migration paths

Code Generation

Can generate Python dataclasses from schema
Type hints from schema
Validation code auto-generated

Testing

Validate test fixtures
Ensure exporters get expected input
Clear test assertions

Our JSON Schema¶

Location¶

cis_bench/models/schema.json

Structure¶

{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://github.com/yourusername/cis-benchmark-cli/schemas/benchmark-v1.0.json",
"title": "CIS Benchmark",
"description": "Canonical data format for CIS Benchmark downloaded from CIS WorkBench",
"type": "object",
"required": ["title", "benchmark_id", "url", "version", "recommendations"],
"properties": {
"title": {
"type": "string",
"description": "Full benchmark title (e.g., 'CIS Amazon EKS Benchmark v1.8.0')",
"examples": ["CIS Amazon EKS Benchmark v1.8.0", "CIS NGINX Benchmark vNEXT"]
},
"benchmark_id": {
"type": "string",
"description": "CIS WorkBench benchmark ID",
"pattern": "^[0-9]+$",
"examples": ["22605", "18528"]
},
"url": {
"type": "string",
"format": "uri",
"description": "Source URL of the benchmark"
},
"version": {
"type": "string",
"description": "Benchmark version (extracted from title or metadata)",
"examples": ["v1.8.0", "vNEXT", "v1.0.0"]
},
"downloaded_at": {
"type": "string",
"format": "date-time",
"description": "ISO 8601 timestamp when benchmark was downloaded"
},
"scraper_version": {
"type": "string",
"description": "Scraper strategy version used",
"examples": ["v1_2025_10"]
},
"total_recommendations": {
"type": "integer",
"minimum": 0,
"description": "Total number of recommendations in benchmark"
},
"recommendations": {
"type": "array",
"description": "Array of benchmark recommendations",
"items": {
"$ref": "#/$defs/recommendation"
}
}
},

"$defs": {
"recommendation": {
"type": "object",
"required": ["ref", "title", "url"],
"properties": {
"ref": {
"type": "string",
"description": "Recommendation reference number",
"pattern": "^[0-9]+(\\.[0-9]+)*$",
"examples": ["3.1.1", "5.2.3"]
},
"title": {
"type": "string",
"description": "Recommendation title",
"minLength": 1
},
"url": {
"type": "string",
"format": "uri",
"description": "Direct URL to recommendation page"
},
"assessment": {
"type": ["string", "null"],
"description": "Automated scoring/assessment information (may contain HTML)"
},
"description": {
"type": ["string", "null"],
"description": "Detailed description of the recommendation (may contain HTML)"
},
"rationale": {
"type": ["string", "null"],
"description": "Rationale/justification for the recommendation (may contain HTML)"
},
"impact": {
"type": ["string", "null"],
"description": "Impact statement (may contain HTML)"
},
"audit": {
"type": ["string", "null"],
"description": "Audit procedure to verify compliance (may contain HTML)"
},
"remediation": {
"type": ["string", "null"],
"description": "Remediation steps to achieve compliance (may contain HTML)"
},
"default_value": {
"type": ["string", "null"],
"description": "Default configuration value (may contain HTML)"
},
"artifact_eq": {
"type": ["string", "null"],
"description": "Artifact equation (may contain HTML)"
},
"mitre_mapping": {
"type": ["string", "null"],
"description": "MITRE ATT&CK framework mappings (may contain HTML)"
},
"references": {
"type": ["string", "null"],
"description": "External references and citations (may contain HTML)"
}
},
"additionalProperties": false
}
}
}

Using JSON Schema¶

Validation¶

import json
import jsonschema

# Load schema
with open('cis_bench/models/schema.json') as f:
schema = json.load(f)

# Validate scraped data
try:
jsonschema.validate(instance=benchmark_data, schema=schema)
print(" Data is valid")
except jsonschema.ValidationError as e:
print(f" Validation failed: {e.message}")
print(f" Path: {' '.join(str(p) for p in e.path)}")

Data Generation from Schema¶

We can also use tools like datamodel-code-generator to create Python dataclasses from our schema:

pip install datamodel-code-generator

datamodel-codegen \
--input cis_bench/models/schema.json \
--output cis_bench/models/benchmark.py \
--input-file-type jsonschema

This generates type-safe Python classes that match our schema exactly!

Exporter Mapping¶

Each exporter knows how to map from our canonical format to its target format:

JSON Exporter¶

# Simple pass-through (already in our format)
def export(self, data, output_path):
with open(output_path, 'w') as f:
json.dump(data, f, indent=2)

YAML Exporter¶

# Direct mapping (structure is the same)
def export(self, data, output_path):
with open(output_path, 'w') as f:
yaml.dump(data, f)

CSV Exporter¶

# Flatten recommendations array
def export(self, data, output_path):
rows = []
for rec in data['recommendations']:
row = {
'benchmark_title': data['title'],
'benchmark_id': data['benchmark_id'],
'ref': rec['ref'],
'title': rec['title'],
# ... flatten all fields
}
rows.append(row)

# Write CSV
with open(output_path, 'w') as f:
writer = csv.DictWriter(f, fieldnames=row.keys())
writer.writeheader()
writer.writerows(rows)

XCCDF Exporter¶

# Map to xsdata XCCDF models
def export(self, data, output_path):
from cis_bench.models.xccdf import Benchmark, Rule, Status

# Create XCCDF Benchmark
xccdf_benchmark = Benchmark(
id=f"cis_benchmark_{data['benchmark_id']}",
status=[Status(value="draft")],
title=data['title'],
version=data['version']
)

# Map each recommendation XCCDF Rule
for rec in data['recommendations']:
rule = Rule(
id=f"rule_{rec['ref'].replace('.', '_')}",
title=rec['title'],
description=strip_html(rec['description']),
# ... map other fields per XCCDF spec
)
xccdf_benchmark.rule.append(rule)

# Serialize to XML
xml_output = xccdf_benchmark.to_xml(pretty_print=True)
with open(output_path, 'w') as f:
f.write(xml_output)

Field Definitions¶

Benchmark Level¶

Field	Type	Description
`title`	string	Full benchmark title
`benchmark_id`	string	CIS WorkBench ID
`url`	string (URI)	Source URL
`version`	string	Benchmark version
`downloaded_at`	string (ISO 8601)	Download timestamp
`scraper_version`	string	Strategy version used
`total_recommendations`	integer	Count of recommendations
`recommendations`	array	Array of recommendation objects

Recommendation Level¶

Field	Type	Description	HTML Allowed
`ref`	string	Reference number (e.g., "3.1.1")	No
`title`	string	Recommendation title	No
`url`	string (URI)	Direct link to recommendation	No
`assessment`	string/null	Automated scoring info	Yes
`description`	string/null	Detailed description	Yes
`rationale`	string/null	Justification	Yes
`impact`	string/null	Impact statement	Yes
`audit`	string/null	Audit procedure	Yes
`remediation`	string/null	Remediation steps	Yes
`default_value`	string/null	Default config value	Yes
`artifact_eq`	string/null	Artifact equation	Yes
`mitre_mapping`	string/null	MITRE ATT&CK mappings	Yes
`references`	string/null	External references	Yes

HTML Allowed: Fields may contain HTML markup that should be:

Preserved in JSON/YAML export
Stripped for CSV export
Converted to XCCDF-safe format for XCCDF export

Validation Strategy¶

When to Validate¶

After scraping - Before saving JSON
Before exporting - Ensure input is valid
In tests - Validate all fixtures

Validation Levels¶

Strict (for our saved JSON):

All required fields present
Types match schema
Additional properties not allowed

Lenient (for backward compatibility):

Required fields present
Types match when present
Additional properties allowed (ignore)

Schema Evolution¶

Version 1.0.0 (Current)¶

Initial schema
Based on current CIS WorkBench fields
Includes all 10 recommendation fields

Version 1.1.0 (Future)¶

Add new fields as CIS WorkBench evolves
Maintain backward compatibility
Use oneOf for version-specific fields

Version 2.0.0 (Future)¶

Breaking changes if needed
Migration guide provided
Tool supports both v1 and v2

Tools and Libraries¶

JSON Schema Validation:

pip install jsonschema

Code Generation from Schema:

pip install datamodel-code-generator

Schema Documentation:

pip install json-schema-for-humans
json-schema-for-humans cis_bench/models/schema.json docs/schema.html

Example: Complete Benchmark¶

{
"title": "CIS Amazon Elastic Kubernetes Service (EKS) Benchmark",
"benchmark_id": "22605",
"url": "https://workbench.cisecurity.org/benchmarks/22605",
"version": "v1.8.0",
"downloaded_at": "2025-10-17T21:48:00Z",
"scraper_version": "v1_2025_10",
"total_recommendations": 50,
"recommendations": [
{
"ref": "3.1.1",
"title": "Ensure that the kubeconfig file permissions are set to 644 or more restrictive",
"url": "https://workbench.cisecurity.org/sections/3511915/recommendations/5772605",
"assessment": "<p>Automated assessment available</p>",
"description": "<p>If kubelet is running, and if it is configured by a kubeconfig file...</p>",
"rationale": "<p>Improper access permissions could allow...</p>",
"impact": "<p>None expected</p>",
"audit": "<p>Run the following command: <code>stat -c %a /var/lib/kubelet/kubeconfig</code></p>",
"remediation": "<p>Run: <code>chmod 644 /var/lib/kubelet/kubeconfig</code></p>",
"default_value": null,
"artifact_eq": null,
"mitre_mapping": "<p>T1574.006 - Hijack Execution Flow: Dynamic Linker Hijacking</p>",
"references": "<ul><li>Kubernetes documentation</li></ul>"
}
// ... 49 more recommendations
]
}

Integration with Exporters¶

BaseExporter Contract¶

class BaseExporter(ABC):
@abstractmethod
def export(self, data: dict, output_path: str) -> str:
"""Export benchmark data.

Args:
data: Dictionary matching cis_bench/models/schema.json
output_path: Where to write output

Returns:
Path to created file

Raises:
jsonschema.ValidationError: If data doesn't match schema
IOError: If file cannot be written
"""
# All exporters should validate input first
self._validate_input(data)

# Then perform export
# ...

Mapping Examples¶

JSON Schema Field XCCDF Field

Our Schema	XCCDF Element	Notes
`ref`	`Rule/@id`	Convert "3.1.1" "rule_3_1_1"
`title`	`Rule/title`	Direct mapping
`description`	`Rule/description`	Strip HTML tags
`rationale`	`Rule/rationale`	Strip HTML tags
`audit`	`Rule/check/check-content`	Strip HTML, wrap in check
`remediation`	`Rule/fixtext`	Strip HTML tags
`mitre_mapping`	`Rule/reference`	Parse and map to reference elements

JSON Schema Field CSV Column

All fields flattened to single row:

benchmark_title,benchmark_id,ref,title,description,audit,...
"CIS Amazon EKS","22605","3.1.1","Ensure...","If kubelet...","Run...",...

Validation Implementation¶

In WorkbenchScraper¶

import jsonschema

class WorkbenchScraper:
def __init__(self, session, schema_path='cis_bench/models/schema.json'):
self.session = session

# Load schema
with open(schema_path) as f:
self.schema = json.load(f)

def download_benchmark(self, benchmark_url):
# ... scraping logic ...

benchmark_data = {
'title': title,
'benchmark_id': benchmark_id,
'recommendations': recommendations,
# ...
}

# VALIDATE before returning
try:
jsonschema.validate(instance=benchmark_data, schema=self.schema)
except jsonschema.ValidationError as e:
raise ValueError(f"Scraped data validation failed: {e.message}")

return benchmark_data

In Exporters¶

class BaseExporter(ABC):
def _validate_input(self, data: dict):
"""Validate input data against schema."""
with open('cis_bench/models/schema.json') as f:
schema = json.load(f)

jsonschema.validate(instance=data, schema=schema)

Schema Files¶

We'll maintain multiple schema versions:

cis_bench/models/
├── schema.json # Current schema (symlink to v1.0.0)
├── schemas/
│ ├── v1.0.0.json # Version 1.0.0
│ ├── v1.1.0.json # Version 1.1.0 (future)
│ └── v2.0.0.json # Version 2.0.0 (future)
└── schema_validator.py # Validation utilities

Benefits in Practice¶

Before (No Schema)¶

flowchart LR
    A[Scraper extracts data] --> B[Export to XCCDF]
    B --> C[❌ Missing field!<br/>XCCDF invalid!<br/>Debug for hours...]

    style C fill:#FFE0E0,stroke:#CC0000,stroke-width:2px

After (With Schema)¶

flowchart LR
    A[Scraper extracts data] --> B{Validate against schema}
    B -->|Invalid| C[❌ Missing 'title' field!<br/>Validation fails immediately<br/>Fix scraper in minutes]
    C --> D[Re-scrape]
    D --> E{Validates}
    E -->|Valid| F[Export to XCCDF]
    F --> G[✅ Valid XCCDF]

    style C fill:#FFE0E0,stroke:#CC0000,stroke-width:2px
    style G fill:#E0FFE0,stroke:#00CC00,stroke-width:2px
    style E fill:#E8F4F8,stroke:#0066CC,stroke-width:2px
    style B fill:#E8F4F8,stroke:#0066CC,stroke-width:2px