CIS Benchmark CLI - Data Model and JSON Schema¶
Documentation Path
You are here: Technical Reference > Data Model
- For architecture: See Architecture Overview
- For data flow: See Data Flow Pipeline
Overview¶
This document defines our canonical data format for CIS benchmarks. All data flows through this format.
Data Flow Architecture¶
flowchart LR
subgraph Input["Input Sources"]
WB["CIS WorkBench<br/>(HTML)"]
API["CIS API<br/>(JSON navtree)"]
SCAP["Future: SCAP"]
EXCEL["Future: Excel"]
PDF["Future: PDF"]
end
subgraph Canonical["Canonical Format"]
SCHEMA["JSON Schema<br/>(Our Format)<br/>VALIDATED"]
end
subgraph Output["Output Formats"]
XCCDF["XCCDF XML"]
YAML["YAML"]
CSV["CSV"]
MD["Markdown"]
JSON["JSON"]
FUTURE["Future formats"]
end
WB --> SCHEMA
API --> SCHEMA
SCAP --> SCHEMA
EXCEL --> SCHEMA
PDF --> SCHEMA
SCHEMA --> XCCDF
SCHEMA --> YAML
SCHEMA --> CSV
SCHEMA --> MD
SCHEMA --> JSON
SCHEMA --> FUTURE
style SCHEMA fill:#90EE90,stroke:#006400,stroke-width:3px
style Input fill:#E8F4F8,stroke:#0066CC
style Output fill:#FFF4E6,stroke:#CC6600
KEY PRINCIPLE: All data MUST validate against our JSON Schema before it can be exported to any format.
Why JSON Schema as Canonical Format?¶
Benefits¶
Single Source of Truth
- All exporters map from ONE format
- No exporter-specific logic in scraper
- Clear contract between components
Validation
- Ensure scraped data is complete
- Catch extraction errors early
- Verify required fields present
Documentation
- Self-documenting data format
- Clear field definitions
- Type information
Versioning
- Schema v1.0, v2.0 as needs evolve
- Backward compatibility tracking
- Migration paths
Code Generation
- Can generate Python dataclasses from schema
- Type hints from schema
- Validation code auto-generated
Testing
- Validate test fixtures
- Ensure exporters get expected input
- Clear test assertions
Our JSON Schema¶
Location¶
cis_bench/models/schema.json
Structure¶
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://github.com/yourusername/cis-benchmark-cli/schemas/benchmark-v1.0.json",
"title": "CIS Benchmark",
"description": "Canonical data format for CIS Benchmark downloaded from CIS WorkBench",
"type": "object",
"required": ["title", "benchmark_id", "url", "version", "recommendations"],
"properties": {
"title": {
"type": "string",
"description": "Full benchmark title (e.g., 'CIS Amazon EKS Benchmark v1.8.0')",
"examples": ["CIS Amazon EKS Benchmark v1.8.0", "CIS NGINX Benchmark vNEXT"]
},
"benchmark_id": {
"type": "string",
"description": "CIS WorkBench benchmark ID",
"pattern": "^[0-9]+$",
"examples": ["22605", "18528"]
},
"url": {
"type": "string",
"format": "uri",
"description": "Source URL of the benchmark"
},
"version": {
"type": "string",
"description": "Benchmark version (extracted from title or metadata)",
"examples": ["v1.8.0", "vNEXT", "v1.0.0"]
},
"downloaded_at": {
"type": "string",
"format": "date-time",
"description": "ISO 8601 timestamp when benchmark was downloaded"
},
"scraper_version": {
"type": "string",
"description": "Scraper strategy version used",
"examples": ["v1_2025_10"]
},
"total_recommendations": {
"type": "integer",
"minimum": 0,
"description": "Total number of recommendations in benchmark"
},
"recommendations": {
"type": "array",
"description": "Array of benchmark recommendations",
"items": {
"$ref": "#/$defs/recommendation"
}
}
},
"$defs": {
"recommendation": {
"type": "object",
"required": ["ref", "title", "url"],
"properties": {
"ref": {
"type": "string",
"description": "Recommendation reference number",
"pattern": "^[0-9]+(\\.[0-9]+)*$",
"examples": ["3.1.1", "5.2.3"]
},
"title": {
"type": "string",
"description": "Recommendation title",
"minLength": 1
},
"url": {
"type": "string",
"format": "uri",
"description": "Direct URL to recommendation page"
},
"assessment": {
"type": ["string", "null"],
"description": "Automated scoring/assessment information (may contain HTML)"
},
"description": {
"type": ["string", "null"],
"description": "Detailed description of the recommendation (may contain HTML)"
},
"rationale": {
"type": ["string", "null"],
"description": "Rationale/justification for the recommendation (may contain HTML)"
},
"impact": {
"type": ["string", "null"],
"description": "Impact statement (may contain HTML)"
},
"audit": {
"type": ["string", "null"],
"description": "Audit procedure to verify compliance (may contain HTML)"
},
"remediation": {
"type": ["string", "null"],
"description": "Remediation steps to achieve compliance (may contain HTML)"
},
"default_value": {
"type": ["string", "null"],
"description": "Default configuration value (may contain HTML)"
},
"artifact_eq": {
"type": ["string", "null"],
"description": "Artifact equation (may contain HTML)"
},
"mitre_mapping": {
"type": ["string", "null"],
"description": "MITRE ATT&CK framework mappings (may contain HTML)"
},
"references": {
"type": ["string", "null"],
"description": "External references and citations (may contain HTML)"
}
},
"additionalProperties": false
}
}
}
Using JSON Schema¶
Validation¶
import json
import jsonschema
# Load schema
with open('cis_bench/models/schema.json') as f:
schema = json.load(f)
# Validate scraped data
try:
jsonschema.validate(instance=benchmark_data, schema=schema)
print(" Data is valid")
except jsonschema.ValidationError as e:
print(f" Validation failed: {e.message}")
print(f" Path: {' '.join(str(p) for p in e.path)}")
Data Generation from Schema¶
We can also use tools like datamodel-code-generator to create Python dataclasses from our schema:
pip install datamodel-code-generator
datamodel-codegen \
--input cis_bench/models/schema.json \
--output cis_bench/models/benchmark.py \
--input-file-type jsonschema
This generates type-safe Python classes that match our schema exactly!
Exporter Mapping¶
Each exporter knows how to map from our canonical format to its target format:
JSON Exporter¶
# Simple pass-through (already in our format)
def export(self, data, output_path):
with open(output_path, 'w') as f:
json.dump(data, f, indent=2)
YAML Exporter¶
# Direct mapping (structure is the same)
def export(self, data, output_path):
with open(output_path, 'w') as f:
yaml.dump(data, f)
CSV Exporter¶
# Flatten recommendations array
def export(self, data, output_path):
rows = []
for rec in data['recommendations']:
row = {
'benchmark_title': data['title'],
'benchmark_id': data['benchmark_id'],
'ref': rec['ref'],
'title': rec['title'],
# ... flatten all fields
}
rows.append(row)
# Write CSV
with open(output_path, 'w') as f:
writer = csv.DictWriter(f, fieldnames=row.keys())
writer.writeheader()
writer.writerows(rows)
XCCDF Exporter¶
# Map to xsdata XCCDF models
def export(self, data, output_path):
from cis_bench.models.xccdf import Benchmark, Rule, Status
# Create XCCDF Benchmark
xccdf_benchmark = Benchmark(
id=f"cis_benchmark_{data['benchmark_id']}",
status=[Status(value="draft")],
title=data['title'],
version=data['version']
)
# Map each recommendation XCCDF Rule
for rec in data['recommendations']:
rule = Rule(
id=f"rule_{rec['ref'].replace('.', '_')}",
title=rec['title'],
description=strip_html(rec['description']),
# ... map other fields per XCCDF spec
)
xccdf_benchmark.rule.append(rule)
# Serialize to XML
xml_output = xccdf_benchmark.to_xml(pretty_print=True)
with open(output_path, 'w') as f:
f.write(xml_output)
Field Definitions¶
Benchmark Level¶
| Field | Type | Required | Description |
|---|---|---|---|
title |
string | Full benchmark title | |
benchmark_id |
string | CIS WorkBench ID | |
url |
string (URI) | Source URL | |
version |
string | Benchmark version | |
downloaded_at |
string (ISO 8601) | Download timestamp | |
scraper_version |
string | Strategy version used | |
total_recommendations |
integer | Count of recommendations | |
recommendations |
array | Array of recommendation objects |
Recommendation Level¶
| Field | Type | Required | Description | HTML Allowed |
|---|---|---|---|---|
ref |
string | Reference number (e.g., "3.1.1") | No | |
title |
string | Recommendation title | No | |
url |
string (URI) | Direct link to recommendation | No | |
assessment |
string/null | Automated scoring info | Yes | |
description |
string/null | Detailed description | Yes | |
rationale |
string/null | Justification | Yes | |
impact |
string/null | Impact statement | Yes | |
audit |
string/null | Audit procedure | Yes | |
remediation |
string/null | Remediation steps | Yes | |
default_value |
string/null | Default config value | Yes | |
artifact_eq |
string/null | Artifact equation | Yes | |
mitre_mapping |
string/null | MITRE ATT&CK mappings | Yes | |
references |
string/null | External references | Yes |
HTML Allowed: Fields may contain HTML markup that should be:
- Preserved in JSON/YAML export
- Stripped for CSV export
- Converted to XCCDF-safe format for XCCDF export
Validation Strategy¶
When to Validate¶
- After scraping - Before saving JSON
- Before exporting - Ensure input is valid
- In tests - Validate all fixtures
Validation Levels¶
Strict (for our saved JSON):
- All required fields present
- Types match schema
- Additional properties not allowed
Lenient (for backward compatibility):
- Required fields present
- Types match when present
- Additional properties allowed (ignore)
Schema Evolution¶
Version 1.0.0 (Current)¶
- Initial schema
- Based on current CIS WorkBench fields
- Includes all 10 recommendation fields
Version 1.1.0 (Future)¶
- Add new fields as CIS WorkBench evolves
- Maintain backward compatibility
- Use
oneOffor version-specific fields
Version 2.0.0 (Future)¶
- Breaking changes if needed
- Migration guide provided
- Tool supports both v1 and v2
Tools and Libraries¶
JSON Schema Validation:
Code Generation from Schema:
Schema Documentation:
pip install json-schema-for-humans
json-schema-for-humans cis_bench/models/schema.json docs/schema.html
Example: Complete Benchmark¶
{
"title": "CIS Amazon Elastic Kubernetes Service (EKS) Benchmark",
"benchmark_id": "22605",
"url": "https://workbench.cisecurity.org/benchmarks/22605",
"version": "v1.8.0",
"downloaded_at": "2025-10-17T21:48:00Z",
"scraper_version": "v1_2025_10",
"total_recommendations": 50,
"recommendations": [
{
"ref": "3.1.1",
"title": "Ensure that the kubeconfig file permissions are set to 644 or more restrictive",
"url": "https://workbench.cisecurity.org/sections/3511915/recommendations/5772605",
"assessment": "<p>Automated assessment available</p>",
"description": "<p>If kubelet is running, and if it is configured by a kubeconfig file...</p>",
"rationale": "<p>Improper access permissions could allow...</p>",
"impact": "<p>None expected</p>",
"audit": "<p>Run the following command: <code>stat -c %a /var/lib/kubelet/kubeconfig</code></p>",
"remediation": "<p>Run: <code>chmod 644 /var/lib/kubelet/kubeconfig</code></p>",
"default_value": null,
"artifact_eq": null,
"mitre_mapping": "<p>T1574.006 - Hijack Execution Flow: Dynamic Linker Hijacking</p>",
"references": "<ul><li>Kubernetes documentation</li></ul>"
}
// ... 49 more recommendations
]
}
Integration with Exporters¶
BaseExporter Contract¶
class BaseExporter(ABC):
@abstractmethod
def export(self, data: dict, output_path: str) -> str:
"""Export benchmark data.
Args:
data: Dictionary matching cis_bench/models/schema.json
output_path: Where to write output
Returns:
Path to created file
Raises:
jsonschema.ValidationError: If data doesn't match schema
IOError: If file cannot be written
"""
# All exporters should validate input first
self._validate_input(data)
# Then perform export
# ...
Mapping Examples¶
JSON Schema Field XCCDF Field
| Our Schema | XCCDF Element | Notes |
|---|---|---|
ref |
Rule/@id |
Convert "3.1.1" "rule_3_1_1" |
title |
Rule/title |
Direct mapping |
description |
Rule/description |
Strip HTML tags |
rationale |
Rule/rationale |
Strip HTML tags |
audit |
Rule/check/check-content |
Strip HTML, wrap in check |
remediation |
Rule/fixtext |
Strip HTML tags |
mitre_mapping |
Rule/reference |
Parse and map to reference elements |
JSON Schema Field CSV Column
All fields flattened to single row:
benchmark_title,benchmark_id,ref,title,description,audit,...
"CIS Amazon EKS","22605","3.1.1","Ensure...","If kubelet...","Run...",...
Validation Implementation¶
In WorkbenchScraper¶
import jsonschema
class WorkbenchScraper:
def __init__(self, session, schema_path='cis_bench/models/schema.json'):
self.session = session
# Load schema
with open(schema_path) as f:
self.schema = json.load(f)
def download_benchmark(self, benchmark_url):
# ... scraping logic ...
benchmark_data = {
'title': title,
'benchmark_id': benchmark_id,
'recommendations': recommendations,
# ...
}
# VALIDATE before returning
try:
jsonschema.validate(instance=benchmark_data, schema=self.schema)
except jsonschema.ValidationError as e:
raise ValueError(f"Scraped data validation failed: {e.message}")
return benchmark_data
In Exporters¶
class BaseExporter(ABC):
def _validate_input(self, data: dict):
"""Validate input data against schema."""
with open('cis_bench/models/schema.json') as f:
schema = json.load(f)
jsonschema.validate(instance=data, schema=schema)
Schema Files¶
We'll maintain multiple schema versions:
cis_bench/models/
├── schema.json # Current schema (symlink to v1.0.0)
├── schemas/
│ ├── v1.0.0.json # Version 1.0.0
│ ├── v1.1.0.json # Version 1.1.0 (future)
│ └── v2.0.0.json # Version 2.0.0 (future)
└── schema_validator.py # Validation utilities
Benefits in Practice¶
Before (No Schema)¶
flowchart LR
A[Scraper extracts data] --> B[Export to XCCDF]
B --> C[❌ Missing field!<br/>XCCDF invalid!<br/>Debug for hours...]
style C fill:#FFE0E0,stroke:#CC0000,stroke-width:2px
After (With Schema)¶
flowchart LR
A[Scraper extracts data] --> B{Validate against schema}
B -->|Invalid| C[❌ Missing 'title' field!<br/>Validation fails immediately<br/>Fix scraper in minutes]
C --> D[Re-scrape]
D --> E{Validates}
E -->|Valid| F[Export to XCCDF]
F --> G[✅ Valid XCCDF]
style C fill:#FFE0E0,stroke:#CC0000,stroke-width:2px
style G fill:#E0FFE0,stroke:#00CC00,stroke-width:2px
style E fill:#E8F4F8,stroke:#0066CC,stroke-width:2px
style B fill:#E8F4F8,stroke:#0066CC,stroke-width:2px