Monitoring and Maintenance
This guide provides detailed information on monitoring and maintaining the Secure CINC Auditor Kubernetes Container Scanning solution.
Overview
Proper monitoring and maintenance are essential for ensuring the reliability and effectiveness of your container scanning solution. This guide covers health monitoring, version management, backup strategies, and other operational aspects.
Health Monitoring
Configure comprehensive monitoring to track the health and performance of scanner components:
| # monitoring-values.yaml
monitoring:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
release: prometheus
healthCheck:
liveness:
enabled: true
initialDelaySeconds: 30
periodSeconds: 10
readiness:
enabled: true
initialDelaySeconds: 5
periodSeconds: 10
dashboard:
enabled: true
|
Prometheus Integration
Configure Prometheus metrics for detailed monitoring:
| # prometheus-values.yaml
prometheus:
metrics:
enabled: true
port: 9090
path: /metrics
rules:
- name: scanner-alerts
groups:
- name: scanner
rules:
- alert: ScannerDown
expr: up{job="scanner"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Scanner is down"
description: "Scanner has been down for more than 5 minutes"
|
Grafana Dashboards
Deploy pre-configured Grafana dashboards for visualization:
| # grafana-values.yaml
grafana:
dashboards:
- name: scanner-overview
json: |
{
"title": "Scanner Overview",
"panels": [
{
"title": "Active Scans",
"type": "graph",
"datasource": "Prometheus"
},
{
"title": "Scan Duration",
"type": "graph",
"datasource": "Prometheus"
}
]
}
|
Version Updates
Manage scanner component versions and updates:
| # version-management-values.yaml
versionManagement:
updateStrategy: RollingUpdate
maxUnavailable: 1
maxSurge: 1
imageUpdateAutomation:
enabled: true
schedule: "0 0 * * 0" # Weekly on Sundays
testBeforePromotion: true
|
Automated Updates
Configure automated update workflows:
| # automated-updates-values.yaml
updates:
automated:
enabled: true
schedule: "0 0 * * 0" # Weekly on Sundays
timeWindow:
start: "01:00"
end: "05:00"
notifyOn:
- success
- failure
versioning:
policy: semver
allowMajorUpgrades: false
pinMinorVersion: true
|
Canary Deployments
Implement canary deployments for safer updates:
| # canary-values.yaml
deployment:
strategy:
type: Canary
canary:
steps:
- setWeight: 20
- pause: {duration: 10m}
- setWeight: 40
- pause: {duration: 10m}
- setWeight: 60
- pause: {duration: 10m}
- setWeight: 80
- pause: {duration: 10m}
|
Backup and Recovery
Configure backup strategies for critical configurations and data:
| # backup-values.yaml
backup:
enabled: true
schedule: "0 0 * * *" # Daily at midnight
retention:
count: 7
include:
- configs
- results
- profiles
storage:
type: s3
bucket: scanner-backups
|
Disaster Recovery
Implement disaster recovery procedures:
| # disaster-recovery-values.yaml
disasterRecovery:
enabled: true
backupLocation: s3://scanner-backups
restore:
fromBackup: true
backupId: latest # or specific backup ID
restoreOptions:
includeConfigs: true
includeResults: true
includeProfiles: true
testing:
schedule: "0 0 * * 0" # Weekly on Sundays
environment: dr-test
|
Log Management
Configure comprehensive logging:
| # logging-values.yaml
logging:
level: info # debug, info, warn, error
format: json # or text
storage:
retention:
days: 30
rotation:
maxSize: 100MB
maxFiles: 10
exporters:
- type: elasticsearch
enabled: true
host: elasticsearch.example.com
index: scanner-logs
|
Resource Cleanup
Implement regular resource cleanup to prevent clutter:
| # cleanup-values.yaml
cleanup:
enabled: true
schedule: "0 0 * * *" # Daily at midnight
resources:
- type: pods
selector: app=scanner,status=completed
olderThan: 7d
- type: scans
olderThan: 30d
includeSuccessful: true
includeFailed: false
|
Alerting and Notifications
Configure alerting for critical events:
| # alerting-values.yaml
alerting:
enabled: true
providers:
- name: slack
enabled: true
channel: "#security-alerts"
severities:
- critical
- error
- name: email
enabled: true
recipients:
- security-team@example.com
severities:
- critical
- error
- warning
alerts:
- name: ScanFailure
description: "Scan failed to complete"
severity: error
- name: CriticalVulnerability
description: "Critical vulnerability detected"
severity: critical
|
Monitor scanner performance metrics:
| # performance-monitoring-values.yaml
performanceMonitoring:
enabled: true
metrics:
- name: scanDuration
description: "Time taken to complete a scan"
- name: scanQueueLength
description: "Number of scans waiting in queue"
- name: scannerCPUUsage
description: "CPU usage of scanner pods"
thresholds:
- metric: scanDuration
warning: 300 # seconds
critical: 600 # seconds
- metric: scanQueueLength
warning: 10
critical: 20
|
Cluster-Wide Monitoring
Implement cluster-wide monitoring for scanner infrastructure:
| # cluster-monitoring-values.yaml
clusterMonitoring:
enabled: true
components:
- name: scanner-infrastructure
selector: app=scanner-infrastructure
- name: standard-scanner
selector: app=standard-scanner
- name: distroless-scanner
selector: app=distroless-scanner
resourceUsage:
pods:
enabled: true
nodes:
enabled: true
|