Monitoring and Maintenance¶

This guide provides detailed information on monitoring and maintaining the Secure CINC Auditor Kubernetes Container Scanning solution.

Overview¶

Proper monitoring and maintenance are essential for ensuring the reliability and effectiveness of your container scanning solution. This guide covers health monitoring, version management, backup strategies, and other operational aspects.

Health Monitoring¶

Configure comprehensive monitoring to track the health and performance of scanner components:

# monitoring-values.yaml
monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
    additionalLabels:
      release: prometheus
  healthCheck:
    liveness:
      enabled: true
      initialDelaySeconds: 30
      periodSeconds: 10
    readiness:
      enabled: true
      initialDelaySeconds: 5
      periodSeconds: 10
  dashboard:
    enabled: true

Prometheus Integration¶

Configure Prometheus metrics for detailed monitoring:

# prometheus-values.yaml
prometheus:
  metrics:
    enabled: true
    port: 9090
    path: /metrics

  rules:
    - name: scanner-alerts
      groups:
        - name: scanner
          rules:
            - alert: ScannerDown
              expr: up{job="scanner"} == 0
              for: 5m
              labels:
                severity: critical
              annotations:
                summary: "Scanner is down"
                description: "Scanner has been down for more than 5 minutes"

Grafana Dashboards¶

Deploy pre-configured Grafana dashboards for visualization:

# grafana-values.yaml
grafana:
  dashboards:
    - name: scanner-overview
      json: |
        {
          "title": "Scanner Overview",
          "panels": [
            {
              "title": "Active Scans",
              "type": "graph",
              "datasource": "Prometheus"
            },
            {
              "title": "Scan Duration",
              "type": "graph",
              "datasource": "Prometheus"
            }
          ]
        }

Version Updates¶

Manage scanner component versions and updates:

# version-management-values.yaml
versionManagement:
  updateStrategy: RollingUpdate
  maxUnavailable: 1
  maxSurge: 1
  imageUpdateAutomation:
    enabled: true
    schedule: "0 0 * * 0"  # Weekly on Sundays
    testBeforePromotion: true

Automated Updates¶

Configure automated update workflows:

# automated-updates-values.yaml
updates:
  automated:
    enabled: true
    schedule: "0 0 * * 0"  # Weekly on Sundays
    timeWindow:
      start: "01:00"
      end: "05:00"
    notifyOn:
      - success
      - failure

  versioning:
    policy: semver
    allowMajorUpgrades: false
    pinMinorVersion: true

Canary Deployments¶

Implement canary deployments for safer updates:

# canary-values.yaml
deployment:
  strategy:
    type: Canary
    canary:
      steps:
        - setWeight: 20
        - pause: {duration: 10m}
        - setWeight: 40
        - pause: {duration: 10m}
        - setWeight: 60
        - pause: {duration: 10m}
        - setWeight: 80
        - pause: {duration: 10m}

Backup and Recovery¶

Configure backup strategies for critical configurations and data:

# backup-values.yaml
backup:
  enabled: true
  schedule: "0 0 * * *"  # Daily at midnight
  retention:
    count: 7
  include:
    - configs
    - results
    - profiles
  storage:
    type: s3
    bucket: scanner-backups

Disaster Recovery¶

Implement disaster recovery procedures:

# disaster-recovery-values.yaml
disasterRecovery:
  enabled: true
  backupLocation: s3://scanner-backups

  restore:
    fromBackup: true
    backupId: latest  # or specific backup ID
    restoreOptions:
      includeConfigs: true
      includeResults: true
      includeProfiles: true

  testing:
    schedule: "0 0 * * 0"  # Weekly on Sundays
    environment: dr-test

Log Management¶

Configure comprehensive logging:

# logging-values.yaml
logging:
  level: info  # debug, info, warn, error
  format: json  # or text

  storage:
    retention:
      days: 30
    rotation:
      maxSize: 100MB
      maxFiles: 10

  exporters:
    - type: elasticsearch
      enabled: true
      host: elasticsearch.example.com
      index: scanner-logs

Resource Cleanup¶

Implement regular resource cleanup to prevent clutter:

# cleanup-values.yaml
cleanup:
  enabled: true
  schedule: "0 0 * * *"  # Daily at midnight

  resources:
    - type: pods
      selector: app=scanner,status=completed
      olderThan: 7d

    - type: scans
      olderThan: 30d
      includeSuccessful: true
      includeFailed: false

Alerting and Notifications¶

Configure alerting for critical events:

# alerting-values.yaml
alerting:
  enabled: true
  providers:
    - name: slack
      enabled: true
      channel: "#security-alerts"
      severities:
        - critical
        - error

    - name: email
      enabled: true
      recipients:
        - security-team@example.com
      severities:
        - critical
        - error
        - warning

  alerts:
    - name: ScanFailure
      description: "Scan failed to complete"
      severity: error

    - name: CriticalVulnerability
      description: "Critical vulnerability detected"
      severity: critical

Performance Monitoring¶

Monitor scanner performance metrics:

# performance-monitoring-values.yaml
performanceMonitoring:
  enabled: true
  metrics:
    - name: scanDuration
      description: "Time taken to complete a scan"

    - name: scanQueueLength
      description: "Number of scans waiting in queue"

    - name: scannerCPUUsage
      description: "CPU usage of scanner pods"

  thresholds:
    - metric: scanDuration
      warning: 300  # seconds
      critical: 600  # seconds

    - metric: scanQueueLength
      warning: 10
      critical: 20

Cluster-Wide Monitoring¶

Implement cluster-wide monitoring for scanner infrastructure:

# cluster-monitoring-values.yaml
clusterMonitoring:
  enabled: true

  components:
    - name: scanner-infrastructure
      selector: app=scanner-infrastructure

    - name: standard-scanner
      selector: app=standard-scanner

    - name: distroless-scanner
      selector: app=distroless-scanner

  resourceUsage:
    pods:
      enabled: true

    nodes:
      enabled: true