[Phase 4] Implement auto-remediation workflow with Kestra #19

Open
opened 2025-12-21 13:04:22 +00:00 by Damien · 0 comments
Owner

Description

⚠️ Mise à jour : Avec Kestra, l'auto-remediation est gérée via un workflow dédié déclenché par Flow Trigger sur détection de drift, avec support des politiques d'approbation.

Ajouter la capacité d'auto-remediation pour corriger automatiquement les drifts de configuration détectés.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│              Drift Detection Service                             │
│              (gNMI Subscribe)                                    │
└──────────────────────────┬──────────────────────────────────────┘
                           │ POST webhook
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│              Kestra: drift-detected.yml                          │
│              → Log, Notify, Emit internal event                  │
└──────────────────────────┬──────────────────────────────────────┘
                           │ Flow Trigger
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│              Kestra: drift-remediation.yml                       │
│              → Check policy                                      │
│              → Auto-approve OR wait for human approval           │
│              → Trigger fabric-reconcile                          │
└─────────────────────────────────────────────────────────────────┘

Workflow: drift-remediation.yml

id: drift-remediation
namespace: network.fabric
description: Auto-remediation workflow triggered by drift detection

inputs:
  - id: device
    type: STRING
    required: true
  - id: path
    type: STRING
    required: true
  - id: drift_type
    type: STRING
    description: "Type of drift: vlan, bgp, mlag, vrf, interface"

tasks:
  - id: check_cooldown
    type: io.kestra.plugin.core.kv.Get
    key: "remediation_cooldown_{{ inputs.device }}"
    errorOnMissing: false

  - id: skip_if_cooldown
    type: io.kestra.plugin.core.flow.If
    condition: "{{ outputs.check_cooldown.value != null }}"
    then:
      - id: log_cooldown_skip
        type: io.kestra.plugin.core.log.Log
        message: "⏸️ Skipping remediation for {{ inputs.device }} - cooldown active"
      - id: end_cooldown
        type: io.kestra.plugin.core.flow.End

  - id: get_policy
    type: io.kestra.plugin.core.http.Request
    uri: "{{ secret('POLICY_API_URL') }}/policies/{{ inputs.drift_type }}"
    method: GET

  - id: check_auto_approve
    type: io.kestra.plugin.core.flow.Switch
    value: "{{ outputs.get_policy.body.auto_approve }}"
    cases:
      "true":
        - id: auto_remediate
          type: io.kestra.plugin.core.flow.Subflow
          namespace: network.fabric
          flowId: fabric-reconcile
          inputs:
            device: "{{ inputs.device }}"
            auto_apply: true
          wait: true

      "false":
        - id: request_approval
          type: io.kestra.plugin.core.flow.Pause
          timeout: PT24H
          onResume:
            - id: approval_status
              type: STRING
              defaults: "approved"

        - id: check_approval
          type: io.kestra.plugin.core.flow.If
          condition: "{{ outputs.request_approval.outputs.approval_status == 'approved' }}"
          then:
            - id: approved_remediate
              type: io.kestra.plugin.core.flow.Subflow
              namespace: network.fabric
              flowId: fabric-reconcile
              inputs:
                device: "{{ inputs.device }}"
                auto_apply: true

  - id: set_cooldown
    type: io.kestra.plugin.core.kv.Set
    key: "remediation_cooldown_{{ inputs.device }}"
    value: "{{ now() }}"
    ttl: PT5M  # 5 minutes cooldown

  - id: notify_remediation
    type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook
    url: "{{ secret('SLACK_WEBHOOK') }}"
    payload: |
      {
        "text": "🔧 *Auto-Remediation Completed*",
        "attachments": [{
          "color": "good",
          "fields": [
            {"title": "Device", "value": "{{ inputs.device }}", "short": true},
            {"title": "Drift Type", "value": "{{ inputs.drift_type }}", "short": true},
            {"title": "Path", "value": "{{ inputs.path }}", "short": false}
          ]
        }]
      }

triggers:
  - id: on_drift_detected
    type: io.kestra.plugin.core.trigger.Flow
    conditions:
      - type: io.kestra.plugin.core.condition.ExecutionStatusCondition
        in:
          - SUCCESS
      - type: io.kestra.plugin.core.condition.ExecutionNamespaceCondition
        namespace: network.fabric
        comparison: EQUALS
      - type: io.kestra.plugin.core.condition.ExecutionFlowCondition
        flowId: drift-detected

errors:
  - id: notify_failure
    type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook
    url: "{{ secret('SLACK_WEBHOOK') }}"
    payload: |
      {
        "text": "❌ *Auto-Remediation Failed*",
        "attachments": [{
          "color": "danger",
          "fields": [
            {"title": "Device", "value": "{{ inputs.device }}", "short": true},
            {"title": "Error", "value": "{{ errorMessage }}", "short": false}
          ]
        }]
      }

Politiques de remediation

Les politiques sont stockées dans un fichier YAML ou KV store Kestra :

# policies/remediation.yml
policies:
  vlan:
    enabled: true
    auto_approve: true
    max_attempts: 3
    
  bgp:
    enabled: true
    auto_approve: false  # Require human approval
    max_attempts: 2
    
  mlag:
    enabled: false  # Never auto-remediate MLAG
    
  vrf:
    enabled: true
    auto_approve: false
    max_attempts: 2
    
  interface:
    enabled: true
    auto_approve: true
    max_attempts: 3

Safety Features

Feature Implementation
Cooldown KV store avec TTL de 5 minutes par device
Max attempts Counter dans KV store, reset après succès
Human approval io.kestra.plugin.core.flow.Pause avec timeout 24h
Circuit breaker Désactivation auto après 3 échecs consécutifs
Audit trail Logs Kestra + notifications Slack

Tasks

  • Créer kestra/flows/drift-remediation.yml
  • Implémenter le système de politiques (YAML ou API)
  • Configurer le cooldown avec KV store
  • Ajouter le Flow Trigger depuis drift-detected
  • Implémenter le flow d'approbation humaine
  • Ajouter les notifications (Slack/Discord)
  • Tests avec différents scénarios de drift
  • Documentation des politiques

Dependencies

  • #16 (Drift detection service)
  • #10 (fabric-reconcile workflow)

Output

  • kestra/flows/drift-remediation.yml
  • kestra/namespace-files/policies/remediation.yml
## Description > ⚠️ **Mise à jour** : Avec Kestra, l'auto-remediation est gérée via un **workflow dédié** déclenché par Flow Trigger sur détection de drift, avec support des politiques d'approbation. Ajouter la capacité d'auto-remediation pour corriger automatiquement les drifts de configuration détectés. ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ Drift Detection Service │ │ (gNMI Subscribe) │ └──────────────────────────┬──────────────────────────────────────┘ │ POST webhook ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Kestra: drift-detected.yml │ │ → Log, Notify, Emit internal event │ └──────────────────────────┬──────────────────────────────────────┘ │ Flow Trigger ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Kestra: drift-remediation.yml │ │ → Check policy │ │ → Auto-approve OR wait for human approval │ │ → Trigger fabric-reconcile │ └─────────────────────────────────────────────────────────────────┘ ``` ## Workflow: `drift-remediation.yml` ```yaml id: drift-remediation namespace: network.fabric description: Auto-remediation workflow triggered by drift detection inputs: - id: device type: STRING required: true - id: path type: STRING required: true - id: drift_type type: STRING description: "Type of drift: vlan, bgp, mlag, vrf, interface" tasks: - id: check_cooldown type: io.kestra.plugin.core.kv.Get key: "remediation_cooldown_{{ inputs.device }}" errorOnMissing: false - id: skip_if_cooldown type: io.kestra.plugin.core.flow.If condition: "{{ outputs.check_cooldown.value != null }}" then: - id: log_cooldown_skip type: io.kestra.plugin.core.log.Log message: "⏸️ Skipping remediation for {{ inputs.device }} - cooldown active" - id: end_cooldown type: io.kestra.plugin.core.flow.End - id: get_policy type: io.kestra.plugin.core.http.Request uri: "{{ secret('POLICY_API_URL') }}/policies/{{ inputs.drift_type }}" method: GET - id: check_auto_approve type: io.kestra.plugin.core.flow.Switch value: "{{ outputs.get_policy.body.auto_approve }}" cases: "true": - id: auto_remediate type: io.kestra.plugin.core.flow.Subflow namespace: network.fabric flowId: fabric-reconcile inputs: device: "{{ inputs.device }}" auto_apply: true wait: true "false": - id: request_approval type: io.kestra.plugin.core.flow.Pause timeout: PT24H onResume: - id: approval_status type: STRING defaults: "approved" - id: check_approval type: io.kestra.plugin.core.flow.If condition: "{{ outputs.request_approval.outputs.approval_status == 'approved' }}" then: - id: approved_remediate type: io.kestra.plugin.core.flow.Subflow namespace: network.fabric flowId: fabric-reconcile inputs: device: "{{ inputs.device }}" auto_apply: true - id: set_cooldown type: io.kestra.plugin.core.kv.Set key: "remediation_cooldown_{{ inputs.device }}" value: "{{ now() }}" ttl: PT5M # 5 minutes cooldown - id: notify_remediation type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook url: "{{ secret('SLACK_WEBHOOK') }}" payload: | { "text": "🔧 *Auto-Remediation Completed*", "attachments": [{ "color": "good", "fields": [ {"title": "Device", "value": "{{ inputs.device }}", "short": true}, {"title": "Drift Type", "value": "{{ inputs.drift_type }}", "short": true}, {"title": "Path", "value": "{{ inputs.path }}", "short": false} ] }] } triggers: - id: on_drift_detected type: io.kestra.plugin.core.trigger.Flow conditions: - type: io.kestra.plugin.core.condition.ExecutionStatusCondition in: - SUCCESS - type: io.kestra.plugin.core.condition.ExecutionNamespaceCondition namespace: network.fabric comparison: EQUALS - type: io.kestra.plugin.core.condition.ExecutionFlowCondition flowId: drift-detected errors: - id: notify_failure type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook url: "{{ secret('SLACK_WEBHOOK') }}" payload: | { "text": "❌ *Auto-Remediation Failed*", "attachments": [{ "color": "danger", "fields": [ {"title": "Device", "value": "{{ inputs.device }}", "short": true}, {"title": "Error", "value": "{{ errorMessage }}", "short": false} ] }] } ``` ## Politiques de remediation Les politiques sont stockées dans un fichier YAML ou KV store Kestra : ```yaml # policies/remediation.yml policies: vlan: enabled: true auto_approve: true max_attempts: 3 bgp: enabled: true auto_approve: false # Require human approval max_attempts: 2 mlag: enabled: false # Never auto-remediate MLAG vrf: enabled: true auto_approve: false max_attempts: 2 interface: enabled: true auto_approve: true max_attempts: 3 ``` ## Safety Features | Feature | Implementation | |---------|----------------| | **Cooldown** | KV store avec TTL de 5 minutes par device | | **Max attempts** | Counter dans KV store, reset après succès | | **Human approval** | `io.kestra.plugin.core.flow.Pause` avec timeout 24h | | **Circuit breaker** | Désactivation auto après 3 échecs consécutifs | | **Audit trail** | Logs Kestra + notifications Slack | ## Tasks - [ ] Créer `kestra/flows/drift-remediation.yml` - [ ] Implémenter le système de politiques (YAML ou API) - [ ] Configurer le cooldown avec KV store - [ ] Ajouter le Flow Trigger depuis `drift-detected` - [ ] Implémenter le flow d'approbation humaine - [ ] Ajouter les notifications (Slack/Discord) - [ ] Tests avec différents scénarios de drift - [ ] Documentation des politiques ## Dependencies - #16 (Drift detection service) - #10 (fabric-reconcile workflow) ## Output - `kestra/flows/drift-remediation.yml` - `kestra/namespace-files/policies/remediation.yml`
Damien added the phase-4-event-driven label 2025-12-21 13:04:27 +00:00
Damien changed title from [Phase 4] Add auto-remediation option for drift to [Phase 4] Implement auto-remediation workflow with Kestra 2026-01-10 13:16:22 +00:00
Sign in to join this conversation.