Skip to content

Latest commit

 

History

History
394 lines (293 loc) · 7.64 KB

File metadata and controls

394 lines (293 loc) · 7.64 KB

Skills Reference

The DevOps Execution Engine includes 11 comprehensive DevOps skills.


Kubernetes Skills

k8s-debug

Purpose: Troubleshoot Kubernetes pods, deployments, and nodes

Use cases:

  • Diagnose CrashLoopBackOff pods
  • Investigate OOMKilled containers
  • Check resource usage
  • Analyze pod events
  • Review logs across multiple pods

Example commands:

"Debug pods in production namespace"
"Why is api-service crashing?"
"Check resource usage for worker nodes"
"Show me logs for all api pods"

Risk level: LOW (read-only diagnosis)


k8s-deploy

Purpose: Safe Kubernetes deployment workflows with rollback

Use cases:

  • Deploy new versions
  • Rollback failed deployments
  • Blue-green deployments
  • Canary releases
  • Update deployment configs

Example commands:

"Deploy api-service v2.1.0 to production"
"Rollback the last deployment"
"Show deployment history for api-service"

Risk level: MEDIUM to HIGH (depending on environment)


argocd-gitops

Purpose: GitOps workflows with ArgoCD

Use cases:

  • Check ArgoCD sync status
  • Trigger manual sync
  • Review application health
  • Investigate sync failures
  • Manage ArgoCD applications

Example commands:

"Check ArgoCD sync status"
"Sync the api-service app"
"Why did the sync fail?"

Risk level: MEDIUM


Cloud Skills

aws-ops

Purpose: AWS operations, queries, and resource management

Use cases:

  • List EC2 instances
  • Check RDS status
  • Review S3 buckets
  • Analyze CloudWatch metrics
  • Manage IAM resources

Example commands:

"List all EC2 instances"
"Check RDS database status"
"Show S3 bucket sizes"
"What's the CloudWatch alarm status?"

Risk level: LOW (reads) to HIGH (modifications)


cost-optimization

Purpose: Cloud cost analysis and optimization

Use cases:

  • Analyze AWS spending
  • Find idle resources
  • Identify oversized instances
  • Suggest cost savings
  • Track cost trends

Example commands:

"Analyze AWS costs this month"
"Find underutilized resources"
"Suggest cost optimizations"
"Show me the top 10 expensive resources"

Risk level: LOW (analysis) to MEDIUM (recommendations)


Infrastructure Skills

terraform-workflow

Purpose: Infrastructure as Code workflows and best practices

Use cases:

  • Plan Terraform changes
  • Review state
  • Validate configurations
  • Manage workspaces
  • Detect drift

Example commands:

"Run terraform plan"
"Check terraform state"
"Validate terraform config"
"Detect infrastructure drift"

Risk level: LOW (plan) to CRITICAL (apply/destroy)


docker-ops

Purpose: Docker container operations and debugging

Use cases:

  • List running containers
  • Check container logs
  • Inspect container configs
  • Analyze image sizes
  • Debug networking

Example commands:

"List all running containers"
"Show logs for api container"
"Inspect the nginx image"
"Check container resource usage"

Risk level: LOW (inspect) to MEDIUM (restart/remove)


Operations Skills

incident-response

Purpose: Structured incident response playbooks

Use cases:

  • SEV1/SEV2 incident handling
  • Service outage response
  • High error rate investigation
  • Performance degradation
  • Security incident response

Example commands:

"We have a SEV1 - API is down"
"High error rates in payment service"
"Database is slow"
"Security incident detected"

Risk level: Varies (diagnosis is LOW, mitigation is HIGH)

Workflow:

  1. Triage - Assess severity and impact
  2. Diagnose - Identify root cause
  3. Mitigate - Generate action plan
  4. Approve - Human approves mitigation
  5. Execute - Apply fixes
  6. Verify - Confirm resolution
  7. Document - Create incident report

log-analysis

Purpose: Cross-platform log analysis patterns

Use cases:

  • Parse and analyze logs
  • Find error patterns
  • Correlate events
  • Extract metrics
  • Identify anomalies

Example commands:

"Analyze logs for errors"
"Show me 5xx responses in the last hour"
"Find slow queries in postgres logs"
"Correlate API errors with database issues"

Risk level: LOW (read-only)


system-health

Purpose: Quick system health checks (disk, memory, CPU, processes)

Use cases:

  • Cluster health overview
  • Node resource usage
  • Disk space monitoring
  • Memory pressure detection
  • Process monitoring

Example commands:

"Check system health"
"Show disk usage across nodes"
"Check memory usage"
"Is any node under pressure?"

Risk level: LOW (monitoring only)


git-workflow

Purpose: Git workflows, branching strategies, and DevOps practices

Use cases:

  • Check git status
  • Review recent commits
  • Manage branches
  • Resolve merge conflicts
  • CI/CD integration

Example commands:

"Show git status"
"What changed in the last deploy?"
"Check recent commits"
"Create a feature branch"

Risk level: LOW (read) to MEDIUM (commits/pushes)


How Skills Work Together

Example: Complete Incident Response

You: API is returning 500 errors

Clawd uses:
1. system-health - Check cluster nodes
2. k8s-debug - Inspect API pods
3. log-analysis - Analyze error logs
4. incident-response - Structure the response

Clawd: I found the issue:
- Database connection pool exhausted
- API pods hitting memory limits

Would you like me to generate a mitigation plan?

You: yes

Clawd uses:
- k8s-deploy - Generate deployment patch plan
- Generates execution plan to increase resources

[You approve, Clawd executes]

Clawd uses:
- k8s-debug - Verify fix
- log-analysis - Confirm errors stopped

Example: Cost Optimization Audit

You: Analyze our AWS costs

Clawd uses:
1. aws-ops - List all resources
2. cost-optimization - Analyze usage patterns
3. system-health - Check resource utilization

Clawd: Found potential savings:
- 5 idle EC2 instances ($800/month)
- 3 oversized RDS databases ($1200/month)
- Unattached EBS volumes ($150/month)

Total potential savings: $2150/month

Would you like detailed recommendations?

Skill Configuration

Enable Specific Skills

In config.yaml:

enabled_skills:
  - k8s-debug
  - k8s-deploy
  - incident-response
  - system-health

Leave empty to enable all skills.

Skill-Specific Config

Some skills support additional configuration:

skills:
  k8s-debug:
    default_namespace: production
    log_tail_lines: 100
    
  aws-ops:
    default_region: us-east-1
    profile: production
    
  cost-optimization:
    savings_threshold: 100  # Only suggest if >$100/month

Adding Custom Skills

  1. Create skill directory in skills/
  2. Add SKILL.md with documentation
  3. Create execution plan templates
  4. Test in isolation
  5. Submit PR

See CONTRIBUTING.md for details.


Skill Maturity

Skill Maturity Test Coverage Documentation
k8s-debug Stable High Complete
k8s-deploy Stable High Complete
argocd-gitops Stable Medium Complete
aws-ops Stable Medium Complete
cost-optimization Beta Medium Complete
terraform-workflow Stable Medium Complete
docker-ops Stable High Complete
incident-response Stable High Complete
log-analysis Stable Medium Complete
system-health Stable High Complete
git-workflow Stable Medium Complete

Next Steps

  • Try each skill in read-only mode first
  • Review generated execution plans
  • Start with LOW risk operations
  • Build confidence over time
  • See EXAMPLES.md for detailed usage examples

Each skill follows the same safety model: Plan → Approve → Execute