envoy-xds-controller

Troubleshooting Guide: Envoy XDS Controller

This document provides guidance for diagnosing and resolving common issues with the Envoy XDS Controller.

Table of Contents

  1. Common Issues
  2. Debugging Techniques
  3. Logs and Monitoring
  4. Known Issues
  5. Getting Help

Common Issues

Controller Not Starting

Symptoms:

Possible Causes and Solutions:

  1. Invalid Configuration
    • Check the controller’s configuration in the ConfigMap or Helm values
    • Verify that all required environment variables are set correctly
    • Solution: Correct the configuration and redeploy
  2. Missing Permissions
    • Controller may not have the necessary RBAC permissions
    • Solution: Verify and correct the ClusterRole and ClusterRoleBinding
  3. Webhook Certificate Issues
    • If using the validating webhook, certificate issues can prevent startup
    • Solution: Check the webhook certificate and secret
      kubectl get secret -n envoy-xds-controller envoy-xds-controller-webhook-cert
      

Envoy Not Receiving Configuration

Symptoms:

Possible Causes and Solutions:

  1. Network Connectivity
    • Ensure Envoy can reach the xDS server
    • Check network policies and firewall rules
    • Solution: Verify connectivity with a simple test
      kubectl exec -it <envoy-pod> -- curl -v <xds-server-address>:<port>
      
  2. Node ID Mismatch
    • Envoy’s node ID must match what the controller expects
    • Solution: Verify Envoy’s node ID configuration and the controller’s nodeIds list
  3. TLS Configuration
    • If TLS is enabled, certificate issues can prevent connections
    • Solution: Check TLS certificates and verify Envoy’s TLS configuration

Custom Resources Not Applied

Symptoms:

Possible Causes and Solutions:

  1. Validation Errors
    • CRs may not pass validation
    • Solution: Check controller logs for validation errors and fix the CR definition
  2. Incorrect Node ID
    • CRs must be associated with the correct node ID
    • Solution: Verify the node ID annotation on the CR
      metadata:
      annotations:
        envoy.kaasops.io/node-id: "node1"
      
  3. Controller Not Watching Namespace
    • Controller may not be watching the namespace where CRs are created
    • Solution: Check the watchNamespaces configuration

Debugging Techniques

Enabling Debug Logs

To enable debug logs for the controller:

# For Helm deployment
helm upgrade envoy-xds-controller \
  --namespace envoy-xds-controller \
  --set envs.LOG_LEVEL=debug \
  helm/charts/envoy-xds-controller

# For manual deployment
kubectl set env deployment/envoy-xds-controller -n envoy-xds-controller LOG_LEVEL=debug

Checking Controller Status

kubectl get pods -n envoy-xds-controller
kubectl describe pod -n envoy-xds-controller <controller-pod>
kubectl logs -n envoy-xds-controller <controller-pod>

Verifying Custom Resources

# List all VirtualServices
kubectl get virtualservices.envoy.kaasops.io --all-namespaces

# Describe a specific VirtualService
kubectl describe virtualservice.envoy.kaasops.io <name> -n <namespace>

Checking Envoy Configuration

To check Envoy’s current configuration:

# Using Envoy's admin interface
kubectl exec -it <envoy-pod> -- curl localhost:9901/config_dump

# Check specific configuration
kubectl exec -it <envoy-pod> -- curl localhost:9901/clusters
kubectl exec -it <envoy-pod> -- curl localhost:9901/listeners

Logs and Monitoring

Important Log Patterns

Look for these patterns in the controller logs:

Prometheus Metrics

The controller exposes Prometheus metrics at /metrics endpoint. Key metrics to monitor:

Known Issues

Template Processing Performance

Issue: Processing large templates with many substitutions can be slow.

Workaround: Break down large templates into smaller ones or reduce the number of substitutions.

Webhook Timeout

Issue: Webhook validation may time out for large resources.

Workaround: Increase the webhook timeout in the ValidatingWebhookConfiguration.

timeoutSeconds: 30  # Increase from default 10

Multiple Node IDs

Issue: When using multiple node IDs, resources may be incorrectly assigned.

Workaround: Ensure each resource has the correct node ID annotation and verify the node ID list in the configuration.

Getting Help

If you encounter issues not covered in this guide:

  1. Check the GitHub Issues for similar problems
  2. Search the project documentation
  3. Open a new issue with:
    • Detailed description of the problem
    • Steps to reproduce
    • Relevant logs and configuration
    • Environment details (Kubernetes version, Envoy version, etc.)