This document provides guidance for diagnosing and resolving common issues with the Envoy XDS Controller.
Symptoms:
Possible Causes and Solutions:
kubectl get secret -n envoy-xds-controller envoy-xds-controller-webhook-cert
Symptoms:
Possible Causes and Solutions:
kubectl exec -it <envoy-pod> -- curl -v <xds-server-address>:<port>
Symptoms:
Possible Causes and Solutions:
metadata:
annotations:
envoy.kaasops.io/node-id: "node1"
watchNamespaces configurationTo enable debug logs for the controller:
# For Helm deployment
helm upgrade envoy-xds-controller \
--namespace envoy-xds-controller \
--set envs.LOG_LEVEL=debug \
helm/charts/envoy-xds-controller
# For manual deployment
kubectl set env deployment/envoy-xds-controller -n envoy-xds-controller LOG_LEVEL=debug
kubectl get pods -n envoy-xds-controller
kubectl describe pod -n envoy-xds-controller <controller-pod>
kubectl logs -n envoy-xds-controller <controller-pod>
# List all VirtualServices
kubectl get virtualservices.envoy.kaasops.io --all-namespaces
# Describe a specific VirtualService
kubectl describe virtualservice.envoy.kaasops.io <name> -n <namespace>
To check Envoy’s current configuration:
# Using Envoy's admin interface
kubectl exec -it <envoy-pod> -- curl localhost:9901/config_dump
# Check specific configuration
kubectl exec -it <envoy-pod> -- curl localhost:9901/clusters
kubectl exec -it <envoy-pod> -- curl localhost:9901/listeners
Look for these patterns in the controller logs:
ERROR - Critical errors that need attentionReconciling - Shows reconciliation of resourcesValidation failed - Indicates validation issues with CRsUpdated snapshot - Indicates successful configuration updatesThe controller exposes Prometheus metrics at /metrics endpoint. Key metrics to monitor:
controller_runtime_reconcile_total - Total number of reconciliationscontroller_runtime_reconcile_errors_total - Total number of reconciliation errorsxds_cache_updates_total - Number of xDS cache updatesxds_cache_update_errors_total - Number of xDS cache update errorsIssue: Processing large templates with many substitutions can be slow.
Workaround: Break down large templates into smaller ones or reduce the number of substitutions.
Issue: Webhook validation may time out for large resources.
Workaround: Increase the webhook timeout in the ValidatingWebhookConfiguration.
timeoutSeconds: 30 # Increase from default 10
Issue: When using multiple node IDs, resources may be incorrectly assigned.
Workaround: Ensure each resource has the correct node ID annotation and verify the node ID list in the configuration.
If you encounter issues not covered in this guide: