How to debug failures in istio service mesh

By Chuning Song on Apr 27, 2022

Disclaimer: This blog post assumes your proficiency in interacting with Kubernetes/Istio.

_

Istio is great…if it plays ball.

This is one of the first words I’ve heard from my Kubernetes instructor way back when I started working as a DevOps engineer.

I didn’t have a full grasp of what he meant by saying that. Surely, I thought, Kubernetes is already pretty hard to befriend with. Istio can’t be more cryptic than that?

I couldn’t be more wrong.

Why would Istio be hard to work with?

A couple of reasons pop into my mind:

  1. It doesn’t have a static syntax checker.

    Most modern IDEs/text editors have plugins that do linting/suggesting for Kubernetes manifests. The one that I use (IntellijIdea) even tries to do semantic cross-checks for you. For instance, if you try to refer to a combination of labels in your Service definition, but IDE fails to find this combination in the Pod template you defined in Deployment or Job, it will underline it with red wavy lines. It will even pre-populate the essential fields for you.

    On the other hand, Istio is a bit lacking in this regard. I’ve yet to see a plugin, if any, that can intelligently do syntactic proofread / linting for me. In fact, my IDE doesn’t even recognise the resources under *.istio.io/v1beta1. It marks that up as if I’m trying to refer to some non-existent resources.

  2. It doesn’t have an effective way to report misconfigurations.

    One thing that sets Kubernetes apart is the abundance of techniques/toolkits that can help you debug an error when it arises. The most obvious error won’t even get accepted by the API server. The less obvious ones will be rejected by admission webhooks (i.e. API server accepts the request but admission controller rejects it because it violates certain conditions). If it is accepted but not performing as expected, you can always go kubectl describe on the resource, or kubectl get event on the events, or kubectl logs to look at the logs. There will be some hints. This process is implanted into the mind of every Kubernetes practitioner, almost like one-two combo for a boxer.

    Istio, on the other hand, again, is a bit lacking. API server doesn’t speak Istio so it treats it as your random CRDs and it takes in whatever you say. There are a few admission webhooks being added in, but the completeness is no match against native Kubernetes. Most Istio resources (such as VirtualService or ServiceEntry) don’t report events, so kubectl describe won’t yield anything. And although there usually will be some hints to why something doesn’t work in istiod, it is usually hidden in istio-system and you don’t always have access to that namespace. Plus, it is usually shared with quite a lot of namespaces, so a useful line of log might get drowned in wall of texts.

  3. Envoy is cryptic.

    Open-source community has a long tradition to favour simple but effective solution over solutions with charming UI. Envoy is no exception to that rule. Your key to resolve an issue with Istio might be hidden in a log entry like this:

    {"upstream_transport_failure_reason":"delayed connect error: 111","downstream_remote_address":"xx.xx.xx.xx:56494", ... "downstream_local_address":"xx.xx.xx.xx:5557","protocol":"HTTP/2","user_agent":"Wget","upstream_cluster":"inbound|5557||","response_flags":"UF"}

    We’ll get to it later in this blog post.

1-2 in Istio

Can we have some sort of combo that would prove to be useful when we run into issues with Istio? Absolutely.

You should familiarise yourself with the demo project located at songford/istio-debugging.

  1. istioctl analyze

    istioctl has come a long way from its nurturing days when it’s just a in-place editor to add sidecars onto Kubernetes manifests. One key feature that was graduated at version 1.5 is istioctl analyze.

    Please take a close look at 1.yaml in the demo repository. I have intentionally planted something dodgy into this manifest. Can you spot it? Avid reader like you might notice the issues immediately, but kindly bare with me.

    Now let’s run kubectl apply -f 1.yaml on your practice cluster. The cluster should take it without any complaint. At this stage, we know both the API server and admission controller failed to spot the error.

    You can try attaching to debugger by running kubectl exec -it deploy/debugger -- /bin/sh, and run wget -O- http://nginx in the session, you should get

    / # wget -O- http://nginx
    Connecting to nginx (172.20.23.127:80)
    wget: server returned error: HTTP/1.1 502 Bad Gateway
    

    Hmm, less than ideal, isn’t it?

    Let’s try running istioctl analyze.

    N.B. You might need to supply a longer timeout threshold if your cluster is quite large. i.e. istioctl analyze --timeout 600s (default value is 30s)

    On a completely fresh istio installation, here are the output I have:

    Error [IST0101] (VirtualService default/nginx) Referenced host+subset in destinationrule not found: "nginx+nginxs"

    Info [IST0102] (Namespace default) The namespace is not enabled for Istio injection. Run 'kubectl label namespace default istio-injection=enabled' to enable it, or 'kubectl label namespace default istio-injection=disabled' to explicitly mark it as not needing injection.

    Info [IST0118] (Service default/nginx) Port name 8080-8080 (port: 8080, targetPort: 8080) doesn't follow the naming convention of Istio port.

    Error: Analyzers found issues when analyzing namespace: default.

    See https://istio.io/v1.13/docs/reference/config/analysis for more information about causes and resolutions.

    Voila! It turns out that the subset as defined in VirtualService can’t be pinned to a subset definition in DestinationRule. On this occasion, it sounds quite trivial, but this issue can be quite nasty to resolve in production when your manifests grow large and it catches you off-guard.

    istioctl also reports the namespace isn’t toggled for automatic injection. This might not essentially be an error (hence INFO, not ERROR) since sometimes manual injection on a case-by-case basis is necessary to work around certain legacy workloads.

    The third issue [IST0118] is also quite interesting. This style (8080-8080) is the default naming convention for a port in Kubernetes service, in which the default name is {{host port}}-{{container port}}, but it doesn’t agree with the port naming convention in istio, in which the name of the port should signify the protocol it uses (i.e. http-8080 or grpc-8080), and this name decides how istio treats the traffic from/to it. Again, this might not necessarily result in errors since istio can recognise http/http2 traffic automatically and if not, istio will fallback to tcp. Oftentimes, this works, but sometimes it doesn’t. More to it later.

    Fix the issues in 1.yaml until istioctl analyze yields

    ✔ No validation issues found when analyzing namespace: default.

    You can use 2.yaml if you couldn’t be bothered. After the issues are fixed, attach to debugger and run wget -O- http://nginx again. Now you should see

    Connecting to nginx (172.20.23.127:80)
    ...
    <title>Welcome to nginx!</title>
    ...
    
  2. istioctl x describe pod/svc

    The capability to describe a pod or a service with istioctl was added in version 1.3. Nevertheless, this feature is not graduated yet, hence the x (experiment) in the command.

    Let’s re-apply the bad manifest by kubectl apply -f 1.yaml. Now let’s run istioctl x describe svc nginx. It should report

    Service: nginx
    Port: 80-80 80/UnsupportedProtocol targets pod port 80
    DestinationRule: nginx for "nginx"
    WARNING POD DOES NOT MATCH ANY SUBSETS.  (Non matching subsets v1)
    Matching subsets:
       (Non-matching subsets v1)
    No Traffic Policy
    VirtualService: nginx
    WARNING: No destinations match pod subsets (checked 1 HTTP routes)
    Warning: Route to UNKNOWN subset nginx; check DestinationRule nginx
    

    Pretty neat, isn’t it?

    Now let’s try applying 2-without-subset-label.yaml. This is basically 2.yaml other than the fact that I’ve conveniently left out version: v1 in the labels of pod templates, so DestinationRule won’t be able to select the corresponding pods for this subset. istioctl analyze doesn’t actually report anything on this occasion since this mishap happens in the native deployment/v1 instead of istio-specific resources, yet istioctl x describe svc nginx still dutifully reports errors.

    While we are at it, let’s run istioctl x describe pod {{NGINX_POD_NAME}} (please get NGINX_POD_NAME via kubectl get pod). It should say

    Pod: nginx-5dcfbc5f7-wkgm7
    Pod Ports: 80 (nginx), 15090 (istio-proxy)
    Suggestion: add 'version' label to pod for Istio telemetry.
    --------------------
    Service: nginx
    Port: http-80 80/HTTP targets pod port 80
    DestinationRule: nginx for "nginx"
    WARNING POD DOES NOT MATCH ANY SUBSETS.  (Non matching subsets v1)
    Matching subsets:
       (Non-matching subsets v1)
    No Traffic Policy
    VirtualService: nginx
    WARNING: No destinations match pod subsets (checked 1 HTTP routes)
       Route to non-matching subset v1 for (everything)
    --------------------
    Effectve PeerAuthentication:
    Workload mTLS: STRICT
    Applied PeerAuthentication:
    default.default
    
  3. istioctl pc

    pc here stands for proxy-config. This command can be used when you want to get to the nitty-gritty bits of debugging. Some of the most commonly used sub-commands are:

    • istioctl pc l: l for listener.
    • istioctl pc ep: ep for endpoint.
    • istioctl pc c: c for cluster.

    cluster, listener and endpoint are envoy-specific terminologies so we won’t go in details here. You are welcome to read more about it here.

    To illustrate what they can do, please apply 2.yaml.

    We can start by running istioctl pc c {{NGINX_POD_NAME}}. The command will print all the clusters being registered in istiod, and in turn being propagated to the sidecar injected to {{NGINX_POD_NAME}}. We are interested in specifically these two lines

    ...
    
    nginx.default.svc.cluster.local                         80        - 
    
    nginx.default.svc.cluster.local                         80        v1
    ...
    

    Once we verified the cluster looks good, we can go ahead and examine the listeners. Run

    istioctl pc ep {{DEBUGGER_POD_NAME}} --cluster 'outbound|80||nginx.default.svc.cluster.local'

    hint: it shouldn’t make any difference on whether you run this command on {{DEBUGGER_POD_NAME}} or {{NGINX_POD_NAME}}. istiod should push these configs to them (sidecars) and they should be in sync at all times. If not, then we have a problem.

    It should return

    ENDPOINT          STATUS      OUTLIER CHECK     CLUSTER
    10.0.2.237:80     HEALTHY     OK                outbound|80||nginx.default.svc.cluster.local
    

    This ENDPOINT IP address should be the same to the IP address of your nginx pod.

    hint: you can check it via kubectl get pod {{NGINX_POD_NAME}} --template={{.status.podIP}}

    Now let’s scale the pod count of nginx to 0 by running kubectl scale deploy/nginx --replicas=0.

    If we recheck the endpoint for outbound|80||nginx.default.svc.cluster.local, there should be none since there’s no pod available for that cluster.

  4. istioctl ps

    Most of the time, we have to admit, that it would be us that made some mistakes in the manifests which lead to istio not behaving like it should. Unlikely as it is, istio or even kubernetes as a cluster can and will fail. istioctl ps or istioctl proxy-status is specifically built for debugging these kinds of errors.

    Running this command should give you a list of istio-managed resources in your cluster. Feel free to read about what CDS/LDS/EDS/RDS mean. They are envoy-specific jargons.

    Ideally, you should see most of the statuses are either SYNCED or NOT SENT. SYNCED means the config stored on the sidecar of that pod is in sync with the config on istiod, and NOT SENT means there’s nothing from istiod needs to be sent to this sidecar, typically indicating no changes are detected.

    You should be extra vigilant if you see a STALE in the list of statuses. This means the sidecar is out of sync with the registration centre (istiod). Sometimes, this issue is transient, the sync re-establishes and it spontaneously resolves itself, but sometimes this could point to network connectivity and visibility issues.

  5. Deciphering envoy logs

    Personally, I would refrain from directly interacting with istio-proxy. For one, they should be managed via istiod, and that’s the whole point of a service mesh. For the other, they are not the most user-friendly things in the world. However, they are usually closest to the root of issues, and the logs usually contain clues to what goes wrong.

    Apply 2-wrong-port-name.yaml. In this manifest, I’ve intentionally changed the name of port service nginx uses to grpc-80. Istio will assume the protocol is gRPC between the client and the service, but it is in fact HTTP.

    You can try attaching to debugger and wget to http://nginx. This time, wget should report 502 Bad Gateway, which is not particularly helpful. All of the combos we’ve been mentioning in this blog post won’t be useful either, since as far as istio concerns, enerything is configured correctly. Istio will have no idea that a declared-gRPC service is in fact communicating in HTTP.

    This matters because all intra-mesh communications are gRPC. istio-proxys are pretty clever in that they trick their partners they are talking in their tongue, but in fact, their tongue is translated into gRPC from proxies to proxies and translated back again upon recipient. In our case, debugger is assuming it’s speaking plain HTTP to nginx, but in fact, it goes like

    debugger (HTTP) -> istio-proxy / debugger (gRPC) -> istio-proxy / nginx (gRPC) -> nginx (HTTP)

    Misconfiguring the name of port on service nginx will cause istio-proxy / nginx to translate an HTTP request from debugger into gRPC, which nginx doesn’t speak.

    By default, istio-proxy will capture every requests it sends and receives. Let’s monitor the log stream on debugger / nginx by kubectl logs -f {{NGINX_POD_NAME}/{{DEBUGGER_POD_NAME}} -c istio-proxy.

    As soon as wget http://nginx is run on debugger, istio-proxy on debugger says:

    [2022-03-16T02:24:35.586Z] "GET / HTTP/1.1" 502 - via_upstream - "-" 0 87 20 19 "-" "Wget" "16f7041e-8304-98c2-a33c-1abe428b070c" "nginx" "10.0.0.54:80" outbound|80|v1|nginx.default.svc.cluster.local 10.0.0.100:38110 172.20.159.97:80 10.0.0.100:57600 - -

    This doesn’t look too bad. The cluster looks good, ports on proxy looks good, and the target IP looks good.

    So we take a look at the corresponding log on nginx:

    [2022-03-16T02:24:35.590Z] "GET / HTTP/2" 502 UPE upstream_reset_before_response_started{protocol_error} - "-" 0 87 0 - "-" "Wget" "16f7041e-8304-98c2-a33c-1abe428b070c" "nginx" "10.0.0.54:80" inbound|80|| 127.0.0.6:55663 10.0.0.54:80 10.0.0.100:38110 outbound_.80_.v1_.nginx.default.svc.cluster.local default

    That doesn’t look correct. It is HTTP/1.1 from debugger, but HTTP/2 on nginx. upstream_reset_before_response_started{protocol_error} is also a dead giveaway.

    If all of the previous combos fail you, take a look at the logs of istio-proxy. Usually, you will be able to find something.

    UPE stands for upstream protocol error, which is among one of a few cryptic status codes envoy uses. You can read more about them here.

Summary

As the development and operation of software continues to be revolutionised by containers and serverless computing, cloud native platforms such as Kubernetes and Istio are increasingly being deployed across organisations.

While Kubernetes has proven to be the technology winner, it is still a new technology that has yet to enjoy the same level of deployment as other mission critical solutions organisations are familiar with.

More advanced organisations are beginning to deploy point solutions using service meshes and are looking for experienced partners to productise and scale.

Easy deployment of a minimal Kubernetes with a service mesh enabled environment has been facilitated by the major cloud vendors, however simply deploying and running containerised applications is only the beginning.

A complete and integrated Kubernetes platform that encompasses CI/CD, security scanning, networking, storage, and monitoring based upon the stateless and repeatable processes inherent in cloud computing is still a significant project for most organisations that simply do not have the skills and experience to undertake.

Innablr is a leading services provider for cloud native, Kubernetes, and serverless technologies. Frequently leading community events, delivering thought leadership and leading practices, Innablr is recognised in the Australian market as one of the most experienced providers of Kubernetes solutions.

Continuing our successful approach of building repeatable and extensible frameworks, Innablr has built a blueprint for Google Cloud and Amazon Web Services Kubernetes deployment whether it is Google Kubernetes Engine (GKE) or Elastic Kubernetes Service (EKS).

Share this post: