all and sundry: Service to Service call patterns - GKE with Anthos Service Mesh on a single cluster

Thursday, November 18, 2021

Service to Service call patterns - GKE with Anthos Service Mesh on a single cluster

This is second in a series of posts exploring service to service call patterns in some of the application runtimes on Google Cloud. The first in the series explored service to service call patterns in GKE.

This post will expand on it by adding in a Service Mesh, specifically Anthos Service Mesh, and explore how the service to service patterns change in the presence of a mesh. The service to service call with be across services in a single cluster. The next post will explore services deployed to multiple GKE clusters.

Set-Up

The steps to set-up a GKE cluster and install Anthos service mesh on top of it is described in this document - https://cloud.google.com/service-mesh/docs/unified-install/install, in brief these are the commands that I had to run in my GCP Project to get a cluster running:

If the installation of cluster and the mesh has run through cleanly, a good way to verify the installation is to see if the cluster gets registered as a Anthos managed cluster in the Google Cloud Console.

The services that I will be installing is fairly simple and looks like this:

Using a UI, the caller can make the producer behave in certain ways:

Introduce response time delays
Respond with certain status codes

This will help check how the mesh environment will behave in the face of these behaviors.

The codebase for the "caller" and "producer" are in this repository - https://github.com/bijukunjummen/sample-service-to-service, there are kubernetes manifests available in the repository to bring up these services.

Behavior 1 - Mutual TLS

The first behavior that I want to see is for the the caller and the producer to verify each others identities by presenting and validating their certificates.

This can be done by adding in a istio DestinationRule for the producer, along these lines:

This also adds in the DestinationRule for the caller, this is because the caller gets the call from the browser via an Ingress Gateway and even this call needs to be authenticated using mtls

Alright now that the set-up in place, the following is what gets captured as the request flows from the Browser to the Ingress Gateway to the Caller to the Producer.

The sign that the mTLS works is seeing the "x-forwarded-client-cert" header, this is in both the Callers headers coming in from Ingress-gateway, and in the "Producers" headers coming in from the Caller.

Behavior 2 - Timeout

The second behavior that I want to explore is the timeouts. A request timeout can be set for the call from the Caller to Producer by creating a Virtual Service for the Producer with the value set, along these lines:

With this configuration in place a request from the caller with a delay of 6 seconds, causes the Mesh to timeout and present an error that looks like this:

The mesh responds with a http status code of 504 with a message of "Upstream timed out".

Behavior 3 - Circuit Breaker

Circuit breaker is implemented using a Destination Rule resource

Here I have configuration which breaks the circuit if 3 continuous 5XX responses are received from the Producer in a 15 second interval, and then does not make a request for another 15 seconds

With this configuration in place a request with broken circuit looks like this:

The mesh responds with a http status code of 503 and a message of "no healthy upstream"

Conclusion

The neat thing is that in all scenarios so far, the way the Caller calls the Producer remains exactly the same, it is the mesh which injects in the appropriate security controls through mTLS and the resilience of calling service through timeouts and circuit breaker.

all and sundry