Sunday, January 16, 2022

Service to Service Call Pattern - Multi-Cluster Ingress

Multi-Cluster Ingress is a neat feature of Anthos and GKE (Google Kubernetes Engine), whereby a user accessing an application that is hosted on multiple GKE clusters, in different zones is directed to the right cluster that is nearest to the user!

So for eg. consider two GKE clusters, one in us-west1, based out of Oregon, USA and another in europe-north1, based out of Finland. An application is installed to these two clusters. Now, a user accessing the application from US will be lead to the GKE cluster in us-west1 and a user coming in from Europe will be lead to the GKE cluster in europe-north1. Multi-cluster Ingress enables this easily!



Enabling Multi-Cluster Ingress

Alright, so how does this work. 

Let me once again assume that I have two clusters available in my GCP project, one in us-west1-a zone and another in europe-north1-a, and an app called "Caller" deployed to these two clusters. For a cluster, the way to get traffic into the cluster from a user outside of it is typically done using an "Ingress"


This works great for a single cluster, however not so for a bunch of clusters. A different kind of an Ingress resource is required that spans GKE clusters and this is where a Multi-Cluster ingress comes in - an ingress that spans clusters.

Multi-Cluster Ingress is a Custom resource provided by GKE and looks something like this:



It is defined in one of the clusters, designated as a "config" cluster. 
See how there is a a reference to "sample-caller-mcs" above, that is pointing to a "MultiClusterService" resource, which is again a custom resource that will work only in the context of a GKE project. A definition for such a resource, looks almost like a Service and here is the one for "sample-caller-mcs"


Now that there is a MultiClusterIngress defined pointing to a MultiClusterService, what all happens under the covers:
1. A load balancer is created which uses an ip advertised using anycast - better details are here. These anycast ip's help get the request through to the cluster closest to the user.
2. A Network Endpoint Group(NEG) is created for every cluster that matches the definition of MultiClusterService. These NEG's are used as the backend of the loadbalancer.

Sample Application

I have a sample set of applications and deployment manifests available here that demonstrates Multi-Cluster Ingress. There are instructions to go with it here. This brings up an environment which looks like this:



Now to simulate a request coming in from us-west1-a is easy for me since I am in US, another approach is to simply spin up an instance in us-west1-a and use that to make a request the following way:



And the "caller" invoked should be the one in us-west1-a, similarly if the request is made from an instance in europe-north1-a:


The "caller" invoked will be the one in europe-north1-a!!

Conclusion

This really boggles my mind, being able to spin up two clusters on two different continents, and having a request from the user directed to the one closest to them, in a fairly simple way. There is a lot going on under the covers, however this is abstracted out using the resource types of MultiClusterIngress and MultiClusterService. 

Tuesday, January 4, 2022

Service to Service call pattern - Using Anthos Service Mesh

Anthos Service Mesh makes it very simple for a service in one cluster to call service in another cluster. Not just calling the service but also doing so securely, with fault tolerance and observability built-in.


This is a fourth in a series of posts on service to service call patterns in Google Cloud. 

The first post explored Service to Service call pattern in a GKE runtime using a Kubernetes Service abstraction

The second post explored Service to Service call pattern in a GKE runtime with Anthos Service mesh

The third post explored the call pattern across multiple GKE runtimes with Multi-Cluster Service

Target Call Pattern

There are two services deployed to two different clusters. The "caller" in "cluster1" invokes the "producer" in "cluster2".


Creating Clusters and Anthos Service Mesh

The entire script to create the cluster is here. The script:
1. Spins up two GKE standard clusters
2. Adds firewall rules to enable ip's in one cluster to reach the other cluster
3. Installs service mesh on each of the clusters

Caller and Producer Installation

The caller and the producer is deployed using the normal kubernetes deployment descriptors, no additional special resource is required to get the set-up to work, so for eg, the callers deployment looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-caller-v1
  labels:
    app: sample-caller
    version: v1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sample-caller
      version: v1
  template:
    metadata:
      labels:
        app: sample-caller
        version: v1
    spec:
      serviceAccountName: sample-caller-sa
      containers:
        - name: sample-caller
          image: us-docker.pkg.dev/biju-altostrat-demo/docker-repo/sample-caller:latest
          ports:
            - containerPort: 8080
		....            

Caller to Producer Call

The neat thing with this entire set-up is that from the callers perspective a call continues to be made to the dns name of a service representing the producer. So assuming that the producer's service is deployed to the same namespace, then a  dns name of "producer" should just work.

So with this in place, a call from the caller to producer looks something like this:

The call fails, with a message that the "sample-producer" host name in cluster1 cannot be resolved. This is perfectly okay as such a service has not been created in cluster1. Creating such a service:


resolves the issue and a call cleanly goes through!! This is magical, see how the service in cluster 1 resolves the pods in cluster2!

Additionally the presence of x-forwarded-client-cert header in the producer indicates that the mTLS is being used during the call. 

Fault Tolerance

So security via mTLS is accounted for, now I want to layer in some level of fault tolerance. This can be done by ensuring that the calls timeout instead of just hanging, and not making repeated calls to producer if it starts to be non-responsive. This is typically done using istio configuration. Since Anthos service mesh is essentially a managed istio, the configuration for timeout looks something like this, using a VirtualService configuration


And circuit breaker, using a Destination Rule which looks like this:

All of it is just straight kubernetes configuration and it just works across multiple clusters.

Conclusion

The fact that I can treat multiple clusters as if they were a single cluster is I believe the real value proposition of Anthos Service Mesh, all the work around how to enable such a communication securely with fault tolerance is what the Mesh brings to the table.

My repository has all the sample that I have used for the post - https://github.com/bijukunjummen/sample-service-to-service