Part Four – Adopting Istio Service Mesh
In the previous articles in this series, we mentioned the advantages of adopting microservice architecture, how to develop and deploy the microservice based applications using Docker and Kubernetes and how we created our Kubernetes clusters at Yahoo Small Business.
During the Kubernetes cluster creation process we were in constant touch with our security team. One of their requirements was that we use an up-and-coming network security model known as “zero trust network” for this new project. In this security model, the traditional way of trusting the company’s internal network no longer applies. This new model will help us to prevent security breaches, even ones that may originate from within the internal network. In a zero trust network we encrypt all of the traffic flowing through the internal network so that a third party cannot eavesdrop on the traffic even while it is in transit within the internal network.
This is usually implemented using a methodology known as mTLS – short form for Mutual Transport Layer Security. As you may already know, TLS or Transport Layer Security is used by the majority of the internet communication channels to provide an encrypted connection by using server side TLS certificates for authentication and encryption. In the case of mutual TLS, this happens in both directions of the communication – server provides its certificate to the client for verification and vice versa. In the microservices world, this means that each microservice in our infrastructure needs its own certificates.
Along with mutual TLS, we were also looking at many other features to help make our lives easier as we moved more and more microservices to the Kubernetes platform. Features like intelligent traffic routing, smarter canary releases, rate limiting and detailed observability were some of the features we were considering.
Many of these functionalities could be implemented in the microservice at code level, but this would add more complexity to the applications’ business logic and would make them harder to maintain in the future. Also, different programming languages have different ways of implementing these features and considering microservices can be implemented in different languages it would be difficult to have a standardised way of implementing these things within the code.
The industry already understood these requirements as well as the challenges associated with them and was moving towards a solution called service mesh manager. A “service mesh” can be defined as a network space between the microservices running on a platform such as Kubernetes. A service mesh manager(often referred to as just “service mesh”) is a software component which sits in this space and provides the ability to intelligently route, secure and observe the traffic.
When we decided to implement service mesh in our Kubernetes cluster, there were two main contenders in the market – Istio and Linkerd. Istio is an open source service mesh project initially created by the combined effort of Google, Lyft and IBM. Istio was initially built to run on top of the Kubernetes platform so that the integration with Kubernetes is easier. (Now it supports multiple platforms). We did some proof of concept internally to evaluate the features and usability and everyone in the team was happy about it so we started adopting it as our service mesh solution.
Istio, being a service mesh, provides end to end encryption, service discovery, load balancing, failure recovery, metrics, and monitoring. It also supports A/B testing, fine grained canary rollouts, rate limiting, and access control. All these features are provided by Istio without much changes to the application code. Istio lives close to the application itself and most of these functionalities are transparent to the application developer. Istio achieves this transparent operation by working in multiple planes within the orchestration platform. There is a distributed “data plane” which sits close to the individual microservices and then there is a centralized “control plane” which controls the overall operations of the service mesh.
The data plane consists of a proxy server known as Istio proxy which is based on an open source lightweight proxy server called Envoy. In a Kubernetes context when you deploy a microservice pod with an application docker container, Istio adds the Istio proxy container within the same pod as a side car. From then on, all the traffic to and from the application container gets routed through this proxy. When this happens to all the microservice pods within the network, you have the entire traffic of the microservice mesh going through a proxy layer created by multiple instances of Envoy proxies. Now the traffic flowing through this mesh can be controlled, routed, encrypted and monitored using the configuration provided by the centralized Istio control plane.
The control plane mainly consists of two components called the Istio Pilot and Citadel. Pilot is responsible for configuring the Envoy side car proxies running within the mesh. Pilot allows us to specify the rules for traffic routing, load balancing, timeouts, retries and circuit breakers. It then converts them to Envoy configs and pushes them to the sidecar proxies. Citadel is the component that allows developers to build zero-trust environments based on service identity rather than network controls. It is responsible for assigning certificates to each service and can also accept external certificate authority keys when needed.
Istio Architecture Diagram
As you have seen above, Istio provides a lot of functionality to control, secure and monitor traffic; and it is very easy to get overwhelmed. In the case of YSB, we started with a few major features – security using mTLS, some of the traffic control features and some of the observability features.
For cluster wide encryption, Istio provided us with an option to enable cluster wide mTLS during the installation by allowing us to specify authentication policies.
For traffic routing, we enabled an Istio component called an ingress gateway which created a cloud load balancer and an Istio proxy deployment under the load balancer to handle all the incoming traffic to the cluster. At this layer, we handle the TLS certificates for the microservices which are exposed outside the cluster. Also at this layer, we were using Istio Virtual services to control and redirect the incoming traffic as needed.
Another useful traffic control feature we were very interested in was the ability to do production software deployments using a pattern called a canary deployment. The term canary deployment comes from the practice of using canary birds in coal mines to detect toxic gases. In the deployment context, this meant releasing a new software version to the production environment in a very gradual and controlled way. Initially only a very small percentage of users would be served using the new version of the service. Then we would observe the metrics for the service to make sure that things are working as expected and user experience is not degraded. Gradually we would roll out the changes to more users and eventually the new version would be available for the entire user base of the service. This deployment approach allows us to detect any issues with the new code early and it complements our thorough QA and Stage environment quality testing. This contributes to the increased uptime of our production services overall.
To implement canary deployments, we used an Istio resource called a “destination rule” to group different versions of a microservice. Then we used another resource called “virtual service” to change the percentage of traffic going to the new version. The full implementation uses a continuous deployment tool called Spinnaker which will be discussed in detail in the next article.
YSB also made use of Istio’s observability features. Considering all the traffic within the mesh goes through a layer of Istio proxies, the proxies can provide very valuable data for observation and monitoring like latency and error rates. These are provided in a metrics format which can be consumed by a tool called Prometheus which provides an elegant solution for metrics-based monitoring.
There are many other Istio features we are continuously working to implement in our environment.
As we have seen so far, we have containerized our microservices using Docker, orchestrated them using Kubernetes and implemented the service mesh for easier management and improved security. In the next part of the series, we will talk more about our continuous deployment system using a tool called Spinnaker.