Intro to Distributed Systems and traceability using Jaeger – Pt.1

Veröffentlicht von Hesham Abo El-Magd am

“If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it” 

Albert Einstein once said that.

Now imagine that this planet in this scenario is your system which you’re responsible for operating and maintaining the availability and durability of the system. Where probably a system contains several functions are communication to each other and you can’t possibly identify what went wrong or where. Imagine you just received an issue that a service request (UI request for instance) hadn’t been fulfilled and you need to debug this?! 

Let’s go back a bit in the history of distributed systems and how they evolved from monolith applications moving forward to service-oriented architecture (SOA) and lately to microservices. 

If we think of Amazon as an example to study here, Amazon as a website that almost everyone on this planet knows. 

Back to our average amazon website, which had started as a monolith application despite having all these services inside, search products, product description, product rating, product reviews, shopping cart … etc. 

Many services and a huge monolith application, so that Jeff Bezos came up with the two-pizza rule of a team, which means that a team shouldn’t be bigger of individuals than 6-8 members OR that can be feed by two pizzas. Dividing the team using this rule would mean to be split by business or functional capabilities. Each team takes ownership when it comes to design, architecture, and operation. One of the teams‘ function would be to provide to the other teams an interface to order the inter-communication from one function to the other (if you may, one microservice to the other) This is one of the advantages of using a distributed architecture and microservices is one of the implementations of it. Because you choose to use distributed architecture, you’re in fact choosing to have distributed decision-making between all the business and functional teams. One team can easily choose one technology different from the other team as long as this team can have a clear interface (API) which other teams can call within their microservice and vice versa. 

This freedom, as much as it is good since there is not a centralized authority to enforce you as a team to work with one technology over the other, also comes with a cost of distributing the problems inside this distributed system you and the other team are responsible for operating it. 

Basically, the problems are now distributed between every microservice and that brings the big question of this post: 

Can you trace it?! 

So, you know your problem and question to answer. How can I trace a service request that had to interact with? Other services that are part of your system? Coming back to our example the Amazon website. You open amazon.com, you searched for a PlayStation 5 and you found it. You want to buy it quickly and added it to your shopping cart but something went wrong in this process and you can’t find it at your shopping cart. Knowing how efficient the engineering team at Amazon is and of course how good their detection mechanism (tracing) is, such an error wouldn’t make it into production. 

However, let’s say that hypothetically, it happened and the engineer tries to trace where did the service request hadn’t been fulfilled what went wrong. The services request had gone from one service to another and in between also updating the DynamoDB with the shopping cart items. You need to detect where the service request stopped and debug this service using one of your logging toolchains such as Loki.  

In essence, you need a higher level of observability ( traceability) which would help you debug any incoming requests and keep track of request that traverses from one service to another until it had been fulfilled. 

And that is what Jaeger is presenting. Jaeger is an open-source, end-to-end distributed tracing tool that helps you monitor and troubleshoot transactions in complex distributed systems. It can identify performance and latency issues, address root cause analysis and create a full map of service dependencies inside distributed systems. 

Below is the architecture diagram of how jaeger works: 

In the next series post, I will explain further how to deploy and use Jaeger into your Kubernetes cluster. Stay tuned, and follow us on LinkedIn or Twitter not to miss any of our blog posts. 

Reference list: https://www.jaegertracing.io/docs/1.22/architecture/


Hesham Abo El-Magd

Hesham is interested in the field of microservices and service architecture. focusing on containers and container orchestration especially Kubernetes and cloud-native solutions.