Return

Pattern: Observability

Cloud native distributed systems require constant insight into the behavior of all running services in order to understand the system’s behavior and to predict potential problems or incidents

Observability

Teams are moving to MS, and there are more and more pieces—the number of components is growing. Traditional responsive monitoring cannot recognize service failures.

In This Context

Traditional systems practice assumes the goal is for every system to be 100% up and running, so monitoring is reactive—i.e., aiming to ensure nothing has happened to any of these components. It alerts when a problem occurs. In traditional monitoring if a server fails, you will have an event; response, even if automatic, is manually triggered. This assumption is not valid for distributed systems.

Therefore

Put in logging, tracing, alerting, and metrics to always collect information about all the running services in a system.

Consequently

There is a continuous overview of the state of the system, visible to anyone for whom this is relevant information.