Blog ›Postmortem Chronicles – Prometheus is Good, But Be Careful!

By Huseyin Babal

Development Team Lead, Hazelcast Cloud

Hüseyin BABAL is currently the development team lead of Hazelcast Cloud. He mainly focused on building SaaS on multi-cloud platforms based on Kubernetes and talking about those topics in public events. Prior to joining Hazelcast, he worked as Software Architect in eBay and Sony for their devops and microservices transition process.

View all blogs by the author

Aug 21, 2020

Back to Blog

Postmortem Chronicles – Prometheus is Good, But Be Careful!

On August 6, we detected and fixed a production issue that showed the wrong cluster status in Hazelcast Cloud Console. When you click the Stop Cluster button, it stays in Stopping state even after your cluster is shutdown. I was a bit confused since this system is already running for almost 2 years, but why did it happen recently? In this article, I talk about how we are using Prometheus then provide more detail about the root cause of the issue and how we fixed it.

Alert Management Architecture

Hazelcast Cloud is based on Kubernetes. As a result, we are using Prometheus to monitor the critical components due to easy integration. We collect metrics to show them in dashboards like Grafana, and trigger events based on some predefined rules. To trigger events:

We define rules based on Hazelcast Cluster metrics.
We define receivers to collect these alarms and route to the destination, which is a REST endpoint in our case. You can also see triggered events in the Prometheus alarm web page even if you don’t define a receiver.

Defining Alert Management Rules

One of the rules in Hazelcast Cloud is FreeIdleCluster which is for stopping the clusters that are idle more than 16 hours.

groups:
- name: rules
  rules:
  - alert: FreeIdleCluster
    expr: rate(hazelcast_cluster_stats{cluster_type="FREE",statname="total_operation_count"}[1m]) == 0
    for: 16h
    annotations:
        summary: 'Cluster [{{ $labels.cluster_id}}] is idle cluster'
        clusterid: '{{ $labels.cluster_id}}'
        customerid: '{{ $labels.customer_id}}'
        clusterName: '{{ $labels.cluster_name}}'
        stack: '{{ $labels.environment}}'
        eventtype: FREE_IDLE_CLUSTER
    labels:
        urgency: medium
        eventtype: FREE_IDLE_CLUSTER
        environment: '{{ $labels.environment }}'

Firstly, if expr is valid for the last 16h interval, Prometheus will trigger an alert. Secondly, we use total_operation_count coming from hazelcast metrics. Annotations and labels are important since those values will be shown in the Alerts page and will also be used inside the REST request payload. Above all, they are all handled by the Prometheus Alert Manager component.

Sending Alerts to Specific Destinations

We define receivers and routes to send these alerts to external services to trigger events. Receivers are for defining the destination like email, Slack, REST endpoint, etc…, while routes are for matching alerts based on their labels and send it to destination.

receivers:
    - name: event-service
    webhook_configs:
    - url: https://event.service/prometheus
        http_config:
        bearer_token: '<jwt_token>'

If your service is restricted, you can add bearer token so that the alert manager will inject this to request headers. To know which fields are sent in the request payload, you can use a mocked endpoint that prints this payload, this tool is very handy to do that.

 route:
      group_by: ['stack_name', 'alertname']
      group_wait: 30s
      group_interval: 1m
      repeat_interval: 3h
      routes:
      - match_re:
          environment: dev
        receiver: event-service
        continue: true

According to the above definition, all the alerts that contain environment: dev will be routed to the event-service receiver.

Some Traces of Production Issue

The number of clusters that has the Stopping state started to increase, and at the same time, we started to get exceptions coming from microservices. Firstly, one of the messages was, Channel is not found, coming from our notification service. Then, when I check our Slack management page, I realized the webhook we use for this operation is disabled! The owner of this webhook is no longer involved with this project and Slack disabled it. Let’s have a look the final state after this problem;

Event Service successfully stops the cluster, but cannot send the message due to missing webhook in Slack. As a result, Event Service responds failure to Prometheus Alert Manager. If Prometheus Alert Manager does not see an OK response from the external endpoint, it will not mark this alert as fixed and will try to send it again… and again. Finally, on the second call, the cluster state will enter Stopping state, but will never go into Stopped state due to this failure.

Actual Fix

As a quick fix, we renewed webhook URL but for long term solutions, we will apply a finite-state machine to our cluster status change mechanism to not update a cluster status from Stopped to Stopping.

Keep Reading

Blog

Hazelcast Platform as the Core Technology for Event-Driven Microservices

In the ever-evolving world of IT, scalability and agility are key factors of an architecture that delivers a competitive advantage….

Webinar

/ Video

/ 60 min

Hazelcast Office Hours: Using the Hazelcast Platform To Implement gRPC Based Microservices

Blog

Announcing Hazelcast Platform 5.4 Release

Introduction The impact of solutions built on the Hazelcast Platform is visible in many aspects of our daily lives. It…

Upcoming Webinar

Embracing the demands of an AI-Centric future with Hazelcast Platform 5.4

May 9, 2024 / 10:00am PDT / 1:00pm EDT / 5:00pm GMT

Unlock AI’s future with Hazelcast! Join our webinar on May 9, 2024, to explore how Hazelcast Platform 5.4 transforms AI workloads. Learn to manage data with accuracy & innovate while reducing costs. Register now!

Blog

3 Techniques to Boost Event-Driven Microservices Architectures

In the ever-changing world of software development, the event-driven microservices architecture has emerged as a game-changer for its ability to…

Case Study

Why Hazelcast?

Forrester names Hazelcast as a Strong Performer

Platform

Introducing Hazelcast Platform 5.4

Solutions

By Industry

By Use Case

By Architecture

Join us for a deep dive into Hazelcast Platform's capabilities

Resource Center

Learn

The Gartner®️ Market Guide for Event Stream Processing

Developers

Community

Learn

Toolbox

By Huseyin Babal

Postmortem Chronicles – Prometheus is Good, But Be Careful!

Alert Management Architecture

Defining Alert Management Rules

Sending Alerts to Specific Destinations

Some Traces of Production Issue

Actual Fix

Keep Reading

Hazelcast Platform as the Core Technology for Event-Driven Microservices

Hazelcast Office Hours: Using the Hazelcast Platform To Implement gRPC Based Microservices

Announcing Hazelcast Platform 5.4 Release

Embracing the demands of an AI-Centric future with Hazelcast Platform 5.4

3 Techniques to Boost Event-Driven Microservices Architectures

PSA Antwerp Cuts Operational Costs by 33% by Optimizing Their Business in Real Time

Why Hazelcast

About Us

Platform

Solutions

Developers

Learn

Connect

Why Hazelcast?

Forrester names Hazelcast as a Strong Performer

Platform

Introducing Hazelcast Platform 5.4

Solutions

By Industry

By Use Case

By Architecture

Join us for a deep dive into Hazelcast Platform's capabilities

Resource Center

Learn

The Gartner®️ Market Guide for Event Stream Processing

Developers

Community

Learn

Toolbox

By Huseyin Babal

Spread the Word

Postmortem Chronicles – Prometheus is Good, But Be Careful!

Alert Management Architecture

Defining Alert Management Rules

Sending Alerts to Specific Destinations

Some Traces of Production Issue

Actual Fix

Keep Reading

Hazelcast Platform as the Core Technology for Event-Driven Microservices

Hazelcast Office Hours: Using the Hazelcast Platform To Implement gRPC Based Microservices

Announcing Hazelcast Platform 5.4 Release

Embracing the demands of an AI-Centric future with Hazelcast Platform 5.4

3 Techniques to Boost Event-Driven Microservices Architectures

PSA Antwerp Cuts Operational Costs by 33% by Optimizing Their Business in Real Time

Why Hazelcast

About Us

Platform

Solutions

Developers

Learn

Connect