Notifications

Notifications are essential for keeping your team informed and aiding troubleshooting. Be sure to add them in the "Notifications" section when setting up your checks.

Available notification channels are at the moment: Email, Slack, Webhook.

Notification Details

  • Summary: A brief, readable summary of the issue, like "High CPU usage detected on server-1."
  • Description: Detailed information about the issue, such as "CPU usage has exceeded 90% for more than 5 minutes on instance server-1."

Configure Notification Channels

When setting up checks, you can configure notification channels to receive alerts if a check fails. If you already have existing notification channels, simply select one from the list. Alternatively, you can create a new notification channel tailored to your needs.

Email

  • Details Required:
    • Name: Specify the name for this notification channel.
    • Email Address: Provide the email address where notifications should be sent.
  • Purpose: Receive email notifications whenever a check fails.

Webhook

  • Details Required:
    • Name: Specify the name for this webhook notification.
    • URL: Provide the webhook URL where notifications will be sent.
    • Additional HTTP Headers (Optional): Add headers to authenticate the request or include extra information for downstream integration.
  • Purpose: Use this channel to send notifications to a generic webhook, allowing integration with various systems.

Example Payload

json
01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
{
"type": "alert.ongoing",
"data": {
"issue": {
"id": "a53b18cd-2f45-4896-90f5-2b6c3e9b0479",
"issueIdentifier": "742195867438912365",
"dataset": "production",
"start": "2024-10-31T11:23:05.123456789Z",
"summary": "High CPU Usage Detected",
"description": "CPU usage on server cluster 'prod-cluster-01' has exceeded the threshold, indicating potential resource bottleneck.",
"labels": [
{
"key": "environment",
"value": {
"stringValue": "production"
}
},
{
"key": "region",
"value": {
"stringValue": "us-west-2"
}
}
],
"annotations": [
{
"key": "impacted_service",
"value": {
"stringValue": "web-server"
}
}
],
"checkrules": [
{
"id": "d9e4a500-1cfc-4fa7-a84c-f7b2b2d97c42",
"version": 5,
"name": "High CPU Usage Check",
"expression": "avg({host_cpu_usage > $__threshold})",
"thresholds": {
"degraded": 0.75,
"failed": 0.9
},
"interval": "5m0s",
"for": "2m0s",
"keepFiringFor": "10m0s",
"summary": "Alert on high CPU usage across production servers",
"description": "Triggered when average CPU usage exceeds specified thresholds over 5 minutes.",
"labels": {
"severity": "critical"
},
"annotations": {
"team": "SRE"
},
"modes": [
"performance",
"threshold",
"alert"
],
"url": <LINK_TO_CHECK_RULE>
}
],
"url": <LINK_TO_FAILED_CECK>
}
}
}

Slack

  • Details Required:
    • Name: Specify the name for this Slack notification.
    • Webhook URL: Provide the Slack-specific webhook URL.
    • Slack Channel: Indicate the Slack channel where notifications should be sent.
  • Purpose: Receive notifications directly within a Slack channel for convenient, real-time alerts.

Labels & Annotations

Labels

  • Purpose: Labels are key-value pairs that categorize and provide metadata for the alert. They define important aspects like the source of the alert, its severity, and contextual information that can help in filtering, routing, and silencing alerts when using Alertmanager.
  • Examples: Common labels might include:
    • severity: Defines the urgency level of the alert, such as critical, warning, or info.
    • alertname: A unique name for the alert rule, like HighCPUUsage or MemoryLeak.
    • instance: Identifies the instance where the alert originated, like server-1 or node-xyz.
    • service: Indicates the job or service name associated with the alert, like web-service or database-service.

Annotations

  • Purpose: Annotations provide descriptive, human-readable information about the alert. They contain details that aid in understanding and troubleshooting the alert, often presented to the user in notification messages.
  • Examples: Common annotations might include:
    • message: Detailed information about the issue, such as "CPU usage has exceeded 90% for more than 5 minutes on instance server-1."
    • runbook_url: A link to documentation or a playbook on how to respond to the alert.
  • Note: Summary and Description that can be configured - are effectively annotations.

Last updated: November 19, 2024