Skip to main content

Custom Alarms

📋 Table of Contents

  1. Overview
  2. Custom Metrics
  3. Custom Alarms
  4. Anomaly Detection
  5. Best Practices
  6. Examples from Our Infrastructure

🌟 Overview

This document provides detailed information about custom alarms and metrics in CloudWatch, which are crucial for monitoring aspects of our infrastructure not covered by default AWS metrics.

Note: Custom metrics and alarms allow us to tailor our monitoring strategy to our specific needs, but they require careful management to avoid unnecessary costs and complexity.

📊 Custom Metrics

Custom metrics allow us to track data points specific to our application or infrastructure that are not provided by default AWS metrics.

Key Concepts

  1. Namespace: A container for CloudWatch metrics. Use a unique namespace for your custom metrics.
  2. Metric Name: The name of your metric (e.g., DatabaseConnections).
  3. Dimensions: Key-value pairs that further identify your metric (e.g., {Service: "UserAPI", Environment: "Production"}).

Publishing Custom Metrics

Use the PutMetricData API call or AWS SDKs to publish custom metrics. Here's a basic example using AWS CLI:

aws cloudwatch put-metric-data --namespace "MyApplication" --metric-name "DatabaseConnections" --value 42 --dimensions Service=UserAPI,Environment=Production

Important: Be mindful of costs when publishing custom metrics. Each unique combination of namespace, metric name, and dimension is billed separately.

🚨 Custom Alarms

Custom alarms allow us to trigger notifications based on the behavior of our custom metrics or complex conditions involving multiple metrics.

📢 Alarm Notifications

When our custom alarms are triggered, they send notifications to a Discord channel using a dedicated Lambda function. This function processes the alarm data and formats it for Discord. For more details on this process, see the Resource Alarm Discord Lambda Function Documentation.

Key Features

  1. Math Expressions: Perform calculations on metrics before evaluating the alarm condition.
  2. Composite Alarms: Combine multiple alarms using AND and OR logic.

Creating a Custom Alarm

Here's an example of creating a custom alarm using AWS CLI:

aws cloudwatch put-metric-alarm --alarm-name "HighDatabaseConnections" --alarm-description "Database connections exceed threshold" --metric-name "DatabaseConnections" --namespace "MyApplication" --statistic "Average" --period 300 --threshold 100 --comparison-operator "GreaterThanThreshold" --evaluation-periods 3 --alarm-actions "arn:aws:sns:us-east-1:123456789012:MyAlarmTopic"

Tip: Always set a clear and descriptive alarm name and description to aid in troubleshooting and maintenance.

🧠 Anomaly Detection

CloudWatch Anomaly Detection uses machine learning to create a model of expected metric values based on historical data.

Key Benefits

  1. Dynamic Thresholds: Adjusts to changing patterns in your metrics.
  2. Seasonality Awareness: Accounts for daily, weekly, or other cyclical patterns.

Setting Up Anomaly Detection

To set up anomaly detection for a metric:

  1. Open the CloudWatch console.
  2. Find your metric and select "Graphed metrics".
  3. Under the "Actions" column, choose "Enable anomaly detection".

Note: Anomaly detection works best with at least 2 weeks of historical data.

📝 Best Practices

  1. Meaningful Metrics: Only create custom metrics that provide actionable insights.
  2. Consistent Naming: Use a consistent naming convention for custom metrics and alarms.
  3. Regular Review: Periodically review custom metrics and alarms to ensure they're still relevant.
  4. Cost Management: Monitor the number of custom metrics to control CloudWatch costs.
  5. Use Anomaly Detection: For metrics with variable patterns, prefer anomaly detection over static thresholds.

🛠 Examples from Our Infrastructure

Custom Alarm: RDS Storage Percentage

In our rds.resource-alarms.ts file, we have a custom alarm for RDS storage:

RDS Storage Percentage Alarm
const rdsStorageAlarm = new aws.cloudwatch.MetricAlarm(`gh-discord-${stack}-cw-${region}-rds-storage-alarm`, {
// ...
metricQueries: [
{
id: "freeStorage",
metric: {
metricName: "FreeStorageSpace",
namespace: "AWS/RDS",
dimensions: {
DBInstanceIdentifier: postgresDb.identifier,
},
period: 300,
stat: "Average",
},
returnData: false,
},
{
id: "freeStoragePercent",
expression: `(freeStorage / (${rdsStorageSize} * 1024 * 1024 * 1024)) * 100`,
label: "Free Storage Percentage",
returnData: true,
},
],
threshold: 20,
// ...
});

This alarm calculates the percentage of free storage and triggers when it falls below 20%.

Anomaly Detection: RabbitMQ Message Count

In our rmq.resource-alarms.ts file, we use anomaly detection for monitoring RabbitMQ message count:

RabbitMQ Message Count Anomaly Detection
const rabbitMQMessageCountHighAlarm = new aws.cloudwatch.MetricAlarm(`gh-discord-${stack}-cw-${region}-rabbitmq-message-count-high-alarm`, {
// ...
comparisonOperator: "GreaterThanUpperThreshold",
thresholdMetricId: "ad1",
metricQueries: [
{
id: "m1",
metric: {
metricName: "MessageCount",
namespace: "AWS/AmazonMQ",
period: 300,
stat: "Sum",
dimensions: {
Broker: rabbitmqBroker.brokerName,
},
},
returnData: true,
},
{
id: "ad1",
expression: "ANOMALY_DETECTION_BAND(m1, 10)",
label: "MessageCount (Anomaly Detection Band)",
returnData: true,
},
],
// ...
});

This alarm uses anomaly detection to identify unusual spikes in the RabbitMQ message count.

Pro Tip: When using anomaly detection, adjust the band width (the 10 in the example above) to balance between catching real issues and avoiding false alarms.

Remember to keep this document updated as you add or modify custom metrics and alarms in your infrastructure. Regular reviews will ensure that your monitoring strategy remains effective and aligned with your operational needs.