Skip to main content

Alarms

📋 Table of Contents

  1. Overview
  2. Alarm Structure
  3. Resource Alarms
  4. Best Practices
  5. Useful Links
  6. Custom Alarms and Metrics

🌟 Overview

This document outlines the CloudWatch alarms configured to monitor various AWS resources in our infrastructure. These alarms play a crucial role in maintaining the health, performance, and reliability of our services.

Note: All alarms are configured to send notifications to a Discord channel via Amazon SNS when triggered. Ensure that the SNS topic and Discord integration are properly set up.

🏗️ Alarm Structure

Each alarm is defined with the following properties:

PropertyDescription
NameThe unique identifier for the alarm in CloudWatch
DescriptionA brief explanation of what the alarm monitors
MetricThe specific data point being measured
ThresholdThe condition that triggers the alarm
PeriodThe time frame over which the metric is evaluated
Evaluation PeriodsThe number of consecutive periods the threshold must be breached to trigger the alarm
ActionsWhat happens when the alarm is triggered (typically an SNS notification)

🔔 Alarm Notifications

Our CloudWatch alarms are configured to send notifications to a Discord channel using a Lambda function. For detailed information on how these notifications are processed and sent, please refer to the Resource Alarm Discord Lambda Function Documentation.

📊 Resource Alarms

💾 RDS Resource Alarms

These alarms monitor the health and performance of our Relational Database Service (RDS) instances.

RDS CPU Utilization Alarm
  • Name: rdsCPUAlarm
  • Description: Monitors CPU usage of the RDS instance
  • Metric: CPUUtilization
  • Threshold: > 80%
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
RDS RAM Utilization Alarm
  • Name: rdsRamAlarm
  • Description: Monitors RAM usage of the RDS instance using anomaly detection
  • Metric: FreeableMemory
  • Threshold: Below lower bound of anomaly detection band
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
RDS Storage Alarm
  • Name: rdsStorageAlarm
  • Description: Monitors available storage space in the RDS instance
  • Metric: FreeStorageSpace
  • Threshold: ≤ 20% of total storage
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
RDS Connection Pool Alarm
  • Name: rdsConnectionPoolAlarm
  • Description: Monitors the number of database connections
  • Metric: DatabaseConnections
  • Threshold: > 20 connections
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS

🌐 Web Application Resource Alarms

These alarms monitor the performance and health of our web application running on ECS.

Web Application CPU Utilization Alarm
  • Name: webappCpuAlarm
  • Description: Monitors CPU usage of the web application ECS service
  • Metric: CPUUtilization
  • Threshold: > 80%
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
Web Application RAM Utilization Alarm
  • Name: webappRamAlarm
  • Description: Monitors RAM usage of the web application ECS service
  • Metric: MemoryUtilization
  • Threshold: > 80%
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
Unauthorized Requests Alarm
  • Name: unauthorizedRequestsAlarm
  • Description: Monitors the number of 4XX HTTP response codes
  • Metric: HTTPCode_Target_4XX_Count
  • Threshold: > 100 in 5 minutes
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
Success/Failure Ratio Alarm
  • Name: successFailureRatioAlarm
  • Description: Monitors the ratio of successful requests to total requests
  • Metric: Custom metric (success / total requests)
  • Threshold: < 95%
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
Failure Requests Alarm
  • Name: failureRequestsAlarm
  • Description: Monitors the number of 5XX HTTP response codes
  • Metric: HTTPCode_Target_5XX_Count
  • Threshold: > 10 in 5 minutes
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
Request Latency Alarm
  • Name: requestLatencyAlarm
  • Description: Monitors the average response time of requests
  • Metric: TargetResponseTime
  • Threshold: > 1.5 seconds (production) or > 5 seconds (staging)
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
Service Health Check Alarm
  • Name: serviceHealthCheckAlarm
  • Description: Monitors the health of the ECS service
  • Metric: HealthyHostCount
  • Threshold: < 1 healthy host
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS

🐳 ECS Cluster Resource Alarms

These alarms monitor the overall health and performance of our ECS cluster.

ECS Memory Utilization Alarm
  • Name: memoryUtilizationAlarm
  • Description: Monitors memory usage of the ECS cluster
  • Metric: MemoryUtilization (custom metric)
  • Threshold: > 80%
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
ECS CPU Utilization Alarm
  • Name: cpuUtilizationAlarm
  • Description: Monitors CPU usage of the ECS cluster
  • Metric: CPUUtilization (custom metric)
  • Threshold: > 80%
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
ECS Ephemeral Storage Utilization Alarm
  • Name: ephemeralStorageUtilizationAlarm
  • Description: Monitors ephemeral storage usage of the ECS cluster
  • Metric: EphemeralStorageUtilization (custom metric)
  • Threshold: > 80%
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
ECS Network Transmit Bytes Alarm
  • Name: networkTxBytesAlarm
  • Description: Monitors network transmit bytes of the ECS cluster using anomaly detection
  • Metric: NetworkTxBytes
  • Threshold: Above upper bound of anomaly detection band
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS

🐰 RabbitMQ Resource Alarms

These alarms monitor the health and performance of our RabbitMQ message broker.

RabbitMQ CPU Utilization Alarm
  • Name: rabbitMQCpuAlarm
  • Description: Monitors CPU usage of the RabbitMQ broker
  • Metric: SystemCpuUtilization
  • Threshold: > 80%
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
RabbitMQ RAM Utilization Alarm
  • Name: rabbitMQRamAlarm
  • Description: Monitors RAM usage of the RabbitMQ broker
  • Metric: RabbitMQMemUsed
  • Threshold: > 7GB (production) or > 800MB (staging)
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
RabbitMQ Storage Alarm
  • Name: rabbitMQStorageAlarm
  • Description: Monitors available disk space of the RabbitMQ broker
  • Metric: Custom metric (RabbitMQDiskFree / RabbitMQDiskFreeLimit)
  • Threshold: < 20%
  • Period: 60 seconds (1 minute)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
RabbitMQ Message Count High Alarm
  • Name: rabbitMQMessageCountHighAlarm
  • Description: Monitors the number of messages in RabbitMQ queues using anomaly detection
  • Metric: MessageCount
  • Threshold: Above upper bound of anomaly detection band
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
RabbitMQ Consumer Health Alarm
  • Name: rabbitMQConsumerHealthAlarm
  • Description: Monitors the number of active consumers
  • Metric: ConsumerCount
  • Threshold: < 2 consumers
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS

🥬 Celery Resource Alarms

These alarms monitor the performance of our Celery task queue workers and scheduler.

Celery Scheduler CPU Utilization Alarm
  • Name: celerySchedulerCpuAlarm
  • Description: Monitors CPU usage of the Celery scheduler ECS service
  • Metric: CPUUtilization
  • Threshold: > 80%
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
Celery Worker CPU Utilization Alarm
  • Name: celeryWorkerCpuAlarm
  • Description: Monitors CPU usage of the Celery worker ECS service
  • Metric: CPUUtilization
  • Threshold: > 80%
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
Celery Scheduler RAM Utilization Alarm
  • Name: celerySchedulerRamAlarm
  • Description: Monitors RAM usage of the Celery scheduler ECS service
  • Metric: MemoryUtilization
  • Threshold: > 80%
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS
Celery Worker RAM Utilization Alarm
  • Name: celeryWorkerRamAlarm
  • Description: Monitors RAM usage of the Celery worker ECS service
  • Metric: MemoryUtilization
  • Threshold: > 80%
  • Period: 300 seconds (5 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS

📁 EFS Resource Alarms

These alarms monitor the usage and performance of our Elastic File System (EFS).

EFS Usage Alarm
  • Name: efsUsageAlarm
  • Description: Monitors storage usage of the EFS file system
  • Metric: StorageBytes
  • Threshold: > 1GB
  • Period: 900 seconds (15 minutes)
  • Evaluation Periods: 1
  • Actions: Notification sent to Discord via SNS

📝 Best Practices

  1. Regular Review: Periodically review and adjust alarm thresholds based on your application's evolving needs and performance characteristics.

  2. Actionable Alarms: Ensure that each alarm is actionable. An alarm should indicate a specific issue that requires attention.

  3. Avoid Alarm Fatigue: Balance sensitivity with practicality to avoid too many false positives, which can lead to ignored alarms.

  4. Documentation: Keep this document updated as you add or modify alarms. Include reasons for threshold choices where applicable.

  5. Testing: Regularly test your alarms to ensure they're functioning as expected and that notifications are being received.

  6. Escalation Plan: Have a clear escalation plan for critical alarms that require immediate attention.

  7. Composite Alarms: Consider using composite alarms for complex conditions that depend on multiple metrics.

  8. Use of Anomaly Detection: Leverage anomaly detection for metrics with variable patterns that are hard to define with static thresholds.

🔧 Custom Alarms and Metrics

For detailed information on our custom alarms and metrics, please refer to the Custom Alarms and Metrics Documentation.


Remember to keep this document up-to-date as your infrastructure evolves. Regular reviews and updates will ensure that your monitoring strategy remains effective and aligned with your operational needs.