Alarms
📋 Table of Contents
🌟 Overview
This document outlines the CloudWatch alarms configured to monitor various AWS resources in our infrastructure. These alarms play a crucial role in maintaining the health, performance, and reliability of our services.
Note: All alarms are configured to send notifications to a Discord channel via Amazon SNS when triggered. Ensure that the SNS topic and Discord integration are properly set up.
🏗️ Alarm Structure
Each alarm is defined with the following properties:
| Property | Description |
|---|---|
| Name | The unique identifier for the alarm in CloudWatch |
| Description | A brief explanation of what the alarm monitors |
| Metric | The specific data point being measured |
| Threshold | The condition that triggers the alarm |
| Period | The time frame over which the metric is evaluated |
| Evaluation Periods | The number of consecutive periods the threshold must be breached to trigger the alarm |
| Actions | What happens when the alarm is triggered (typically an SNS notification) |
🔔 Alarm Notifications
Our CloudWatch alarms are configured to send notifications to a Discord channel using a Lambda function. For detailed information on how these notifications are processed and sent, please refer to the Resource Alarm Discord Lambda Function Documentation.
📊 Resource Alarms
💾 RDS Resource Alarms
These alarms monitor the health and performance of our Relational Database Service (RDS) instances.
RDS CPU Utilization Alarm
- Name:
rdsCPUAlarm - Description: Monitors CPU usage of the RDS instance
- Metric: CPUUtilization
- Threshold: > 80%
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
RDS RAM Utilization Alarm
- Name:
rdsRamAlarm - Description: Monitors RAM usage of the RDS instance using anomaly detection
- Metric: FreeableMemory
- Threshold: Below lower bound of anomaly detection band
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
RDS Storage Alarm
- Name:
rdsStorageAlarm - Description: Monitors available storage space in the RDS instance
- Metric: FreeStorageSpace
- Threshold: ≤ 20% of total storage
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
RDS Connection Pool Alarm
- Name:
rdsConnectionPoolAlarm - Description: Monitors the number of database connections
- Metric: DatabaseConnections
- Threshold: > 20 connections
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
🌐 Web Application Resource Alarms
These alarms monitor the performance and health of our web application running on ECS.
Web Application CPU Utilization Alarm
- Name:
webappCpuAlarm - Description: Monitors CPU usage of the web application ECS service
- Metric: CPUUtilization
- Threshold: > 80%
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
Web Application RAM Utilization Alarm
- Name:
webappRamAlarm - Description: Monitors RAM usage of the web application ECS service
- Metric: MemoryUtilization
- Threshold: > 80%
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
Unauthorized Requests Alarm
- Name:
unauthorizedRequestsAlarm - Description: Monitors the number of 4XX HTTP response codes
- Metric: HTTPCode_Target_4XX_Count
- Threshold: > 100 in 5 minutes
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
Success/Failure Ratio Alarm
- Name:
successFailureRatioAlarm - Description: Monitors the ratio of successful requests to total requests
- Metric: Custom metric (success / total requests)
- Threshold: < 95%
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
Failure Requests Alarm
- Name:
failureRequestsAlarm - Description: Monitors the number of 5XX HTTP response codes
- Metric: HTTPCode_Target_5XX_Count
- Threshold: > 10 in 5 minutes
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
Request Latency Alarm
- Name:
requestLatencyAlarm - Description: Monitors the average response time of requests
- Metric: TargetResponseTime
- Threshold: > 1.5 seconds (production) or > 5 seconds (staging)
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
Service Health Check Alarm
- Name:
serviceHealthCheckAlarm - Description: Monitors the health of the ECS service
- Metric: HealthyHostCount
- Threshold: < 1 healthy host
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
🐳 ECS Cluster Resource Alarms
These alarms monitor the overall health and performance of our ECS cluster.
ECS Memory Utilization Alarm
- Name:
memoryUtilizationAlarm - Description: Monitors memory usage of the ECS cluster
- Metric: MemoryUtilization (custom metric)
- Threshold: > 80%
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
ECS CPU Utilization Alarm
- Name:
cpuUtilizationAlarm - Description: Monitors CPU usage of the ECS cluster
- Metric: CPUUtilization (custom metric)
- Threshold: > 80%
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
ECS Ephemeral Storage Utilization Alarm
- Name:
ephemeralStorageUtilizationAlarm - Description: Monitors ephemeral storage usage of the ECS cluster
- Metric: EphemeralStorageUtilization (custom metric)
- Threshold: > 80%
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
ECS Network Transmit Bytes Alarm
- Name:
networkTxBytesAlarm - Description: Monitors network transmit bytes of the ECS cluster using anomaly detection
- Metric: NetworkTxBytes
- Threshold: Above upper bound of anomaly detection band
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
🐰 RabbitMQ Resource Alarms
These alarms monitor the health and performance of our RabbitMQ message broker.
RabbitMQ CPU Utilization Alarm
- Name:
rabbitMQCpuAlarm - Description: Monitors CPU usage of the RabbitMQ broker
- Metric: SystemCpuUtilization
- Threshold: > 80%
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
RabbitMQ RAM Utilization Alarm
- Name:
rabbitMQRamAlarm - Description: Monitors RAM usage of the RabbitMQ broker
- Metric: RabbitMQMemUsed
- Threshold: > 7GB (production) or > 800MB (staging)
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
RabbitMQ Storage Alarm
- Name:
rabbitMQStorageAlarm - Description: Monitors available disk space of the RabbitMQ broker
- Metric: Custom metric (RabbitMQDiskFree / RabbitMQDiskFreeLimit)
- Threshold: < 20%
- Period: 60 seconds (1 minute)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
RabbitMQ Message Count High Alarm
- Name:
rabbitMQMessageCountHighAlarm - Description: Monitors the number of messages in RabbitMQ queues using anomaly detection
- Metric: MessageCount
- Threshold: Above upper bound of anomaly detection band
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
RabbitMQ Consumer Health Alarm
- Name:
rabbitMQConsumerHealthAlarm - Description: Monitors the number of active consumers
- Metric: ConsumerCount
- Threshold: < 2 consumers
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
🥬 Celery Resource Alarms
These alarms monitor the performance of our Celery task queue workers and scheduler.
Celery Scheduler CPU Utilization Alarm
- Name:
celerySchedulerCpuAlarm - Description: Monitors CPU usage of the Celery scheduler ECS service
- Metric: CPUUtilization
- Threshold: > 80%
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
Celery Worker CPU Utilization Alarm
- Name:
celeryWorkerCpuAlarm - Description: Monitors CPU usage of the Celery worker ECS service
- Metric: CPUUtilization
- Threshold: > 80%
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
Celery Scheduler RAM Utilization Alarm
- Name:
celerySchedulerRamAlarm - Description: Monitors RAM usage of the Celery scheduler ECS service
- Metric: MemoryUtilization
- Threshold: > 80%
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
Celery Worker RAM Utilization Alarm
- Name:
celeryWorkerRamAlarm - Description: Monitors RAM usage of the Celery worker ECS service
- Metric: MemoryUtilization
- Threshold: > 80%
- Period: 300 seconds (5 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
📁 EFS Resource Alarms
These alarms monitor the usage and performance of our Elastic File System (EFS).
EFS Usage Alarm
- Name:
efsUsageAlarm - Description: Monitors storage usage of the EFS file system
- Metric: StorageBytes
- Threshold: > 1GB
- Period: 900 seconds (15 minutes)
- Evaluation Periods: 1
- Actions: Notification sent to Discord via SNS
📝 Best Practices
-
Regular Review: Periodically review and adjust alarm thresholds based on your application's evolving needs and performance characteristics.
-
Actionable Alarms: Ensure that each alarm is actionable. An alarm should indicate a specific issue that requires attention.
-
Avoid Alarm Fatigue: Balance sensitivity with practicality to avoid too many false positives, which can lead to ignored alarms.
-
Documentation: Keep this document updated as you add or modify alarms. Include reasons for threshold choices where applicable.
-
Testing: Regularly test your alarms to ensure they're functioning as expected and that notifications are being received.
-
Escalation Plan: Have a clear escalation plan for critical alarms that require immediate attention.
-
Composite Alarms: Consider using composite alarms for complex conditions that depend on multiple metrics.
-
Use of Anomaly Detection: Leverage anomaly detection for metrics with variable patterns that are hard to define with static thresholds.
🔧 Custom Alarms and Metrics
For detailed information on our custom alarms and metrics, please refer to the Custom Alarms and Metrics Documentation.
🔗 Useful Links
- AWS CloudWatch Documentation
- Best Practices for Monitoring
- Setting Up SNS Notifications
- CloudWatch Anomaly Detection
- Creating Composite Alarms
Remember to keep this document up-to-date as your infrastructure evolves. Regular reviews and updates will ensure that your monitoring strategy remains effective and aligned with your operational needs.