Alarms

📋 Table of Contents

🌟 Overview

This document outlines the CloudWatch alarms configured to monitor various AWS resources in our infrastructure. These alarms play a crucial role in maintaining the health, performance, and reliability of our services.

Note: All alarms are configured to send notifications to a Discord channel via Amazon SNS when triggered. Ensure that the SNS topic and Discord integration are properly set up.

🏗️ Alarm Structure

Each alarm is defined with the following properties:

Property	Description
Name	The unique identifier for the alarm in CloudWatch
Description	A brief explanation of what the alarm monitors
Metric	The specific data point being measured
Threshold	The condition that triggers the alarm
Period	The time frame over which the metric is evaluated
Evaluation Periods	The number of consecutive periods the threshold must be breached to trigger the alarm
Actions	What happens when the alarm is triggered (typically an SNS notification)

🔔 Alarm Notifications

Our CloudWatch alarms are configured to send notifications to a Discord channel using a Lambda function. For detailed information on how these notifications are processed and sent, please refer to the Resource Alarm Discord Lambda Function Documentation.

📊 Resource Alarms

💾 RDS Resource Alarms

These alarms monitor the health and performance of our Relational Database Service (RDS) instances.

RDS CPU Utilization Alarm

Name: rdsCPUAlarm
Description: Monitors CPU usage of the RDS instance
Metric: CPUUtilization
Threshold: > 80%
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

RDS RAM Utilization Alarm

Name: rdsRamAlarm
Description: Monitors RAM usage of the RDS instance using anomaly detection
Metric: FreeableMemory
Threshold: Below lower bound of anomaly detection band
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

RDS Storage Alarm

Name: rdsStorageAlarm
Description: Monitors available storage space in the RDS instance
Metric: FreeStorageSpace
Threshold: ≤ 20% of total storage
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

RDS Connection Pool Alarm

Name: rdsConnectionPoolAlarm
Description: Monitors the number of database connections
Metric: DatabaseConnections
Threshold: > 20 connections
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

🌐 Web Application Resource Alarms

These alarms monitor the performance and health of our web application running on ECS.

Web Application CPU Utilization Alarm

Name: webappCpuAlarm
Description: Monitors CPU usage of the web application ECS service
Metric: CPUUtilization
Threshold: > 80%
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

Web Application RAM Utilization Alarm

Name: webappRamAlarm
Description: Monitors RAM usage of the web application ECS service
Metric: MemoryUtilization
Threshold: > 80%
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

Unauthorized Requests Alarm

Name: unauthorizedRequestsAlarm
Description: Monitors the number of 4XX HTTP response codes
Metric: HTTPCode_Target_4XX_Count
Threshold: > 100 in 5 minutes
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

Success/Failure Ratio Alarm

Name: successFailureRatioAlarm
Description: Monitors the ratio of successful requests to total requests
Metric: Custom metric (success / total requests)
Threshold: < 95%
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

Failure Requests Alarm

Name: failureRequestsAlarm
Description: Monitors the number of 5XX HTTP response codes
Metric: HTTPCode_Target_5XX_Count
Threshold: > 10 in 5 minutes
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

Request Latency Alarm

Name: requestLatencyAlarm
Description: Monitors the average response time of requests
Metric: TargetResponseTime
Threshold: > 1.5 seconds (production) or > 5 seconds (staging)
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

Service Health Check Alarm

Name: serviceHealthCheckAlarm
Description: Monitors the health of the ECS service
Metric: HealthyHostCount
Threshold: < 1 healthy host
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

🐳 ECS Cluster Resource Alarms

These alarms monitor the overall health and performance of our ECS cluster.

ECS Memory Utilization Alarm

Name: memoryUtilizationAlarm
Description: Monitors memory usage of the ECS cluster
Metric: MemoryUtilization (custom metric)
Threshold: > 80%
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

ECS CPU Utilization Alarm

Name: cpuUtilizationAlarm
Description: Monitors CPU usage of the ECS cluster
Metric: CPUUtilization (custom metric)
Threshold: > 80%
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

ECS Ephemeral Storage Utilization Alarm

Name: ephemeralStorageUtilizationAlarm
Description: Monitors ephemeral storage usage of the ECS cluster
Metric: EphemeralStorageUtilization (custom metric)
Threshold: > 80%
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

ECS Network Transmit Bytes Alarm

Name: networkTxBytesAlarm
Description: Monitors network transmit bytes of the ECS cluster using anomaly detection
Metric: NetworkTxBytes
Threshold: Above upper bound of anomaly detection band
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

🐰 RabbitMQ Resource Alarms

These alarms monitor the health and performance of our RabbitMQ message broker.

RabbitMQ CPU Utilization Alarm

Name: rabbitMQCpuAlarm
Description: Monitors CPU usage of the RabbitMQ broker
Metric: SystemCpuUtilization
Threshold: > 80%
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

RabbitMQ RAM Utilization Alarm

Name: rabbitMQRamAlarm
Description: Monitors RAM usage of the RabbitMQ broker
Metric: RabbitMQMemUsed
Threshold: > 7GB (production) or > 800MB (staging)
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

RabbitMQ Storage Alarm

Name: rabbitMQStorageAlarm
Description: Monitors available disk space of the RabbitMQ broker
Metric: Custom metric (RabbitMQDiskFree / RabbitMQDiskFreeLimit)
Threshold: < 20%
Period: 60 seconds (1 minute)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

RabbitMQ Message Count High Alarm

Name: rabbitMQMessageCountHighAlarm
Description: Monitors the number of messages in RabbitMQ queues using anomaly detection
Metric: MessageCount
Threshold: Above upper bound of anomaly detection band
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

RabbitMQ Consumer Health Alarm

Name: rabbitMQConsumerHealthAlarm
Description: Monitors the number of active consumers
Metric: ConsumerCount
Threshold: < 2 consumers
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

🥬 Celery Resource Alarms

These alarms monitor the performance of our Celery task queue workers and scheduler.

Celery Scheduler CPU Utilization Alarm

Name: celerySchedulerCpuAlarm
Description: Monitors CPU usage of the Celery scheduler ECS service
Metric: CPUUtilization
Threshold: > 80%
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

Celery Worker CPU Utilization Alarm

Name: celeryWorkerCpuAlarm
Description: Monitors CPU usage of the Celery worker ECS service
Metric: CPUUtilization
Threshold: > 80%
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

Celery Scheduler RAM Utilization Alarm

Name: celerySchedulerRamAlarm
Description: Monitors RAM usage of the Celery scheduler ECS service
Metric: MemoryUtilization
Threshold: > 80%
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

Celery Worker RAM Utilization Alarm

Name: celeryWorkerRamAlarm
Description: Monitors RAM usage of the Celery worker ECS service
Metric: MemoryUtilization
Threshold: > 80%
Period: 300 seconds (5 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

📁 EFS Resource Alarms

These alarms monitor the usage and performance of our Elastic File System (EFS).

EFS Usage Alarm

Name: efsUsageAlarm
Description: Monitors storage usage of the EFS file system
Metric: StorageBytes
Threshold: > 1GB
Period: 900 seconds (15 minutes)
Evaluation Periods: 1
Actions: Notification sent to Discord via SNS

📝 Best Practices

Regular Review: Periodically review and adjust alarm thresholds based on your application's evolving needs and performance characteristics.
Actionable Alarms: Ensure that each alarm is actionable. An alarm should indicate a specific issue that requires attention.
Avoid Alarm Fatigue: Balance sensitivity with practicality to avoid too many false positives, which can lead to ignored alarms.
Documentation: Keep this document updated as you add or modify alarms. Include reasons for threshold choices where applicable.
Testing: Regularly test your alarms to ensure they're functioning as expected and that notifications are being received.
Escalation Plan: Have a clear escalation plan for critical alarms that require immediate attention.
Composite Alarms: Consider using composite alarms for complex conditions that depend on multiple metrics.
Use of Anomaly Detection: Leverage anomaly detection for metrics with variable patterns that are hard to define with static thresholds.

🔧 Custom Alarms and Metrics

For detailed information on our custom alarms and metrics, please refer to the Custom Alarms and Metrics Documentation.

🔗 Useful Links

Remember to keep this document up-to-date as your infrastructure evolves. Regular reviews and updates will ensure that your monitoring strategy remains effective and aligned with your operational needs.

Alarms

📋 Table of Contents​

🌟 Overview​

🏗️ Alarm Structure​

🔔 Alarm Notifications​

📊 Resource Alarms​

💾 RDS Resource Alarms​

🌐 Web Application Resource Alarms​

🐳 ECS Cluster Resource Alarms​

🐰 RabbitMQ Resource Alarms​

🥬 Celery Resource Alarms​

📁 EFS Resource Alarms​

📝 Best Practices​

🔧 Custom Alarms and Metrics​

🔗 Useful Links​