Monitor AWS ECS Agent & Automatically Restart Agent on Failure
Amazon EC2 Container Service is a container management service that makes it easy to manage docker containers on EC2 instances. AWS ECS you can create task definition to define container configuration like memory, cpu, environment variables, mount point and services to scale docker containers.
Use Case: In one of our project we setup complete QA environment on AWS ECS and after few days we observed ECS agent gets frequently disconnected with the AWS ECS service. As a result AWS ECS service is unable to communicate with ECS agent resulting in no more schedulding and unable to get any status of the existing containers.
Note: We are using AWS ECS Optimized AMI i.e. Amazon Linux AMI, if you are using other OS AMI few steps may change i.e. install aws and getting metadata.
Steps to setup monitoring script on ECS nodes:
1. Setup SNS topic for recieving notifications
On the AWS console create sns topic and in the subscriber add notification email id, confirm the subscription you recieved from the SNS service.
2. Install AWS CLI
Our script will use AWS CLI to query AWS to find container instance arn and agent status using awscli ecs command option.
[js]yum install -y aws-cli[/js]
3. Setup IAM policies for SNS and ECS
a. AWS SNS IAM Policy: The below mentioned policy will allow IAM instance role to publish message to the SNS topic we created earlier. This will help us in getting notifications for agent failure.
[js]
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1460976768000",
"Effect": "Allow",
"Action": [
"sns:GetEndpointAttributes",
"sns:GetPlatformApplicationAttributes",
"sns:GetSubscriptionAttributes",
"sns:GetTopicAttributes",
"sns:ListEndpointsByPlatformApplication",
"sns:ListPlatformApplications",
"sns:ListSubscriptions",
"sns:ListSubscriptionsByTopic",
"sns:ListTopics",
"sns:Publish"
],
"Resource": [
"arn:aws:sns:ap-southeast-1:<aws-account-id>:<topic-name>"
]
}
]
}
[/js]
b. AWS ECS IAM Policy: The below mentioned IAM policy will allow IAM instance role to query AWS ECS api to list container instances and check agent connectivity status.
[js]
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1460960788000",
"Effect": "Allow",
"Action": [
"ecs:DescribeClusters",
"ecs:DescribeContainerInstances",
"ecs:DescribeServices",
"ecs:DescribeTaskDefinition",
"ecs:DescribeTasks",
"ecs:DiscoverPollEndpoint",
"ecs:ListClusters",
"ecs:ListContainerInstances",
"ecs:ListServices",
"ecs:ListTaskDefinitionFamilies",
"ecs:ListTaskDefinitions",
"ecs:ListTasks",
"ecs:Poll"
],
"Resource": [
"arn:aws:ecs:ap-southeast-1:<aws-account-id>:cluster/<cluster-name>"
]
}
]
}
[/js]
4. Monitoring Script
The below mentioned script will check for ECS agent connectivity with the ECS service, it first extract all the container instances arns, instance id (using metadata). It will then check for each container instance arn for its current status check weather its on the same instance. If current instance ECS agent is donnected it will trigger a notification and restart ecs service on the instance.
[js]
#!/bin/bash
# Sourcing the ecs.config file for using the cluster name
source /etc/ecs/ecs.config
CONTAINERS_ID=$(aws ecs list-container-instances –cluster $ECS_CLUSTER –output text –query ‘containerInstanceArns’)
INSTANCE_ID=$(curl
DATE=$(date +%Y-%m-%d-%H:%M)
TOPIC="arn:aws:sns:ap-southeast-1:<aws-account-id>:<topic-name>"
for container in $CONTAINERS_ID
do
STATUS=$(aws ecs describe-container-instances –container-instances $container –cluster $ECS_CLUSTER –output json –query ‘containerInstances[0].agentConnected’)
CHECK_INSTANCE_ID=$(aws ecs describe-container-instances –container-instances $container –cluster $ECS_CLUSTER –output text –query ‘containerInstances[0].ec2InstanceId’)
if [ $INSTANCE_ID == $CHECK_INSTANCE_ID ]
then
if [ $STATUS == "false" ]
then
echo "Agent Disconnected" $DATE >> /var/log/script.log
aws sns publish –message "AWS ECS Agent Failed $INSTANCE_ID $DATE" –topic $TOPIC
sudo stop ecs
sudo start ecs
else
echo "Agent Connected" $DATE >> /var/log/script.log
fi
fi
done[/js]
5. Setup cron to run every 5 minutes
After monitoring QA environment for more than 1 week, we found ECS agent gets disconnected almost twice daily, so I choose to setup cronjob to run every 5 minutes and writing error logs to /var/log/monitor-agent-logs.txt
[js]*/5 * * * * bash /home/ec2-user/monitor_agent.sh 2&>1 /var/log/monitor-agent-logs.txt[/js]
6. Create AMI and Update ECS Auto Scaling Groups Launch Configuration
Once you create a AMI of running instance, copy the existing launch configuration (i.e. created by AWS ECS Cloudformation Stack), update the AMI and create new launch configuration. Update the auto scaling group to use newly created launch configuration.
You seem to leave the query of the instance-id out:
INSTANCE_ID=$(curl http://169.254.169.254/latest/meta-data/instance-id)
You did this much ahead of what people are struggling now.