13 Must-Know Automation Scripts for DevOps Monitoring & Logging

12 / Mar / 2025 by Ahmad Ali 0 comments

Introduction

Monitoring and logging are crucial for maintaining a reliable system. Whether you’re managing cloud infrastructure, microservices, servers, or CI/CD pipelines, automation scripts provide a strong foundation & play a key role in preventing late-night production issues

In this article, we will look at 13 essential automation scripts that every DevOps engineer should know. These scripts will help in tracking system performance, identifying bottlenecks, and streamlining log management, making your workflow more efficient.

This guide is perfect for beginners, but experienced DevOps professionals will also find valuable insights. Let’s dive in and discover how these scripts can enhance your operations with real-world applications!

Objective

This blog will provide DevOps engineers with essential automation scripts to enhance system reliability, performance monitoring, and log management. DevOps teams can prevent critical issues, reduce manual effort, and improve operational efficiency by automating key tasks.

Prerequisites

  • Linux Server
  • Familiarity with Bash & Python Scripting.

1. Monitor Disk Space

Running out of disk space can cause DB crashes, application/server crashes, failed build pipelines, or even data loss.

Solution: A bash script checks disk usage in your server and sends an alert if usage exceeds the defined threshold. In this example 80%.

Sample Script:

#!/bin/bash
THRESHOLD=80
df -h | awk '{if($5+0 > '$THRESHOLD') print $0}'

2. Monitor CPU Usage

High CPU usage can indicate that the application isn’t scaling properly or that a zombie process is hogging resources.
Solution: This Python script uses psutil to monitor CPU usages and sends alerts when it exceeds a threshold.

Sample Script:

import psutil  
import smtplib  
  
if psutil.cpu_percent(interval=1) > 80:  
    print("High CPU Usage Alert!!")

3. Automate Log Rotation

The large logs can quickly fill storage and can become unmanageable. Automating log rotation will help to keep them under control.
Solution: A cron job paired with this Bash script compresses and archives the old logs, while retaining the recent ones.
Sample Script:

#!/bin/bash 
LOG_DIR="/var/log/app/" 
tar -czf $LOG_DIR/archive_$(date +%F).tar.gz $LOG_DIR/*.log 
rm $LOG_DIR/*.log

4. Service Uptime Monitoring

Immediate detection of service downtime is crucial for maintaining application reliability.
Solution: A Python script pings a service endpoint and sends alerts if it’s unreachable.
Sample Script:

import requests  
try:  
    response = requests.get("http://my-service-url.com")  
    if response.status_code != 200:  
        print("Service Down Alert!")  
except Exception as e:  
    print("Service Unreachable!")

5. Monitor Memory Usage

Low server memory can lead to application crashes and disruption of services.
Solution: This script checks memory usage and alerts when available memory drops below a specified threshold. In this example 600 MB.
Sample Script:

#!/bin/bash  
THRESHOLD=600 # in MB  
AVAILABLE=$(free -m | awk '/Mem/ {print $7}')  
if [ $AVAILABLE -lt $THRESHOLD ]; then  
    echo "Low Memory Alert!!"  
fi

6. Docker Container Health Check

Container health is crucial for ensuring the smooth operation of micro-services.
Solution: This script checks the health of running Docker containers.
Sample Script:

#!/bin/bash  
for container in $(docker ps --format "{{.Names}}"); do  
    STATUS=$(docker inspect --format '{{.State.Health.Status}}' $container)  
    echo "Container: $container - Status: $STATUS"  
done

7. Kubernetes Pod Monitoring

Pods are the smallest deployable units in Kubernetes. Monitoring their status ensures applications are running smoothly.
Solution: This script uses kubectl commands to check pod statuses.
Sample Script:

#!/bin/bash 
kubectl get pods --all-namespaces | grep -v 'Running'

8. Application Log Parsing Script

Log parsing helps in detecting the hidden issues, but manually filtering can be time-consuming.
Solution: A Bash script filters the application logs for ERROR messages or specific patterns.
Sample Script:

#!/bin/bash 
LOG_FILE="/var/log/app.log" 
grep -i "error" $LOG_FILE

9. Website Availability Check Script

What If your website goes offline, it’s critical to detect the outage immediately.
Solution: A Python script checks if your website is up and alerts you if it’s down.
Sample Script:

import requests  
try:  
    response = requests.get("http://myexample.com")  
    print("Website is up!") if response.status_code == 200 else print("Website is down!")  
except:  
    print("Website Unreachable!")

10. Script for Cleaning Temp Files

Temporary files can clutter the filesystem, leading to disk space issues in the server. Cleaning up the temp files is a necessary step.
Sample Script:

import os  
import shutil  
import time  
  
TEMP_DIR = '/tmp'  
AGE_THRESHOLD = 7 * 24 * 60 * 60  # 7 days in seconds  
  
def clean_temp_files():  
    now = time.time()  
    for filename in os.listdir(TEMP_DIR):  
        file_path = os.path.join(TEMP_DIR, filename)  
        if os.stat(file_path).st_mtime < now - AGE_THRESHOLD:  
            if os.path.isfile(file_path) or os.path.islink(file_path):  
                os.remove(file_path)  
                print(f'Removed file: {file_path}')  
            elif os.path.isdir(file_path):  
                shutil.rmtree(file_path)  
                print(f'Removed directory: {file_path}')  
if __name__ == "__main__":  
    clean_temp_files()

11. Track DNS Resolution Time

Monitoring DNS resolution speeds helps troubleshoot slow application performance.
Sample Script:

#!/bin/bash 
DNS_SERVER="8.8.8.8" 
DOMAIN="example.com" 
dig @$DNS_SERVER $DOMAIN +stats | grep "Query time"

12. Automated Backup Verification Script

Backups must be verified to ensure they’re available in the event of a system failure.
Sample Script:

#!/bin/bash  
BACKUP_DIR="/backups"  
for file in $BACKUP_DIR/*.tar.gz; do  
    if ! tar -tzf $file > /dev/null; then  
        echo "Corrupted backup file: $file"  
    fi  
done

13. AWS Cloud Cost Monitoring Script

Managing cloud costs ensures efficient resource allocation and prevents unexpected billing spikes.
Sample Script:

#!/bin/bash
THRESHOLD=10  # Set cost alert threshold
LOG_FILE="/var/log/aws_cost.log"
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/XXXXXXX"

# Fetch today's AWS cost
COST=$(aws ce get-cost-and-usage \
  --time-period Start=$(date +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics "BlendedCost" \
  --query 'ResultsByTime[0].Total.BlendedCost.Amount' \
  --output text 2>/dev/null)

# Ensure COST is not empty/null; if empty, set to zero
COST=${COST:-0}

# Validate COST as a numeric value
if ! [[ "$COST" =~ ^[0-9]+(\.[0-9]+)?$ ]]; then
    echo "$(date): ERROR - AWS cost data is invalid: '$COST'" >> "$LOG_FILE"
    exit 2
fi

# Log the cost
echo "$(date): AWS Cost Today: $COST USD" >> "$LOG_FILE"

# Function to send Slack notification
send_slack_alert() {
    MESSAGE="🚨 *AWS Cost Alert!* 🚨\n\nToday's AWS cost: *\$${COST} USD* 💰\nThreshold: *\$${THRESHOLD} USD*\n\n🔍 Check AWS Cost Explorer for more details."

    PAYLOAD=$(jq -n --arg text "$MESSAGE" '{text: $text}')

    curl -X POST -H 'Content-type: application/json' \
        --data "$PAYLOAD" "$SLACK_WEBHOOK_URL"
}

# Check if cost exceeds threshold using awk
if awk "BEGIN {exit !($COST > $THRESHOLD)}"; then
    send_slack_alert
fi

 

Conclusion

These scripts will certainly give a solid foundation for improving logging and monitoring in your infrastructure.  As it offers a practical solution, these scripts can be further modified as needed.

For more DevOps and automation content, do subscribe to our blogs.

 

FOUND THIS USEFUL? SHARE IT

Leave a Reply

Your email address will not be published. Required fields are marked *