Optimizing Cloud Infrastructure Spending for a Major Financial Enterprise

16 / Sep / 2024 by Vivek Tiwary 0 comments

Cost optimization in cloud environments is not just a best practice; it’s a necessity for businesses aiming to maximize their return on investment in cloud services. AWS provides a wealth of tools and services to help manage costs, but without proper implementation and monitoring, expenses can spiral out of control.  In this blog, I will outline a recent project aimed at enhancing cost efficiency in an AWS environment by identifying and addressing key inefficiencies.

Problem Statement

One of our financial enterprise clients was experiencing escalating cloud costs, particularly across their AWS infrastructure supporting critical financial services. After conducting a comprehensive review of their AWS account, we uncovered several inefficiencies that were driving these increased expenses. By implementing targeted cost optimization strategies, we were able to help the client significantly reduce their cloud spending while ensuring the performance, security, and scalability required for their financial operations.

Discovering Inefficiencies

At the outset of the project, we identified the following inefficiencies contributing to rising cloud costs:

  • Lack of Proper Tagging on Services: Without consistent tagging, tracking resource usage and allocating costs becomes challenging.
  • Absence of Budget Alerts: No budget alerts can lead to uncontrolled spending.
  • No Budget Allocation by Environment: Without separate budgets for different environments, it’s difficult to manage and track costs accurately.
  • Lack of CloudWatch Logs Retention Policy: Logs stored indefinitely increase storage costs.
  • Absence of S3 Intelligent-Tiering: Not using Intelligent-Tiering results in higher storage costs for infrequently accessed data.
  • Underutilization of Graviton-based Instances: Not using Graviton-based instances means missing out on cost savings and performance benefits.
  • Not Utilizing Spot Instances in Lower Environments: Missing the opportunity to use Spot Instances in non-production environments leads to higher costs.
  • Dev and UAT Environments Running During Non-Business Hours: Running these environments 24/7 increases unnecessary costs.

Discussion, Resolution, and Actionable Steps

1. Lack of Proper Tagging on Services:

One of the first issues We encountered was the lack of proper tagging across our AWS services. Tags are critical for identifying resources, managing costs, and maintaining security. Without consistent tagging, it’s difficult to allocate costs to specific projects, teams, or departments, leading to inaccurate financial tracking and accountability.

Resolution

To address this, We implemented a comprehensive tagging strategy. Each service was tagged with details such as environment (Dev, UAT, Production), owner, project name, and cost center. AWS Config rules were also set up to enforce tagging compliance, ensuring that any untagged resources are flagged for remediation. This not only improved visibility into our spending but also made it easier to generate cost allocation reports.

Actionable Steps
  • Define a clear tagging policy that aligns with your organizational structure.
  • Use AWS Tag Editor to tag existing resources.
  • Automate tagging during resource creation using AWS CloudFormation or Terraform.

2. Absence of Budget Alerts:

In many organizations, cloud costs often go unchecked until the bill arrives, which can be too late to take corrective action. This was the case in our environment, where there was no mechanism to alert us when budgets were exceeded.

Resolution

AWS Budgets allows me to create custom cost and usage budgets and receive alerts when thresholds are exceeded. We set up budgets for each environment and linked them with SNS notifications to alert stakeholders when spending reached critical levels. This proactive approach enabled us to take corrective actions, such as reducing resource usage or optimizing services before costs spiralled out of control.

Actionable Steps
  • Set up AWS Budgets for each account and environment.
  • Integrate with SNS to send alerts via email, SMS, or Slack.
  • Review budget reports regularly to ensure spending aligns with expectations.

3. No Budget Allocation by Environment:

In our AWS setup, there was no clear separation of budgets between different environments, such as Dev, UAT, and Production. This made it difficult to track which environment was contributing most to the costs and to enforce spending limits.

Resolution

We established separate budgets for each environment, allowing us to monitor spending more granularly. This separation also helped identify which environments were consuming resources unnecessarily, enabling us to optimize them accordingly. For example, we discovered that the UAT environment was consuming more resources than expected, leading us to investigate and optimize its usage.

Actionable Steps
  • Create environment-specific AWS Budgets.
  • Use AWS Cost Explorer to analyze spending trends across environments.
  • Adjust resource allocation and usage policies based on budget performance.

4. Lack of a CloudWatch Logs Retention Policy:

AWS CloudWatch logs can accumulate quickly, leading to unnecessary storage costs if not managed properly. In our environment, there was no retention policy in place, meaning logs were stored indefinitely, even when they were no longer needed.

Resolution

We implemented log retention policies to automatically delete logs after a certain period, depending on the criticality of the data. For example, logs from non-production environments were set to expire after 30 days, while production logs were retained for 90 days. This simple change reduced our storage costs significantly.

Actionable Steps
  • Review existing CloudWatch log groups and set appropriate retention periods.
  • Automate log retention policies using AWS CLI or SDKs.
  • Regularly audit log usage and adjust retention periods as needed.

5. Absence of S3 Intelligent-Tiering: Problem:

In our S3 setup, all objects were stored in the Standard storage class, regardless of access patterns. This resulted in higher storage costs, especially for infrequently accessed data that could have been stored more cost-effectively.

Resolution

We enabled S3 Intelligent-Tiering for buckets where data access patterns were unpredictable. S3 Intelligent-Tiering automatically moves objects between two access tiers (frequent and infrequent) based on changing access patterns, optimizing storage costs. For data that was accessed even less frequently, We moved it to Glacier or Glacier Deep Archive for long-term storage at a fraction of the cost.

Actionable Steps
  • Identify S3 buckets where Intelligent-Tiering can be applied.
  • Enable Intelligent-Tiering through the AWS Management Console or CLI.
  • Use S3 Lifecycle policies to move data to Glacier for long-term storage.

6. Underutilization of Graviton-based Instances:

Despite AWS Graviton-based instances offering better price-performance ratios, none were being utilized in our environments. Graviton instances are ARM-based and can provide significant cost savings for workloads that are compatible with the architecture.

Resolution

After assessing our workloads, We identified several that were suitable for migration to Graviton instances. By switching to Graviton-based instances, we achieved cost savings of up to 20% while maintaining performance. This was particularly effective for compute-intensive tasks and applications that could be easily recompiled or containerized for ARM architecture.

Actionable Steps
  • Identify workloads that can be migrated to Graviton instances.
  • Test and benchmark performance on Graviton-based instances.
  • Implement Graviton instances in both development and production environments.

7. Not Utilizing Spot Instances in Lower Environments:

Spot Instances offer significant cost savings but were not being utilized in our non-production environments. These environments, such as Dev and UAT, are typically more tolerant of interruptions, making them ideal candidates for Spot Instances.

Resolution

We replaced On-Demand instances in our lower environments with Spot Instances, leading to cost savings of up to 70%. To ensure minimal disruption, We used AWS Auto Scaling with Spot Fleet, which automatically replaces interrupted Spot Instances, ensuring continuous availability.

Actionable Steps
  • Replace On-Demand instances in non-production environments with Spot Instances.
  • Use Spot Fleets and Auto Scaling to manage Spot Instances.
  • Monitor Spot Instance availability and performance to adjust configurations as needed.

8. Dev and UAT Environments Running During Non-Business Hours: Problem:

In many organizations, including ours, Dev and UAT environments often run 24/7, even when not in use. This results in unnecessary costs, especially when these environments are only needed during business hours.

Resolution

We implemented automated shutdown and startup schedules for our Dev and UAT environments using AWS Lambda and EventBridge. This ensured that these environments were only running during business hours, significantly reducing costs without impacting productivity. For example, shutting down instances at 7 PM and starting them at 7 AM on weekdays resulted in nearly 60% savings in our lower environments.

Actionable Steps
  • Set up automated schedules for Dev and UAT environments using AWS Lambda and EventBridge.
  • Implement a start-stop policy that aligns with your organization’s working hours.
  • Monitor usage and adjust schedules based on actual usage patterns.

Conclusion

Optimizing cloud costs is crucial for maintaining an efficient AWS environment. Our project achieved a notable cost reduction of up to 20% by addressing inefficiencies like improper tagging, lack of budget alerts, and inefficient resource usage. Implementing comprehensive tagging, budget alerts, and environment-specific budgets improved expense tracking and control.

Leveraging AWS features such as S3 Intelligent-Tiering, Graviton-based instances, and Spot Instances, along with automated shutdown schedules, enabled significant savings and optimized resource use. Additionally, fostering cost awareness among teams and regularly reviewing cloud usage ensures ongoing efficiency.

By applying these strategies, we have achieved substantial savings and maintain effective cost management in our AWS environment.

FOUND THIS USEFUL? SHARE IT

Leave a Reply

Your email address will not be published. Required fields are marked *