Scaling Smartly: Utilizing ECS Fargate Spot Instances to Reduce Production Costs
Introduction
Optimizing the cost of running production workloads is essential in today’s cloud-driven world. By using Amazon ECS with Fargate Spot instances, you can significantly reduce your cloud costs while also ensuring that your applications are reliable and functional. Fargate Spot offers a cost-effective solution, offering the same benefits as On-Demand service but at a fraction of the cost. With careful planning and architectural design, you can save significant costs and maintain smooth operations, even in a dynamic and scalable production environment
Why Fargate Spot?
One of the big perks of using Fargate Spot is how much money it can save you. In certain cases, these Spot instances might cost up to 90% less than the regular On-Demand ones, which can really help lower your cloud expenses, especially if you’re working with large applications.
Although there’s a chance of interruptions with Spot instances, the cost savings often make it worthwhile. By designing your system to handle these interruptions efficiently, you can ensure that your applications keep running smoothly, minimizing any negative impact from the occasional disruption.
Challenges of Running Production Workloads on ECS Fargate Spot
While the cost savings with Fargate Spot are substantial, it’s essential to be aware of the challenges that come with running production workloads on Spot Instances:
- Interruption Handling: Spot instances can be interrupted with a two-minute warning when AWS needs Spot instances back. Handling this situation requires careful planning so that your application can either scale or gracefully shift workload without data loss or downtime within the 2-minute window.
- Capacity Availability: Spot instances are not always available, especially during periods of peak demand. It’s important to have a fallback strategy, such as using On-Demand instances or configuring your service to maintain a minimum number of On-Demand instances to ensure that critical work is always running.
- Complexity in Infrastructure: Combining spot instances with existing On-Demand instances can add complexity to your infrastructure. You will need to manage scaling policies, health checks, and load balancing to ensure proper performance of both Spot and On-Demand instances.
- Application Suitability: Not every app is a great match for Spot Instances. Apps that are flexible, stateless, and able to handle interruptions tend to do a lot better. But, if you’re working with apps that are stateful or sensitive to delays, you’ll have to take extra care to avoid issues like data loss or performance dips.
Even though there are some challenges, with careful planning and a solid system in place, you can still run production workloads on Fargate Spot and enjoy some pretty noticeable cost savings.
Implementation
- This setup consists of two ECS services including a primary service running spot containers and a secondary service that runs on-demand containers
- Both of the services receive traffic from a single target group
- Spot ECS service is set to scale at a lower threshold in comparison to on-demand service which helps the spot service to scale before on-demand service.
- For example, the spot service scales at 10,000 requests count per target, and on-demand service scales at 14000 requests per target
- As traffic increases and there is a need to scale containers then spot service starts to scale first due to a lower scaling threshold, and the On-demand service only scales when the request count per target increases to the on-demand scaling threshold which usually occurs in case spot loss or very high traffic
- During downscaling, the On-Demand service scales down first when traffic decreases, allowing the Spot service to continue handling most of the traffic.
- In scenarios where Spot capacity is unavailable, a Lambda function automatically starts a predetermined number of On-Demand containers upon receiving a “SERVICE_TASK_PLACEMENT_FAILURE” event for Spot services. This ensures continuous traffic handling, even if Spot instances are interrupted.
This setup provides a cost-efficient system where Spot services handle the majority of traffic, with On-Demand services as a reliable fallback during peak traffic or Spot interruptions. The scaling thresholds are tested and optimized over time, but the core principle remains to prioritize Spot instances for cost savings while using On-Demand instances to maintain service availability.
SPOT vs On-Demand RunningTaskCount over the past few months
The graph shows the RunningTaskCount for Spot and On-Demand instances over the past months. The orange regions representing the Spot instances are dominated, indicating that the Spot tasks handled most of the workload, consistent with our cost-saving strategy. The blue lines showing On-Demand services are still small when there is no Spot capacity or sometimes traffic increases. This highlights how to optimize the use of Spot instances for business for the most part, with On-Demand services stepping in as a reliable backup when needed.
Monitoring
We have also created a lambda function that triggers on ECS spot task interruption and logs a data point for a custom cloudwatch metric for each ECS service. This helps us to monitor the number of containers that were lost due to spot interruption over time to get valuable insights and the performance of the Spot-based infrastructure.
Savings
The cost savings by adopting spot containers have been substantial. Below is a summary of the cost reductions over the months.
Month | Dec 2023 | Jan 2024 | Apr 2024 | Jul 2024 | Aug 2024 |
ECS Cost | $148,665.20 | $114,588.88 | $83,029.26 | $42,966.87 | $36,287.29 |
Using spot services can really cut down infrastructure costs by a huge margin, helping you save a lot without sacrificing the performance you need. This cost optimization shows how powerful AWS Spot containers are when it comes to reducing cloud costs, all while keeping your services scalable and available.
Conclusion
Running ECS production workloads on Fargate Spot instances has been a real game changer, combining cost efficiency with operational flexibility. At TO THE NEW, we’ve mastered the art of designing infrastructure that smoothly handles Spot interruptions and scaling, ensuring big savings without losing out on reliability. Our approach uses Spot as the primary service, with On-Demand as a fallback, so we’re always ready for any changes in demand or capacity. With smart monitoring and a well-planned scaling strategy, we manage to cut costs while still delivering the performance and availability required in production. Stay tuned for more updates as we keep pushing the boundaries of cloud cost optimization and operational excellence.