Spot.IM is one of the earliest clients on our portfolio. The very successful startup based in NYC and TLV is backing some of the busiest media providers’ conversation platforms on the internet. When giants like FoxNews, Diply, AOL, PlayBoy, HuffPost and many more are relying on your production to back their conversation, it needs to be stable, resilient, scaleable and perform each with perfection. Conversations are the second major interaction method users have with media websites, making it one of the two major revenue producers. As such, every outage is immediately translated to 💰 loss and lots of it.

Motivation

The architecture design in the company was always striving for micro services, and was deployed on containers from very early stages. Yet, throughout a long period of time, they were deployed “hard” on dedicated hosts, making them inflexible, slowly scaleable and inefficient. We came to an understanding that a re-architecture is required and that orchestrating the services is the natural way to go. Keeping in mind the implications any mistake can project, it had to be planned thoroughly.

Preparation

Migrating a live production that is currently serving customers to a new platform requires designing, planning and zooming into specific details. Even after doing these, the process has to involve short feedback cycles as the designs and plans are never final in their first draft. To this end we utilized CloudFormation for templating the infrastructure from VPC level, all the way down to ECS TaskDefinition attributes. After a multilevel iteration phase that included additional networking and service encapsulation, we turned to implement the entire monitoring and alerting system’s as templates as well. For this task we implemented CloudWatch Metrics, Alerts, Logs and Dashboards all deployed as CloudFormation stacks. Once the templates were ready to work with, we started our POC.

  • Side note — The iterations over the environment architecture never really end; additional features, tweaks and fine-tuning are required throughout the production lifetime, while updating the templates to be ready for a full deployment in case of disaster.

Phase1 | Proof of Concept

In order to achieve a POC, we decided to implement the architecture, deploying a specific core service of the company in a staging environment. Had all went well, we would have proceeded to a full staging cluster.

Load Test

The first deploy went well, and the service was successfully serving it’s purpose from it’s new location. However, during the initial load test we identified that ECS optimized AMI’s are limiting the available open files on the host by default. Since the hosted containers inherit the maximum values from their underlying machine, we ended up allowing ~1000 open files to our service, which in our specific use case was unacceptable. Tweaking the AMI to allow inheritance of larger open files max values, improved the tests results by a factor.

  • TIP: Open files is something you can handle by changing the AMI, or as simple as a CloudFormation init script of the instance. Note — Being a host-related configuration, this is something you cannot achieve with FARGATE; definitely something to consider.

Phase 2 | Non-production migration

After a succusfull POC, passing load tests and yielding good monitoring metrics values, we went on to a full staging cluster deployment. This stage means we will be utilizing a full ECS cluster for the first time, hosting ~60% of the companies backend services. The deployment was successful with all services but that was no surprise. We started by deploying the core service; a cache server that handles 100% of the incoming traffic, which is the most complex of them all. What we did encounter for the first time was the projection of scale activities in two different planes:

  1. Container level: the more common and understandable piece of a scaleable component as it reflects the lowest granularity level of our infrastructure; a container which is an instance of a service, that needs to be duplicated in order to serve a larger number of request or any kind of intensive workload.

  2. Instance level: For the first time, we had to deal with a scale-triggered replication level that didn’t have to do directly with a specific workload, but with an increasing demand of the infrastructure as a whole. Confused? So was I. All of a sudden the resource management turns to be a whole new level of infra deployment that was never dealt with before. I’m actually being Amazon myself, except for the fact that I’m already on Amazon. Handling instance level scale is a relatively easy task when it comes to scaling out (i.e adding resources) but when scaling down it’s a different complexity level which required a bit of intervention (so I’ve learned). To date, the AWS ECS team still hasn’t released a good solution to scaling down the ECS underlying infrastructure. Whether it’s a result of not caring about scale down since it doesn’t produce revenue or a very good selling point to the new Fargate service, at the end of the day, someone needs to take care it. You’re are more than welcome to find my solution here and here.

Experimenting with Chaos 🔥

Once a full grown cluster was in place, monitored, handling scale activities and looking good on load tests results it was time for the last preproduction test; Chaos.

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

  • principlesofchaos.org

Experimenting with chaos doesn’t have to be a cluster of application(s) that are continuously running such as Netflix’s Simian Army. Chaos is talking about “creating chaos” in different areas of the infrastructure to test their durability in times of mass failures, major outages that are out of our control, etc. Testing chaos was done in 4 scenarios:

  1. Over Load — Naming “DDOS” as a test is exaggerating, but we definitely created a distributed load of requests on the system to test it’s responsiveness to peaks. The goal: being able to handle a steep peak of interaction as a result of a major news event in the world. These kind of scale peaks happen 2–3 times a year, but can be fatal to the company if it’s system’s would not be durable enough to handle the load and response. Success: handling a load of up to 300% growth over 1 hour with a maximum latency level of 3 seconds. How: To quickly set and spin multi region requests load test we used Goad that is utilizing a multi-lambda setup to quickly facilitate any scale of load.

  2. Sudden Death — Instances on AWS sometimes just die. Whether it’s a planned removal of physical resources on AWS’s side or a failure on the host level that causes a malfunction. Normally, stopping and starting the instance would do the trick since it would migrate it to a new physical location. But we want to be completely agnostic to the reason or have it done gracefully. As long as we’re dealing with stateless pieces of application we shouldn’t care or put in any mitigation efforts. The goal: drop random instances on ECS (no underlying stateful services involved) without any noticeable effect on the applciation. Success: 100% uptime, keeping ALB errors (that originate in a backend service that’s not responsive) without any change, with a maximum latency level of 3 seconds. How: Used a simple Boto3 based application to randomly pick an instance from a given cluster and destroy it. Together with Goad for keeping a production-like level of requests. Lessons learned: The levels of redundancy change between different services. It’s not dramatically different, but a very good way of better knowing how your micro services respond underfire.

  3. *3rd Party disconnections **— We’re relying on 3rd party services on a daily basis; Npmjs, Docker hub, External Docker registries, External SDK and software libraries that are fetched on deployments from remote locations. We need to be able to handle outages of other services who we rely on, and maybe create buffers and proxies to be able to stand on our own. *The goal: Withstanding errors in production with errors on each of our 3rd party services. Being able to deploy critical applications even during outages. Success: Zero impact on production, zero errors regarding external docker images, NPM packages or Ruby gems. How: by killing the NAT connection to our private subnets resources, we were able to test the services for their ability to withstand outgoing web outages. *Lessons learned: *Live services usually won’t have any difficulties since they are using compiled software and preinstalled dependencies. Deployments however, tend to fetch remote deponent packages and will fail if these are not reachable. The conclusion is to cache / proxy / store as many of these as possible. We started using AWS’s ECR for storing official images as well as our own. We’ve also started caching as many packages as possible using Nexus.

Phase 3 | Money time: PRODUCTION

At this point, we had a fully functional ECS staging cluster, load tested, chaos experimented on, running with continuous integration and delivery processes. Next stop → PRODUCTION

Building a production cluster was a rather simple task. After experimenting with heavy load and “accidental” failures, we had a pretty good idea of how real traffic would be handled within the new system. The process involved a single service deployment of four core application services in a cluster of 15 containers. One by one the services were migrated from the old environment to ECS, carefully letting them live in cycles of 24 hours, while monitoring them as much as possible. After a week of slow migrations, the first production cluster was live and healthy. Over the course of the 72 hours that followed, the mission was to bring scalability to perfection, by fine-tuning each service with its relevant metrics’ levels and thresholds, upon which they would scale up or down. Doing that, included a manual tuning of CloudWatch metrics and alerts, and later on translating the final values to the CloudFormation templates used to backup and deploy the services.

Phase 4 | From MVP to full power deployment

Running a production cluster that included a few heavy-duty-core services provided the confidence in the system and architecture. This was the starting pistol to the migration project of the entire production platform. Over a course of 3 months, different clusters were created and services were migrated into them. We decided to logically separate clusters into 6 groups of applications, and so they were divided and deployed. Each cluster went through the same cycle; deployment, migration, monitoring, fine-tuning, monitoring for results and when we were happy about them, the final step was updating the templates.

After Math

Major things I’ve learned in the process of building and migrating to ECS:

  1. Testing is everything. In system architecture, like in application development; knowing a component is production ready can only be done with the help of tests.

  2. When migrating to a new technology, get in touch with the creators, make sure you posses every bit of information before starting to work. It’ll save a lot of head and heart aches. If you can find boiler plates created by the developers and build on top of them you’d be starting off on the right foot and save yourself hours of work and debugging sessions.

  3. Get out of your comfort zone — there’s no bigger cliché, but it’s true; use chaos, try load testing with new tools, experiment and research with the guidance of the creators. Do not enforce old habits on a system just because you’re comfortable with them.

  4. Keep your base code updated, you’ll probably be using it soon enough. Whether for DR purposes or for duplications; keep your infrastructure as code and make sure it’s updated. Best way to do it? Deploy using template updates!

The production described above has been running and growing ever since, 14 months and counting. Services were added, as well as some of the biggest content providers in the world such as FoxNews, Diply, AOL, PlayBoy, HuffPost and thousands of others enjoy it’s resiliency and scalability every day.

After considering the success of the first iteration of ECS deployment and striving to perfection with speed of deployments, usage of infrastructure-as-code concepts, rollback mechanisms and automatic updates; we’ve conceived a plan we call “ECS V2”. The path to V2, new concepts and improvements will be detailed in my next story, stay tuned.

References:

My name is Omer, and I am an engineer at ProdOps — a global consultancy that delivers software in a Reliable, Secure and Simple way by adopting the Devops culture. Let me know your thoughts in the comments below, or connect with me directly on Twitter @0merxx. Clap if you liked it, it helps me focus my future writings.

Originally published at www.prodops.io.