Libbie Jacobs-Moyal is a team leader of full-stack developers in the R&D Infrastructure & Oracle group at Panaya. She was a crucial member in some of Panaya’s most challenging development projects across different areas of expertise. Recently she led the cross-group Fargate expedition, taking Panaya’s infrastructure right into the future.
How Did This Start?
Our production database and processes became extremely heavy over the years. In order to accommodate our customers’ needs, we needed to add new features and customization with the intent of improving our Users’ experience and providing additional value to our customers.
One day, we woke up and realized that our production offline processes had become slower and prone to failures related to heavy usage.
We started to check what was going on, and we discovered that some of our processes were excessively heavy and negatively impacting our EC2 servers performance.
SO WHAT DO WE DO?
First – we concluded that some activities which run parallel in the background, are starting to be massively used and will soon need a scaling solution, in order not to negatively impact performance and even supply faster results than today.
Our next step was to look for technical solutions that we could use in the background and that could help to scale-up our services without any user impact.
Well – guess what, there are indeed a bunch of solutions out there.
In order to find the best solutions, we did an extensive analysis of our offline services, and discovered that our servers are sometimes idle, while at other times, they are totally overloaded. This led us to the conclusion that the solution would have to be based on actual computing power with scaled-up options and not necessarily on machines with 24/7 availability.
We understood that this approach could also help us save $$$$$$, because we would pay for the services as used, when we need them, no more and no less.
Since we also required the “separation” of processes to avoid cross-impact, we looked at technologies that would enable encapsulation and separation of processes.
And “voila” there it was! AWS Fargate.
Actually, the solution was apparent as we are AWS partners and have considerable experience with their services.
What Was the Next Step?
We took some of our heavier processes that are related to our impact analysis and encapsulated each one of them in a separate docker. By using dockers, we eliminated the processes’ cross-impact!
Nevertheless, we still had an issue of overloading few EC2 instances with unlimited dockers that would be spanned as a result of the background impact analysis.
Our next step was to facilitate our scaling of the above solution by examining solutions such as Spot Instances, Lambda and Fargate. In each of them you pay for the process time.
Lambda was a great potential solution but was limited to 15 minutes, and our processes would require more time.
Spots are competitively priced but without a no-fail guarantee in mid-process. We could possibly use this solution but would have to develop a compatible recovery mechanism. Hence, the subsequent winner was Fargate!
We used Fargate’s SDK to spin a docker that would run our analysis process, allowing an unlimited amount of processes to run simultaneously. We pay only for resources used: time, CPU & memory.
To summarize, we changed our flow from running many java processes on few EC2 instances, throw dockers running on the same EC2 instances and finally today we run dockers in Fargate which allows large scale of offline analysis.
One of the challenges with this solution was the ability of monitoring and investigating possible problems. This was solved by using Amazon CloudWatch. CloudWatch enabled us to gain access to logs (that were required for investigation) and by stepping-up alerts according to different thresholds, like CPU usage, memory etc. We also used EFS in order to store logs & configurations in a visible location available for developers etc.
Once that we accomplished all this, the task was completed.
Here is our Workflow, Simplified a Bit:
Great – What’s Next?
The current next bottleneck is our Database. It’s pretty obvious since we now have a bunch of processes that run via Fargate, working on the same DB. Therefore, we are looking for potential solutions that would streamline our DB usage.
We are evaluating using Spot Instances for recoverable and low-cost processes (Either inside or outside Fargate). Kubernetes is also a technology that we may adopt in order to create orchestration of dockers. Finally, SQS & Lambda could possibly be a potential solution for receiving work requests and dispatching them to the containers.
The future is bright!
More to come in the near future!
At Panaya we understand that behind each incredible product there is a team of experts. These individuals are our pride. They go above and beyond to make sure each component and stage in the process is done with perfect accuracy.
This initiative opens the curtains to share with you the personas at Panaya.