At Sensor Tower, we operate a fleet of thousands of background workers that execute various tasks to bring our data and product to our users. In this blog, we are sharing some of the tools and techniques that have enabled us to process billions of jobs a month with minimal cost.
Often times when you have a huge amount of data to process, it’s hard to make sure that you are operating at the highest productivity without overloading a particular subset of resources. For example, early on in the company’s history we realized that we needed to have more workers for processing data and we managed to ask AWS to provide us with additional servers. We thought this would solve our issues, but, after only a couple of hours, our database crashed because the increased number of workers had overloaded it.
This was the first of many lessons that scaling taught us the hard way. Fast forward a few years and we have implemented many changes that allow us to process billions of jobs that interact with our databases hosting more than 60 TB of data in more than 90 billion documents.
Sensor Tower’s products provide mobile app metrics. We supply crucial competitive insights into mobile app advertising, revenue, downloads, usage, and other aspects of the ecosystem. Our customers are from a wide range of business types such as app developers that want a read on the competition, financial institutions that make data-driven investment decisions, and media that cover current mobile trends.
With that, allow me to tell you a little bit more about our team. We are a global company with offices and employees on multiple continents. Our main office is in beautiful San Francisco. Over time, we have grown to a partially remote company and, in order to make this work, we make every effort to feel like one team—both by using modern technology to bring us closer, but also by regularly having the team come together at the same physical location.
When I was looking for a new job, I had a set of important characteristics I was looking for. After many years in the industry, I’ve come to the conclusion that the key characteristics I am looking for in business are:
Solving a real problem – This is core to having a healthy business.
A growing business – If business is good, most other problems can be solved.
Technical depth – Don’t settle for simple, copycat, or trivial. Solve unique, hard problems and challenge yourself. Ten years from now you will thank yourself.
Sensor Tower checked all the boxes for me. As a bonus, our organization has a truly great culture. I hope that everyone that starts working at Sensor Tower enjoys the same feeling of warmth I got throughout my interviewing, onboarding, and day-to-day work. People are literally smiling all the time. It’s infectious and I love it. I’m not going to ramble on about how great it is here, but if you’d like to know more please read about our values and also check out our open positions.
The Sensor Tower Stack
Infrastructure and Tooling Overview
We are primarily an AWS shop and we manage our infrastructure using Ansible together with some home-baked scripts. The total number of EC2 instances we employ varies over time, but it normally hovers around 1,000 instances. Roughly half of the instances are used for Sidekiq workers.
For our CI pipeline we use Jenkins, which facilitates unit testing, integration testing, linting, docs, and the builds artifacts.
How We Look at Jobs
Our worker fleet executes a plethora of tasks:
Ingesting data from external sources
A task is executed using one or many Sidekiq jobs. A Sidekiq job is a program (a Ruby class) that carries out a subset of a task, and larger tasks are normally broken down into many smaller jobs that can then be processed in parallel by a fleet of workers. An example of such a division could be a task that calculates something for each month in a year and updates a database. This task can be broken up into twelve pieces where each job is working on one month each.
Custom Solution: Sidekiq Throttled
Sidekiq Throttled is an open source, custom middleware for Sidekiq developed by Sensor Tower. We are huge open source nerds and we try to share what we can with the rest of the community.
The purpose of Sidekiq Throttled is to control maximum job concurrency. This can be critical to ensure that a large number of jobs queued does not overload a constrained resource that they access.
Sidekiq Throttled offers a lot of features and, if you want to learn more, please head over to its Github page.
Our Take on Queue and Instance Allocation
Sidekiq manages priority and throughput of tasks by assigning a weight to each queue that specifies how often a Sidekiq worker should pull a job from that queue. For a basic Sidekiq system, these weights are shared among all Sidekiq worker nodes. This makes configuration straightforward, but it lacks the flexibility needed when running a wide range of jobs with different characteristics, memory, and CPU requirements on a large fleet of workers over many different types of EC2 instances. What’s ideal is to make sure that jobs are run in priority order on the smallest (cheapest) instance that can execute the job, while at the same time keeping the total fleet at high utilization.
Our solution to this problem was to create a custom weights specification file for each of the EC2 instance types.
Our system uses a default queue weights file used as the base for most EC2 instance types and then an instance-specific file with the weights for that particular instance file. Below is an example of an instance type specific file:
This file, together with the defaults, is compiled into the queue weights for each instance type.
The system allows us to run jobs on the appropriately sized EC2 instance that can’t handle the load and, at the same time, utilizing the larger instances for smaller jobs when they otherwise would have remained idle. By employing this strategy, we maximize resource utilization and therefore reduce our cost.
Job Issues We’ve Encountered Over Time
There are many pitfalls when designing jobs, weights, and managing a fleet of workers. Below is a list of issues we’ve encountered in the past:
Workers that idle because there are not enough jobs of the right type
Uneven worker load, for example during certain times of the day
Jobs in a different queues that use (and overload) the same database
Jobs not executing because a server dies or was killed
Jobs executing multiple times without being idempotent
Workers running old code due to worker code update issues
Workers dying due to external resource being unreliable during deploy, for example Github
Workers running out of memory
Jobs being paused manually but the operator forgetting to turn the back on
Instance Sources and Our Approach to Cost Management
Amazon provides three basic ways of acquiring instances.
Reserved instances are leased for a long time span (years) at a very discounted price compared to spot or on-demand instances. They can either be acquired directly or bought on the secondary market from companies reselling their old instances they don’t use anymore.
Spot instances are more temporary leases and are leased on the spot instance market.
Finally, there are the on-demand instances, which is the instance type a normal customer gets if they request a temporary EC2 node. They are the most flexible in terms of lease but also the most expensive.
In order to reduce cost, we utilize all three different means of acquiring instances. At times when our system is at normal load, we use mostly spot instances and reserved instances. When we are at high system load, we use on-demand instances. These are brought up using auto scaling or manual acquisition.
For us, it was a logical choice to create scripts that use Amazon’s API to periodically check the marketplace for both spot instances and reserved instances to be able to lease instances at discounted rates. This way, we can get the majority of the instances we need for normal load at a discounted price.
How We Handle Auto Scaling
Amazon’s Auto Scaling feature enables dynamic control of the size of the instance fleet. A set of user-defined rules dictates if the fleet should grow or shrink between a minimum number up to a maximum number of instances. This allows a customer to use custom logic to ensure adequate throughput at the right cost.
For certain instance types we have fixed number of instances, but for others we use Sidekiq queue size (the number of jobs in the queue) to govern if the number of instances should grow or shrink.
Modern data-centric SaaS services are driven by scheduled events and are often interfacing with external systems. Load will vary across the day and week, it will be affected by customer activity, new feature rollouts, and, in general, fluctuate significantly. For early stage startups, over-provisioning is okay as a solution to this, since finding product and market fit is more important than revenue. However, as a business matures, the cost of background workers tends to scale with the business and the loss in profit margins due to resource waste will soon become significant. In addition to this robust deployment, monitoring and troubleshooting systems are important to ensure product quality, as problems tends to scale with the size of your platform.
In this blog, we have shown you some of the tools we use, some open source software, and general strategies to greatly reduce OPEX and robustness of your worker fleet. These are by no means the limit of what we think is possible in terms of optimizing our approach to these challenges. There are still many opportunities to explore and solutions to find. If you would like to join us in this and other projects, please take a look at our careers page and get in touch.