Many data aggregators and ad tech companies are struggling with similar issues to sovrn.
That is, as supply for programmatic advertising grows, the amount of raw data collected also grows exponentially.
In turn, the architectures and infrastructure that process the bids and data which were designed for much lower volumes, are not able to process the peak volumes now being seen across the industry.
Therefore, the failures that were not impactful in the past are now magnified. Examples are major delays for ingest of raw data into the "data lake" ie common database, and, major delays in being able to process the data in real time due to lack of processing capacity.
Also the effort to recover becomes much more complex because the platform must continue to service the current volume while "backfilling" or catching up on processing for delayed data.
A good metaphor is the old Lucy skit with the conveyor belt of candy moving faster than she can pack the boxes. In our case the candy is the data, and the boxes are the dashboards and reports. In order to catch-up we have to add more workers ie capacity to keep up with current data as well as process the backlog of data at the same time.
There are two platform architectures companies typical use: one is dedicated servers in a fixed set of data centers; two is the use of "elastic cloud" where capacity is virtualized and depends on the cloud provider for base services such as virtual servers, storage, and data processing software.
In the case of dedicated servers companies are required to make capital investments to build out dedicated infrastructure capacity.
In case of cloud, companies are taking advantage of "just in time" capacity through automation to scale capacity.
sovrn is taking a hybrid approach meaning we use both dedicated data centers and cloud. We are also architecting our software to take advantage of both options based on the workload requirements ie running workload where it is best served.
There are risks to both models.
In the case of dedicated data centers, there are constraints the amount of physical capacity and lead times to build-out can be longer. This option typically requires some level of a base investment, and can result in idle capacity during non-peak times.
In the case of cloud, there is an implied agreement that any cloud service can be unavailable at any time. This option requires variable expense ie pay for what you use and requires careful expense forecasting and controls.
In the case of the AWS outage in the Eastern Region which impacted many publishers and demand partners last week the storage service was unavailable. This caused major sites to be to go offline because they could not connect to their data.
Although not frequent, this scenario can occur in the cloud AND dedicated data centers.
There are common reliability patterns that can be applied in both dedicated data centers and cloud hosting to ensure data is "offloaded' to geo-diverse hosting locations, and that you can automatically move processing to a secondary hosting location to avoid impact to your business.
Think of these options as your insurance policy and like other insurance the costs are based on probability of failure & cost of failure ie opportunity, brand, and revenue.
In case of dedicated data centers, this involves working with your provider to setup data replication between data centers, and deciding what level of processing you want to pay for in the second site. You can make choices about how far apart you want the data centers, and whether you want "hot", "warm", or "cold" processing ie immediate recovery, intermediate recovery, extended recovery.
When hosting in the cloud your choices involve choosing which secondary Regions ie geographic location, you want your site to run in if the primary becomes unavailable to avoid your site going down.
In the cloud you also have choices to decide with "Zones" ie data centers within a region, you want your site to use if a specific service is unavailable in one zone but not others.
Ultimately, the goal is to ensure high reliability at massive volumes, and ensure no single point of failure takes your business down !