appfleet is a cloud platform that is built to help our users to easily achieve high-availability and 100% uptime.

Here we describe most of the mitigations we have in place to handle all possible outages and failures in an automated way.

Automated checks

Our system runs automated checks on every action in the system. This includes but not limited to:

  • Deployment of new cluster settings
  • Pulling of new container version
  • Addition and removal of nodes

As well as periodical checks of all components like:

  • Overall cluster health
  • Individual node reachability and health
  • Container reachability and health
  • Routing system consistency and health
  • Validity of HTTPS certificates

On top of that our system also takes into account custom user-made healthchecks that they have configured for every individual cluster.

If any of the components fail we have independent automated systems in place to handle the failure in a transparent to the user way without any downtime. Yet these systems are limited to the resources the user has specified, meaning if the user is using only 1 region and the datacenter has an outage we won't be able to re-route user traffic to a different region.

Single Region High-Availability

If you only need a single region we recommend deploying at least 2 nodes in that region to ensure single region high-availability.

This way if one of the nodes fails for some reason we will be able to instantly re-route user traffic to the second node.

Otherwise the system will need some time (2-4 minutes depending on size) to create a new node and migrate your application over.

Multi-Region High-Availability

To ensure best possible uptime we recommend using a minimum of 2 regions with 2 nodes on each region.

This way in case of a complete outage of a whole datacenter in one region we will be able to re-route all traffic to remaining healthy regions.

All failover and mitigations systems are completely automated and don't require user intervention!

Did this answer your question?