appfleet is a cloud platform that is built to help our users to easily achieve high-availability and 100% uptime.
Here we describe most of the mitigations we have in place to handle all possible outages and failures in an automated way.
Our system runs automated checks on every action in the system. This includes but not limited to:
Deployment of new cluster settings
Pulling of new container version
Addition and removal of nodes
As well as periodical checks of all components like:
Overall cluster health
Individual node reachability and health
Container reachability and health
Routing system consistency and health
Validity of HTTPS certificates
On top of that our system also takes into account custom user-made healthchecks that they have configured for every individual cluster.
If any of the components fail we have independent automated systems in place to handle the failure in a transparent to the user way without any downtime. Yet these systems are limited to the resources the user has specified, meaning if the user is using only 1 region and the datacenter has an outage we won't be able to re-route user traffic to a different region.
Single Region High-Availability
If you only need a single region we recommend deploying at least 2 nodes in that region to ensure single region high-availability.
This way if one of the nodes fails for some reason we will be able to instantly re-route user traffic to the second node.
Otherwise the system will need some time (2-4 minutes depending on size) to create a new node and migrate your application over.
To ensure best possible uptime we recommend using a minimum of 2 regions with 2 nodes on each region.
This way in case of a complete outage of a whole datacenter in one region we will be able to re-route all traffic to remaining healthy regions.
All failover and mitigations systems are completely automated and don't require user intervention!