📢 Optimization of Website Access Instability Issues

ety001 (73)in #steemit • 3 months ago (edited)

Recently, the website has been experiencing unstable access.

In the background, we can see that the AWS AutoScaling Group has been constantly starting new machines and shutting down old ones.

This should be the apparent cause of the unstable user access.

A user's current access might be on this machine. But the next second, this machine could be shutdown by the AWS AutoScaling Group. That cause the unstable access.

I spent a long time troubleshooting and finally found that the possible reason is that the exception handling of callBridge() is imperfect.

As a result, the exception is finally thrown at the top level, and eventually a 500 error occurs.

The ELB health check wrongly takes this as a signal, deems the node in AutoScaling Group to be in an unhealthy state, and then shuts down the machine and starts a new one.

We can also see this situation on this charts below.