DApp Store | Web3 Hub for Events & Games

Trending topics

NadeshikoManju@薫る花は凛と咲く7月5日播出

A Python developer at day A Java developer at night PyCon China organizer @pythonhunter__ co-founder @containerd CTL maintainer. Super fan of @yurucamp_anime

Let's briefly review the AWS incident and some actions taken as an AIGC Startup SRE, hoping to help everyone. Since I discovered that our main cluster is in USE1 from the start of my employment, I began to make some preparations. The main things I did are as follows: 1. I created multi-region backups for our core databases, forming backups in USE1, Tokyo, and SG. This way, in extreme cases, we may lose some data, but we can ensure service continuity. 2. I restructured our SG testing cluster from a simple K3S setup on EC2 to a standard AWS EKS cluster. This allows us to quickly warm up a cluster during a disaster, reusing existing AWS components, minimizing the cost of manifest changes. 3. I drafted a simple SOP, including user announcements, DNS switching, version freezing, and other matters. Back to today, about 10 minutes after the AWS incident occurred, I found that there were new Pods in our containers that could not be set up. After confirming with AWS support that it was a USE1 issue, I realized that the ECR incident must be related to other events, so I decisively began handling it according to my planned Tier 1 incident response (for SREs, it's better to be wrong than to miss something). T+0 min, I issued a company-wide announcement and entered emergency mode. I set up a public meeting for everyone to join at any time. T+2 min, I confirmed that the incident was expanding as I expected, and I issued two directives: 1. Prohibit any code merges/commits across the board (mainly to avoid new resources being created that would cause Pod rotation and affect traffic), 2. Ask the operations team to prepare an announcement. T+3 min, I began recovery of the database in the SG region according to the SOP and started cascading the creation of dependencies such as OpenSearch/Redis. T+5 min, we began to formally confirm the specific issues with upstream and downstream dependencies, confirming that a newly launched core service was affected. T+10 min, we issued a service suspension announcement and an announcement regarding the impact on other services. T+10 min, I asked two colleagues to assist in setting up a new ECR and cleaning up existing resources in the testing environment, and I updated the CTO. In extreme cases, we might face a decision to preserve the experience while losing data. T+15 min, we ultimately confirmed that the currently created resources and traffic direction would not be significantly affected. The switching plan was suspended, but we continued to prepare relevant resources. T+30 min, our first database was restored. T+40 min, our second database was restored. T+1h, all our associated core infra, RDS/ES/Redis, were on standby, and we set up master-slave and other optimization options according to the production architecture. At the same time, we also started new services in the new cluster. Fortunately, the AWS crash did not affect all our services. We did not have to face the complex data recovery work after switching traffic. About T+2h to T+3h later, I officially announced to everyone that the emergency status was lifted. To be safe, we still froze features tonight. Looking back at the entire incident, I could have done more: 1. Make the extreme case SOP I prepared for myself public to everyone. This ensures that even if I'm not online, someone can take over for me. 2. We could conduct some preemptive drills. 3. The directives could be issued more decisively. That's about it, just a little sharing, hoping to help everyone.

Top

Ranking

Favorites