📢 Incident Report: Hivemind Service Disruption - January 19
At 02:41 AM Beijing Time on January 19, we began receiving alert emails from the hivemind service.
Investigation commenced around 03:00 AM.
Incident Overview
During the incident, several PostgreSQL database metrics were abnormal:
- Database connection sessions exceeded 30
- Large number of slow SQL queries, with IO:DataFileRead nearly saturated
- Peak load: 39.45 AAS
- AWS Elastic Beanstalk instances were frequently restarting (two EC2 instances were continuously being removed and replaced by EB health checks)
After three hours of investigation and re-verification of recently added indexes, we were unable to identify a clear root cause.
Resolution
We rebuilt a new hivemind environment and redirected traffic from the affected environment to the new one.
The incident was resolved.
Post-Incident Analysis
The affected environment was not immediately removed.
Approximately 40 minutes later, we discovered that the affected environment had not automatically recovered (under normal circumstances, if a service is overwhelmed by excessive traffic, AWS EB health checks should gradually restore failed EC2 instances after traffic is removed).
Instead, we observed a warning that one EC2 instance had reached 91% memory usage.
This observation strongly suggests that hivemind may have a memory leak issue.
And this incident was likely a service avalanche phenomenon triggered by an unknown cause.
Hypotheses
We have two hypotheses regarding the root cause:
Hypothesis 1: Application Layer Memory Leak Cascade
Application layer (hivemind) memory leak/high usage (91%)
↓
Application response slows (GC pressure, memory swapping)
↓
Health checks fail
↓
EB removes instances, starts new instances
↓
New instances start, connect to database
↓
Old instance connections not properly closed
↓
Database connection count accumulates
↓
Database load increases
↓
Slow query latency further increases
↓
Application layer waits longer for database responses
↓
Memory usage further increases (request backlog)
↓
More instances removed
↓
Avalanche intensifies
Hypothesis 2: EB Environment Abnormal State Cascade
EB environment enters abnormal state (possibly triggered by an event)
↓
Health check logic becomes abnormal
↓
Instances incorrectly marked as unhealthy
↓
Instances removed, new instances started
↓
New instances inherit abnormal state
↓
Cycle repeats
↓
Database connection count accumulates
↓
Database load increases
↓
Application layer resource contention intensifies
↓
Memory usage increases
↓
Avalanche intensifies
Remediation Measures
We have implemented the following measures:
- In PR-362
- Added configurable database connection pool size, increased connections from 20 to 25
- Added new index optimization
- Configured 4GB SWAP for EC2 instances (previously EC2 had 8GB memory with no SWAP configured)
Next Steps
We will continue monitoring the service over the coming period.
If you experience any issues or have questions, please leave a message here.
Thank you for your patience and understanding.
Upvoted! Thank you for supporting witness @jswit.