📢 Incident Report: Hivemind Service Disruption - January 19

ety001 (73)in #steemit • last month

At 02:41 AM Beijing Time on January 19, we began receiving alert emails from the hivemind service.

Investigation commenced around 03:00 AM.

Incident Overview

During the incident, several PostgreSQL database metrics were abnormal:

Database connection sessions exceeded 30
Large number of slow SQL queries, with IO:DataFileRead nearly saturated
Peak load: 39.45 AAS
AWS Elastic Beanstalk instances were frequently restarting (two EC2 instances were continuously being removed and replaced by EB health checks)

After three hours of investigation and re-verification of recently added indexes, we were unable to identify a clear root cause.

Resolution

We rebuilt a new hivemind environment and redirected traffic from the affected environment to the new one.

The incident was resolved.

Post-Incident Analysis

The affected environment was not immediately removed.

Approximately 40 minutes later, we discovered that the affected environment had not automatically recovered (under normal circumstances, if a service is overwhelmed by excessive traffic, AWS EB health checks should gradually restore failed EC2 instances after traffic is removed).

Instead, we observed a warning that one EC2 instance had reached 91% memory usage.

This observation strongly suggests that hivemind may have a memory leak issue.

And this incident was likely a service avalanche phenomenon triggered by an unknown cause.

Hypotheses

We have two hypotheses regarding the root cause:

Hypothesis 1: Application Layer Memory Leak Cascade

Application layer (hivemind) memory leak/high usage (91%)
  ↓
Application response slows (GC pressure, memory swapping)
  ↓
Health checks fail
  ↓
EB removes instances, starts new instances
  ↓
New instances start, connect to database
  ↓
Old instance connections not properly closed
  ↓
Database connection count accumulates
  ↓
Database load increases
  ↓
Slow query latency further increases
  ↓
Application layer waits longer for database responses
  ↓
Memory usage further increases (request backlog)
  ↓
More instances removed
  ↓
Avalanche intensifies

Hypothesis 2: EB Environment Abnormal State Cascade

EB environment enters abnormal state (possibly triggered by an event)
  ↓
Health check logic becomes abnormal
  ↓
Instances incorrectly marked as unhealthy
  ↓
Instances removed, new instances started
  ↓
New instances inherit abnormal state
  ↓
Cycle repeats
  ↓
Database connection count accumulates
  ↓
Database load increases
  ↓
Application layer resource contention intensifies
  ↓
Memory usage increases
  ↓
Avalanche intensifies

Remediation Measures

We have implemented the following measures:

In PR-362
- Added configurable database connection pool size, increased connections from 20 to 25
- Added new index optimization
Configured 4GB SWAP for EC2 instances (previously EC2 had 8GB memory with no SWAP configured)

Next Steps

We will continue monitoring the service over the coming period.

If you experience any issues or have questions, please leave a message here.

Thank you for your patience and understanding.

#steem #witness #dev

last month in #steemit by ety001 (73)

$5.69

Sort:

Trending

[-]

jswit (74) last month

Upvoted! Thank you for supporting witness @jswit.

To turn off auto-reply, write a reply to this comment with "@jswit reply-off"
Delegate SP to jsup & receive daily upvote
Search and find Steemit posts

$0.00

STEEM 0.05

TRX 0.28

JST 0.045

BTC 65683.56

ETH 1896.44

USDT 1.00

SBD 0.38

📢 Incident Report: Hivemind Service Disruption - January 19steemCreated with Sketch.