📢 Hivemind Major Update: Using Temporary Table to Improve Hot Data Query Efficiency

ety001 (74)in #steem • 4 months ago

Background

The biggest bottleneck for Hivemind query performance today is the hive_posts_cache table. It holds hundreds of millions of rows and serves queries that involve complex conditions and joins. As a result, read performance for hot data (trending, hot, created, promoted, payout lists, etc.) on websites has been poor.

A common approach to this kind of problem is to split the hive_posts_cache table (e.g., by time range or sharding). Given the current codebase and operational constraints, however, the engineering effort and risk of a full table split are high. We have therefore adopted a compromise: a dedicated temporary table for hot queries.

What We Did

We introduced a second cache table, hive_posts_cache_temp, which holds only the last ~90 days of post cache data.

This PR is what we did: PR 364

Dual-write: When syncing data from the chain, every insert or update to hive_posts_cache is applied to both the main table and hive_posts_cache_temp in the same transaction. This keeps the two tables in sync without a separate sync job.
90-day window: We maintain logic so that the temp table only keeps data from the last 90 days. Rows older than 90 days are pruned periodically (e.g., every 60 seconds). Post deletes and fork rollbacks also remove rows from both tables as needed.
Query routing: Hot list APIs (trending, hot, created, promoted, payout, payout_comments) are routed to hive_posts_cache_temp. Other sorts continue to use hive_posts_cache. This reduces load on the main table for the most frequent, performance-sensitive queries.

Hot lists are intentionally limited to the last 90 days in the temp table; “load more” does not automatically switch to the main table beyond that window. Comment-by-ID APIs query the temp table first and fall back to the main table for missing IDs, so older comments can still be resolved.

Why This Helps

Lowest coupling with the existing system: This approach requires no schema change to the main hive_posts_cache table, no migration of existing data, and no change to how the main table is written or read by other code. The temp table is an additive layer; hot APIs are routed to it via a small routing layer, and the rest of the system stays unchanged. That keeps rollout and rollback simple and safe.
Lower load on the main table: Hot, high-traffic list APIs read from a much smaller table (~90 days of data) with indexes tuned for those sorts.
Same transaction for writes: Dual-write means no eventual consistency window between main and temp; both are updated together.
Lower risk than a full split: No schema change to the main table, no data migration of existing rows, and a clear rollback path (route back to main only) if needed.

We hope this update improves hot data read performance for Hive-based frontends and APIs. For technical details (schema, routing, dual-write, prune and delete behavior, and 90-day boundary semantics), see the documentation below.

How to Update

Pull the new Docker image:
- docker pull steemit/hivemind:latest-314979e
- or docker pull steemit/hivemind:latest
Restart your service using the new image.

When Hivemind runs in sync mode, migration v24 will run automatically: it creates the hive_posts_cache_temp table (if it does not exist) and performs a one-time cold-start backfill by copying the last ~90 days of data from hive_posts_cache into hive_posts_cache_temp. After the migration and backfill complete, the instance is ready for normal use and hot list APIs will use the temp table.

📢 Hivemind Major Update: Using Temporary Table to Improve Hot Data Query EfficiencysteemCreated with Sketch.

Background

What We Did

Why This Helps

How to Update

Further Reading

📢 Hivemind Major Update: Using Temporary Table to Improve Hot Data Query Efficiency