Test Article

Last Updated: February 26, 2025, 1:55 pm UTC
Test Article

Record-Book Incremental Stats Engine

Subject: Transitioning to Event-Driven Architecture with RabbitMQ & Bulk Upserts

Scope: 30 Tenants |  "Giant" Tenants (30M+ Rows) | Standard Tenants (~1M+ Rows)

Infrastructure Target: AWS RDS db.t3.large (8GB RAM) | RabbitMQ Cluster


1. Executive Summary

The current Batch Rebuild architecture operates at O(N) complexity, where system load increases linearly with data growth. For our “Giant" tenants (30M+ records), this model causes severe CPU starvation, table locking, and IOPS exhaustion during rebuild cycles.

The proposed solution transitions the system to an Event-Driven Delta model. By utilizing RabbitMQ as a high-throughput buffer and implementing Micro-Batching Consumers, we shift the computational cost from O(N) to O(1). This ensures that a 30M-row tenant updates with the same sub-second latency as a 500k-row tenant, while reducing Database I/O by approximately 90%.


2. Problem Statement

The current system relies on periodic aggregation queries (SUM, COUNT) against the match_stat_player table. This approach is no longer viable due to:

  1. CPU Spikes: Aggregating 40M rows consumes 100% of CPU credits on t3 instances.
  2. Database Locking: Long-running writes block read operations, degrading the API experience.
  3. IOPS Saturation: Frequent large-scale reads/writes deplete the GP3 storage burst balance.

3. Infrastructure Layer: Reliability & Scaling

3.1 Message Broker: RabbitMQ (The Buffer)

To decouple high-velocity stat updates from the database layer, we will utilize RabbitMQ.

  • Role: Acts as a durable buffer that absorbs traffic spikes (e.g., simultaneous matches).
  • Durability: Messages are persisted to disk to survive service restarts.
  • Flow Control: We utilize prefetch_count (e.g., 500) to ensure workers process messages at a sustainable rate without exhausting RAM.

3.2 Hardware Upgrade: db.t3.large (8GB RAM)

Upgrading to 8GB of RAM is a critical requirement for this architecture.

  • InnoDB Buffer Pool: The increase in RAM allows the RDS to keep the "Hot" index trees for the 5 Giant Tenants in memory.
  • Why it matters: If the index does not fit in RAM, the database reverts to Disk I/O, which is 100x slower. 8GB provides the necessary headroom for our dataset.

4. Core Logic: The Streaming Pipeline

4.1 Phase 1: The Producer (Zero-Latency)

The application intercepts changes in the match_stat_player table. Instead of calculating totals, it calculates the Delta (Difference).

  • Trigger: Event Listener on MatchStat model.
  • Payload: Lightweight JSON:

JSON

{

  "tenant_id": "client_A",

  "league_id": 10,

  "member_id": 505,

  "stat_id": 3,

  "delta": 1  // +1 Goal

}

  • Performance: The Producer finishes execution in milliseconds, as it does not wait for the database write.

4.2 Phase 2: The Smart Consumer (Micro-Batching)

Directly writing every message to the DB would kill IOPS. The Worker implements a Buffering Strategy:

  1. Accumulation: The worker pulls messages from RabbitMQ and stores them in an internal memory array (grouped by Tenant ID).
  2. Execution Trigger: The buffer flushes to the database when:
    • Buffer size reaches 1,000 records, OR
    • Time elapsed reaches 60 seconds.
  3. Context Switching: The worker dynamically switches the Database Connection to the specific Tenant ID before executing the write.

4.3 Phase 3: Bulk Atomic Upsert

We execute a single SQL query to update hundreds of rows simultaneously.

SQL Logic:

SQL

INSERT INTO record_book_career (league_id, member_id, stat_id, value)

VALUES

  (1, 101, 3, 1),

  (1, 102, 3, 2),

  ... (up to 1,000 rows)

ON DUPLICATE KEY UPDATE

  value = value + VALUES(value);

  • Efficiency: Reduces 1,000 I/O operations into 1 I/O operation.
  • Atomic Safety: The math happens inside the storage engine, guaranteeing data integrity even during concurrency.

5. Storage Strategy & Optimization

5.1 Specialized Record Tables

Data is distributed across three optimized tables:

  • record_book_career: Cumulative all-time stats.
  • record_book_season: High-water mark tracking (Updates only if New_Total > Best_Season).
  • record_book_game: Peak performance per match.

5.2 Advanced Partitioning & Indexing

For the 5 Giant Tenants, standard indexing is insufficient.

  • Partitioning: Tables are partitioned by league_id. This restricts MySQL search operations to small physical segments of the disk.
  • Clustered Composite Key: (league_id, stat_id, member_id). This ensures the database finds the exact row via a direct "Seek" operation rather than a "Scan."

6. Legacy Data Migration (Backfilling)

To populate the system with legacy records without downtime:

  1. Chunking: A background crawler processes legacy data in chunks of 5,000 IDs.
  2. Throttling: The script monitors AWS CloudWatch for RDS CPU Credits. If credits fall below 30%, migration pauses automatically.
  3. Parallel Convergence: Since the logic is additive (A + B = B + A), live RabbitMQ updates and the Legacy Migration script can run simultaneously without data corruption.

7. Real-Time Leaderboard Caching (Top 10)

We avoid expensive ORDER BY queries on 30M rows using Threshold-Based Caching:

  • The Threshold Rule: Redis stores the score of the 10th-place player.
  • Conditional Trigger: A WarmCacheJob is only dispatched if a player's new score exceeds the threshold.
  • Debouncing: Multiple warm requests are collapsed into a single query every 60 seconds to prevent "Job Storms."

8. Reliability & Fault Tolerance

  • RabbitMQ Acknowledgements: Messages are only acked (removed from the queue) AFTER the Bulk Upsert is successfully committed to the database.
  • Idempotency: A stat_updates_log table tracks processed batches to prevent double-counting if a worker crashes mid-process.
  • Dead Letter Exchange (DLQ): Failed batches are routed to a separate queue (dlx_stats) for manual inspection instead of blocking the pipeline.

9. Strategic Impact & Business Value

  • High Throughput: The architecture can handle spikes of thousands of events per minute without lag.
  • Cost Efficiency: We maximize the utility of the t3.large instance by reducing Database I/O by ~90%.
  • Scalability: The system becomes volume-agnostic; the performance for a 40M-row tenant is identical to that of a small tenant.

More articles written by Pete Kazanjy Featured articles More articles