Hands-on System Design with Java Spring Boot

Hands-on System Design with Java Spring Boot

Day 22: Designing for Task Statuses (Pending, Running, Succeeded, Failed)

The State Machine Approach to Bulletproof Task Execution

Sumedh's avatar
Sumedh
Oct 22, 2025
∙ Paid

Why Task States Matter More Than You Think

Imagine you're at a busy coffee shop where orders move through different stages - ordered, brewing, ready, delivered. Without clear status tracking, chaos ensues: customers don't know when their coffee is ready, baristas don't know what to make next, and the manager can't tell which orders are stuck.

Your task scheduler faces the same challenge, but at massive scale. When handling thousands of tasks per second across distributed servers, knowing exactly where each task stands becomes mission-critical. A single "lost" task status can cascade into system-wide confusion.

Lesson Video

The Hidden Complexity of Task States

Most engineers think task status is simple: "just mark it complete when done." But in ultra-scalable systems, reality is messier:

Race Conditions: What happens when two servers try to update the same task status simultaneously? Without proper state transitions, you might mark a failed task as succeeded, or worse, lose track of a critical task entirely.

Zombie Tasks: Tasks that start but never finish due to server crashes create "zombie" states - they appear running forever, consuming resources and blocking dependent workflows.

Audit Requirements: In production systems, you need forensic-level tracking. When a million-dollar transaction fails, stakeholders demand to know exactly when and why each status change occurred.

State Machines: Your Safety Net

A state machine approach treats status changes as controlled transitions with rules. Instead of allowing arbitrary status updates, you define valid paths:

  • PENDING → RUNNING ✅

  • RUNNING → SUCCEEDED ✅

  • RUNNING → FAILED ✅

  • SUCCEEDED → RUNNING ❌ (Invalid!)

This prevents impossible states and makes your system predictable under stress.

Real-World Impact

Netflix uses state machines for their content encoding pipeline. When a video upload moves through states (uploaded → processing → encoded → distributed), strict transitions prevent partially processed videos from being served to millions of users.

Stripe applies similar patterns to payment processing. A payment can't jump from "pending" to "refunded" without passing through "succeeded" first - preventing financial discrepancies.

Core Implementation Strategy

Our TaskStatus enum will be more than simple constants. It becomes a controlled vocabulary with transition rules:

public enum TaskStatus {
    PENDING,    // Queued for execution
    RUNNING,    // Currently executing
    SUCCEEDED,  // Completed successfully
    FAILED;     // Failed with error
    
    // Validation logic for state transitions
    public boolean canTransitionTo(TaskStatus newStatus) {
        return switch(this) {
            case PENDING -> newStatus == RUNNING;
            case RUNNING -> newStatus == SUCCEEDED || newStatus == FAILED;
            case SUCCEEDED, FAILED -> false; // Terminal states
        };
    }
}

Integration with TaskExecution Entity

Building on Day 21's TaskExecution tracking, we'll enhance our entity to enforce state transitions at the database level. This creates a robust audit trail and prevents invalid state changes even under high concurrency.

The entity will track not just the current status, but also timestamps for each transition, creating a complete execution timeline. This proves invaluable when debugging issues in production or analyzing performance patterns.

System Design Benefits

Observability: Clear states enable powerful monitoring. You can instantly see how many tasks are stuck in "RUNNING" state (potential zombies) or track average time in each state.

Debugging: When issues arise, state history provides forensic evidence. Instead of guessing why a task failed, you see its complete journey through the state machine.

Performance Optimization: State metrics reveal bottlenecks. If tasks spend excessive time in "PENDING", you need more worker capacity. If "RUNNING" time is high, optimize task logic.

Today's Implementation Focus

We'll build a state machine that's both robust and performant:

  1. TaskStatus Enum: Define states with transition validation

  2. Enhanced TaskExecution Entity: Add status tracking with timestamps

  3. State Transition Service: Centralized logic for safe status updates

  4. Monitoring Integration: Expose state metrics for observability


This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 javap · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture