Day 5: Introducing the Need for a Distributed Scheduler

Aug 15, 2025

We're Solving

Yesterday's duplicate task executions weren't just annoying—they revealed fundamental problems with single-instance scheduling:

Race Conditions: Multiple instances executing the same task simultaneously
Single Points of Failure: One server crash stops all scheduled operations
Zero Visibility: No way to monitor task states across distributed instances

Today we'll architect a solution that eliminates these issues while handling millions of requests per second.

The Reality Check: When Single-Instance Scheduling Falls Apart

Remember yesterday when we discovered that running two instances of our Spring Boot app resulted in duplicate task executions? That wasn't just a minor inconvenience—it was a glimpse into one of the most critical challenges in modern distributed systems.

Think about Netflix's recommendation engine updating millions of user profiles every hour, or Uber's surge pricing algorithm recalculating rates across thousands of cities simultaneously. These systems can't afford duplicate executions, missed tasks, or single points of failure. They need what we're about to build: a distributed task scheduler.

The Three Horsemen of Distributed Scheduling Problems

1. Race Conditions: The Speed Demon Problem

When multiple scheduler instances try to pick up the same task simultaneously, you get race conditions. Imagine two instances both thinking, "I should process the 'daily-report' task right now!" Both grab it, both execute it, and suddenly your users receive duplicate daily reports.

In production systems handling millions of tasks, this isn't just annoying—it's expensive. Duplicate payment processing, double notification sends, or redundant data updates can cost companies thousands of dollars and erode user trust.

2. Single Points of Failure: The Achilles' Heel

Traditional @Scheduled approaches create a single point of failure. If that one server goes down, all scheduled tasks stop. No backups, no failover, just silence.

Real-world example: A major e-commerce company once lost $2 million in sales because their single scheduler instance crashed during Black Friday, stopping all inventory updates and promotional campaigns.

3. Task Visibility: The Black Box Problem

When tasks are scattered across multiple instances with no central coordination, you lose visibility. Which tasks are running? Which failed? Which instances are overloaded? Without answers, debugging becomes nearly impossible, and system optimization is just guesswork.

System Design Concepts: The Foundation

Leader Election Pattern

In distributed systems, we need one instance to be the "leader" who coordinates task distribution. This leader doesn't execute all tasks—instead, it assigns tasks to available workers, including itself.

Why it matters: Prevents duplicate executions while maintaining high availability. If the leader fails, another instance takes over seamlessly.

Distributed Locking

Before executing any task, an instance must acquire a distributed lock. Think of it as a digital "Do Not Disturb" sign that all instances respect.

Implementation approaches:

Database-based locks (simple, reliable)
Redis-based locks (faster, ephemeral)
ZooKeeper/Consul locks (enterprise-grade)

Task State Management

Each task needs a lifecycle: SCHEDULED → RUNNING → COMPLETED/FAILED. This state must be shared across all instances in real-time.

Event Sourcing for Task Operations

Every task operation (creation, execution start, completion, failure) becomes an immutable event stored in sequence. This creates a complete audit trail and enables powerful replay capabilities.

High-Level Architecture: The Distributed Scheduler Ecosystem

Our distributed scheduler consists of five core components working in harmony:

1. Scheduler Coordinator (The Brain)

Responsibility: Task discovery, distribution, and health monitoring
Technology: Spring Boot with scheduled polling
Key Features: Leader election, load balancing, failure detection

2. Task Registry (The Memory)

Responsibility: Persistent storage of task definitions and execution state
Technology: PostgreSQL with optimistic locking
Key Features: ACID transactions, task versioning, execution history

3. Distributed Lock Manager (The Traffic Controller)

Responsibility: Prevents concurrent execution of the same task
Technology: Redis with Spring Integration
Key Features: TTL-based locks, automatic cleanup, performance monitoring

4. Worker Nodes (The Hands)

Responsibility: Actual task execution with resource isolation
Technology: Spring Boot instances with thread pool management
Key Features: Heartbeat reporting, graceful shutdown, resource monitoring

5. Event Stream (The Nervous System)

Responsibility: Real-time communication between components
Technology: Apache Kafka with Spring Cloud Stream
Key Features: Guaranteed delivery, event replay, horizontal scaling

Architecture Flow: From Task Creation to Execution

Task Registration: A new task definition is registered in the Task Registry
Discovery Phase: Scheduler Coordinator discovers eligible tasks for execution
Leader Election: Active instances elect a coordinator using distributed consensus
Task Assignment: Coordinator assigns tasks to available worker nodes
Lock Acquisition: Worker attempts to acquire distributed lock for the task
Execution Phase: If lock acquired, worker executes task and publishes events
State Updates: Task state changes are persisted and broadcast to all instances
Cleanup: Locks are released, and execution metrics are recorded

Real-World Applications

Financial Services

JPMorgan Chase uses distributed schedulers for:

End-of-day settlement processing
Risk calculation batch jobs
Regulatory report generation

E-commerce Platforms

Amazon's scheduler handles:

Inventory synchronization across warehouses
Price optimization algorithms
Recommendation engine updates

Social Media

Facebook's task scheduler manages:

News feed algorithm updates
Content moderation workflows
User activity aggregation

Component Integration Strategy

Our architecture promotes loose coupling through event-driven communication. Each component can be developed, tested, and deployed independently while maintaining system coherency through well-defined interfaces and event contracts.

Scalability Considerations

Horizontal Scaling: Add more worker nodes to handle increased task volume
Vertical Scaling: Increase resources for individual components based on bottlenecks
Auto-scaling: Dynamic instance management based on queue depth and system load

Fault Tolerance Features

Circuit Breakers: Prevent cascade failures when dependencies are down
Retry Mechanisms: Automated retry with exponential backoff for transient failures
Health Checks: Continuous monitoring with automatic failover capabilities

Hands-On Implementation

Now let's build this architecture step by step. We'll create a working distributed scheduler that demonstrates all these concepts in action.

Github Link:

https://github.com/sysdr/taskscheduler/tree/main/day5/distributed-task-scheduler

Prerequisites Setup

Before we start coding, ensure you have:

Java 21 or later installed
Maven 3.6+ for dependency management
Redis server (for distributed locking)
Docker (optional, for advanced multi-instance testing)
Your favorite IDE (IntelliJ IDEA recommended)

Quick verification:

java -version    # Should show Java 21+
mvn -version    # Should show Maven 3.6+
redis-server --version  # Should show Redis info

Phase 1: Project Foundation (15 minutes)

Step 1: Create Spring Boot Project Structure

Initialize a new Maven project with this structure:

distributed-task-scheduler/
├── src/main/java/com/scheduler/
│   ├── model/          # Task entities
│   ├── repository/     # Data access
│   ├── service/        # Business logic
│   ├── controller/     # Web endpoints
│   └── config/         # Configuration
├── src/main/resources/
│   ├── templates/      # Web dashboard
│   └── application.yml # Configuration
└── docker/             # Container setup

Step 2: Configure Core Dependencies

Add these essential dependencies to your pom.xml:

spring-boot-starter-web - Web framework
spring-boot-starter-data-jpa - Database access
spring-boot-starter-data-redis - Redis integration
spring-boot-starter-actuator - Monitoring
shedlock-spring - Distributed locking
h2 - Development database

Step 3: Application Configuration

Create application.yml with these key settings:

spring:
  application:
    name: distributed-task-scheduler
  datasource:
    url: jdbc:h2:mem:scheduler
  data:
    redis:
      host: localhost
      port: 6379

scheduler:
  enabled: true
  pool-size: 10
  distributed-lock:
    enabled: true
    default-lock-time: PT30M

Phase 2: Core Model Design (20 minutes)

Task Definition Entity

Create the foundation of our scheduler with a TaskDefinition class that includes:

Unique task name and description
Cron expression for scheduling
Task class for execution logic
Status tracking (ACTIVE, RUNNING, FAILED, COMPLETED, INACTIVE)
Timestamps for created, last executed, next execution
Version field for optimistic locking

Task Execution Tracking

Build a TaskExecution entity to record:

Which task ran and on which instance
Start and completion timestamps
Execution duration and status
Error messages for failed executions

State Management

The task lifecycle follows this state machine:

ACTIVE: Ready for scheduling and execution
RUNNING: Currently executing with distributed lock held
COMPLETED: Successfully finished, ready for next schedule
FAILED: Encountered error, may need retry
INACTIVE: Manually disabled by administrator

Phase 3: Repository Layer (10 minutes)

Create repository interfaces with custom queries:

TaskDefinitionRepository:

findEligibleTasks() - Tasks ready for execution
countActiveTasks() - System health metrics
findByStatus() - Filter by current state

TaskExecutionRepository:

findRecentExecutions() - Monitoring dashboard data
getAverageExecutionTime() - Performance metrics
countRunningExecutions() - Prevent race conditions

Phase 4: Distributed Coordination (25 minutes)

Scheduler Service with Distributed Locking

The heart of our system uses ShedLock annotations:

@Scheduled(fixedDelay = 30000)
@SchedulerLock(name = "taskDiscovery", lockAtMostFor = "29s")
public void discoverAndExecuteTasks() {
    // Only one instance executes this method
}

Task Executor Service

Handles asynchronous task execution with:

Proper state transitions
Error handling and retry logic
Performance metrics collection
Lock cleanup after completion

Redis Configuration

Configure ShedLock with Redis for distributed coordination:

@Bean
public LockProvider lockProvider(RedisConnectionFactory connectionFactory) {
    return new RedisLockProvider(connectionFactory);
}

Phase 5: Monitoring Dashboard (15 minutes)

Web Controller

Create endpoints for:

Dashboard view with task status
REST API for system metrics
Real-time execution monitoring

Dashboard Template

Build a responsive web interface showing:

Active task definitions and their schedules
Recent execution history with status
System metrics (active/running task counts)
Instance identification and health

Auto-refresh Functionality

Add JavaScript to automatically refresh the dashboard every 30 seconds, providing real-time visibility into system state.

Testing & Demonstration

Local Single-Instance Testing

Start Redis and Application:

# Start Redis server
redis-server

# Run the application
./mvnw spring-boot:run

Verify Basic Functionality:

Open dashboard at http://localhost:8080/scheduler/dashboard
Watch tasks execute automatically every 30 seconds
Check H2 console at http://localhost:8080/h2-console
Monitor health at http://localhost:8080/actuator/health

Multi-Instance Distributed Testing

Docker Compose Setup:

For true distributed testing, use Docker Compose to run multiple instances:

# Start all services (Redis, PostgreSQL, 2 scheduler instances)
docker-compose up --build

Observe Distributed Behavior:

Open both dashboards:
- Instance 1: http://localhost:8080/scheduler/dashboard
- Instance 2: http://localhost:8081/scheduler/dashboard
Watch which instance picks up each task (only one should execute)
Kill one instance and observe failover
Check logs for distributed locking messages

Race Condition Prevention Verification

Test Scenario:

Create a task with high frequency execution (every 10 seconds)
Start multiple instances simultaneously
Monitor execution logs and database records
Verify no duplicate executions occur

Expected Results:

Only one instance executes each task occurrence
Distributed locks prevent race conditions
Failed lock acquisitions are logged properly
System maintains exactly-once execution semantics

Performance and Scalability Testing

Load Testing:

Add multiple task definitions with varying execution times
Monitor system performance under increased load
Test horizontal scaling by adding more instances
Verify resource utilization and response times

Failure Recovery Testing:

Simulate instance crashes during task execution
Verify lock timeout and cleanup mechanisms
Test automatic failover to remaining instances
Confirm no task executions are lost

Success Criteria for Implementation

By the end of this lesson, your distributed scheduler should achieve:

Functional Requirements

Zero Duplicate Executions: Tasks run exactly once regardless of instance count
High Availability: System continues operating despite single instance failures
Complete Visibility: Real-time monitoring of all task states and executions
Distributed Coordination: Proper leader election and lock management

Technical Requirements

Spring Boot Integration: Leverages Spring ecosystem for enterprise features
Database Persistence: Reliable state management with ACID properties
Redis Coordination: Fast distributed locking with automatic cleanup
Production Ready: Comprehensive monitoring, logging, and error handling

Learning Objectives Achieved

Distributed Systems Understanding: Practical experience with real-world challenges
Race Condition Prevention: Hands-on implementation of coordination mechanisms
Architectural Design: Component-based thinking for scalable systems
Production Considerations: Monitoring, observability, and operational practices

Assignment: Architecture Design Challenge

Task: Design a distributed scheduler for a real-world scenario of your choice (e.g., social media platform, delivery service, streaming platform).

Requirements:

Identify 5 specific tasks your system needs to schedule
Determine the execution frequency and dependencies for each task
Design the data flow between components
Specify the technology choices for each component
Create a failure recovery plan for each component

Deliverables:

Architecture diagram showing all components and their interactions
Task definition specifications with execution constraints
Failure scenario analysis with recovery strategies

Solution Hints

Start Simple: Begin with database-based locking before moving to Redis
Monitor Everything: Add metrics from day one—they're crucial for debugging
Test Failures: Regularly test your failure scenarios in development
Performance Baseline: Establish performance benchmarks early to track improvements
Documentation: Maintain clear documentation of your design decisions and trade-offs

The distributed scheduler we're building isn't just code—it's a blueprint for handling one of the most challenging problems in modern system design. Master this, and you'll have the foundation for building truly scalable, resilient systems that companies like Netflix, Uber, and Amazon rely on every day.

Next lesson, we'll dive deep into designing our Task Definition Model, creating the data structures that will power our entire scheduler ecosystem.