We're Solving
Yesterday's duplicate task executions weren't just annoying—they revealed fundamental problems with single-instance scheduling:
Race Conditions: Multiple instances executing the same task simultaneously
Single Points of Failure: One server crash stops all scheduled operations
Zero Visibility: No way to monitor task states across distributed instances
Today we'll architect a solution that eliminates these issues while handling millions of requests per second.
The Reality Check: When Single-Instance Scheduling Falls Apart
Remember yesterday when we discovered that running two instances of our Spring Boot app resulted in duplicate task executions? That wasn't just a minor inconvenience—it was a glimpse into one of the most critical challenges in modern distributed systems.
Think about Netflix's recommendation engine updating millions of user profiles every hour, or Uber's surge pricing algorithm recalculating rates across thousands of cities simultaneously. These systems can't afford duplicate executions, missed tasks, or single points of failure. They need what we're about to build: a distributed task scheduler.
The Three Horsemen of Distributed Scheduling Problems
1. Race Conditions: The Speed Demon Problem
When multiple scheduler instances try to pick up the same task simultaneously, you get race conditions. Imagine two instances both thinking, "I should process the 'daily-report' task right now!" Both grab it, both execute it, and suddenly your users receive duplicate daily reports.
In production systems handling millions of tasks, this isn't just annoying—it's expensive. Duplicate payment processing, double notification sends, or redundant data updates can cost companies thousands of dollars and erode user trust.
2. Single Points of Failure: The Achilles' Heel
Traditional @Scheduled approaches create a single point of failure. If that one server goes down, all scheduled tasks stop. No backups, no failover, just silence.
Real-world example: A major e-commerce company once lost $2 million in sales because their single scheduler instance crashed during Black Friday, stopping all inventory updates and promotional campaigns.
3. Task Visibility: The Black Box Problem
When tasks are scattered across multiple instances with no central coordination, you lose visibility. Which tasks are running? Which failed? Which instances are overloaded? Without answers, debugging becomes nearly impossible, and system optimization is just guesswork.
System Design Concepts: The Foundation
Leader Election Pattern
In distributed systems, we need one instance to be the "leader" who coordinates task distribution. This leader doesn't execute all tasks—instead, it assigns tasks to available workers, including itself.
Why it matters: Prevents duplicate executions while maintaining high availability. If the leader fails, another instance takes over seamlessly.
Distributed Locking
Before executing any task, an instance must acquire a distributed lock. Think of it as a digital "Do Not Disturb" sign that all instances respect.
Implementation approaches:
Database-based locks (simple, reliable)
Redis-based locks (faster, ephemeral)
ZooKeeper/Consul locks (enterprise-grade)
Task State Management
Each task needs a lifecycle: SCHEDULED → RUNNING → COMPLETED/FAILED. This state must be shared across all instances in real-time.
Event Sourcing for Task Operations
Every task operation (creation, execution start, completion, failure) becomes an immutable event stored in sequence. This creates a complete audit trail and enables powerful replay capabilities.
High-Level Architecture: The Distributed Scheduler Ecosystem
Our distributed scheduler consists of five core components working in harmony:
1. Scheduler Coordinator (The Brain)
Responsibility: Task discovery, distribution, and health monitoring
Technology: Spring Boot with scheduled polling
Key Features: Leader election, load balancing, failure detection
2. Task Registry (The Memory)
Responsibility: Persistent storage of task definitions and execution state
Technology: PostgreSQL with optimistic locking
Key Features: ACID transactions, task versioning, execution history
3. Distributed Lock Manager (The Traffic Controller)
Responsibility: Prevents concurrent execution of the same task
Technology: Redis with Spring Integration
Key Features: TTL-based locks, automatic cleanup, performance monitoring
4. Worker Nodes (The Hands)
Responsibility: Actual task execution with resource isolation
Technology: Spring Boot instances with thread pool management
Key Features: Heartbeat reporting, graceful shutdown, resource monitoring
5. Event Stream (The Nervous System)
Responsibility: Real-time communication between components
Technology: Apache Kafka with Spring Cloud Stream
Key Features: Guaranteed delivery, event replay, horizontal scaling
Architecture Flow: From Task Creation to Execution
Task Registration: A new task definition is registered in the Task Registry
Discovery Phase: Scheduler Coordinator discovers eligible tasks for execution
Leader Election: Active instances elect a coordinator using distributed consensus
Task Assignment: Coordinator assigns tasks to available worker nodes
Lock Acquisition: Worker attempts to acquire distributed lock for the task
Execution Phase: If lock acquired, worker executes task and publishes events
State Updates: Task state changes are persisted and broadcast to all instances
Cleanup: Locks are released, and execution metrics are recorded
Real-World Applications
Financial Services
JPMorgan Chase uses distributed schedulers for:
End-of-day settlement processing
Risk calculation batch jobs
Regulatory report generation
E-commerce Platforms
Amazon's scheduler handles:
Inventory synchronization across warehouses
Price optimization algorithms
Recommendation engine updates
Social Media
Facebook's task scheduler manages:
News feed algorithm updates
Content moderation workflows
User activity aggregation
Component Integration Strategy
Our architecture promotes loose coupling through event-driven communication. Each component can be developed, tested, and deployed independently while maintaining system coherency through well-defined interfaces and event contracts.
Scalability Considerations
Horizontal Scaling: Add more worker nodes to handle increased task volume
Vertical Scaling: Increase resources for individual components based on bottlenecks
Auto-scaling: Dynamic instance management based on queue depth and system load
Fault Tolerance Features
Circuit Breakers: Prevent cascade failures when dependencies are down
Retry Mechanisms: Automated retry with exponential backoff for transient failures
Health Checks: Continuous monitoring with automatic failover capabilities
Hands-On Implementation
Now let's build this architecture step by step. We'll create a working distributed scheduler that demonstrates all these concepts in action.
Github Link:
https://github.com/sysdr/taskscheduler/tree/main/day5/distributed-task-scheduler
Prerequisites Setup
Before we start coding, ensure you have:
Java 21 or later installed
Maven 3.6+ for dependency management
Redis server (for distributed locking)
Docker (optional, for advanced multi-instance testing)
Your favorite IDE (IntelliJ IDEA recommended)
Quick verification:
java -version # Should show Java 21+
mvn -version # Should show Maven 3.6+
redis-server --version # Should show Redis info
Phase 1: Project Foundation (15 minutes)
Step 1: Create Spring Boot Project Structure
Initialize a new Maven project with this structure:
distributed-task-scheduler/
├── src/main/java/com/scheduler/
│ ├── model/ # Task entities
│ ├── repository/ # Data access
│ ├── service/ # Business logic
│ ├── controller/ # Web endpoints
│ └── config/ # Configuration
├── src/main/resources/
│ ├── templates/ # Web dashboard
│ └── application.yml # Configuration
└── docker/ # Container setup
Step 2: Configure Core Dependencies
Add these essential dependencies to your pom.xml
:
spring-boot-starter-web
- Web frameworkspring-boot-starter-data-jpa
- Database accessspring-boot-starter-data-redis
- Redis integrationspring-boot-starter-actuator
- Monitoringshedlock-spring
- Distributed lockingh2
- Development database
Step 3: Application Configuration
Create application.yml
with these key settings:
spring:
application:
name: distributed-task-scheduler
datasource:
url: jdbc:h2:mem:scheduler
data:
redis:
host: localhost
port: 6379
scheduler:
enabled: true
pool-size: 10
distributed-lock:
enabled: true
default-lock-time: PT30M
Phase 2: Core Model Design (20 minutes)
Task Definition Entity
Create the foundation of our scheduler with a TaskDefinition
class that includes:
Unique task name and description
Cron expression for scheduling
Task class for execution logic
Status tracking (ACTIVE, RUNNING, FAILED, COMPLETED, INACTIVE)
Timestamps for created, last executed, next execution
Version field for optimistic locking
Task Execution Tracking
Build a TaskExecution
entity to record:
Which task ran and on which instance
Start and completion timestamps
Execution duration and status
Error messages for failed executions
State Management
The task lifecycle follows this state machine:
ACTIVE: Ready for scheduling and execution
RUNNING: Currently executing with distributed lock held
COMPLETED: Successfully finished, ready for next schedule
FAILED: Encountered error, may need retry
INACTIVE: Manually disabled by administrator
Phase 3: Repository Layer (10 minutes)
Create repository interfaces with custom queries:
TaskDefinitionRepository:
findEligibleTasks()
- Tasks ready for executioncountActiveTasks()
- System health metricsfindByStatus()
- Filter by current state
TaskExecutionRepository:
findRecentExecutions()
- Monitoring dashboard datagetAverageExecutionTime()
- Performance metricscountRunningExecutions()
- Prevent race conditions
Phase 4: Distributed Coordination (25 minutes)
Scheduler Service with Distributed Locking
The heart of our system uses ShedLock annotations:
@Scheduled(fixedDelay = 30000)
@SchedulerLock(name = "taskDiscovery", lockAtMostFor = "29s")
public void discoverAndExecuteTasks() {
// Only one instance executes this method
}
Task Executor Service
Handles asynchronous task execution with:
Proper state transitions
Error handling and retry logic
Performance metrics collection
Lock cleanup after completion
Redis Configuration
Configure ShedLock with Redis for distributed coordination:
@Bean
public LockProvider lockProvider(RedisConnectionFactory connectionFactory) {
return new RedisLockProvider(connectionFactory);
}
Phase 5: Monitoring Dashboard (15 minutes)
Web Controller
Create endpoints for:
Dashboard view with task status
REST API for system metrics
Real-time execution monitoring
Dashboard Template
Build a responsive web interface showing:
Active task definitions and their schedules
Recent execution history with status
System metrics (active/running task counts)
Instance identification and health
Auto-refresh Functionality
Add JavaScript to automatically refresh the dashboard every 30 seconds, providing real-time visibility into system state.
Testing & Demonstration
Local Single-Instance Testing
Start Redis and Application:
# Start Redis server
redis-server
# Run the application
./mvnw spring-boot:run
Verify Basic Functionality:
Open dashboard at
http://localhost:8080/scheduler/dashboard
Watch tasks execute automatically every 30 seconds
Check H2 console at
http://localhost:8080/h2-console
Monitor health at
http://localhost:8080/actuator/health
Multi-Instance Distributed Testing
Docker Compose Setup:
For true distributed testing, use Docker Compose to run multiple instances:
# Start all services (Redis, PostgreSQL, 2 scheduler instances)
docker-compose up --build
Observe Distributed Behavior:
Open both dashboards:
Instance 1:
http://localhost:8080/scheduler/dashboard
Instance 2:
http://localhost:8081/scheduler/dashboard
Watch which instance picks up each task (only one should execute)
Kill one instance and observe failover
Check logs for distributed locking messages
Race Condition Prevention Verification
Test Scenario:
Create a task with high frequency execution (every 10 seconds)
Start multiple instances simultaneously
Monitor execution logs and database records
Verify no duplicate executions occur
Expected Results:
Only one instance executes each task occurrence
Distributed locks prevent race conditions
Failed lock acquisitions are logged properly
System maintains exactly-once execution semantics
Performance and Scalability Testing
Load Testing:
Add multiple task definitions with varying execution times
Monitor system performance under increased load
Test horizontal scaling by adding more instances
Verify resource utilization and response times
Failure Recovery Testing:
Simulate instance crashes during task execution
Verify lock timeout and cleanup mechanisms
Test automatic failover to remaining instances
Confirm no task executions are lost
Success Criteria for Implementation
By the end of this lesson, your distributed scheduler should achieve:
Functional Requirements
Zero Duplicate Executions: Tasks run exactly once regardless of instance count
High Availability: System continues operating despite single instance failures
Complete Visibility: Real-time monitoring of all task states and executions
Distributed Coordination: Proper leader election and lock management
Technical Requirements
Spring Boot Integration: Leverages Spring ecosystem for enterprise features
Database Persistence: Reliable state management with ACID properties
Redis Coordination: Fast distributed locking with automatic cleanup
Production Ready: Comprehensive monitoring, logging, and error handling
Learning Objectives Achieved
Distributed Systems Understanding: Practical experience with real-world challenges
Race Condition Prevention: Hands-on implementation of coordination mechanisms
Architectural Design: Component-based thinking for scalable systems
Production Considerations: Monitoring, observability, and operational practices
Assignment: Architecture Design Challenge
Task: Design a distributed scheduler for a real-world scenario of your choice (e.g., social media platform, delivery service, streaming platform).
Requirements:
Identify 5 specific tasks your system needs to schedule
Determine the execution frequency and dependencies for each task
Design the data flow between components
Specify the technology choices for each component
Create a failure recovery plan for each component
Deliverables:
Architecture diagram showing all components and their interactions
Task definition specifications with execution constraints
Failure scenario analysis with recovery strategies
Solution Hints
Start Simple: Begin with database-based locking before moving to Redis
Monitor Everything: Add metrics from day one—they're crucial for debugging
Test Failures: Regularly test your failure scenarios in development
Performance Baseline: Establish performance benchmarks early to track improvements
Documentation: Maintain clear documentation of your design decisions and trade-offs
The distributed scheduler we're building isn't just code—it's a blueprint for handling one of the most challenging problems in modern system design. Master this, and you'll have the foundation for building truly scalable, resilient systems that companies like Netflix, Uber, and Amazon rely on every day.
Next lesson, we'll dive deep into designing our Task Definition Model, creating the data structures that will power our entire scheduler ecosystem.