How to Build a Distributed Task Scheduler in Java Spring Boot — Complete Guide
Every production backend has the same silent problem.
You deploy your Spring Boot app to three nodes. Each one has a @Scheduled method. Every 5 minutes, all three fire — processing the same records, sending the same emails, charging the same customers twice.
@Scheduled works perfectly on one machine. It silently breaks on two.
Why @Scheduled Isn’t Enough
Spring Boot’s built-in @Scheduled annotation is the right starting point. fixedRate, fixedDelay, cron expressions — all clean, all simple. But it runs on a single thread by default, and it knows nothing about other JVM instances.
The moment you horizontally scale — even to just two pods — you have a race condition. Multiple instances competing to execute the same task with no coordination. This is Day 4 of the curriculum, and it’s the moment most engineers realize they’ve been shipping broken schedulers for years.
The fix isn’t a library swap. It’s a system design problem.
The Architecture of a Distributed Scheduler
A production distributed scheduler has four layers:
1. Task Definition & Persistence Tasks aren’t just annotations. They’re entities — with a name, schedule expression, payload, status, and retry policy. Stored in PostgreSQL with Spring Data JPA. This is what lets you create, update, pause, and inspect tasks at runtime (Days 6–9).
2. Distributed Coordination Two mechanisms prevent duplicate execution:
Distributed Locks — Using Redis
SET NX EX(Days 14–15). The first node to acquire the lock runs the task. Others skip. The Redlock algorithm handles the edge case where your Redis node itself fails.Leader Election — One node becomes the scheduler “leader” via a database lease/heartbeat mechanism. Only the leader enqueues tasks. Others stand by and re-elect if the leader goes silent (Days 16–18).
3. Execution Engine Tasks are encapsulated as Runnable or Callable. A custom ThreadPoolTaskScheduler bean replaces Spring’s default single-thread executor. Work is dispatched asynchronously via @Async for long-running jobs (Days 3, 10, 28).
4. Observability & Resilience Retry logic with exponential backoff, circuit breakers for failing external dependencies, dead-letter queues for permanently failing tasks, and Prometheus metrics for execution counts, latency, and thread pool saturation (Days 30–45, 55–60).
The Core Technical Problems (and How to Solve Them)
The Duplicate Execution Problem
Problem: Two nodes pick up the same task simultaneously. Solution: Redis-backed distributed lock. Before executing, acquire LOCK:task:{task_id} with a short TTL. Release after execution. If the node crashes mid-execution, the TTL auto-expires — no stuck locks.
java
// Simplified Redis lock pattern
Boolean acquired = redisTemplate.opsForValue()
.setIfAbsent("LOCK:task:" + taskId, nodeId, 30, TimeUnit.SECONDS);
if (Boolean.TRUE.equals(acquired)) {
try {
executeTask(taskId);
} finally {
redisTemplate.delete("LOCK:task:" + taskId);
}
}The Single Point of Failure Problem
Problem: Your designated “scheduler node” goes down. No tasks run. Solution: Database-backed leader election. Every node tries to upsert a scheduler_leader row with a 30-second TTL. The node that wins becomes leader. All others poll every 10 seconds. If the leader misses a renewal, the next poll triggers re-election. Zero downtime, no Zookeeper required for simple cases.
The Thundering Herd Problem
Problem: Your app restarts. 200 delayed tasks all fire simultaneously. The database collapses. Solution: Add jitter. Don’t fire all tasks immediately on startup — spread them with randomized initial delays. Same principle applies to retry backoff: delay = baseDelay * 2^attempt + random(0, 1000ms).
The Long-Running Task Problem
Problem: A 10-minute report generation blocks your scheduler thread. Nothing else runs on time. Solution: @Async + separate thread pool. The scheduler thread hands the task to a dedicated executor and returns immediately. Thread pool sizing becomes a tunable operational parameter, not a bottleneck.
What the Full 60-Day Curriculum Covers
ModuleDaysFocusFoundations1–10@Scheduled, thread pools, task persistence, dynamic schedulingDistribution11–20Distributed locks, Redis, leader election, failoverResilience21–30Retries, circuit breakers, dead-letter queues, @AsyncAdvanced Patterns31–40Message queues (Kafka/RabbitMQ), idempotency, priority queuesObservability41–50Prometheus, Grafana, distributed tracing, alertingProduction51–60Kubernetes, containerization, security, multi-tenancy, virtual threads
Every lesson ships with runnable Spring Boot code. Not pseudocode — actual mvn clean package code with tests.
Key Concepts You’ll Master
Cron expressions and why
fixedRatevsfixedDelayis a semantic decision, not a style choicePessimistic vs optimistic locking — when
SELECT FOR UPDATEis fine and when it kills throughputRedlock — why the naive Redis lock has a failure mode and how Redlock addresses it
Idempotency keys — how to make task execution safe to retry without side effects
Virtual threads (Java 21) — how Project Loom changes the thread-per-task calculus
Multi-tenancy — scheduling tasks for thousands of tenants without one tenant starving others
Who This Is For
This course was built for engineers who already know Spring Boot basics and want to understand the distributed systems layer underneath. If you’ve ever wondered why Quartz, ShedLock, or JobRunr exist — and what problems they’re actually solving — this curriculum builds that understanding from scratch rather than wrapping a library.
The full 60-lesson course is free to read at javatsc.substack.com.
Start from Day 1. By Day 5 you’ll understand the problem space. By Day 20 you’ll have a working distributed scheduler. By Day 60 you’ll have the production-grade version.

