🔍 Introduction
In any distributed system, failure is not an exception—it’s the norm. Nodes may crash, lose network connectivity, or experience high latency. A resilient Spring Batch framework must gracefully detect and recover from such failures, ensuring consistency and progress without manual intervention.
This part of the series focuses on how failure detection and retry logic work in the database-backed Spring Batch coordination model.
💥 Why Failure Handling Matters
Distributed job execution is susceptible to:
- 💻 Node crashes or shutdowns
- 🌐 Network issues causing delayed heartbeats
- ⏳ Transient processing failures (e.g., DB timeout, file lock)
- 🔁 Unacknowledged or abandoned partitions
Without an intelligent retry mechanism and failure awareness, these issues can lead to:
- Stalled jobs
- Incomplete processing
- Duplicate processing due to incorrect reassignment
🧠 Node Failure Detection
The framework monitors node health using periodic heartbeats recorded in the BATCH_NODES
table. A two-step failure detection process is used:
-
Mark as UNREACHABLE: If a node fails to update its heartbeat within a configurable time window (
spring.batch.cluster.unreachableNodeThreshold
), other nodes mark it asUNREACHABLE
. -
Eviction: If the node remains stale beyond the configured threshold (
spring.batch.cluster.nodeCleanupThreshold
), it's evicted and deleted from the cluster registry.
This gives temporarily slow nodes time to recover, reducing false positives.
🔄 Reassignment of Tasks
When a node is marked as unreachable, all active partitions associated with that node are checked for reassignment.
The decision is governed by the master step’s partitioner:
return new ClusterAwarePartitioner() {
@Override
public List<ExecutionContext> createDistributedPartitions(int availableNodeCount) {
// Developer-defined logic to split workload
// (e.g., based on row ranges, file segments, or other ___domain inputs)
return generatePartitions(availableNodeCount);
}
@Override
public PartitionTransferableProp arePartitionsTransferableWhenNodeFailed() {
return PartitionTransferableProp.YES; // or NO, based on task sensitivity
}
@Override
public PartitionStrategy buildPartitionStrategy() {
return PartitionStrategy.builder()
.partitioningMode(PartitioningMode.ROUND_ROBIN)
.build();
}
};
If this PartitionTransferableProp
flag is set to YES
, the framework reassigns those partitions to healthy nodes based on the configured strategy (e.g., round-robin, fixed-node).
✅ What’s Next (Part 7 Preview)
In the next part, we’ll cover:
- 📋 Best practices for running in production
- 🔐 Securing coordination and metadata tables
- 🧪 Performance tuning for large-scale workflows
- 🚀 Integrating with monitoring stacks (Grafana, Prometheus)
👨💻 About the Author
Janardhan Reddy Chejarla is a Lead Software Engineer specializing in distributed systems and batch processing frameworks. He is the author of spring-batch-db-cluster-partitioning and a contributor to multiple open-source initiatives.
⭐️ Star the GitHub repo and stay tuned for Part 7 → Production-Grade Best Practices!
Top comments (0)