Distributed Spring Batch Coordination, Part 6: Developer Control and Partitioning Strategies

#springbatch #java #opensource #cloudnative

🔍 Introduction

In any distributed system, failure is not an exception—it’s the norm. Nodes may crash, lose network connectivity, or experience high latency. A resilient Spring Batch framework must gracefully detect and recover from such failures, ensuring consistency and progress without manual intervention.

This part of the series focuses on how failure detection and retry logic work in the database-backed Spring Batch coordination model.

💥 Why Failure Handling Matters

Distributed job execution is susceptible to:

💻 Node crashes or shutdowns
🌐 Network issues causing delayed heartbeats
⏳ Transient processing failures (e.g., DB timeout, file lock)
🔁 Unacknowledged or abandoned partitions

Without an intelligent retry mechanism and failure awareness, these issues can lead to:

Stalled jobs
Incomplete processing
Duplicate processing due to incorrect reassignment

🧠 Node Failure Detection

The framework monitors node health using periodic heartbeats recorded in the BATCH_NODES table. A two-step failure detection process is used:

Mark as UNREACHABLE: If a node fails to update its heartbeat within a configurable time window (spring.batch.cluster.unreachableNodeThreshold), other nodes mark it as UNREACHABLE.
Eviction: If the node remains stale beyond the configured threshold (spring.batch.cluster.nodeCleanupThreshold), it's evicted and deleted from the cluster registry.

This gives temporarily slow nodes time to recover, reducing false positives.

🔄 Reassignment of Tasks

When a node is marked as unreachable, all active partitions associated with that node are checked for reassignment.

The decision is governed by the master step’s partitioner:

return new ClusterAwarePartitioner() {

    @Override
    public List<ExecutionContext> createDistributedPartitions(int availableNodeCount) {
        // Developer-defined logic to split workload
        // (e.g., based on row ranges, file segments, or other ___domain inputs)
        return generatePartitions(availableNodeCount);
    }

    @Override
    public PartitionTransferableProp arePartitionsTransferableWhenNodeFailed() {
        return PartitionTransferableProp.YES; // or NO, based on task sensitivity
    }

    @Override
    public PartitionStrategy buildPartitionStrategy() {
        return PartitionStrategy.builder()
            .partitioningMode(PartitioningMode.ROUND_ROBIN)
            .build();
    }
};

If this PartitionTransferableProp flag is set to YES, the framework reassigns those partitions to healthy nodes based on the configured strategy (e.g., round-robin, fixed-node).

✅ What’s Next (Part 7 Preview)

In the next part, we’ll cover:

📋 Best practices for running in production
🔐 Securing coordination and metadata tables
🧪 Performance tuning for large-scale workflows
🚀 Integrating with monitoring stacks (Grafana, Prometheus)

👨‍💻 About the Author

Janardhan Reddy Chejarla is a Lead Software Engineer specializing in distributed systems and batch processing frameworks. He is the author of spring-batch-db-cluster-partitioning and a contributor to multiple open-source initiatives.

⭐️ Star the GitHub repo and stay tuned for Part 7 → Production-Grade Best Practices!