Choosing the correct Flink Restart Strategy & avoiding Production Gotchas

Having spent the last year and half designing & developing real time data pipelines, here are my 2 cents on the production gotchas I encountered related to Flink’s Restart strategy which I wish I knew sooner. Also if you are new to stream system development, checkout my previous article where i wrote about the change of mindset required to move from batch to stream development. With all that out of the way, let’s get started on the current article !!

What is a Restart Strategy

A restart strategy is the primary failover mechanism of Flink & defines the behavior of the pipeline when it encounters an Exception. This per-job setting is used to instruct the JobManager to restart and re-run the pipelines or jobs in case of failures and recover. The number of times your job will be restarted , if at all , depends on which strategy you choose and how you configure it. Let’s delve a bit deeper into this.

Characteristics of an Ideal Restart Strategy

The strategy should be resilient enough so that it retries at the event of failure but at the same time it shouldn’t mindlessly get stuck in a Restarting-Failed-Restarting loop.
Basically create a strategy like Wolverine, which recovers & heals quickly from failures but dies as a hero when the end is near. On the other hand, retrying for Integer.Max times which, for some reason, is the default value [WHYYYYY !!!] , is a terrible idea because then that job would be tougher to kill than Thanos and would take down half of the cluster resources with it as it becomes immortal thanks to a rather generous 2147483647 attempts.

The 2 Options [Batch vs Stream]

Flink offers 2 usable types of restart strategies [ Third one being No-Restart which is sorta unusable ;_; ]
  • Fixed Delay Restart Strategy [Suited for batch jobs]
  • Failure Rate Restart Strategy [Suited for streaming jobs]

Fixed Delay Restart Strategy is the default strategy flink uses. It restarts the job for a fixed number of attempts [please don’t use the default value] after a failure. This makes sense for a batch job which has a definite end at some point of time and the batch can be rerun entirely as the input would still be with us. Also batch programs aren’t used in realtime applications anyway so an error/downtime is easier to digest. For the below snippet, JobManager will wait for 10 seconds after failure & then restart the failed job. This happens for 5 times. After the 5th restart, the next failure won’t lead to a restart & the pipeline job will crash & stop.

RestartStrategies.fixedDelayRestart( 5, // number of restart attempts
  Time.of(10, TimeUnit.SECONDS) // delay);

Failure Rate Restart Strategy is when the JobManager keeps restarting the job till the job has failed for a specific number of times in a given time interval. For the below configuration , the JobManager will keep restarting the job till the number or errors has crossed 5 in an interval of 5 mins

RestartStrategies.fixedDelayRestart( 5, // max failures per interval
  Time.of(5, TimeUnit.MINUTES), //time interval for measuring failure rate
  Time.of(10, TimeUnit.SECONDS) // delay);

Why use Failure Rate strategy for Streaming jobs

Streaming Jobs are supposed to run continuously, and keeping a Fixed Delay Restart Strategy its like putting an expiry date on a job which should never end. Distributed systems are a fragile collections of commodity hardware tied together by a network where failures eventually occur & would cause exceptions to be generated. We should ideally let these exceptions propagate up the stack so that it may be logged and acted upon. By which I mean, crash and restart the pipeline job [& try to repick & reprocess the message]. & That’s precisely why setting the number of attempts to Fixed doesn’t work

Let’s say our Kafka cluster is under load and a produce message request lead to a TimeoutException. By the time Kafka was able to stabilize, our flink job restarted 3 times & now has just 2 attempts left. After 24 hours we face the Kafka issue again and this time Kafka came back online by the time our job restarted 2 times.

Now the job is technically FUBAR. It has ran out of its retry attempts, & thus has lost its failover & fault tolerance traits. The next occurance of any stray exception will stop our job and it won’t be restarted by the JobManager. That’s not something we want in a streaming app.

Remember the point I mentioned in the ideal restart strategy. We want the job to be persistent and resilient to failures without overwhelming the server by restarting indefinitely. We achieve that by restarting the job with an error threshold. [ A specific number of times in a given interval ]. Once this threshold is crossed, the job gives up and doesn’t restart anymore. This makes sure that any stray exceptions pertaining to network or any external dependency which happens sporadically and is beyond our control won’t slowly eat away our job’s failover mechanism.

So there you have it. I hope this article helps Streaming developers who are new to Flink to avoid the mistakes of configuring the restart strategy incorrectly for their pipeline job.