[Bug] Delayed jobs get stuck in delayed queue if Redis is busy
Created by: swayam18
Description
When Redis is "busy running a script" the jobs in a queue get stuck in the delayed state unless a new job is added to the queue or the process is restarted. Upon investigation, the culprit is the following line:
https://github.com/OptimalBits/bull/blob/edfbd163991c212dd6548875c22f2745f897ae28/lib/queue.js#L899
If this command ever fails, the recursion breaks and updateDelayTimer
is not called again till a new delayed job is added. Since that may never happen, jobs may get permanently stuck in the delayed queue.
Here is the sequence of events that lead to this scenario:
-
Redis is busy running a heavy script (for eg: queue.clean was run to clear failed jobs)
-
During this time, a call to
updateDelayTimer
is made, which in turn calls theupdateDelaySet
command: https://github.com/OptimalBits/bull/blob/edfbd163991c212dd6548875c22f2745f897ae28/lib/queue.js#L897 https://github.com/OptimalBits/bull/blob/edfbd163991c212dd6548875c22f2745f897ae28/lib/queue.js#L899 -
The
updateDelaySet
command fails with the following error:ReplyError: BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE.
as Redis is busy. -
The promise fails and the catch block simply emits an error: https://github.com/OptimalBits/bull/blob/edfbd163991c212dd6548875c22f2745f897ae28/lib/queue.js#L932
Now because of the failure, the updateDelayTimer
function is never called after this point, leading to the delayed jobs being stuck. The only way to recover them is by adding another delayed job to the queue, which seemingly triggers the message handler to call updateDelayTimer
and restart the recursive process.
Proposed Solution
I am not 100% sure if this makes sense, but adding this line of code seems to have fix the problem:
.catch(err => {
setTimeout(() => this.updateDelayTimer(), 1000); // <- this line
this.emit('error', err, 'Error updating the delay timer');
});
Essentially, we retry the updateDelayTimer
after a constant delay and hope that Redis is no longer busy and can now run the updateDelaySet
command.
Not 100% sure if this can cause more than one this.updateDelayTimer
loop to be active, will need your feedback for this.
Minimal, Working Test code to reproduce the issue.
Let me know if this is necessary and I will create a repo with the necessary code
Bull version
3.22.1