Jobs get double processed (if removed immediately upon completion)
Created by: bradvogel
There exists a situation where a job can get processed a second time by the processStalledJobs
periodic cleanup job. This only happens when jobs are removed upon completion by the worker that processed them (a common pattern - see #354 (closed)).
We only discovered this because we saw Error: Missing Job xx when trying to move from active to completed
show up in our server logs that is running a high-volume (100/events/sec) Bull queue.
This is how it happens:
Time | Process A | Process B |
---|---|---|
1 | In the regular Bull run loop of Process A, getNextJob moves a job from wait to active
|
|
2 | In Process B, processStalledJob happens to run, pulls all active jobs (this.client.lrangeAsync(this.toKey('active') ), and beings to iterate over them |
|
3 | processes the job normally | pulls job data via Job.fromId (the job data still exists at this point) |
4 | completes processing the job and the application code calls job.remove() which removes the job data and the lock |
|
5 | calls scripts.getStalledJob which only checks if the job is in completed (which it isn't anymore so it continues to grab the lock) |
|
6 | job is processed again | |
7 | upon job completion, Error: Missing Job 59694333 when trying to move from active to completed is thrown because the job data doesn't exist anymore |
Perhaps a solution here is to have scripts.getStalledJob
ensure the job is in the active
state, not merely checking if it's not in completed
- since now we know that it could have been removed prior to this check. So if a job is in active
AND doesn't have an existing lock, then processStalledJobs knows that another worker isn't processing it. However, since the active
queue is a LIST, checking the existence of the element in the list is expensive (requires list iteration in the lua script).
Our temporary workaround is to delay calling job.remove()
at the completion of the job to leave around a 'tombstone' so processStalledJobs
will see it in the completed queue and skip over it.