Race condition during bilateral initial resend request
Created by: sami-sweng
Race condition during bilateral initial resend request
I believe I have found a race condition when the initiator successfully connects to an acceptor. The specific scenarios which cause a bug is when both have a few messages of advance at the time of the first successful logon and the initiator unsent messages are all administrative messages (Logon).
Expected sequence of event
The expected flow is as follow:
Direction | Message type |
---|---|
init -> acc | Logon |
acc -> init | Logon |
acc -> init | ResendRequest |
init -> acc | ResendRequest |
acc -> init | News (the 1st message to resend) |
acc -> init | News (the 2nd message to resend) |
acc -> init | SequenceReset |
init -> acc | SequenceReset |
Issue
The issue is that sometimes, the initiator sends the sequenceReset
immediatly after the resendRequest
of the acceptor
and then sends it's own resendRequest
.
The sequenceReset
is visibly out of order as it has a wrong timestamp (a later one) and the resendRequest
continues with the sequence number of before the sequence reset.
The issue is that the acceptor that I'm connecting to (which is using Quickfix Java I think) doesn't tolerate the wrong sequence number and rejects it with a Logout. And thus the resend mechanism fails and the acceptor messages are lost.
Actual sequence of event
Here are the outputed FIX messages during the scenario:
8=FIX.4.4|A9=76|A35=A|A34=4|A49=INIT|A52=20210525-22:05:30.686|A56=ACC|A369=0|A98=0|A108=90|A10=220|A
8=FIX.4.4|A9=71|A35=A|A34=14|A49=ACC|A52=20210525-22:05:30.803|A56=INIT|A98=0|A108=90|A10=239|A
8=FIX.4.4|A9=68|A35=2|A34=15|A49=ACC|A52=20210525-22:05:30.804|A56=INIT|A7=1|A16=0|A10=068|A
8=FIX.4.4|A9=107|A35=4|A34=1|A43=Y|A49=INIT|A52=20210525-22:05:30.833|A56=ACC|A122=20210525-22:05:30.833|A369=15|A36=6|A123=Y|A10=013|A
8=FIX.4.4|A9=73|A35=2|A34=5|A49=INIT|A52=20210525-22:05:30.821|A56=ACC|A369=0|A7=1|A16=0|A10=030|A
8=FIX.4.4|A9=108|A35=5|A34=16|A49=ACC|A52=20210525-22:05:30.848|A56=INIT|A58=MsgSeqNum too low, expecting 6 but received 5|A10=159|A
8=FIX.4.4|A9=76|A35=A|A34=7|A49=INIT|A52=20210525-22:05:40.882|A56=ACC|A369=1|A98=0|A108=90|A10=223|A
8=FIX.4.4|A9=71|A35=A|A34=17|A49=ACC|A52=20210525-22:05:40.996|A56=INIT|A98=0|A108=90|A10=000|A
8=FIX.4.4|A9=68|A35=2|A34=18|A49=ACC|A52=20210525-22:05:40.997|A56=INIT|A7=6|A16=0|A10=090|A
8=FIX.4.4|A9=107|A35=4|A34=6|A43=Y|A49=INIT|A52=20210525-22:05:41.031|A56=ACC|A122=20210525-22:05:41.031|A369=18|A36=9|A123=Y|A10=008|A
8=FIX.4.4|A9=73|A35=2|A34=8|A49=INIT|A52=20210525-22:05:41.015|A56=ACC|A369=1|A7=2|A16=0|A10=032|A
8=FIX.4.4|A9=108|A35=5|A34=19|A49=ACC|A52=20210525-22:05:41.046|A56=INIT|A58=MsgSeqNum too low, expecting 9 but received 8|A10=160|A
Root cause
I believe I identified the root cause of the issue.
There are essentially three goroutines involved in processing messages in quickfixgo:
- the read loop (
go readLoop()
), which parse and sends messages to the messageIn channel. - the processing loop (
session.run()
), which selects from either messageIn (message to be processed) or messageEvent (signaling message in the sending queue) - the write loop (
go writeLoop()
)
Step 1
When the acceptor Logon message is received, we are in the LogonState
.
The Logon message is written in the messageIn
channel, and the session gets it and forwards it to the FixMsgIn()
of the LogonState
.
So the FixMsgIn()
of the logon state essentially create a resendRequest
which is simply put in the queue of message to send as we are not logged and trigger the messageEvent
signal.
This puts us in the state resendState
which in our case will simply proxy the next FixMsgIn
to the loggedOn
state as there is nothing else to do.
Step 2.a
The session get the messageEvent
signal from step 1, and as the resendState
signal is considered a loggedOn
state, it dequeue the single resend message and immediately sends it.
Step 2.b
The readLoop
gets the resendRequest
of the acceptor and writes it in the messageIn
channel.
The session get the message and submits it to the FixMsgIn
of the loggedOn
state. This creates a resendRequest
and IMMEDIATELY sends it as bytes.
Issue
The issue is that the step 2.a and 2.b are concurrent, as during the time requested to perform step 1, the readloop
could have first written the resendRequest
in the messageIn
which would cause step 2.b to be executed before step 2.a.
The problem is that in that case, step 2.b ignores the fact that there is already a message to send in the queue (the resend) and bypass it.
Possible remediation
If step 2.b happens first, if instead of immediately sending the sequenceReset
, it first send the queued resendRequest
, no out of order messages are sent.