Jump to content
OpenSplice DDS Forum
Bill

writer reached maximum heartbeat-frag sequence number

Recommended Posts

Running P637-OpenSpliceDDSV6.3.2p2-HDE-x86.win-vs2010

We have an application using DDS between multiple computers. After approximately 24 hours (happens reliably), we see the following error in the log after the application crashes:

"writer reached maximum heartbeat-frag sequence number"

Examination of the source code shows:

q_transmit.c :: create_HeartbeatFrag()

  if (wr->hbfragcount == DDSI_COUNT_MAX)
    NN_FATAL0 ("writer reached maximum heartbeat-frag sequence number");
  hbf->count = ++wr->hbfragcount;
#define NN_FATAL0(fmt) do { \
    nn_log (LC_FATAL, (fmt)); \
    os_report (OS_FATAL, config.servicename, __FILE__, __LINE__, 0, (fmt)); \
    abort (); \
  } while (0)

So it appears on the surface that each time a heartbeat fragment is created, a counter is incremented, and if the counter exceeds the maximum (32 bit max value), an abort() is executed and the entire program exits. The counter does not appear to every be reset except in the case of creating the initial connection. There is similar behavior in q_xevent.c :: add_AckNack(). This would correspond to roughly 20us traffic, which is approximately what we see using wireshark.

 

Can someone please describe the expected behavior in these cases, and what possible fixes can be applied? I have looked at version notes all the way through V6.8 and there appears to be no mention of this issue. Cannot find any mention in the forum either.

 

Thanks.

Share this post


Link to post
Share on other sites

Hi Bill,

The rule that we always try to follow is to never generate invalid messages, and in this case, that means once you reach 2^31-1, you cannot continue while remaining compliant with the specification. After all, it states that this particular sequence number is of type “Count_t”, described as a “[t]ype used to encapsulate a count that is incremented monotonically, used to identify message duplicates.” And quite obviously, you can’t increment a signed 32-bit (two’s complement) number past 2^31-1.

So what does one do in a case like this? Clearly the correct answer is not to crash but to let it roll over anyway, but perhaps out of frustration with some blatant errors in the specification that wasn’t the initial implementation. Needless to say, this should have been addressed before releasing, but somehow it slipped through the cracks. If it is any consolation, you are the first ever to report running into this.

It has been fixed long since; that the release notes don’t show it is an oversight. if you upgrade to the current version you will not encounter it anymore and moreover benefit from the many other improvements made since the 6.3 release — including some fixes that address an issue where the data path can stall when just the right packets get lost while sending fragmented data.

(If you're on the community edition, then you can also fix this by deleting the problematic two lines — and then please also take out two analogous cases in q_xevent.c — just search for DDSI_COUNT_MAX.)

Best regards,

Erik

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×