Jump to content
OpenSplice DDS Forum
Chris Roberts

Blocking with Takes/Reads

Recommended Posts

I am experiencing intermittent issues from a real-time application calling 'take' on topics.  Generally it works as expected.  If no data is available it does not block and the call completes in microseconds.  Same case when there is data available with the call completes in microseconds.  However occasionally a single call to 'take' completes anywhere from 2ms to over 20ms.  It happens for both cases where there is data as well as when there isn't.  Size of the data also does not matter as I see this with topic sizes from 40 bytes to 8000 bytes...  This is causing issues as the calls are done by a process executing at 60Hz (and thus each frame only has 16.67ms to complete...)

What conditions would cause a 'take' or 'read' to block?

 

Details on the configuration:

  • Using OpenSplice 6.7.1
  • Communication in question is only between 2 machines connected directly via 1Gb Ethernet.
  • Both machines are running federated/shared memory mode and communicate using DDSI2E.
    • I *am* running two separate DDSI2E services on each machine with a configuration provided by PrismTech as I have a need to communicate on two separate networks (with separate network cards...)
    • However, the calls causing issues with blocking is on the DDSI2E service tied to the network cards directly connected between two machines.
  • I do have durability service running on each but doesn't help if I turn off.
  • The QoS is set as Reliable, Volatile durability with Keep All History, sorted by source timestamp, and min latency set to 50ms (also tried leaving at default of 0 as well.)  I also have a lifespan of 300 seconds set on these topics.
  • The process that is blocking on the 'take' call is running Real-Time scheduling and pinned to a CPU all to itself.
    • All 'takes' are also done from a single thread
  • All OpenSplice processes are also set to Real-Time scheduling and pinned to a different CPU.
  • I have pre-allocated memory for doing the 'takes' on each topic.  Memory is also locked for both the process calling 'take' and all opensplice services.

Additional Notes:

  • The blocking seems to occur when a backlog grows on some of the topic data.  There are periods of time where the reader has to pause doing takes for a while and the history depth grows on some of the topics...  There are other times where the sender sends a large burst of samples also causing a buildup of history (since the reader is only running at 60Hz and only does so many takes per frame).  In both of these cases, this is where I see the spikes in 'take' time.
  • I have tried adjusting the config files on both ends heavily, adjusting queue sizes, adjusting bandwidth, max packet sizes, network buffers, etc all with little effect.

 

Any assistance would be greatly appreciated!

Chris

 

Share this post


Link to post
Share on other sites
erik   

Hi Chris,

 

I don't usually keep an eye on these forums, so I guess you're lucky I did this time.

Firstly, are you sure it is blocked during the take? If you have allocated that CPU exclusively to this process, it should be pretty straightforward to determine whether "take" takes 20ms or whether it sleeps 20ms. Both are "less than ideal" of course, but being certain which case it is definitely would help with diagnosing.

That said, if it is blocked, it should be blocked on some mutex somewhere and I would expect it to be a victim of priority inversion, though I am not certain. There are a number of cases I can think of that might do this (in no particular order, and noting there may be more):

  • update of data received from the network or from a local writer
  • a GC step checking for old instances to be freed
  • a badly timed network disconnection
  • the memory allocator releasing large numbers of objects in a short period of time and hitting contention
  • possibly clearing trigger events higher up in the entity hierarchy that are used for blocking on waitsets and triggering listeners

None of these would lead me to expect delays in ms unless there are huge numbers of instances (or, in some cases, samples), but if it is indeed priority inversion then the scenarios can get pretty hairy pretty quickly. If it is this, then mitigation on Linux (which I think you're running) could be as simple as enabling priority inheritance on the mutexes — that has an option in the configuration file: Domain/PriorityInheritance, set attribute "enabled" to true.

If you have a way of making it take 20ms reasonably often, then it should be possible to catch it in flagrante delicto without too much trouble if you have SystemTap or dtrace at hand. I've never actually done that, but once upon a time I did play with dtrace and I am certain it is possible to use it to profile only during a take operation. Then you discard the profile if it took mere microseconds, and something interesting might well show up.

Finally, while I don't think it is the case, it could be driven by interrupts on Linux. I believe it is possible to assign interrupts to CPUs, and hence to not handle them on this particular CPU, but I could be wrong there.

 

Best regards,

Erik

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×