Jump to content
OpenSplice DDS Forum

erik

Members
  • Content count

    36
  • Joined

  • Last visited

Everything posted by erik

  1. Blocking with Takes/Reads

    Hi Chris, I don't usually keep an eye on these forums, so I guess you're lucky I did this time. Firstly, are you sure it is blocked during the take? If you have allocated that CPU exclusively to this process, it should be pretty straightforward to determine whether "take" takes 20ms or whether it sleeps 20ms. Both are "less than ideal" of course, but being certain which case it is definitely would help with diagnosing. That said, if it is blocked, it should be blocked on some mutex somewhere and I would expect it to be a victim of priority inversion, though I am not certain. There are a number of cases I can think of that might do this (in no particular order, and noting there may be more): update of data received from the network or from a local writer a GC step checking for old instances to be freed a badly timed network disconnection the memory allocator releasing large numbers of objects in a short period of time and hitting contention possibly clearing trigger events higher up in the entity hierarchy that are used for blocking on waitsets and triggering listeners None of these would lead me to expect delays in ms unless there are huge numbers of instances (or, in some cases, samples), but if it is indeed priority inversion then the scenarios can get pretty hairy pretty quickly. If it is this, then mitigation on Linux (which I think you're running) could be as simple as enabling priority inheritance on the mutexes — that has an option in the configuration file: Domain/PriorityInheritance, set attribute "enabled" to true. If you have a way of making it take 20ms reasonably often, then it should be possible to catch it in flagrante delicto without too much trouble if you have SystemTap or dtrace at hand. I've never actually done that, but once upon a time I did play with dtrace and I am certain it is possible to use it to profile only during a take operation. Then you discard the profile if it took mere microseconds, and something interesting might well show up. Finally, while I don't think it is the case, it could be driven by interrupts on Linux. I believe it is possible to assign interrupts to CPUs, and hence to not handle them on this particular CPU, but I could be wrong there. Best regards, Erik
  2. Hi Bill, The rule that we always try to follow is to never generate invalid messages, and in this case, that means once you reach 2^31-1, you cannot continue while remaining compliant with the specification. After all, it states that this particular sequence number is of type “Count_t”, described as a “[t]ype used to encapsulate a count that is incremented monotonically, used to identify message duplicates.” And quite obviously, you can’t increment a signed 32-bit (two’s complement) number past 2^31-1. So what does one do in a case like this? Clearly the correct answer is not to crash but to let it roll over anyway, but perhaps out of frustration with some blatant errors in the specification that wasn’t the initial implementation. Needless to say, this should have been addressed before releasing, but somehow it slipped through the cracks. If it is any consolation, you are the first ever to report running into this. It has been fixed long since; that the release notes don’t show it is an oversight. if you upgrade to the current version you will not encounter it anymore and moreover benefit from the many other improvements made since the 6.3 release — including some fixes that address an issue where the data path can stall when just the right packets get lost while sending fragmented data. (If you're on the community edition, then you can also fix this by deleting the problematic two lines — and then please also take out two analogous cases in q_xevent.c — just search for DDSI_COUNT_MAX.) Best regards, Erik
  3. Using DDS on Ubuntu 16.04 LTS

    Hi Jeremy, Yes, it will work. Best regards, Erik
  4. Using DDS on Ubuntu 16.04 LTS

    Hi Jeremy, There is no issue using it on Ubuntu 16.04 LTS. Backwards compatibility is excellent in Linux. Best regards, Erik
  5. I suspect the use of -flto (which turns on link-time optimizations) is the cause in your case, too, but I can't easily test it on my machine. I would suggest modifying bin/checkconf, commenting out the "set_var CFLAGS_LTO=-flto" (line 262), and doing a full rebuild.
  6. wishes and dreams (aka "i want a unicorn')

    HI Bud, As is well-known, unicorns do exist. The problem is finding fully grown ones, it is only the baby ones that are quite common. In other words: - does it really have to be C++ or is C good enough? - what are your performance requirements? - how much effort are you willing to put into it? - what licensing schemes are acceptable? There is a C library named "corto" on github that does all this. The generic type handling in https://github.com/prismtech/opensplice-tools can convert between something resembling C99 designated initializers and the in-memory representation of the IDL-to-C mapping — and so all the tricks needed to do this are in there even if it does require hooking up the right parsers. Then there is my proof-of-concept Haskell binding (https://github.com/prismtech/haskell-dds), if you're really looking for a proper unicorn The second and third are definitely limited to the C representation, I'm not sure about the first. One way to deal with that is to have a multi-language program that does the conversion between C and C++ representations via DDS. Best regards, Erik
  7. Hi Loay, In a shared-memory deployment OpenSplice uses shared memory to communicate between the OpenSplice applications that are attached to that shared memory, but for everything else it relies on the networking service, i.e., DDSI2. With a bit of trickery you can even have two independent shared memory domains running inside a single machine, and then connect them via DDSI2. So no, OpenSplice's shared memory is not in any way relevant here. Both OpenSplice and OpenDDS are multicasting to 239.255.0.1:7400 but there is no indication either of them receives anything — and I am certain OpenSplice did not receive anything from OpenDDS in the ddsi2.log file you sent earlier and from the traffic that it generates. I know I have at times had problems with multicasting in a VM, especially if the VM was in a NAT configuration, and I suspect this may be the case for you as well. That means as a next step I think you should try enabling unicast discovery (and probably disable multicasting altogether). In OpenSplice that is pretty straightforward: - set General/AllowMulticast to false - add a <Discovery><Peers><Peer address="localhost"/></Peers></Discovery> Obviously this is not a desirable configuration, but if it turns out that it is a virtual networking problem, then I don't think there are many alternatives short of configuring your VM differently. In any case, it is a sensible step to gain some further understanding of the problem.
  8. Hi Loay, Can you do a WireShark capture of all RTPS traffic and post it? Best regards, Erik
  9. Hi Loay, It is obvious that OpenDDS and OpenSplice are not talking, as OpenSplice doesn't receive a single packet from OpenDDS. Each packet contains a "vendor code", PrismTech's is 1.2, and there no packets with a vendor code other than 1.2 — I think OpenDDS uses 1.3, but it is just as easy to check for anything else. Try, e.g., "grep -E 'vendor 1\.([013-9]|2[0-9])'". Absolutely nothing happens unless both sides receive participant discovery data (SPDP) from each other. So if you see a hint of OpenDDS responding to OpenSplice, but nothing actually working, then the first thing to check is where OpenDDS is sending its SPDP data and why that isn't received by OpenSplice. In WireShark, the SPDP data is shown in the summary as DATA(p), so that's easy to spot. About the "proxy" thing: the world in DDSI is divided into two sides: the entities, and the proxy entities. The first are local, the proxy entities are where it stores information on the remote entities such as the locators, last sequence number received, what sequence numbers have been acknowledged so far, &c., &c. The DDSI endpoint discovery (SEDP) distributes the information on the readers and writers, so that every party in the network is aware of who is out there, and what data needs to be send to whom. The "match_writer_with_proxy_readers" therefore is about matching local writers with remote readers, which determines the destination IP addresses to use and from whom to expect acknowledgements. Also note that the "plist" keyword messes up the trace but helps debug issues with the encoding/interpretation of the QoS and various other things that are transmitted as part of the discovery. If you leave it out, then you can easily search for new participants using the regular expression "SPDP.*NEW". As the case is now, it is a bit harder because it is split over multiple lines. Still, "bes .*NEW" works. Best regards, Erik
  10. Hi Loay, Perhaps the problem is with the selection of the network interface to use if you have multiple network interfaces. I don't know about OpenDDS, but the DDSI2 service is somewhat picky in that it really wants to use a single interface. It could well be that the two simply don't receive each other's multicasts. You can specify the interface (by name or by IP address) in the "General/NetworkInterfaceAddress" parameter. When all else fails ... try enabling DDSI2's tracing by adding: <Tracing> <EnableCategory>trace,plist</EnableCategory> </Tracing> to the DDSI2Service section in the ospl.xml file. This consists of a dump of the configuration, then stuff about network interfaces, addresses, port numbers, &c., and finally you get all the traffic and discovery. This may help in finding out what network interface to use, but it usually also gives valuable information for more complicated problems. It would be unreasonable to expect you to understand everything that is that trace file, so feel free to post fragments of it if you need further help. Best regards, Erik
  11. OpenSplice IPv6

    Hi, There is no "ipv6" boolean attribute in the network interface selection: the correct way is to add <UseIPv6>true</UseIPv6> to the "General" element of the DDSI configuration. Best regards, Erik
  12. Interop Issue

    Hi Peter, RTI are now sending both a UDPv4 and a type 16777216 locator, and then there is a "transport info" list that correlates with and gives some additional information, so it clearly must be a vendor-specific extension occupying part of the OMG-reserved namespace, that evidently should be ignored for things to work ... I'm pretty sure they added this recently, by the way, or we would've run into it ourselves in the most recent interoperability plugfest. Thanks for helping us discover it. For a quick fix, since you are using the open source version, just modify OpenSplice's DDSI implentation to ignore it (see my previous comment, just return 0 instead of ERR_INVALID). I'll make sure a fix goes into the OpenSplice sources. Best regards, Erik
  13. Interop Issue

    I wonder ... 16777216 could be an RTI-specific locator type that gets rejected (whether or not it should be rejected is debatable, the language of the specification is interpretable in multiple ways), or it could be a byte-swapped version of a UDPv4 locator ... A wireshark capture will likely give a hint: if this sample includes locators with kind 1 as well as locators with kind 16777216, then it almost certainly is an RTI-specific locator, but if there are none with kind 1, it likely is an endianness issue. In the former case, ignoring it is pretty simple (see https://github.com/PrismTech/opensplice/blob/master/src/services/ddsi2/code/q_plist.c#L1018). Would you be able to do a packet capture?
  14. Interop Issue

    Hi Peter, Firstly, your only chance of interoperability is with “StandardsConformance” set to “lax”. Only in that mode will OpenSplice accept some of the non-conforming messages sent (and even send a few itself that are needed) by the other implementations. I suspect the “invalid qos” and “malformed packet […] parse:acknack” messages occurred in some mode other than “lax”, if not, it would be useful to have a Wireshark capture and perhaps a DDSI2 trace. Secondly, regarding: There are some subtle, at least formally incompatible, changes from the 2.1 to the 2.2 version of the specification, so we believe that our DDSI implementation should restrict itself to version 2.1 until it has been qualified for version 2.2. However, that is not a valid reason for flagging version 2.2 messages as invalid. Chances are that it will work, though, and you change the check easily enough (see https://github.com/PrismTech/opensplice/blob/master/src/services/ddsi2/code/q_receive.c#L2798). Thirdly, in the second log file: and (as well as all analogous cases) are correct warnings: these messages are indeed not valid DDSI 2.1 messages. Why RTI sends them, I don’t know. Best regards, Erik
  15. Hi, What "goes wrong" when you raise the limit is that the unicast discovery will start blasting even larger numbers of packets into the network, and it has to do so periodically (the SPDPInterval). For each peer address, it sends a unicast packet to all N port numbers, so before you know it, the burst will be huge. There are some obvious ways of mitigating that, but those break support for asymmetrical discovery. Obviously, it is a bit silly that it is hard-coded at 10. It is a historical artefact (as is the call to "exit") that simply never is an issue in federated (shared memory) deployments because the limit is on the number of DDSI2 instances, not participants, and also not in environments that support multicast. How it came to be that this found its way into the product I am unfortunately not at liberty to tell, but I suspect you would understand if you knew ... Anyway, it never got its priority raised because it never became a real issue ... such is life. Please feel free to raise the limit and recompile, that has by far the shortest turn-around time. A periodic burst of packets presumably is better than a non-working system. In that case, please raise the limit in two (...) places: the one you found, and at https://github.com/PrismTech/opensplice/blob/master/src/services/ddsi2/code/q_addrset.c#L58. I will start the process of making it configurable, eliminating the call to exit(), and considering mitigations for the resulting packet bursts, but please be aware that whatever we do internally may take a while to reach github. I have very little influence on that. The decisions what is freely available in the community edition and what is not are what they are, and you're welcome to use the community edition. It just happens that sometimes the commercial edition appears to be a better proposition on technical grounds, and from your description, I think yours is one of those cases. Since I don't know why you are using the community edition, for all I know, you may be in a position to consider switching to the commercial package. If you are, you might want to look at the traffic overhead caused by having 10 hosts with 20 autonomous processes each, compared to having 10 hosts each containing a shared-memory deployment with 20 attached processes. The specification is freely downloadable, but I can give you the short summary: the mandated discovery protocol is quadratic in terms of the number of "participants" (scare quotes because in OpenSplice it is the number of DDSI2 instances, not application participants), and if there are multiple participants on a single node all subscribed to the same data, many copies will have to be sent. Both disappear in shared-memory deployments. (If you really want to scale up, you enter the territory of our Cloud package, with proper scalable discovery.) If you can't go commercial and bump into issues of scale, the best I can advise is to look at the "poor man's" shared memory mode: multiple threads in a single application. A bit of trickery with the run-time linker can go a long way ... Best regards, Erik
  16. Linux x64 Java core dump

    This looks more like a bug in the 6.3.0 that got fixed in the 6.3.3p1 and 6.4.0. I doubt increasing the stack size will do much.
  17. The use case you describe sounds interesting, and definitely not one that DDSI is tuned for out-of-the-box. The wireshark trace you’ve attached suggests a much shorter round trip than 2.5s, but it is not inconsistent with it: you can still get many heartbeats/acknowledgements in a row even with much shorter round trips, because the discovery involves quite a few independent readers and writers. This can be significantly reduced by setting the SquashParticipants option to true. As far as discovery traffic is concerned, that is always a win. With a 2.5s round trip, getting this trace would require a very particular relationship between the timing of heartbeats and latency to get an alternating sequence of heartbeats and acknowledgements. The bad news is that at this time the parameters controlling this timing are not currently exposed in the configuration file. The good news is that it is possible to change the basic timing parameters in the source code for a quick test and rebuild OpenSplice, it is open source, after all. They can be found near the top of src/services/ddsi2/code/q_transmit.c. There may potentially be further consequences to changing these, but it is worth a try. Another issue you may run into is that with the MaxMessageSize setting you chose, you are relying on the fragmentation implementation at the IP level. Setting the MaxMessageSize to a little below the MTU of 976 bytes (account for the UDP/IP headers), and the FragmentSize below that still would eliminate IP fragmenting and instead relying on DDSI fragmenting only. In an unreliable network, this can reduce the size of the retransmissions significantly. As you’ve mentioned the bandwidth limiting features of DDSI2E are relevant, and it also provides some more control over which data is sent where, via the network partitions. Is experimenting with the commercial version an option for you?
  18. interoperability with opensplice and opendds

    While I don't know this particular problem, it does remind me of issues with multicasts in discovery. Are you using VMs, by any chance?
  19. Type Names and DDSI2

    Technically, there is only an interoperability issue, with neither vendor having deviated from the specification: the specification does not actually specify what that "type_name" parameter means. In OpenSplice the interpretation chosen many years ago is that registering a type under a different name allows the DomainParticipant to use a local alias, but that globally it has no effect. The rationale for this interpretation is unknown to me — but it is old, and consistent with the specification. If I were to hazard a guess, it is that OpenSplice at the time only supported LANs where it is feasible to globally enforce consistency between topics; and with the topics necessarily consistent, obviously there was very little value in type name aliases. Moreover, this all antedates the specification of an interoperable wire protocol ... Clearly, the interpretation chosen in CoreDX is different, and that it is the type name to be used for matching globally. May I ask why you tried to use a type name alias? We are aware that OpenSplice and RTI use the same conventions for scoped names (they both use "::" as a separator), and that CoreDX uses a different convention (don't remember what exactly). Intriguingly, this detail is also left unspecified by the OMG ... Perhaps it would be possible to instruct CoreDX to use the name used by OpenSplice? That should solve the issue, I presume.
  20. Receiving same data message twice

    Bernie, No, the time stamp format is fixed. For automated processing this is typically much nicer, but indeed, for reading it as a human being it is a bit unfriendly. Good to hear the 1 minute issue is sorted! Erik
  21. Receiving same data message twice

    Bernie, This is a weird enough case that I don't want to ask you to go and dig through it, as I know how hard it can be. So send it to us - I think you can attach files on the forum, but if you need to keep it private, perhaps you'd best email it to me at erik.boasson@prismtech.com. Erik
  22. Receiving same data message twice

    Hi Bernie, I think I understand what happens. When you start a new process (given that you're limited to "single process" mode with the community edition), your new process gets its own local copy of the DDSI2 service and of the durability service. DDSI2 performs its discovery and starts receiving data from the peer publishers (and sending to the subscribers) pretty quickly in a small network (in the order of milliseconds), but the durability service instances need some time to discover and agree on the topology of the network, and to exchange data to create a consistent view. The speed of this process is related to the heartbeat interval and (with the default timing) it typically takes 10s to 15s, although it of course also depends on the amount of data to be exchanged and how much bandwidth is allocated to that process (also a paid-for feature; we have customers where it can take minutes). Since your data is of TRANSIENT durability, in this initial phase, you'll be receiving data directly, but also get copies later on through the initial data exchange by the durability service. Whether or not you would observe those copies is dependent on other QoS settings and on your application's behaviour. If you were to select BY_SOURCE ordering (rather than the default, BY_RECEPTION) and you were to call wait_for_historical_data() before starting processing, it should be fine: in BY_SOURCE mode, we check for duplicates; and by not removing the data from the reader until all the potential duplicates have arrived, there is actually something to check against. Having said that, while this should work, I don't think this is what you suits your application. The reason I think so is that what you describe is a request-reply pattern using messages, but the TRANSIENT setting really is about maintaining a description of the important parts of the system state so that newly started processes can join quickly (and crashes can be recovered from quickly). These request-reply messages typically do not describe a system state. So I suspect you really ought to consider using a durability setting of VOLATILE. Regards, Erik
  23. Receiving same data message twice

    Bernie, That one minute delay is very strange. I take it that all messages are delivered correctly, despite the delay? DDSI has a tracing capability that logs its actions in great (perhaps too great) detail. You might want to try enabling it, by adding to your ospl.xml, in the DDSI2Service section: <Tracing> <Verbosity>finest</Verbosity> <OutputFile>ddsi2.log</OutputFile> </Tracing> This contains enough information to find out whether DDSI sent the data across from the one process to the other (we much prefer using shared memory over treating processes in a single machine as if they were network nodes, but that is a paid feature). There are a few timed activities in DDSI, but 1 minute is a bit odd, so it would be interesting to see what exactly is going on. With regards to the duplication of samples, what QoS settings are you currently using? Without knowing your application, we can't of course make sensible recommendations, but it would help understand the exact situation. Erik
  24. Receiving same data message twice

    Hi Bernard, We know of some circumstances in which data duplication can occur. Since we don't know which exact version of OpenSplice you are using, and with what QoS settings, here are a few options off the top of my head: 1. On versions prior to 6.2.3 and multiple partitions shared between readers and writers (fixed since 6.2.3 except for very rare circumstances involving resource limits and carefully crafted interleavings of events). 2. For transient-local data, data can arrive twice with a lot of time in between (one is the transient-local support in DDSI that is required by the specification, one because of the durability services exchanging data to ensure all nodes have the same view of the transient(-local)/persistent state). 3. For all data, in the first few seconds following startup of a process (this is a variant of 2), which relates to trying to work around the difference in philosophy underlying the OMG specification and that on which OpenSplice is based. OpenSplice works much harder to maintain a coherent picture throughout the network, but at same time builds some features on this foundation. RT networking is designed from the ground up for this, but with DDSI we have to rely on some workarounds that sometimes have visible side effects. Does the duplication still occur if you create the participants, then wait a while (say 20s in a default configuration — not proposing 20s waits as a solution here, rather as an aid to diagnosing the problem) before starting data publication? Regards, Erik
  25. The trace shows that ddsi2 is not using the "wlan0" interface, but rather the "vibr0" interface. I googled, and it turns out to be related to Xen-based virtualization. Presumably it doesn't fully support multicast (but this is only an assumption). The "wlan0" interface is listed by ddsi2 as being down, whereas the ifconfig dump you showed listed it as being up. Ddsi2 will ignore any interface that's down. Of course, the two programs, ifconfig and ddsi2, should be in agreement on this - they use the exact same information, after all. We have never yet seen a discrepancy like this. Could you please double-check that that ddsi2 and ifconfig are really in disagreement? Secondly, could you explicitly set the interface to loopback, by adding <General><NetworkInterfaceAddress>lo</NetworkInterfaceAddress> and trying again? Note that it may be that multicast over the loopback interface is disabled, even though it possible. If "ifconfig lo" shows you "MULTICAST" among the flags, then it is supported, and it should simply work; else, you should add 127.0.0.1 as a peer address, as shown in a previous post. This should allow you to check if other things do work as expected while we try to figure out what's going on with your wireless LAN interface.
×