Tuesday 28 November 2017

Dercho discussion thread

 As requested by Scott...  Scott, how about reposting your questions here?
===
Hi Ken.

Thanks very much for the info. I wasn't familiar with LibFabrics. I think since I don't have access to RDMA hardware right now I will wait for the derecho over libfabrics. How can I know when it is available?

In the mean time a question: you mentioned in one of your blog posts that it would be difficult to use other languages than c++ for using the derecho replicated object API. Could you comment on why that would be?...e.g. relative to python or java say.

To share one thought: the OSGi (java) concept of a service...or rather a remote service...is conceptually similar. OSGi services are plain 'ol object instances that are managed by a local broker (aka service registry) and accessed by interfaces.

One thing that makes OSGi services different from java objects is support for dynamics...i.e. OSGi services can come and go at runtime

Which brings me to another question: Is it possible for replicated object instances to be created and destroyed at runtime within a derecho process group?

BTW, if there is some derecho mailing list that is more appropriate than your blog for such questions please just send me there.

Scott

25 comments:

  1. Java/Python/Rust/... : These languages all do memory management, so any data sent via RDMA would have to be copied to a pinned memory page first. Copying is 3x slower than 100Gb RDMA, so copying twice (sender... RDMA... receiver) is a path at least 7x slower than just using RDMA directly from memory, as we can do from C++. In practice it would be more like 25x because of the need to create new objects on the receiver side.

    A further issue is that we use C++ generics (templates) for our Derecho API. The Java type system isn't the same as the C++ type system, so there is no simple way to call a C++ generic from Java. Python doesn't even have static types...

    ReplyDelete
    Replies
    1. Hi Ken. I understand the limitations for memory management but couldn't c++/native impls be accessed from java/python? This is a common approach with (e.g.) tensorflow accessing GPUs.

      WRT C++ templates: Is there something about the Replicated Object API that requires C++ generics or is it a convenience?

      My question is primarily practical...i.e. to get adoption of a 'higher level' api like replicated objects rather than MPI or something else I expect it will help to support multiple languages.

      Delete
    2. Good questions! I don’t really have the best answers, meaning that someone should probably do some actual experiments. Derecho currently uses a lot of fairly elaborate templates that do compile time optimization as a way of minimizing runtime costs. So the direct full use of the platform needs that, and also needs memory to be pinned and registered. In Java, memory can be copied around without you knowing they did it.

      But one could definitely build some sort of Java library that would use generics a lot like Derecho templates, mimicking Derecho, and then call into C++ dll’s via the standard Java import feature. So you could do a Library (dll) exposing Derecho as a set of non-templated stub methods, and in this way do cross language calls. The big issue would be finding a way to lock down the Java memory regions so that you wouldn’t have endless copying.

      So in theory, I do see how it might be done for Java or C# or Rust.

      Python is pretty weird, since it lacks types and does absolutely everything by a form of runtime reflection. I’ve been thinking about the issue because TensorFlow is layered over python, but I think the best option there is to generate C++ code by literally parsing the python and then emitting Derecho modules for the chunks that would do smart-memory stuff. Python does allow you to wire types down, so you could use that approach just for those modules... might be possible to get most of the speed of a Derecho that way.

      If you want to give it a try, we’ll be happy to provide help if you get stuck!

      Delete
    3. Hi Ken. I would be interested in the approaches you describe for either java or python...probably java to start as I'm more familiar extending java with native code libraries...and with vm internals (e.g. garbage collector) than with the python equivalents, although I'm fascile with python also. First I think I have to fully understand how derecho is using C++ generics/templates and whether it could be reasonably exposed via java interfaces and classes.

      Delete
    4. Basically, we implement reflection, but entirely at compile time. And we do that with variadic (recursive) templates, constant expressions evaluated at compile time, and a form of parameter aliasing that results in inlined code. Java gets similar API features and polymorphism, but using costly runtime mechanisms.

      Delete
    5. Thanks for the info. I assume by 'costly runtime mechanisms' you mean reflection. Java reflection is indeed costly. With OSGi remote services (my corner of the OSGi universe) the distribution providers typically do use reflection. The way they often do this, however, is to impose the reflection cost once at service registration time (i.e. registration with the service registry...i.e. the broker) and only with the service interfaces (the external contract for the service). In the derecho case, if this registration were to correspond with the dynamic adding/removal from a group and a consequent view change, perhaps it's runtime costs could be mitigated.

      I agree that c++ templates generating inline code at compile time would be generally faster, but it might be that the Java runtime costs could be reduced sufficiently with such approaches. And then there is the dynamic compiler...which possibly could be modified to help in these cases.

      Delete
  2. OSGi: These client/server boundaries are expensive! LibFabrics links directly to C++, and just maps to the cheapest option available: RDMA if we have the hardware, TCP if not...

    ReplyDelete
    Replies
    1. Not sure what you mean wrt OSGi and client/server boundaries. The The OSGi service registry is a within-process broker that separates contract (interfaces) from impl. Remote services extends the broker model by defining standardized meta-data (e.g. interfaces of instance, object identity, etc) and allows a pluggable 'distribution system' to handle state replication and rpc for object instances just as derecho does for replicated objects. In any event, it's quite possible to have OSGi services that are all within process.

      I get that LibFabrics provides layered comm abstraction...which seems great to build something like a replicated object system on top of!

      Delete
    2. Well, I stand corrected! Perhaps it could work. Honestly, you seem to know way more about OSGi than me... and what I knew wasn’t even right!

      Delete
    3. If curious you might like this high-level presentation of OSGi/modularity: https://www.slideshare.net/bjhargrave/why-osgi

      Delete
  3. Dynamic object creation: Yes, by changing the group "view" and having the mapping function do a new membership assignment at that point. You can't do it without a view change. But view changes are fast (150ms).

    ReplyDelete
    Replies
    1. With a view change...I see. With the current derecho API, are there any tests/examples that do this?

      Delete
    2. I think Edward Tremel has unit tests for that part of our logic. He owns the view change code, and is actually extending it now (we are finishing the code or restart with persisted version vectors). I’ll ask him to jump in on this, if you like.

      Delete
    3. I would be grateful if he did!

      Delete
  4. Availability of the LibFabrics version: Weijia Song is currently wrapping up his experiments to understand exactly how the library works, how MVAPICH (MPI) uses it and is able to configure itself semi-automatically, etc. Then the plan is to port FFFS, which Weijia wrote and can be used standalone (it runs in our GridCloud platform, and v2 might use Derecho, but FFFSv1 is pretty stable and has an RDMA option). If that goes easily, we will port Derecho next. I’m guessing early February?

    The project is open and we would welcome contributors but honestly, there are just three key developers who really know their stuff and all of them have priorities of various kinds tied to their research and career goals. So we do plan to do this soon, but it may not occur tomorrow.

    ReplyDelete
    Replies
    1. Hi Ken. Hope things with you and Derecho are well. Any word on the LibFabrics version? I checked out the github site and there haven't been any recent commits and the docs appear to not point to LibFabrics yet. Anyway, I'm interested in what's happening.

      Also...any further word (from Ed Tremel?) on the dynamics of dynamic replicated object creation/deletion?

      Thanksinadvance.

      BTW, I'm the longtime project lead for this OS project:

      http://www.eclipse.org/ecf

      One part of these APIs is a java-based implementation of what's essentially a replicated object (called the shared object api in ECF). Another part of ECF uses this shared object API to implement smart proxies for the OSGI Remote Services specification...which is a spec/standard for network-services meta-data (endpoint info, intents, sync/async, ids, privacy/confidential, service-specific meta-data, etc).

      https://osgi.org/specification/osgi.cmpn/7.0.0/service.remoteservices.html


      One thing I've been thinking: It's possible that I could use Derecho to impl (native code) the shared object API, which could mean a high performance remote services impl as well.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. This comment has been removed by the author.

      Delete
    4. Well, it hasn’t gone as quickly as we hoped, but actually is going very well. Weijia now has LibFabrics working for a bunch of experiments and has done the port for about 2/3rds of the Verbs uses in the system. We actually hope to do the RDMC experiments over LibFabrics over TCP as soon as next week. Derecho will quickly follow, once those work. It went slowly because the package turns out to be rather sophisticated, and has mostly been used in MPI, which doesn’t have a dynamic notion of membership. Nothing has actually gone wrong, and if anything performance is looking very good. But it forces us to take tiny steps, test like crazy, then take a further tiny step, etc. Weijia is getting very good at this stuff, though, and the pace is accelerating. We also found him some help, a guy named Alex, who seems to be a natural.

      Edward has focused on the restart (from full shutdown) code, and has that mostly working now, in a development branch on github. But he also knows of some bugs, and Sagar also knows of some new bugs. We should fix those known issues before we release the new version to the public, but you could have it sooner if you like. The system is really in great shape, overall.

      You asked specifically about dynamically creating new subgroups or new sharding patterns (dynamic objects). For us, that would be done by triggering a new view and then getting your layout function to just decide who should belong to the new subgroup, or the new shards. Except for full restart, that would work today, over RDMA hardware. Theoretically, that should work in a few weeks on TCP via this LibFabrics port. I say your layout function because Edward’s is kind of basic, focused on regular patterns.

      It might make sense to set up a Skype call with you, me and Matt Milano (our PL expert) about this idea of C++ to Java linkage. Broadly, my reaction is that it can work. One option is to create a kind of universal generic on the Derecho side: our variadic templates instantiate at compile time, but Java types are instantiated at DLL load time, using reflection. But for the best speed, you need to dream up a way to compile a C++ DLL at load time, with stubs matched perfectly to your Java types, dynamically load it, then wire the reflection calls to resolve against the C++ API, which won’t involve generics at that point, since the compiler will have expanded them. Could work.

      Not many people could follow the above suggestions..l but I can see how you and your team might be able to do it.

      Delete
    5. More on this same idea.

      In Vsync, from C#, I actually had many of the same challenges. Your guys could look at that code to see how they were handled.

      Vsync did its own marshalling and demarshalling, but the approach was quite slow because it ended up with polymorphic type operations on the critical path, and in Java or C# you pay a fortune for querying the current dynamic type binding at runtime. I guess your guys are experts on this. Anyhow, it was messy and slow, and even with tricks like using a bound type reflection object in C#, I never got genuinely high speed that way. Derecho is 15,000x faster than Vsync, to give some sense of things. A lot of that was other stuff, but some of it was reflection costs.

      All this said, you still could code up a Java serialization of the request, send it down as a byte vector to some sort of stub compiled against Derecho and hence non-generic, have that stub do the RDMA multicast, and then deliver it back up to Java and hence into your demarshaller for polymorphic dispatch. So that's what I did in Vsync.

      Now the fancier option I'm outlining would have costs (just in time compilation, using the C++ compiler) and might be a little fragile (to compile Derecho you need the proper versions of one of the C++ toolchains, plus potentially the libFabric library now that we are starting to use that). But then you could basically JIT a bunch of stubs, one for each Java argument type "mix" needed for your polymorphic methods, which has the effect of compiling non-polymorthic POD methods that now can be called as C++ externs from a DLL you would dynamically load on the Java side. I've done all steps of this at one time or another.

      The core issue here is that even with such steps you may end up with some copying: Derecho is gradually moving towards scatter gather of arguments that have a memory representation the RDMA NIC can handle, and that are in pinned memory (Mellanox recently relaxed this need, but LibFabrics still imposes it). So potentially, you go to all this trouble but even so, we can't easily avoid copying. If someone will generate a marshalled byte array, it might as well be you. So that's the tradeoff.

      So then we get to yet a fifth (sixth?) idea. You could malloc a big buffer in memory, RDMA registered and pinned, and now your Java code could marshall to that or even somehow generate data directly in it. Then Derecho's raw mode kicks in (or we can get it to kick in) and we skip the C++ marshalling or scatter/gather question entirely (we still need the generic stub for byte arrays, but that's easy). Now we would not need a JIT step, and we get a pretty simple pathway -- but we've lost the elegance of the API, which leverages polymorphism quite extensively now. So you end up forced to reimpose the API but as part of the OGI Java infrastructure via some form of Java preprocessor, presumably.

      Delete
    6. Hi Ken,

      Thanks for the thoughts.

      >All this said, you still could code up a Java serialization of the request, send it down as a byte vector to some sort of stub compiled against Derecho and hence non-generic, have that stub do the RDMA multicast, and then deliver it back up to Java and hence into your demarshaller for polymorphic dispatch. So that's what I did in Vsync.

      Yes that's similar to what I was thinking.

      > ...I've done all steps of this at one time or another.

      An interesting idea. I think this would require modifying/hooking into the jvm (jit)...which would be jvm impl dependent...but I'll look into that.

      >The core issue here is that even with such steps you may end up with some copying: Derecho is gradually moving towards scatter gather of arguments that have a memory representation the RDMA NIC can handle, and that are in pinned memory (Mellanox recently relaxed this need, but LibFabrics still imposes it). So potentially, you go to all this trouble but even so, we can't easily avoid copying. If someone will generate a marshalled byte array, it might as well be you. So that's the tradeoff.

      I understand. Question: Do you think that LibFabrics going to relax the pinning? Or is this going to be a permanent constraint for Derecho? My thinking has been that zero-copy is the goal, but min-copy with dynamic types might be worth it (for dev usability).

      >So then we get to yet a fifth (sixth?) idea. You could malloc a big buffer in memory, RDMA registered and pinned, and now your Java code could marshall to that or even somehow generate data directly in it. Then Derecho's raw mode kicks in (or we can get it to kick in) and we skip the C++ marshalling or scatter/gather question entirely (we still need the generic stub for byte arrays, but that's easy). Now we would not need a JIT step, and we get a pretty simple pathway -- but we've lost the elegance of the API, which leverages polymorphism quite extensively now. So you end up forced to reimpose the API but as part of the OGI Java infrastructure via some form of Java preprocessor, presumably.

      I like this thought. WRT the elegance of the API...I and other could help create a Derecho Java API...OS of course...and perhaps even attempt to standardize API into JVM. But it would have type safety, dynamic classloading, etc and could be optimized via native impls.

      I would be happy to chat via skype. Next couple of days are busy for me but would be my pleasure.

      Delete
    7. I'm not sure about the pinning. Weijia may have a better sense of how this will play out. The problem isn't seen on HPC systems, where LibFabrics was born, because they tend to disable paging and run with direct page mapping ("bare metal") for the utmost speed. Virtualization has a price and HPC people don't pay prices for things they don't use.

      So this entire question is seen only on paged platforms. True full virtualization, VM style, is probably not feasible with RDMA. So that is most likely off the table.

      The middleground is something like Mesos or Linux, not virtualized, but with processes that have mapped address spaces. For this case, Mellanox (the majority vendor) now can handle the mapping and can even generate a page fault if needed. No pinning required, but only for Connect-X3 and above, and only with the new firmware update released last December. You still see some folks with X2, and you also see lots of people who didn't upgrade the firmware. So that's an issue even for Mellanox.

      As for LibFabrics, it wants to map to Intel DirectPath (IODirect) and other non-RDMA technologies from the GenZ consortium, and my guess is that those will mostly want pinning and registration. People hack this by pinning and registering the whole of memory, but that causes the NIC to slow down (potentially) due to overloading the on-board cache the NIC itself uses to avoid consulting PTEs constantly, over in host DRAM.

      So my short answer is "no". And my long answer is: "that depends".

      I like the Java API offer! We accept...

      Delete
    8. So when should we do this call? Are there other people who we should include?

      Delete
  5. I have the hardware... like the switches and the Chelsio cards and also some mellanox cards... all of which support RDMA (Ethernet) not infiniband... and am willing to help (not a high end programmer, but can provide quality assurance, testing, test lab, etc.,

    How can i help and whom can i get in touch with... ?

    https://twitter.com/apscomp

    ReplyDelete
    Replies
    1. What would be most valuable is to find courageous early users who are interested in the idea of smart memory for vision tasks, know C++ and are researchers in computer vision, and would like to explore moving time-critical aspects into a smart memory based on Derecho.

      Delete

This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.

Note: only a member of this blog may post a comment.