Friday 24 November 2017

Can distributed systems have predictable delays?

I ran into a discussion thread about “deterministic networking” and got interested...

Here's the puzzle: suppose a system requires the lowest possible network latencies, the highest possible bandwidth, and steady timing.  What obstacles prevent end-to-end timing from being steady and predictable, and how might we overcome them?  What will be the cost?

To set context, it helps to realize that modern computing systems are fast because of asynchronous pipelines at every level.  The CPU runs in a pipelined way, doing branch prediction, speculative prefetch, and often executing multiple instructions concurrently, and all of this is asynchronous in the sense that each separate aspect runs independent of the others, as much as it can. Now and then barriers arise that prevent concurrency, of course.  Those unexpected barriers make the delays associated with such pipelines unpredictable.

Same for the memory unit.  It runs asynchronously from the application, populating cache entries, updating or invalidation them, etc.  There is a cache of memory translation data, managed in an inverted lookup format that depends on a fast form of parallel lookup hardware for fast address translations.  With a cache hit memory performance might be 20x better than if you get a miss.

Or consider reading input from a file.  The disk is asynchronous, with a buffer of pending operations in between your application and the file system, then another in between the file system buffer area and the disk driver, and perhaps another inside the disk itself (a rotating disk or SSD often has a battery-backed DRAM cache to mask hardware access delays, and to soak up DMA transfers from main memory, which ideally occur at a higher speed than the SSD can really run at).  If everything goes just right, your performance could be 100x better than if luck runs against you, for a particular operation.

NUMA effects cause a lot of asynchronous behavior and nondeterminism.  Plaforms hide the phenomenon by providing memory management software that will assign memory from a DRAM area close to your core in response to a malloc, but if you use data sharing internal to a multithreaded application and it runs on different cores, beware: cross core accesses can be very slow.  Simon Peter researched this, and ended up arguing that you should invariably make extra copies of your objects and use memcpy to rapidly move the bytes in these cross-core cases, so that each thread always accesses a local object.  His Barrelfish operating system is built around that model.  Researchers at MIT sped up Linux locking this way, too, using “sloppy counters” and replicating inode structures.  In fact locking is a morass of unpredictable delays.

Languages like C# and Java can hide some NUMA effects, but on the other hand, they also do garbage collection and other memory compaction operations, which can be costly and hard to anticipate or control.  You can eliminate this with C++, but the standard libraries sometimes do copying in unexpected ways too, so you can’t just assume that everything is roses even with these low level languages.  These days, even C can have strange library-level delays.

The case I worry about is when activities with different priorities share a lock: it is easy to create a priority inversion in which urgent action A is stuck waiting for thread B, which holds a lock but is very low priority and has been preempted by medium priority thread C.  This seems arcane as I describe it, but if you think of a NIC or an interrupts handler as a thread, that could be A.  Your user code is thread B. And the garbage collector is thread C.  Voila: major problems are lurking if A and C share a lock.  Locks make concurrent coding easier, but wow, do they ever mess up timing!

In distributed systems, any form of remote interaction brings sources of nondeterministic timing.  First,  on the remote side, we need some kind of task to wake up and run, so you have a scheduling effect.  But in addition to this, most networks use oversubscribed COS structures, with spine and lead routers or switches that can’t actually run at a full all-to-all data rate.  They congest and drop packets or send “slow down” messages or mark traffic to warn the endpoints about excessive load.  So any kind of cross-traffic effect can cause variable performance.

Few of us have a way to control the layout of our application tasks into the nodes in the computer.  So network latency can introduce nondeterminism simply because when things run, every different launch of your application may end up with nodes P and Q at different distances from each other, measured in network units.  In fact routing can also change, and tricks involving routing are surprisingly common.  Refreshing IP address leases and other kinds of dynamic host bindings can also glitch otherwise rock-steady performance.

If you use VPN security, encryption and decryption occurs at the endpoints, adding delay.  There are cryptographic schemes that have constant cost, but many are more or less rapid depending on the data you hand them, hence more non-determinism creeps in at that step.

Failure handling and any kind of system management activity can cause performance to vary in unexpected ways.  This would include automated data replication, copying, compaction and similar tasks on your file storage subsystem.

So... could we create a new operating system with genuinely deterministic timing?  Cool research question!  If I find the right PhD student, maybe I’ll let you know (but don’t hold your breath:  with research, timing can be unpredictable).

4 comments:

  1. Hi Professor Birman,

    I'm interested in trying out and possibly contributing to Derecho. In several of your 2017 blog posts you've mentioned Derecho 2...is that what's currently at https://github.com/Derecho-Project ? or is that still version 1? If 1, is there any expected timeline on v2 availability?

    FWIW, I designed and implemented a replicated object API in ~1999 based upon knowledge of Isis and virtual synchrony. It's interesting that Derecho is moving in those directions to simplify replication and distributed state synchronization.

    Thanks,

    Scott

    ReplyDelete
  2. Hi Scott! The version on GitHub is the most current release, and it works quite well except for restarts from complete shutdowns. Edward has been rewriting the restart code (we migrated from a version of our code that maintained a single Paxos log to a much fancier approach using versioned state variables, but it requires changes to the cold restart logic). So yes, go for it. Just hold off on full shutdowns until Edward uploads this new version of the restart code. It will work for other purposes.

    We also should have a port to LibFabrics soon. That layer can run on TCP, RDMA, the Intel IODirect connector hardware, and other options. We find that SoftROCE is just absurdly slow, way slower than it really should be. It pegs two cores at 100% which is definitely not reasonable. LibFabrics seems to be twice as fast as SoftROCE and doesn’t have this high background load. So that’s the path we are on.

    Let us know if you run into any problems... we’ll be happy to help.

    ReplyDelete
    Replies
    1. Hi Ken.

      Thanks very much for the info. I wasn't familiar with LibFabrics. I think since I don't have access to RDMA hardware right now I will wait for the derecho over libfabrics. How can I know when it is available?

      In the mean time a question: you mentioned in one of your blog posts that it would be difficult to use other languages than c++ for using the derecho replicated object API. Could you comment on why that would be?...e.g. relative to python or java say.

      To share one thought: the OSGi (java) concept of a service...or rather a remote service...is conceptually similar. OSGi services are plain 'ol object instances that are managed by a local broker (aka service registry) and accessed by interfaces.

      One thing that makes OSGi services different from java objects is support for dynamics...i.e. OSGi services can come and go at runtime

      Which brings me to another question: Is it possible for replicated object instances to be created and destroyed at runtime within a derecho process group?

      BTW, if there is some derecho mailing list that is more appropriate than your blog for such questions please just send me there.

      Scott

      Delete
    2. I’ll create a Derecho discussion thread.

      Delete

This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.

Note: only a member of this blog may post a comment.