A Few Thoughts on Distributed Computing: Next generation memory technologies

During the past few months I've had time to talk to Dave Patterson (Berkeley), Dave Cohen (Intel) and Pankaj Mehra (Sandisk/WD) about the next generation memory hierarchy. This blog seems like a good way to organize what I'm learning and to jot down a few thoughts.

First, we're about to see another major technology dislocation, similar to that associated with RDMA (and actually it will include an RDMA dimension). Here's why I say this:

There are a LOT of a new technologies in the pipeline, including 3-D XPoint memory, much larger SSD (NAND) storage systems with battery-backed DRAM caches, phase-change memory, and probably more. For those unfamiliar with the terms, normal computer memories are usually referred to as SRAM (the super fast kind used for registers) and DRAM (main memory storage).
They have very different properties. 3-D XPoint is a persistent memory technology that can run nearly as fast as DRAM when evaluated purely in terms of throughput, and with only a very small access delay (a few clock cycles more than for DRAM). It offers full addressability and byte-by-byte atomicity, and it can keep up with RDMA, so we should be able to direct RDMA data streams directly into 3-D XPoint persistent storage. The issue for 3-D XPoint is likely to be its initial price: until volume becomes very large, this will be a fairly costly form of storage, more similar in price to DRAM memory than to SSD/NAND flash.
Battery backed flash (SSD/NAND) has been around for a while. It takes an SSD disk and then puts a large DRAM cache in front of the disk, with a control processor and a battery to keep the unit alive even if the host machine crashes. The idea is that once you've written to the DRAM cache (which you can do at full speeds and even using RDMA), you won't have to worry about persistency because if power is lost or the host crashes, the little control device will finish any pending transfers. But because NAND has lower transfer rates, you run into the issue that at RDMA speeds you can easily fill the entire cache before the system would have time to DMA the data onto the back-end flash storage system, so any high speed solution will need to either limit total transfer rates or stripe data across a bank of these units.
Rotational disk is quite slow by the standards of either of these technologies, but with new coding techniques ("shingling") offers some benefits that neither of the others possesses: it has really massive capacity, and doesn't suffer from "write leveling", a notorious problem for flash (if you rewrite a block too many times, the block degrades and can no longer record the data).

I didn't list phase-change memory here because I don't know as much about it, but people tell me that it has huge potential as well. I mentioned that 3-D XPoint will initially be fairly expensive (unless companies like Intel decide to price it low to push volumes up quickly). In contrast, SSD and rotational disk remain quite cheap, with SSD relatively more expensive than rotational disk, but quite a bit cheaper than 3-D XPoint. On the other hand, if you put a battery-backed cache and an ARM co-processor on your SSD store, it becomes a much more costly storage solution than it would be if you had one battery-backed DRAM cache for the whole system and shared it across your fleet of SSD disks.

We need to think of these memory technologies in the context of other technologies, such as:

RDMA, mentioned above.
NUMA hardware: with modern multicore systems, a memory unit can be close to a core or further away. It is very important to ensure that your memory blocks are allocated from nearby memory; otherwise every access can incur a huge penalty.
NetFPGA coprocessors, and GPU coprocessors: increasingly popular and able to run at line data rates but only if the pipeline is very carefully crafted.
Other forms of data capture and data visualization hardware that might use DRAM as their main way to talk to the host (your display already does this, but we'll see more and more use of 3-D visualization technologies, novel forms of video recording devices and new interactive technologies).
Various forms of caches and pipelines: each technology tends to have its own tricks for keeping its compute engine busy, and those tricks invariably center on pipelining, prefetching, data prediction, etc. If we move data around, we potentially move data from or into memory being computed upon by some device, and then either we might see the "wrong" data (e.g. if a specialized compute engine signals that it has finished, but its results are still in a pipeline being written to the output memory), or we might start the specialized compute engine up prematurely. In fact we even see this with our battery-backed DRAM/NAND mix, or with 3-D XPoint, where is takes a few clock cycles before data written to the device has really become persistent within it.

Some of these technologies are block oriented: the main processor cache operates in terms of blocks that we call "cache lines", an SSD would normally be structured into blocks of data, and a rotating disk invariably is. Others are byte addressable, like 3-D Xpoint and DRAM.

So we have quite a range here of speeds, price points, and addressability models. More complications; to the extent that these technologies have pipelines or caches or other internal memory structures, there isn't any standard way to flush them. Thus when I receive an RDMA data transfer into my machine and the RDMA NIC interrupts to say "data received", it isn't always obvious when it would be safe to use that data, and how. RDMA vendors ensure that CPU access to the received data will operate correctly: I can do any kind of computation on it that I want, or initiate a new outbound RDMA. But I want to turn around and ask my GPU to compute on the received data, it isn't obvious how to ensure that the GPU subsystem won't have cached some form of data of its own, that the RDMA transfer overwrote and hence invalidated, and if I can't flush those caches, I have no good way to be sure the GPU will "see" my new bytes and not the old ones!

A further puzzle is that in a rack-scale setting with NUMA machines, memory management potentially becomes very complex. RDMA makes it pretty much as cheap to pull in data from other machines as from my own: after all, the RDMA network runs as fast or faster than the DMA transfer from an SSD or a rotational disk. So the entire cluster potentially shares a huge amount of persisted memory, subject to just small latencies for accessing the first byte, provided that my transfers are large. With byte addressing, local DRAM is vastly faster than remote 3-D XPoint accessed over RDMA: even if the access will be handled by hardware, we can't eliminate those microseconds of setup delay.

Here's another thought: suppose that we shift perspective and ask ourselves what the application itself is likely to be doing with all of this persisted storage. I like to use two examples to shape my thinking. The first is that of self-driving cars sharing a huge world model that they read and update (the world model would include descriptions of the cars and their "plans" but also of everything they encounter and might need to chat about: other vehicles that aren't using the same self-driving technologies, pedestrians, transient road conditions like black ice or potholes, etc). And my second example is that of a bank logging financial transactions.

Here one instantly notices that there would be a need to keep fresh data easily accessible at low delays, but that older data could either be garbage collected or archived. So beyond the need to manage memory even for immediate uses, there is a broader need to think about "long term data curation". In the case of the bank, protection becomes an issue too: both to protect secrets and also to prevent tampering.

I don't have a magic answer here (nobody does) but I do think the picture highlights a huge opportunity for research in the operating systems and storage communities. Honestly, this looks to me like a situation in which we need a whole need operating system specialized just on these questions: there is an issue of what the proper abstractions should be, and how one would program against them, and how the technologies enable doing so (or prevent us from doing so).

Then we should ask about new opportunities. The self-driving car example points to a deep integration of storage with machine-learning applications. What are the access patterns that arise in such cases, and how does this shape our caching needs? What about replication? We clearly will need to replicate data, both for fault-tolerance and for speedy local access in the case of byte-addressing data where low-latency is key. Get such questions right, and suddenly we open the door to new kinds of medical devices that capture, process and display imaging as the operation advances; get them wrong and the same mix of hardware might be useless in the operating room!

I'll stop here: my blog entries are always way too long. But what an exciting time to do research in systems... I'm jealous of the young researchers who will get a chance to really tackle these open questions. And I'm looking forward to reading your SOSP papers once you answer them!

A Few Thoughts on Distributed Computing

Wednesday, 28 December 2016

Next generation memory technologies

No comments:

Post a Comment