We're fast approaching a new era of online machine learning and online machine-learned behavior: platforms that will capture real-time data streams at high data rates, process the information instantly, and then act on the output of the processing stage. A good example would be a smart highway that tells cars what to expect up around the next curve, or a real-time television feed integrated with social networking tools (like a world-cup soccer broadcast that lets the viewers control camera angles while their friends join from a remote location).
If you ask how such systems will need to be structured, part of the story is familiar: as cloud-hosted services capturing data from sensors of various kinds (I'm including video cameras here), crunching on it, then initiating actions.
A great example of such a service is Microsoft's Cosmos data farm. This isn't a research platform but there have been talks on it at various forums. The company organized a large number of storage and compute nodes (hundreds of thousands) into a data warehouse that absorbs incoming objects, then replicates them onto a few SSD storage units (generally three replicas per object, with the replication pattern fairly randomized to smooth loads, but in such a way that the replicas are on fault-independent machines: ones that live in different parts of the warehouse and are therefore unlikely to crash simultaneously).
Once the data is securely stored, Cosmos computes on it: it might compress or resize an image or video, or it could deduplicate, or run an image segmentation program. This yields intermediary results, which it also replicates, stores, and then might further process: perhaps, given the segmented image, it could run a face recognition program to automatically tag people in a photo. Eventually, the useful data is served back to the front-end for actions (like the smart highway that tells cars what to do). Cold but valuable data is stored to a massive backend storage system, like Microsoft's Pelican shingled storage server. Unneeded intermediary data is deleted to make room for new inputs.
Thus Cosmos has a lot in common with Spark, the famous data processing platform, except on a much larger scale, and with an emphasis on a pipeline of transformations rather than on MapReduce.
If we step way back, we can start to perceive Cosmos as an example of a smart memory system: it remembers things, and also can think about them (process them), and could potentially query them too, although as far as I know Cosmos and Spark have limited interactive database query functionality. But you could easily imagine a massive storage system of this kind with an SQL front-end, and with some form of internal schema, dynamically managed secondary index structures, etc.
With such a capability, a program could interrogate the memory even as new data is received and stored into it. With Cornell's Derecho system, the data capture and storage steps can be an asynchronous pipeline that would still guarantee consistency. Then, because Derecho stores data into version vectors, queries can run asynchronously too, by accessing specific versions or data at specific times. It seems to me that the temporal style of indexing is particularly powerful.
The interesting mix here is massive parallelism, and massive amounts of storage, with strong consistency... and it is especially interesting that in Derecho, the data is all moved asynchronously using RDMA transfers. Nobody has to wait for anything, and queries can be done in parallel.
Tomorrow's machine learning systems will surely need this kind of smart memory, so for those of us working in systems, I would say that smart memory architectures jump out as a very exciting next topic to explore. How should such a system be organized, and what compute model should it support? As you've gathered, I think Cosmos is already pretty much on the right track (and that Derecho can support this, even better than Cosmos does). How can we query it? (Again, my comment about SQL wasn't haphazard: I would bet that we want this thing to look like a database). How efficiently can it use the next general of memory hardware: 3-D XPoint, phase-change storage, high-density RAID-style SSD, RDMA and GenZ communication fabrics?
Check my web page in a few years: I'll let you know what we come up with at Cornell...
No comments:
Post a Comment
This blog is inactive as of early in 2020. Comments have been disabled, and will be rejected as spam.
Note: only a member of this blog may post a comment.