How SCI coherence works

SCI tells how to send signals ("Interface") in a very application-and- technology-independent way ("Scalable"). It also explains how to keep track of duplicate cached copies of data so that all the stale copies can get refreshed when someone changes the data, and how to do this in a system of unspecified size and shape, with an unspecified number of processors and I/O controllers performing unspecified applications, in a way that's simple enough so that the hardware can take care of it for you.

The basic idea is to keep track every time a cache comes to memory to get a particular chunk, or "cache line", of data. The memory tells each cache who its predecessor is. The fact that each cache knows its predecessor is sufficient to find all the cached copies, because that's a list.

For reasons of efficiency, however, the SCI protocols do more than this. Each cache also talks to its predecessor, which then knows its successor, which forms a doubly-linked list.

This double-linking process also determines one unique cache in the list at any instant, called the "head", which is the only cache allowed to change the data. The unique one is that one that has received "headship" in the response from its predecessor and has not yet passed "headship" along in its response to its successor. Headship is constrained by the SCI protocols to proceed unidirectionally along the list, in the same sequence that the caches originally went to memory, which is important for SCI's forward-progress guarantees that ensure nobody gets starved by accidents of timing.

The doubly-linked part of the list is called the "sharing list" for this cache line, and the singly-linked part is the "prepend queue".

If the head modifies the data, it has the duty to follow the sharing list to notify all those caches that their shared copy is now invalid. If their processors request the data again, their cache knows it doesn't have it, so it goes to memory to get an up-to-date copy. Memory may know that it does not have a current copy. If so, memory tells the cache to get the data from its predecessor instead; but in any case, the memory tells the cache who its predecessor is, so the cache is automatically in the prepend queue. If it got the data already from memory, it can use them; but it can not modify them until it becomes head.

Note that a cache only goes to memory once, so this process is distributed and does not contribute to congestion at the memory. (Another trip to memory does happen occasionally, when modified data are written back either as a logical necessity or to optimize efficiency.)

Note also that the sharing list is chopped to zero every time someone changes the data, so these lists are short (and efficient) unless data are widely shared and rarely modified, in which case the inefficiency of a long list doesn't matter because it is rarely traversed (only when the data change).

The base SCI protocols are "invalidation" protocols, discarding old copies instead of updating them. This avoids the inefficiency of updating copies for some processors that no longer care about those data, and keeps the sharing lists active and efficient.

If a sharing processor later wants to modify the data, it leaves the sharing list by telling its neighbors to point to each other, discards its shared copy, and goes to memory to get a new copy (which might be different, due to the passage of time) and waits until it becomes head before it does modify the data. (It might not really have the current data until the instant it becomes head, if memory's copy was not current.)

SCI also has some nice optimizations, like pairwise sharing and QOLB (Queue on Lock Bit), which can greatly improve efficiency in some situations. Pairwise sharing lets two caches that see they are the only ones in a sharing list simply pass data back and forth without going back to memory each time. If others join in, the behavior automatically goes back to normal. QOLB uses the list to pass sequentially used resources from one to the next without generating much synchronizing traffic in the interconnect.

So that's the story on coherence. It's not so very complicated, but it is important to get the details absolutely right. After all, these lists are distributed shared data structures that are being dynamically updated by multiple processors' interface hardware, and at any instant processors may be leaving the list because they need to reuse a cache entry for a different line or they want to become head again to modify the data, and you have to be sure that the system can recover from any error that might happen in any stage of this process.

So the SCI standard defines the coherence behavior precisely in executable C code, which shows exactly what a cache or memory controller does with any packet. That makes the spec seem hard to understand at first, but it's actually far simpler to use than a spec that doesn't detail all the corner cases, or has implementation bugs due to various interpretations of the English, or can't handle realistic environments in a robust industrial-strength way, or contains deadlock hazards.

The SCI coherence protocols have been beaten on by many people now, from mathematical correctness folks (still a little beyond the state of the art, unfortunately) to manufacturers. I doubt there are any coherence protocols in existence that have been tested more thoroughly in simulation (companies' survival depends on getting this stuff right!), and now Convex Computer is shipping commercial supercomputers that use SCI coherence, and many other companies are quietly designing future machines that rely on SCI.

Note that the coherence protocols merely communicate via packets that are sent by one device to another, just like any other communication in SCI. Thus bridges don't have to know anything about coherence, they just check packet addresses and forward them or not, as always. There are no race conditions or critical timings--when a device gets a packet, it examines it when it's ready, and does the appropriate thing. Preferably this should be fast, of course, but nothing breaks if it isn't.

This is an enormous simplification compared to the usual coherence mechanisms today, which rely on every processor observing all transactions on a shared bus. In that scheme, the slowest controller sets the speed of the system, so increasingly heroic measures are needed to get the performance up as processor speeds increase. (Things like dual cache directories, or multiple write buffers with their own snoop logic, are often used.) Bridges between snooping-coherent buses are very expensive, as the bridge has to stand-in for all the cache activity of the other side--it's intimately involved with coherence. And with more than two buses the problems caused by snooping get much worse.

et the performance up as processor speeds increase. (Things like dual cache directories, or multiple write buffers with their own snoop logic, are often used.) Bridges between snooping-coherent buses are very expensive, as the bridge has to stand-in for all the cache activity of the other side--it's intimately involved with coherence. And with more than two buses the problems caused by snooping get much worse.

eck packet addr çа0 çÕd them or no çË°s always. There are no race conditions or critical timings--when a device gets a packet, it examines it when it's ready, and does the appropriate thing. Preferably this should be fast, of course, but nothing breaks if it isn't.

This is an enormous simplification compared to the usual coherence mechanisms today, which rely on every processor observing all transactions on a shared bus. In that scheme, the slowest controller sets the speed of the system, so increasingly heroic measures are needed to gs are needed to g