Sunday afternoon, August 16, 1998, I paid $20 to attend a 4-hour tutorial on High-Speed I/O, part of the IEEE Hot Chips conference at Stanford University. The tutorial claimed that I/O systems have been badly neglected, and need to be improved a lot--hardly controversial!
This turned out to be an incredibly good marketing presentation for SCI, presented by a team from Silicon Graphics. It was almost perfect, except for one little flaw that I'll mention later.
It justified using SCI/LAMP (Scalable Coherent Interface/Local Area MultiProcessor) for I/O systems, for LANs, for a SAN (System Area Network), for ccNUMA (cache coherent Non-Uniform Memory Access--i.e., some memory is faster than other memory, but all memory works correctly, as expected, even when data is cached to hide the latency.)
The tutorial became very annoying in a few places, where they listed all kinds of I/O systems and buses and showed their performance on graphs, and made statements about problems and requirements that were a perfect advertisement for SCI, except that (here's the one little flaw): they never ever once mentioned SCI anywhere, not even the version of SCI (GigaRing) that SGI's own Cray Research uses as the I/O system of all its computers!
That would have been bad enough, but perhaps excuseable because the set of presentations were plainly intended as marketing for SGI (though that seems to me a violation of the spirit and tradition, if not any actual regulation, of the Hot Chips Conference). You can't expect a marketing presentation to mention a competitor. But it got worse.
After their chip designer's presentation I was one of the questioners at the microphone in the aisle, pointing out that there was a common interest in standards between yesterday's high speed serial links tutorial, whose presenters pleaded for standardization of links, and this tutorial; and pointed out that there is already an ANSI and international standard that seems to meet their requirements, namely the Scalable Coherent Interface.
SCI would have lower latency, the same signaling rate, fewer signal wires, much longer distance limits on fiber optic serial links (though shorter distance between repeaters on copper cables). SGI saved about half a bit from the packet by removing support for coherence! So, had they considered SCI?
We all love standards, so very much that every young engineer wants to design his own, so I understood the emotion behind SGI's decision--in fact, I had attended the initial kick-off meeting that was supposed to be a "shoot-out" to select the technology SGI would adopt, and it was obvious that everyone else present was excited about and already committed to designing their own new standard, with its name (SuperHiPPI) based on their previous much slower one (HiPPI). There was no visible interest in using or adapting another standard that was already available, met most of their objectives even without modification, and could have been adapted to their applications either compatibly or with modifications. Human nature, normal engineering attitude, we all sympathize with "Not Invented Here."
The SGI presentation showed the disadvantage of traditional network protocols, with those heavy software costs, and listed early attempts at direct memory-to-memory transfers--but no mention of SCI. They listed Modcomp, DEC, HIPPI-MI, Ethernet, ATM, Fibre Channel, Token Ring, HIPPI-FP, etc.
They surveyed the history of System Area Networks (like SCI), including ServerNet, GigaNet, Myrinet, and even IEEE 1355 Heterogeneous Interconnect. Not SCI.
They had a lot of negative things to say about the maturity of FibreChannel, but they are using it anyway for their disk I/O interface. They mentioned IEEE 1394 SerialBus, part of the same architectural family with SCI, and think it might have a future in disk interfaces.
However, the absolute last straw for me, way over the top, was the final section, where they went into a series of statements about how standards are needed, and their new I/O system will soon become an ANSI standard, and people need to start thinking "scalable" in their future designs, and plan ahead more, and think of using ccNUMA (cache coherent NonUniform Memory Access, like SCI) etc etc etc.
So when the speaker concluded a few minutes later, I headed for the mike in the aisle to ask a question (turned out to be a comment).
I expressed great pleasure that the presentation would have been a nearly perfect advertisement for SCI (they even mentioned the performance advantages of directory-based cache coherence!), except for the fact that they never mentioned SCI once, a surprise since they bought Cray Research and presumably know about Cray's technology, and Cray uses SCI for its I/O (double width and half speed because of constraints related to needing to use a particular chip and process, but that's a minor deviation that can easily be bridge-connected to the standard).
And that I was surprised that they hadn't mentioned SCI anywhere, though it works very well--IBM uses a version of it inside its AS/400's and its new RS/6000 RIO I/O system. Yet they did mention FibreChannel, which they feel still has a long way to go, but they are using it anyway.
And SCI was standardized in good time, finished in 1992, with LVDS versions added that reach up to 8 GBytes/s/cable, well over what is asked for today, and that could be sped up by another factor of 2 easily or 4 or 8 with effort, not counting the arbitrary speedup possible using multiple links in parallel like Convex does.
And it's used by Data General for its ccNUMA interconnect. And it's used by Sequent as its ccNUMA interconnect. And Siemens Nixdorff uses SCI in its I/O system, to connect large numbers of PCI interface cards. And Scali uses SCI, and Sun uses SCI, for high performance clustering based on message passing. And there are 2 commercial switches for SCI. And PCI interfaces to SCI, too (now both single and dual).
The tutorial MC assured me that they had considered SCI, and it didn't meet their requirements, though he didn't know the details. I replied that I had told SGI's VP Forest Baskett about SCI at Hot Chips back in about 1991, and he said then that 1 GByte/s/processor just wasn't fast enough for SGI. But Convex made a brilliant leap, a breakthrough, to get around that--Convex simply used 4 SCI's in parallel to get the bandwidth they wanted in their ccNUMA supercomputer, worked fine.
It seems to me that SGI is determined to use an interface that they have designed, which they control, which eventually will be a standard--not something that is already a standard, where they would have to face the existence of competing suppliers.
There's nothing illegal about that; it's a tactic used by many of the larger companies, most of the ones that think they are dominant enough to pull it off.
In practice, that strategy results in 1-2 years of effective monopoly and an insurmountable head start over those that try to be compatible competitors. Prices can be much higher, and the customer has no practical alternative but to pay. When competition begins to appear, the leader's costs are already so low that the prices can be dropped low enough to make competitors unprofitable. With skill and luck, this can be prolonged to maximize the damage to those who dared invest in the competitor. This process can be repeated for each new technology, and if very successful it can eventually result in a much relaxed pace of innovation, reducing R&D costs, making the leader's profits higher still.
This strategy fails, however, if the "open standard" marketing hype is not believed by customers, so that they don't pile on to support it and stimulate further investment and growth. The customers may select an alternative, if one is near enough, which can turn this great strategy into a catastrophe instead.
I guess I'm surprised that SGI thinks it is still that dominant in this industry, but so what--not my business to second-guess them.
What I am upset about is seeing/hearing:
And through all this, never once to give the reader/pupil any hint to let them know SCI exists as an open ANSI, IEEE, and IEC standard in the very space under discussion, covering the whole gap they describe, is really way beyond the pale, way beyond what I consider to be normal marketing-hype levels likely to be seen at Hot Chips conferences.
(You will need a copy of the handout to get the most out of this. I cannot reproduce those here due to copyright considerations. However, copies can be purchased from the Hot Chips organizers, mailto:email@example.com (Diane Smith). Ask for the afternoon tutorial notes.
I'll include page-slide references to the handouts as I summarize the high points. E.g., p9-s16 means slide 16 on handout page 9. Slide numbers start over in several places, but the combination of slide number and page number seems to be unambiguous.
In some cases I've augmented what is printed on the slide based on my notes or recollections of what was said during its presentation.
The presentation starts out by summarizing some of the I/O problems that modern systems have to deal with, supporting fast memory, fast graphics displays, and lots of I/O devices.
They point out that I/O interconnects need to be relatively stable, scaling over several generations of processor technology, to preserve I/O investment.
The presentation lists a lot of typical bus speeds, and shows a variety of typical system flows, pointing out the bandwidth bottlenecks.
p3-s5: Various nonexisting memory devices are postulated for 1999, but the slide omits the likely cost-performance leader, SLDRAM memory devices.
p4-s6: Processor System buses are listed up to R12000's 64-wide 1.6GB/s in 1998, but no mention of the fact that this is only about the same as one 16-wide SCI connection, which concurrently inputs 1GB/s and outputs 1GB/s (some manufacturers use multiple SCI links in parallel to get higher bandwidths, or one can use wider SCI links (up to 128-wide in ANSI/IEEE std 1596.3-1996 LVDS)
p4-s7: The time line from 1990 to 1999 lists EISA, PCI, and AGP but no SCI.
Graphics flows--SCI's LVDS transceivers are being used in portable computers to connect the display to the system very flexibly, with high bandwidth. Not mentioned here.
p5-s9: I/O Devices lists Ethernet, SCSI up through LVDS-SCSI, and Fibrechannel (100MB/s), but SCI is not mentioned. This is strange, because Cray Research uses SCI (GigaRing) for its I/O, and SGI ought to know most of Cray's secrets by now, because Cray is their wholly-owned subsidiary! In addition, Siemens Nixdorf uses SCI for its I/O, connecting very large numbers of PCI cards through SCI, and IBM uses SCI (RIO) for its I/O for new RS6000 workstations.
They point out in p9-s16 that having separate paths to memory is useful for high bandwidth graphics, and in p9-s17 they point out that keeping memory and caches coherent requires a lot of resources (but it is essential!). Then in p10-s19 they show how directory-based coherence mechanisms can handle much higher performance levels than the traditional snooping coherence that is generally used today.
SGI's coherence directory is apparently in a proprietary association with the processors. No mention of SCI here, though SCI's standard directory-based coherence is used in commercial products by HP/Convex, Sequent, and Data General.
They list the bottlenecks in p11-s20 and p11-s21, and emphasize the importance of making the system scalable. In s20, they point out the importance of having coherent I/O buses.
Too bad they don't mention that SCI is the modern equivalent of Multiple Coherent I/O buses!
In p12-s22 they state that ideal systems have scalable memory bandwidth, scalable I/O bandwidth using multiple standard buses, directory-based coherence, and high bandwidth for graphics.
Wow, that's exactly what SCI is!!
Then in p13-s2, SGI proposes their solution to these problems, a network they call Gigabyte System Network, which they say is a standard.
(I have a problem with this--in the IEEE we are not allowed to call draft documents standards. Of course this one really will be an honest-to-God standard very soon, and I'm sure that's true because I've heard those words over and over in many contexts for years.)
In p14-s4, they summarize the motivations for scalable solutions, high bandwidth (1+ GB/s) and low latency (1-10 us), an easy perfect match for SCI.
In p15-s5, they show a typical 14TFLOPS system made of several servers, listing the various bandwidths needed, typically a few GB/s on each of several ports of each 8-24-processor server. Each of the flows listed is an easy target for SCI.
In p15-s6 and p16-s7, they explain why existing technology isn't good enough; mainly packet sizes are too small (they apparently assume software has to be involved in handling each packet, true for most systems but not for SCI); there is a lack of flow control, causing congestion and packet loss; lack of error control; too complex, too much protocol stack overhead (i.e. software is assumed to be involved--in particular, they are speaking of Network technology now, mostly not mentioning buses anymore.
In p16-s8, they reveal their solution to all these problems--20 differential signals in each direction, 500Mbaud each. Just like SCI, except SCI uses 18 instead of 20.
But then they encode each signal with 4b/5b, which is a different tradeoff than SCI made--encoding adds latency at every chip boundary, because you have to gather 5 bits before you can decode the 4 and vice versa, so this is a 10ns overhead for encoding, at each chip boundary.
The benefit is longer copper cables, 40m compared to SCI's 10 or 20m. Their fiber link is 10-wide instead of 20-wide, and at twice the bit rate, same encoding. (However, the intrinsic encoding penalty for optical is thus only 5ns; unfortunately, much of the 5ns saved will be lost in the electrical/optical/electrical conversion delays.)
p17-s9 shows how they manage to handle large packets.
SGI's GSN handles long packets by redefining the word "packet" to be a message, up to 4GB long, carried in "micropackets" (which others, like SCI, would call packets) that solve the latency and flow control problems by being small, 32 bytes each.
I call this a "breakthrough by definition."
(SCI uses small packets in order to keep latency low and reduce blocking in packet switches--long packets are more tolerable in the high-latency world of circuit switching; listen to the debates between FibreChannel (very long packets) and ATM (very short packets) to see these issues illustrated. Long packets are fine in the traditional I/O model or networking model, where software latencies are enormous anyway; they are unsuitable for low latency applications like distributed shared memory, or even clustering.)
p17-s10 shows their credit-based flow control for micropackets. This method of flow control is not defined in the 1992 SCI standard, but applications could choose to get the same effect without violating the standard.
The reason SCI does not rely on credits is that they only work effectively with rather stable pairings of communicating nodes, which are allocating buffers for one another for an extended series of transmissions, again similar to the properties of a circuit-switch (almost-static flow pattern, very infrequent changes, slow to reconfigure) model of the world.
SCI, on the other hand, expects many series of transmissions to sometimes be interleaving at intermediate points, so its fundamental flow control has to handle this very dynamic "reconfiguration" efficiently. Where long transfers are used, nanosecond responses aren't required, so performance can be improved by higher-level mechanisms such as rate limiting at the data source.
p18-s11 reveals a Scheduled Transfer protocol for GSN that has been adapted to handle the deep protocol stacks common to the world of Ethernet, ATM, FibreChannel, etc.
That is not a low-latency world.
p19-s13 shows how it is more efficient to move data from where it is to where you want it by DMA, without operating system intervention. As examples, they list Modcomp, DEC DR11b, DEC CI, VaxCluster, VIA, HIPPI-MI.
Strangely, they don't mention SCI, which does exactly this, by making pages of memory scattered throughout the system appear as part of a global memory image. With appropriate permission/protection, memory pages can be made available for sharing by all or selected others, for direct access by I/O devices or user programs. An elegant and efficient model.
p19-s14 discusses an SGI chip for GSN, and reveals a new wonder--the chip deskews data from the cables automatically! This is handled by a standard sequence sent every 10 us.
This is a really good idea! Which is why it is part of the ANSI 1596-1992 SCI standard, and is used in some (mainly IBM's) but not all SCI interfaces.
p21-s15 describes an adapter chip that includes DMA engines and the micropacket buffers.
These buffers are not small, because they have to be able to store all the data that could be in flight in the worst-case cable. Since 256 micropackets can be in flight in a 1km cable, these buffers have to be able to handle 256*32, or 8kBytes, for each virtual channel, so 32kBytes at each input port of a chip that limits itself to handling 1km cables.
This is why SGI's links have to be limited to 1km for single-mode fiber, while serial SCI single-mode links were demonstrated at 40km.
SCI architects considered this kind of limit to be a scaling failure, and rejected such mechanisms.
Note too that this is not cheap memory--these buffers have to handle 1GByte/s both in and out concurrently.
This slide also mentions storage-area networking, NUMA, Clustering, etc. All of which SCI is doing today.
p23-s2 points out trends in storage, with Network Attached Storage on the rise, and System Area Networks and Storage Area Networks.
p24-s4 lists ServerNet, GigaNet, and Myrinet as historical SANs, and IEEE 1355 as an emerging one.
No mention of SCI Local Area MultiProcessor's very attractive distributed I/O model.
p25-s5 points out that FibreChannel today is very limited, still lacks infrastructural support for many promised features. "Forward Marketing". Says the concept of ubiquitous connectivity between storage and compute platforms emerged with FC development.
No mention of SCI here, though SCI postulated this model from the beginning (1987) as the obvious way to go.
p25-s6 discusses merging SAN and StAN. Big latencies don't matter for StAN, but are deadly for clusters etc.
That's why SCI was designed for low latency--necessary for some applications, and never hurts to have it. (However, this requirement does restrict your design model--you can't be using software protocol stacks if you want low latency!)
p26-s7, p26-s8 discuss the Virtual Interface Architecture. This is supported by Compaq, Intel, Microsoft, and is intended for clustering.
SCI fits VIA well.
p27-s9 discusses the I2O Intelligent I/O Architecture. The basic idea is to put lower protocol layers into silicon.
That is exactly what SCI does, of course--the network in SCI just looks like memory, and I/O devices also appear in memory space. So SCI's software is normal load and store instructions, with all of the network protocol handled by the chip. Just what the doctor ordered.
p27-s10 discusses a transfer model where a third party can initiate transfers that are carried out by a source and a sink party.
SCI's DMA models (P1285) support this well.
p28-s11 discusses Network Attached Storage.
SCI can handle this model as well, in fact all native SCI storage devices are naturally "NAS", though SCI does not impose a file format or directory/allocation mechanism, which could be taken from any of the usual models as desired.
p29-s13 discusses security risks etc.
SCI can support various network security models, but does not impose any. Interfaces that implement security can do so in several ways without extending the SCI protocols.
However, high levels of security would require encryption of packet contents, which could be significantly expensive, so many applications will prefer to assume physical security, leaving the data flows unencrypted, for improved cost-effectiveness.
A general-purpose standard, like SCI, has to be able to allow the use of many different models in order to work in many different markets. Higher-level standards that define how encryption is managed, for example, will need to be generated for each family of applications.
p29-p32 discuss the evolution of a Network Attached Secure Disk etc. Requires more standards, lots of intellectual property issues. Lower level protocol layers have to move to hardware. Several years to go on this front. Assumes FibreChannel will be pervasive.
No mention here of IEEE 1394 Serial Bus, though it was mentioned elsewhere and seems an attractive candidate, or of ANSI/IEEE Scalable Coherent Interface, though it seems the ideal candidate.
p33-p48 discuss storage on tape, disk, etc. Drives will gradually get bigger and faster, at an ever-increasing rate. An oral comment was made that it may become attractive to attach a large number of simple drives to one controller, moving the smarts back toward the controller, in order to get lower costs. That fits well with the P1285 models developed with SCI in mind.
p41 lists 1394 Serial Bus as a likely future drive interface, along with Fibre Channel.
p49-70 sum up the situation in terms of "InfraStress", infrastructure stress, where our old models of I/O and networking become increasingly unsuited to our modern computing environment. The Internet is changing many things unpredictably, expanding markets, changing the game.
We've reached the point where large servers are hurt by 32-bit limits, so 64-bit systems will phase in over the next several years and spread downward toward microprocessors.
Disk storage is growing very rapidly, with annual storage shipped rising more than a factor of 30 from 1995 to 2000. Accessing this data will require improvements on several fronts. (I liked this quote: "Disks are binary devices: new (empty) and full.")
System workloads are converging, with some commercial loads resembling technical computing loads.
Need 64-bit addressing (like SCI), more bandwidth for memory and for interconnect.
p64-s32 discusses latency issues, from CPU to DRAM to disk. Latencies are not improving much. Beware of latency in interconnect.
p66-s35 shows I/O buses performance lagging behind system needs. Oral comment: "please help us, SGI had to do our own!"
p67-s37: Bandwidth vs year for buses, HIPPI, ATM, Ethernet, etc. No SCI!
p67-s38: SMP bandwidth--shows Sequent 1987 and 1994 machines, omits Sequent's current machines (using SCI). Shows dramatically that buses can't keep up.
p68-s39: shows the merits of switch-based ccNUMA machines.
SCI was designed for switch-based ccNUMA, but also scales downward for higher volume production efficiencies and I/O applications, by supporting mixtures of switched and daisy-chain-ring interconnections.
p69-s42: design for scalability now, don't wait!
p70-s43: Plan for survival, for incremental scalability, unanticipated growth, headroom.
All in all, a wonderful marketing presentation for justifying SCI!
Just one detail--it never mentioned SCI at all.
I wish SGI well, and hope they survive. But it seems to me they chose the wrong strategy for a company in their situation.
I write this in hopes SGI will "see the light" before it is too late, but I'm not optimistic--I expect that pride will keep them on their present course until it is too late. Then we'll mourn the loss of a valiant competitor, and the market will move yet another step toward monopoly.