SCI Industrial Takeup and Future Developments c

David B. Gustavson

SCI's cultural context

The Scalable Coherent Interface began life in 1987, in the IEEE 896 Futurebus project. Futurebus was aiming at the high-performance bus/module market, above VME, hoping to be a multiprocessing backplane bus/module system.

Multiprocessing support in Futurebus evolved to include fair and prioritized arbitration; a set of primitives to handle process synchronization, locks and mutual exclusion; reflective memory, then cache coherence; and a variety of maintenance features as required for large systems.

Paul Sweazey, then of National Semiconductor, had led the Futurebus Cache Coherence task group, which developed the now-standard MOESI (Modified Owned Exclusive Shared Invalid) categorization of coherence schemes, had finished that and took time to think about where things were going.

Sweazey extrapolated microprocessor chip performance versus time, and showed that within just a few years Futurebus would be unable to support multiprocessing meaningfully, because one processor would be able to saturate the bandwidth of the fastest backplane bus possible. Moreover, ever-increasing levels of integration had moved computer systems from rooms to racks to crates to boards, and soon to chips.

Thus the whole concept of a high-end modular bus had a finite remaining lifetime. Buses would be useful for I/O adapters, for customizing a system, but would be inadequate for memory expansion or multiprocessing demands.

Sweazey then chaired a Study Group under the IEEE Computer Society's Microprocessor Standards Committee, which was then chaired by Bob Davis, to consider what to do about this problem. After just a few meetings, the outlines of a potential solution began to appear, and a formal IEEE Project was begun, chaired by me.

At this point, all our emphasis was on the very high end of possible performance, and cost was a secondary consideration. Furthermore, we had to define our territory to avoid damaging the market for Futurebus. The agreement we worked out with Futurebus chairman Paul Borrill was that SCI's goal would be 1GByte/s per processor, far above anything that even the most optimistic Futurebus enthusiast thought any backplane bus could ever do. (Some buses have now exceeded 1GByte/s in total, but only by reducing loads and distances to far less than a backplane module/bus system requires.)

At about that time, Futurebus 1987 was complete but lacking much commercial support. The US Navy decided that Futurebus was just what it needed, if only a few minor changes would be incorporated. Therefore, the Futurebus+ group was formed and began to revise the 1987 spec, to add more priority levels and more performance options and packaging options. The effect of this effort was to kill adoption of Futurebus 1987 because its replacement was imminent. However, working out these modifications took several years, so that Futurebus+ did not get approved until 1991.

IEEE 1394 SerialBus (now sometimes commercially called FireWire(TM) or iLink(TM)) was being developed at the same time, by some of the same people, and was incorporated into Futurebus+ and SCI and several other bus standards as an alternate path for diagnostics and I/O. There was a deliberate decision to optimize SerialBus strictly for low cost for desktop I/O, without regard for scalability or the cost, complexity, or performance of future switches/bridges, but SerialBus did adopt the same address space model as SCI, and a similar command set.

SerialBus, Futurebus, and SCI also share a co-developed Control and Status Register architecture, IEEE Std 1212, chaired by SCI architect David James. This was a painful laborious process, because though his interests lay mainly toward the scalable SCI, it was important to keep the other groups committed to this joint architecture. That meant in effect that every complaint from the other groups was equivalent to a veto of the project, so every issue had to be worked through carefully, both with regard to their desires and also with regard to problems those desires might introduce when scaled up. Eventually agreement or respectful compromises were reached on all points and the standard was completed.

The marketing model in the early days of SCI development showed VME gaining market volume until Futurebus+ products arrived, then gradually declining as Futurebus+ volumes grew, and after about 10 years SCI would appear and begin to take volume from Futurebus+. That was attractive for everyone, because it followed the expected natural progression, allowed for planning, and reassured the customer that there was no dead-end in sight.

Unfortunately, that isn't how things turned out. The delay in producing Futurebus+ resulted in a large increase in microprocessor power, which nearly closed its multiprocessor-application window, and VME continued to develop extensions that increased its performance so VME could handle the remaining application-space adequately. Thus there was no compelling application to justify the costs of adopting a completely new technology like Futurebus+. A few companies shipped Futurebus+ products, including DEC, but the economics didn't work anymore. (For example, to keep bus stubs as short as possible, Futurebus+ had to use a very wide bus with transceivers as close to the connector as possible, resulting in many packages and many pins--17 packages in one commercial version. The protocol variants also made the controller chip rather complex, the asynchronous protocol was hard to test under worst-case conditions, and the need for short stubs made it impossible to use a passive extender card to make boards accessible while trouble-shooting. Thus trouble-shooting tended to change the signal timings, which often changed the conditions that were to be studied.)

On the other front, SCI arrived far too early. Futurebus+ was difficult partly because there was complete freedom how to do everything, so it was hard to agree which of several methods should be used. (In some cases there was no agreement, and more than one method was adopted by the standard, deferring the decisions needed for product compatibility to other documents, which were called "profiles.") Futurebus meetings were very large, with many participants from every part of industry, so technical work was difficult, procedures became more formal, and decisions were by vote. Backtracking in the design space after problems were discovered was too difficult.

SCI also had complete freedom, but its technical problems were very difficult and there was seldom more than one good solution to choose from. Also, SCI had no credibility, due to its preposterous goals, so most people attended only because of personal interest rather than corporate agenda, and the meetings were small. As a result, it was possible to work in a different way that would avoid the dynamics that plagued Futurebus (and its predecessor IEEE 960 Fastbus). We had enough people, often the leading experts in the field, but not too many. We had leading researchers from academe, and chief architects from computer manufacturers. We asked for advice from companies active in the multiprocessor field, such as BB&N and Sequent, and they were extremely helpful. Norsk Data, later Dolphin Server Technology, planned from the beginning to build computers using SCI as the processor/memory interconnect, so they worked intensively with us and had a major influence in keeping the SCI design sound and practical.

Furthermore, we were fortunate to have the right personality mix, so that we could operate in a no-holds-barred ego-free nonpolitical environment. And, we had several key participants (especially our technical editor and chief architect David James, of HP and later Apple Computer) available fulltime.

The Internet also helped us work faster. HP provided an FTP server that could be accessed by all participants, and a reliable platform-independent document formatting program, FrameMaker, became available. This made it possible for people in the US to perform a round of edits, then pass the document over to people in Europe for further work, and vice versa on a daily basis when appropriate. The nine hour time difference made it possible to essentially double the editing hours available without having to deal with multiple-copies/multiple-authors problems. Later the HP server was replaced by one at Santa Clara University and then by one at SCIzzL, and FrameMaker was supplemented by another versatile technology, Adobe Acrobat.

There was only one vote in the SCI working group. That was the vote at the very end, to declare the project complete. Up to that point, every decision was made by consensus. If the right answer was clear, we took it. If not, we took a "good enough" answer until we ran into trouble, then we would backtrack if needed.

As a result, technical progress was extremely fast, with a document that was essentially complete in 1990 but was polished and refined until 1992. This schedule was unexpected--none of us would have predicted that SCI would take far less time than Futurebus. We started the project fully expecting to spend 10 years or more.

This schedule anti-slip threatened the marketing model of Futurebus, which caused a great deal of distress and even personal hostility.

But the threat was even worse than it first appeared, because SCI had to be very simple in order to run at such high speeds, and because SCI had no need for separate transceiver chips or terminators or constrained physical layouts (e.g. short bus stubs), and because SCI fit easily within one corner of an ASIC and needed only 72 pins for 2 GBytes/s, and because SCI's bandwidth scaled up with system size, and worked with fiber optic links as well as copper cables.

So the cooperative "VME then Futurebus then SCI" marketing model vaporized, and SCI became opposed by a large number of influential people whose goals were threatened.

Another marketing problem was the lack of ego commitment to SCI. Usually, for an interconnect standard there will be attendees from a large number of companies. The attendees will see numerous problems that are not being addressed to their satisfaction, and will join in the work, make some contributions, and thereby become committed to its success. They lobby their company management for support and commitment, and build up a substantial visibility in the marketplace. This can be essential for getting the critical mass of support needed for the success of a standard. (Of course, it is also responsible for many of the baroque features found in most standards, and adds a great deal of complexity to a system.)

We had a steady stream of visitors attending the monthly multi-day meetings, but few developed that sort of attachment. Our work was often very esoteric, such as working through the requirements of forward-progress guarantees (no deadlocks or livelocks/stalls) and their implications for the cache coherence mechanism. Furthermore, we had bypassed many of the more tractable problems by adopting other current work. For example, we adopted as a whole the new metric crate and backplane system (IEEE Std 1301) that had been developed primarily for Futurebus (with only one P1301 participant representing SCI interests).

As a result, we had only a few committed companies when we finished the work, not a broad base of industry acceptance. Our main competition at that time seemed to be Futurebus, with many committed supporting companies and significant visibility in the market and the trade press; and FibreChannel, with perhaps even greater commitment and certainly very professional coordinated marketing.

SCI's image as a backplane bus/module system also broke down for another reason. If manufacturers have a choice between using their favorite packaging for their module designs (connecting the modules by cable), and redesigning their modules to fit some standard's requirements, most prefer to use their own packaging. It was the severe constraints of backplane buses that caused standard modules to be widely used. Once SCI removed those constraints, allowing arbitrary packaging connected by cables, virtually all designs chose something other than SCI's recommended mechanical crate/module standard. Thus SCI devices have no common appearance that can be used for marketing photos, and SCI received very little promotion from the module and crate manufacturers, a significant reduction of visibility in the marketplace compared to a normal bus standard.

But probably the biggest obstacle to marketing SCI was the lack of a simple way to explain what it is, what it can do, how it relates to well-known and understood products. SCI began as a high-end backplane bus for processors and memory and I/O, but became much more general, supporting cables and switches and acting somewhat like an incredibly fast local area network, and simpler/cheaper. It required much more explanation than a typical new product that merely lowers prices or runs a little faster.

Too late, some better names were invented, like Local Area MultiProcessor and Local Area Memory Port, but by then "SCI" had gained its own name recognition. More recently, the descriptive term "concurrent bus" has begun to look interesting--more on that later.

SCI marketing and adoption

SCI and FibreChannel began marketing at almost the same time, but FibreChannel was enormously well funded and well coordinated, supported by large companies with significant resources, who saw FibreChannel as useful for solving real problems and also as a potential source of sales.

SCI lacked that kind of support. Only one company, Dolphin (no longer in the processor business), promoted SCI as an open interchange standard. The Navy did adopt SCI as the unifying interconnect for its Joint Advanced Strike Technology future fighter program, and had an interest in SCI's commercial success because of the need for military designs to take advantage of cost savings by using Commercial Off The Shelf (COTS) products wherever possible, and this did add somewhat to SCI's visibility. Unfortunately, it also created enemies, because JAST was seen as an important design win for FibreChannel as well. There were a number of design competitions and "shootouts" as a result, and last I saw FibreChannel had gained approximately equal authorization. The program has now gone under wraps, so the current status of SCI in JAST is not obvious, but indirect evidence indicates that SCI is still included. (For example, one startup company, iCore Technology, has received some SBIR (Small Business Incentive) funding from military sources, to bring a high performance PCI/SCI interface to a manufacturing-ready state.)

FibreChannel shared one marketing problem with SCI, that of being hard to describe or categorize, because it combined networking features with channel-based I/O systems, which placed conflicting demands on switches (channels prefer circuit switching and networks prefer packet switching). Initially, FibreChannel systems all required switches; however, after a presentation about SCI (perhaps the timing was a coincidence) a new project was initiated to define a ring for FibreChannel too, called the FibreChannel Arbitrated Loop.

The other companies that adopted SCI mainly did so in order to take advantage of its technology for high performance rather than for low cost, and most had no interest in using it as an open interchange standard enabling them to interconnect with others' products.

Another obstacle is inherent in the nature of SCI: SCI was designed to be integrated directly into ICs that are mainly doing something else, such as processors, I/O controllers, switches, and memory controllers. Previous buses created a market for transceiver chips, which met the bus standard on one side and used conventional logic signals on the other. But conventional logic's external signals are not fast enough to keep up with SCI, so the back end of an SCI transceiver has to use a wider data path than the SCI links do.

The Working Group should have gone on to specify a standard for such a back-end bus, but did not. There was a reluctance to standardize yet another bus as part of SCI, since the design process for the SCI standard was driven by the problems inherent in such buses. Following that logic always ends up moving SCI right into the customer's ASIC. But that makes it hard to get started--we were too early for solving the problem by putting SCI into every vendor's ASIC cell library.

Three companies designed (for sale to the public) different chips that interfaced SCI to other circuits via backend buses: Dolphin Interconnect Solutions http://www.dolphinics.com, Vitesse Semiconductor http://www.vitesse.com, and Interconnect Systems Solution http://www.iss-us.com. The production versions of these did converge on the official standard, though not all at the full SCI speed. The first product from Interconnect Systems Solution uses an 8-bit-wide link (IEEE Std 1596.3) and runs at 100Mbyte/s in order to take advantage of low cost CMOS processes. This product was actually designed for a particular customer, much of whose application (including a processor!) is integrated on the same chip (in the true spirit of SCI), but it can also be packaged for general sale with the custom portion disabled.

In 1994, I left the Stanford Linear Accelerator Center, which had supported my work on SCI and related standards, and formed a new organization called SCIzzL at Santa Clara University, at the invitation of Prof. Qiang Li. SCIzzL (pronounced "sizzle," or like "scissors," the other cutting-edge tool that's safe to use) is a partial acronym abbreviating something like "Scalable Coherent Interface Local Area MultiProcessor Users, Developers, and Manufacturers Association," which didn't seem to lend itself to any catchy and concise acronym, notwithstanding various attempts at substitution and reordering.

SCIzzL was envisioned as providing a focus for the formation of an SCI trade association, and as a mechanism for supporting ongoing standardization work in the IEEE to enhance SCI and its relatives and descendants. Funding for SCIzzL relies wholly on memberships or donations from interested companies, which turned out to be a problem due to the nature of the SCI market until now. Fortunately, one of those supporting standards, IEEE Std 1596.4 RamLink, became the initial focus of the memory industry's effort to create SLDRAM (Synchronous Link Dynamic Random Access Memory, P1596.7) as an alternative to the Rambus RDRAM memory devices, which it was feared would put Intel in control of the memory industry. For several years this effort provided ample support for SCIzzL, but also a significant distraction from SCI. In 1999 the SLDRAM was abandoned, though it had demonstrated technical success with two independent working device designs, because Intel had made clear its commitment to RDRAM. SLDRAM Inc. was renamed "Advanced Memory International, Inc." and became a marketingand coordinating organization for the remaining DRAM industry, and SCIzzL divorced from that group, to find out whether the SCI marketplace has yet grown enough to begin supporting further standards development, and a users association, on its own.

The advanced signaling technology developed for the SLDRAM bus has become SLIO in JEDEC, and will apparently be used for the second generation of Double-Data Rate DRAMs (DDR2). SLIO has individually adjustable high and low levels for drivers, fine adjustment of individual bit timing, and the option of using stub-decoupling resistors (compensating for the attenuated driver signal amplitude with the individual adjustments as necessary). The signaling is mostly single-ended, but with differential clocks.

Another part of SCI that has become widely adopted is IEEE 1596.3 LVDS, Low Voltage Differential Signals. This standard was deliberately scoped to apply specifically to SCI (otherwise the whole world would have piled on and it might never have finished due to territorial disputes etc.). Almost immediately it was generalized and specified for standard telecom industry rates, like 622 Mbits/s, as TIA/EIA-644. A chip-enable has been added by some vendors, with double the current drive, in order to allow using LVDS in a bus configuration rather than point-to-point. LVDS has become the de facto standard for connecting flat panel displays in laptop computers.

GLVDS, a more advanced version developed and used by Ericsson Telecom is now being standardized in EIA/JEDEC Committee 16. This has signal swings similar to LVDS (0.25 to 0.5 V) but has the low level near ground (the G of GLVDS) instead of having the center value above 1 V as LVDS does. At the time LVDS was defined, its levels seemed adequate for several generations of power supply voltages and several chip technologies, and CMOS manufacturers thought they needed some voltage above ground to make the chips practical. GLVDS also defines the common-mode termination scheme, which was left unspecified in 1596.3.

Commercial adoption of SCI

Interface chips and products

Dolphin Interconnect Solutions

Dolphin began as a minicomputer company, Norsk Data, then reorganized as Dolphin Server Technology when Norsk Data withdrew from that business, and finally abandoned the server market entirely, for an interconnect business based on SCI. Dolphin has experimented with several strategies along the way, and has had to discover the difficult tradeoffs between being a chip supplier and being a board or subsystem vendor and risking competition with its own customers. Some potentially large customers, e.g. Bit 3 Computer, were frightened away from SCI because of theprospect of competing with their sole supplier of SCI chips, which I believe was one of the biggest setbacks for SCI market acceptance in the early years. Dolphin is sensitive to this problem, and is increasingly supportive of growing the market as a whole rather than maintaining a near monopoly.

Dolphin's products include CMOS SCI interface chips, interface boards, switches, and development tools.

See their web site http://www.dolphinics.com for current information. In particular, check the application notes at http://www.dolphinics.com/dolphin2/interconnect/applications/apps.html

Dolphin Interconnect Solutions is still by far the dominant producer of SCI interfaces.

Vitesse Semiconductor

Vitesse Semiconductor builds a GaAs SCI interface for Sequent Computer (who is now being bought by IBM), but has not succeeded in marketing it broadly. This was the first practical chip to run at full standard speed (1 GByte/s). I think most customers are reluctant to use GaAs, regarding it as exotic, and they don't think they really need GByte/s bandwidth. I think a good strategy would have been to show that this system can economically provide excellent performance for a large number of attached devices that have ordinary bandwidth requirements, without needing an expensive switch. In other words, ignore the link bandwidth and focus on the cost for n-port interconnect performance based on a simple ring, with the built-in safety feature that switches can be added later as requirements increase. Vitesse says the chips are still for sale to other customers, but a manual search of their web site today reveals no mention of SCI or of those chips, and recent private comments from the management indicate that they have given up expanding this market.

Interconnect Systems Solution

ISS was founded by Khan Kibria, the lead designer on an SCI chip being designed at Unisys with other companies, including Vitesse. ISS does custom SCI (and other) chip design, but with an eye to leveraging the custom work to make SCI parts available to others. The ISS business model has been to support growth out of sales, which has hindered rapid growth. What is needed for making these chips broadly available is an investor who would fund a production run, and then sell the chips to various customers. The first chips use an 8-bit-wide SCI link and run at 100 Mbytes/s per link. When packaged for sale as general purpose chips, they use a simple wide back-side bus and internal features are memory-mapped for easy access in a wide variety of applications.

Lockheed Martin

Lockheed has developed a fast 16-port switch called RelianetSCI. See http://www.lmco.com/minn/raj.htm

Coherent Shared Memory implementations

Convex/Hewlett Packard

The first computers to use SCI as a high-performance cache-coherent shared-memory (ccNUMA) multiprocessor interconnect were the Convex Exemplar supercomputers. Multiple SCI rings were used in parallel to increase bandwidth, connecting switch-based hypernodes of 8 PA-RISC processors. The SCI chips started from a Dolphin design, but were optimized by Convex and fabricated by Fujitsu in GaAs.

The Exemplar allowed writing programs on an HP workstation, and transparently scaling them up for parallel execution by recompilation, gradually optimizing the program for more efficient operation across a larger number of processors. This is the beauty of coherent shared memory compared to message-passing for multiprocesing. On the other hand, conversion from a single workstation environment to a message-passing cluster environment usually requires a comprehensive rewriting and reorganization of the software.

Hewlett Packard bought Convex and added these machines to the top end of its line of workstations. Further optimization took place in subsequent generations of the product, and Convex/HP has not made the SCI signals available for external connections, so there has been no motivation for them to adhere to the standard--SCI was just a cheap source of useful technology.

Convex did make several attempts, in partnership with the DOE's Stanford Linear Accelerator Center, to get funding from the Department of Energy or the Department of Defense, to build an SCI-standard interface for external use. This would have been tested in the demanding High Energy Physics particle-accelerator lab as a tool for rapid data acquisition and analysis. Though the specifications of the proposed Convex SCI system seemed like a perfect fit for several DOE and DOD problems, no funding was ever forthcoming. This was probably partly an accident of bad timing, as the DOD was then in the process of commercializing its own funded technology, Myrinet. If Myrinet succeeded commercially, the government funders could easily justify their continued existence as a government program. But if it was surpassed by some other technology, they would not look so good.

Sequent Computer

Sequent specializes in high-end multiprocessor servers for commercial applications, transaction processing. They had used a highly optimized backplane bus along with very carefully tuned software in order to get high performance for particular classes of application, but they had reached the fundamental limits of that technology.

The Sequent NUMA-Q products are SCI based. Sequent no longer builds any bus-based systems.

The first generation of NUMA-Q did not take full advantage of SCI's potential speed, opting instead to minimize risk by allowing easy design updates via firmware uploads. Subsequent generations were able to proceed with confidence into faster hardware implementations. Sequent has not made the SCI signals available for external use, so once again there is little motivation to adhere to the standard in all respects, and a variety of specific optimizations have undoubtedly been incorporated. However, the SCI links have been implemented with Vitesse GaAs chips, which have also been available for purchase by others, so the low level protocols are presumably quite close to the standard.

The Sequent systems are based on an SMP Quad of Intel processors, four processors sharing a common memory via a local SMP bus. Sequent converts between the Intel MESI coherence and SCI coherence in their interface to this bus. The Quad memory occupies a fraction of the global address space.

Sequent's operating system, "ptx," is spread across as many as 16 Quads. More details, including an overview of the NUMA-Q 2000, can be found in white papers at: http://www.sequent.com/whitepapers/numa_arch.html

In 1999, IBM announced that it would buy Sequent to fill in the high end of its server line.

Data General

Data General uses SCI in its high-end AV20000 and AV25000 servers, referring to it as their "NUMALiiNE Technology." Their approach is similar to Sequent's, using quad Intel processors as the building block, and designing their own coherent interface from the quad's bus to SCI. Data General uses the Dolphin SCI interface chips, so is very close to the standard, but has not made the SCI links accessible outside the machines as yet. However, rumor has it that DG envisions using SCI eventually as an open interface for third party I/O devices, which might become the first true use of SCI as an interchange standard as opposed to just being a cheap source of high performance technology for internal use.

More details about the Data General machines can be found in their white papers, at http://www.dg.com/about/html/white_papers.html

Other articles of interest include "SCI Interconnect Chipset and Adapter: Building Large Scale Enterprise Servers with Pentium II Xeon SHV Nodes," by Roy Clark, http://www.dg.com/about/html/sci_interconnect_chipset_and_a.html

http://www.dg.com/aviion/html/av_25000_enterprise_server.html

Data General's NUMALiiNE Technology: The Foundation for the AV 25000

http://www.dg.com/aviion/html/av_20000_enterprise_server.html

http://www.dg.com/aviion/html/av_20000_technical_overview.html

Non-coherent implementations

Cray Research

Cray Research studied SCI and adopted a variant of it in 1995 as the GigaRing I/O system for the three larger families of Cray mainframes, the J90, T90 and T3E MPP plus a suite of I/O subsystem modules including Fibre Channel disk arrays, HIPPI channel adapters, ESCON and Block-Mux tape channel adapters and a Multiple Purpose Node (MPN) adapter based on the SPARC processor and SBus technology for various other peripheral and network adapters. The GigaRing I/O system provided a common I/O product base for the mainframe products as well as a flexible system interconnect between mainframes or shared I/O.

Because of production considerations, to fit into qualified chips, it was necessary to double the width and halve the speed of standard SCI. In addition, encoding was added so that ground-potential differences could be blocked with high-pass capacitive coupling, and dual counter-rotating rings were used in order to support live removal and insertion of devices or systems, as well as for higher performance. Since compatibility with the standard was not a consideration, Cray also added protocol features to support multiple virtual channels.

IBM

IBM's AS400 designers began looking for a way to "firmly couple" systems in about 1991, discovered SCI and decided that SCI's direction looked about right. Though the original motivating project didn't develop as expected, IBM built a test chip in BiCMOS and described it in a 1995 ISSCC paper. This paper concluded that the SCI technology was robust and manufacturable. Eventually an 8-bit-wide version was designed for use as a mezzanine bus for I/O, the RIO interface, which is now shipping in AS400 and RS6000 machines. The physical signaling is similar to SCI's LVDS, and the interfaces implement per-bit deskew. IBM was the first to implement and validate SCI's per-bit deskew scheme. Like SCI, the signals in these parallel links are not encoded for DC balance. IBM favors longer CRC codes than SCI's 16 bits; 32 bit CRC was used for the test chip.

Some of the same IBM designers are now working on the physical layer for FutureIO; the FutureIO links look rather similar, but probably will include encoding for DC balance to simplify ground-isolation in large systems.

Scali

Scali began as a result of experience with military signal processors at Kongsberg. The plan was to demonstrate equivalent or better performance to custom-built signal processors by using Commercial Off-The-Shelf (COTS) technology, and the strategy was to do this by connecting processors with SCI.

Scali began with building blocks of dual SPARC processors sharing dual SCI interfaces (via Sbus). With two 4-port switches from Dolphin, this allows 8 processors to be interconnected redundantly for reliable operation.

The next family of products was based on Intel processors connecting to SCI interfaces via the PCI bus. In 1998 a 64- and a 192-processor system were delivered to Paderborn. http://www.siemens.com/computer/hpc/en/hpcline3/index.htm

With two SCI interfaces on a PCI card, Scali implemented a 2-dimensional toroidal interconnect, which does not require any switches: http://www.scali.com/Presentation/sld011.htm

The Scali technology is now the basis for the Siemens HPCLINE computers. http://www.siemens.com/computer/hpc/en/hpcline2/index.htm

http://www.scali.com/

Siemens

Siemens uses SCI to connect large numbers of PCI and other buses in the enhanced I/O system for its RM600E processors. http://manuals.mchp.siemens.de/servers/rm/rm_us/rm_pdf/rm600e37.pdf

In 1999, Siemens began to market Scali machines as building-block components for building large systems, called the HPCLINE. http://www.siemens.com/computer/hpc/en/hpcline5/index.htm

Sun

There are two main applications of SCI at Sun as of 1999, Clustering and High Performance Computing, with other applications expected soon.

Clustering is currently limited to 2, 3 or 4 nodes (using the Dolphin 4-port switch). The hardware is made by Dolphin, the software mostly by Sun.

The low end is the Ultra 2 desktop, with Sbus interface. At present, only Sbus machines are shipping, but PCI is coming.

Sunfire servers: now E3000, E4000, E5000, E6000 with Sbus; soon E3500, E4500, E5500, E6500 will have PCI.

Starfire servers (top of the line): E10000 with Sbus, PCI version will also support SCI.

Soon there will be a PCI-based Workgroup Server E250, and a bigger one, E450, that support SCI, so essentially the whole server product line will soon be supporting SCI.

Auspex

Auspex makes high-end network storage servers, called 4Front, based on SCI. See http://www.auspex.com/pdf/tr24.pdf

Silicon Graphics, Inc.

Silicon Graphics has been a strong exponent of coherent shared memory, and has made a very strong case for SCI on several occasions (but without mentioning SCI--see: http://www.SCIzzL.com/SGIarguesForSCI.html). However, SGI chose to define its own interconnect, most recently called a System Area Network, formerly known as Super-HIPPI. As of 1999, I have the impression that SGI was not dominant enough in the marketplace to succeed with this strategy.

SGI bought Cray Research just as Cray was implementing its SCI-like GigaRing, but did not stop Cray from proceeding with the GigaRing deployment. At this writing, SGI is cutting back severely, and is reported to be ready to sell Cray again.

Future directions

There are several high speed signaling technologies (e.g., Gbit Ethernet and FibreChannel) and at least two I/O architectures (NGIO and FutureIO) that are beginning to move into SCI's territory.

NGIO and FutureIO appear at first blush to be doing essentially identical tasks--they interface to a processor at its full-bandwidth nexus, the "North Bridge," and connect to a wide range of I/O devices with high bandwidth.

But they don't solve all the problems--they don't do shared memory!

Shared memory is highly desired by the users of multiprocessor systems who want high performance, because it can provide interprocess communication latencies that are about 100x better than channel-based communication for a given technology. (With shared memory, communication is a part of one memory-reference instruction, instead of the many instructions needed by non-shared-memory methods to set up buffer content, start the data moving, and extract it at the receiving end.)

The hard part of providing shared memory connections (e.g. for use by the IEEE Std 1596 Scalable Coherent Interface) has been getting full access to the processor bus, which both NGIO and FutureIO will solve.

The real problem is one of business strategy. If Intel allowed NGIO to use its access to the processor bus to support coherent shared memory, then anyone could build large powerful multiprocessors by stacking arrays of cheap high-volume processors, which would wipe out the high-profit high-end market for special expensive processor chips that has been carefully guarded until now.

Only a few companies have been allowed to build their own interfaces to Intel processors at the full processor bus level, i.e. Sequent and Data General, for use in their shared memory high-end servers using the Scalable Coherent Interface. Another approach, Corollary, was limited to smaller systems, but anyway Corollary has now been bought by Intel.

IEEE P2100, SerialPlus

The original SCI architects have been looking at designs for the follow-on generation for SCI ever since SCI's completion. This process has gone through several generations of complete ground-up redesign, while waiting for the right time, the right customer, the right marketing strategy, to make bringing out the next generation a useful exercise. There is a balance to maintain between obsoleting existing product, which can kill a standard, and letting the technology stagnate, which can also be fatal. The place where this work has been most visible has been IEEE P1394.2 Serial Express, later renumbered as P2100, and now with a tentative new name, SerialPlus.

Although these protocols are capable of doing everything SCI does, and more, the SerialPlus document is positioned for serial links of modest speed, and with the capability of encapsulating SerialBus packets whole. This is to take advantage of a possible market opportunity for providing a truly scalable backbone (and more) to allow extending SerialBus systems arbitrarily. So, cache coherence and parallel links are barely mentioned, but behind the scenes the protocols have been designed to support those extensions smoothly, and to add some useful features to SCI. The following paragraphs reflect this new positioning in the presentation of SerialPlus.

SerialPlus is a concurrent bus, which can do many things at once. Like NGIO and FutureIO, SerialPlus devices can be connected by cables to switch hubs. However, in addition SerialPlus allows devices to be connected as a daisychained cable bus, so that many devices can share a switch port, or not use a switch at all. This makes SerialPlus much more versatile, gives it a much lower entry cost, and allows much better balancing of device requirements against switch port costs.

SerialPlus represents several generations of refinement of the proven SCI technology, to make it more consumer friendly, more robust, more versatile, and more economical. At present SerialPlus is being positioned by its designers as a useful backbone for interconnecting IEEE Std 1394 SerialBus devices (digital video cameras etc.), to get past the 1394 length and bandwidth limitations. (There are also other IEEE projects working on these problems, P1394.1 and P1394b.)

The advantage of SerialPlus is its scalability. Even though it will probably start at 1 Gbit speeds for 1394-interconnecting applications, its protocols scale up to any speed that technology can offer, and a SerialPlus system can scale to any size or performance requirement by adding additional cables and switches. SerialPlus eases backward compatibility for future devices by supporting multiple speeds, intermixed. It supports isochronous transfers (guaranteed delivery time as required by AV data), live insertion/removal of devices without disturbing other users, discovery protocols, automatic hardware error recovery, nonstop robustness or failover by means of redundant paths and redundant packet streams, and arbitrary connection topology.

Concurrent buses--a new name for this technology

SerialPlus (like SCI) is a _concurrent_ bus. Other buses only allow one device to transmit at a time, but a concurrent bus allows any number of devices to transmit at the same time. The signals in the bus cables are directed so that these transmissions do not interfere with each other, and no information is ever lost. Each device cooperates by storing any information it receives while transmitting, and then sending that information along as soon as possible. I.e., only one signal at a time can be present on a particular piece of cable, but different signals can be active on every piece of cable at the same time.

Obviously this means that, under the covers, SerialPlus cannot really be a bus. Real buses have continuous unbroken signal wires from one end to the other, and wires have no signal-directing ability, so multiple transmissions always interfere.

So why do we call SerialPlus a bus? Because it acts like one as far as all the connected devices are concerned. They can do reads, writes, and locks just as on any ordinary bus. They don't need to know about network protocols or routing, they just ask for data from some 64-bit address and soon the data arrive.

Of course, not all ordinary buses act alike--SerialPlus acts like a high performance split-transaction bus, where the bus is released for other uses between the data request and the arrival of those data. More primitive buses are more common, where one device holds the bus, blocking all other users, until the requested data arrive; but these "unified transaction" buses are hopelessly inefficient and not useful where high performance is desired. Split transaction buses were the first step toward concurrence, using the bus wires as efficiently as possible. However, they were only able to scale to about 10x the performance of primitive buses. Scaling further requires support for switches and some network-like features.

Moving from unified to split transaction buses introduced several new problems: mutual exclusion/locks; ordering; and resource allocation. Many users encounter these issues for the first time when they make a leap from unified transaction buses to SerialPlus or SCI, not realizing that these were already issues that had to be handled by split transaction buses, and thus inappropriately consider SerialPlus or SCI to be complex--that is not a fair or reasonable comparison.

Concurrent behavior is essential for scalability.

Scalable architectures are ones that do not change their behavior as technology advances--they just run faster, and gain higher capacity.

As a contrasting example, a 1394 SerialBus system spends time arbitrating to decide which device will transmit next. The arbitration takes a fixed time, independent of technology (here technology means the currently possible signal bandwidth). This fixed delay interferes with scaling up in bandwidth--one has to send more data with each transmission as the bandwidth goes up, or else the system performance becomes limited by the constant arbitration delay times. Making packets longer is not a minor problem--it causes backward compatibility problems or inefficiencies, and infrastructure problems (especially for interfaces and bridges that have to provide buffer storage for packets).

Arbitration also requires a time proportional to the physical size of the connected system, which thus forces limits on the possible system size.

Furthermore, when transmitting data 1394 sends the same information over all the cables in the system. Thus adding cables does not add capacity for 1394, as it does for SerialPlus or SCI.

This is not a criticism of 1394, which was designed for absolute minimum cost in a single desktop I/O environment, consciously ignoring scalability considerations, which were in those days believed to be expensive (we were wrong).

However, for high-end SerialBus users the resulting problems are already real, and gradually more and more users will bump into the built-in limits.

Now that technology has advanced to the point where the cost of logic gates is not the primary constraint in a system design, there are big advantages to using scalable architectures, which can just keep growing and adapting smoothly to use new technology as it evolves, generation after generation.

Standardized scalability is good for the consumer, but may not fit the business models of established manufacturers. Planned obsolescence and sharply defined market segments can support higher profits on a continuing basis. Standardized scalability replaces this controlled environment with free market forces, which make profits uncertain and the future unpredictable.

Thus the consumer's interests are always aligned with the underdog's--it's a very dynamic world.