Date: July 29, 2002
One of the most wonderful performance enhancing
features of the forthcoming Opteron (Hammer) are the multiple independent direct
memory channels that are built into each CPU. This is a huge, huge
difference from the shared memory (actually, shared everything) approach of
Intel symmetric multi-processing (SMP) systems. Given the limitations of
today's memory technology, this was an exceedingly smart move on the part of AMD
and will likely change the way we design some of our applications in the very
near future.
Intel SMP systems exhibit terrible scalability beyond as few as 3 CPU's if
programs and data don't fit into a local CPU's cache (unless the programs are
designed from the start for parallelism). This is largely why Intel CPU's are
being equipped with larger and larger cache sizes. Even large cache
increases don't really have very high scalability on SMP shared memory systems
since cache coherency maintenance between multiple CPU's is still a problem with
serious limitations. The bus snooping and resultant cache flush/fill
activity often cripple a given CPU for many cycles of operation. In
addition, modern OS's use globally available data structures to coordinate
activity among independent CPU's. The need for atomic test/set operations
to support spinlocks and semaphores can chew up lots of memory bandwidth and
suffers potentially severe latency delays when dirty cache needs flushing (thus
forcing MMU's to resynchronize their cache images of shared memory). The
latency just kills these systems for many, many applications. That is why
scalability is so poor as can be seen in various tests running everyday apps on
everyday OS's such as Unix and Windows NT/2K (when the fourth CPU only delivers
a 10% performance kick the term "scalable" is a complete misnomer). Generally,
the apps themselves are simply not designed to efficiently run multi-threaded
across multiple CPU's (they are architecturally limited). It's not just
the applications that have problems with SMP. The OS kernels are often a
nightmare with critical code segments heavily dependent on a given hardware
design (memory controller, cache subsystems, etc.) and with full SMP there is
LOTS of activity competing for the available latency and bandwidth of the memory
system.
Of course, a multi-threaded OS that is running on a multi-CPU system (actually,
it's multi-process also) still needs some form of shared resource control and in
the case of multi-threading some way of doing those spinlocks and semaphores.
Most of our currently used software is capable of using multi-threading over
multiple CPUs but is essentially designed for the shared memory model. How
can we get past the performance bottleneck? Thread affinity and other
techniques have been developed to help deal with the issue but the shared memory
model is just plain the primary problem. There is only so much bandwidth
to share and it gets used up fast. It's clear that memory bandwidth is generally
THE limiting factor on SMP systems. Thus, to get away from the problem we
will need to get away from SMP. It's that simple.
You could see this problem coming years ago as CPU speeds zoomed past memory
performance which has plodded along at a sub-Moore development rate. There
simply isn't enough memory performance in SMP to keep up with CPU's. For
high performance and high volume applications (transaction processing,
rendering, searching, etc.) there MUST be division of the workload among
multiple separate memory systems. There are several ways to do
this and currently they tend to be rather expensive. One way to get a lot
of performance out of a multi-CPU configuration is to put the CPU's into
separate systems.
When data centers are designed to handle massive loads, we often utilize
separate systems that operate together in "clusters". Perhaps the company
best at doing this was the now absorbed Digital Equipment Corporation (good old
DEC). Their VMS operating system had a built in clustering capability that
allowed multiple separate systems to share their resources in nearly complete
harmony (with a significant amount of overhead for the inter-system management
to take place). Each system of course had its own OS image and attached
resources (which could be shared as desired) and various enhancements to the OS
which allowed such things as distributed lock management, failover of
applications, etc. This allowed any given system to "see" the resources of
another and by passing an internally generated request to another system (or
virtualizing it) gain access to the other resources. The point here is
that each system ran its own apps and utilized the storage and I/O systems of
other systems. It is pretty amazing when done properly (and you do have to
have some skill at application development and cluster management to make it
work well). We can do this with other operating systems, typically Linux
(and IBM mainframes can too) and Microsoft is trying very hard to do the same
(it ain't easy and their OS's are already such a kluge it may take them years
yet to get it right!). The main benefit of clusters is that we don't have
shared memory issues and the main drawback is that we still have distributed I/O
performance issues, usually due to the bandwidth limits of the system
interconnect.
Of course, clusters tend to be somewhat expensive and power hungry since we need
to create completely separate systems and then link them with gigabit ethernet,
fiber channel or some other high performance LAN that acts as a "backplane" for
the systems to communicate over. (DEC used a thingy call CI or "cluster
interconnect" that was very fast.) They work great but the cost of each box and
the expense of hooking it together with state-of-the-art networking gear is
quite high. Of course there is also the power and cooling expense to
consider. If we could dream of a "perfect" system, what would we really
need?
What we really want are the best features of a cluster without all the expense
in hardware and energy cost. We also don't want just any old SMP system
with its bandwidth and latency limitations. What will really help eliminate the
SMP bottleneck issues is to have the CPU's do their own thing (just like any
given system in a cluster) and then only communicate with another CPU when it
must (just like in a cluster). With AMD's high end Opteron CPU's a lot of
the cost can be eliminated and the whole idea of running applications on such a
platform may change the way we "do" computing forever.
If you've looked at some functional block diagrams of AMD's intended 4-way high
end Opteron systems, (see this
.pdf file for reference) you will note that all 4 CPUs are interconnected
via HT links (Hyper Transport). One of the CPU's is intended to be
dedicated to serving as the primary display and I/O resource manager while the
other three are dedicated to process or thread execution. One of the
potential problems with this is that in most current OS designs process/thread
scheduling is controlled by a single CPU and this can create more inefficiencies
by requiring significant amounts of inter-process and thus inter-processor
communication (IPC). There simply isn't any free lunch. But what if we
didn't need lunch so often? If we just turned each CPU loose on it's own
set of problems and let it manage them by itself, it would only have to "talk"
to another CPU when it needed I/O or access to some mutually agreed to shared
resource. Hmmm, this is starting to sound a bit like a cluster....
If we extend the concept a bit, why don't we just go ahead and provide a full
mini-kernel to each CPU and let that CPU schedule and manage it's own list of
processes? That can cut down significantly on IPC requirements and for
certain applications (that are written properly) eliminate cache coherency and
shared memory latency/bandwidth issues. There's still the shared I/O to
deal with but at least that is handed off to a completely separate CPU. In
the Opteron architecture, the HT links allow extremely efficient I/O transfers
to the buffer memory of a CPU that wants to move data in/out. As Nils has
pointed out, this is almost a mainframe class "channel" architecture (and it is
very, very efficient). The HT channels can each handle over a giga-BYTE
per second of data transfer and that's enough to gobble up the output of over a
dozen striped disk drives simultaneously (if that isn't enough, just wait for
the next rev of HT!). Each CPU's cache does its own thing with its
own list of processes and cuts down dramatically on inter-CPU IPC.
Applications generally would run completely within a given CPU and should be
designed for CPU thread affinity. This will keep the garden variety
application design similar to how we do things today and keep things simple.
This isn't much of a problem since these CPUS are smokin' anyway.
However, when there is a compelling need for wringing out performance (such as
OO and Relational DBMS, rendering, etc.) we can easily design threads that
should run more or less independently on separate CPU's and then coordinate
their intermediate results over those HT channels (Linux Beowulf clusters often
do this already so the techniques are well known).
These are not new ideas but the imminent arrival of AMD's Opteron server class
CPU's begs for a more advanced OS architecture than we are currently using in
shared memory SMP systems. We will not be able to get the best from these
systems without a superior OS design. Is there one available?
If you've followed so far you'll be interested in this
heavily footnoted paper
by Karim Yaghmour of Opersys. He suggests and analyzes what it will take
to get a Linux kernel running separately in a multi-CPU environment. There
is no mention of Hammer or Opteron but the concepts can easily be applied and
without doubt more efficiently on the Opteron architecture than on Intel SMP.
Of course, the above doesn't even begin to outline the plethora of issues
involved but it at least highlights where some of the biggest problems currently
are and how the AMD Opteron architecture can potentially provide a significant
advance. Think on ya'll!
===================================
Other Links
Brief on paper
http://www.linuxdevices.com/news/NS7919961767.html
Opersys home page
http://www.opersys.com/
===================================
Pssst! We've updated our Shopping Page.
===================================