Re: Blue Gene

From: Robert J. Bradbury (bradbury@www.aeiveos.com)
Date: Tue Dec 07 1999 - 10:25:36 MST


On Tue, 7 Dec 1999, Mike Hall wrote:

>
> Maybe, but I've seen nothing in the published material that says this is
> anything other than a general-purpose machine. But again, the facts in
> these pieces are somewhat meager. I'd like to get a peek at the
> instruction set if they ever deign to publish it.

So would I. I think they would publish it, you can't effectively use a
machine unless you can work with it at multiple levels. I've rarely
seen a compiler that I can't out-code. The question is whether they
will publish it before the machine becomes available, for example do we
even have the Merced instruction set (or the Playstation instruction set?).

>
> But even if I'm right, the task of designing software to make full use
> of the machine's capabilities may be so daunting that no one else will
> want to take it on, effectively making it a single-purpose machine.

Not really, if it is general purpose, there are already software
models (e.g. the Oxford Bulk Synchronous Parallel (BSP) model,
the OpenMP API for Shared Memory Programming, and the BIP message
passing model (for Myrinet) for programming similar machines.
There only difference between programming something for a Beowulf
cluster and Blue Gene is the granularity of the processor units.

What IBM probably did was ask themselves what the failure rate
was going to be in the processor units. With 1M processors
it might be quite high. Customers aren't going to be happy
if your machine is down most of the time getting boards replaced.
This is now solved in multiprocessor & clustered architectures
where you can afford to take out a node for a few minutes to
hours to replace parts. However if you are running integrated
calculations (i.e. this isn't a client-server archecture) that
take days to weeks and the data in one node interacts with *all*
of the other data, then when you pull a node you slow down the
entire calculation. The clever trick is going to be detecting
the failures (you don't want soft failures, you want hard failures)
and having the data arranged so that multiple processors/nodes can
rapidly get to it.

This is a new level in computer architecture and getting very close
to what goes on in the brain. If they get the architecture right
and the fault tolerance right and because they have solved the
bandwidth problem, you can expect a simple instruction set to
gradually expand as people come up with other applications
and declining feature sizes give you more chip real-estate to
work with.

> And this is likely the only one they will build, like Deep Blue.

IBM is one of the most clever marketing organizations in the world.
Unlike Deep Blue, they aren't doing this for publicity. (After all
how many machines are you going to sell when you know you are going
to lose the game...) They realize the market for these machines is
in the dozens (major pharma & govmnts), thousands (universities &
small-biotech), and potentially workstation quantities (individual
researchers). I'll predict with this one they are planning to do
the software investment and then use that to follow the declining
hardware costs to make the machines available to larger markets.

>
> P.S. I apologize for my sloppy editing on my original post
> (which was truly my first post to this list).

No problem. The information was quite helpful and appreciated.

Robert



This archive was generated by hypermail 2.1.5 : Fri Nov 01 2002 - 15:05:59 MST