On Tue, 7 Dec 1999, Mike Hall wrote:
>
> Maybe, but I've seen nothing in the published material that says this is
> anything other than a general-purpose machine. But again, the facts in
> these pieces are somewhat meager. I'd like to get a peek at the
> instruction set if they ever deign to publish it.
So would I. I think they would publish it, you can't effectively use a machine unless you can work with it at multiple levels. I've rarely seen a compiler that I can't out-code. The question is whether they will publish it before the machine becomes available, for example do we even have the Merced instruction set (or the Playstation instruction set?).
>
> But even if I'm right, the task of designing software to make full use
> of the machine's capabilities may be so daunting that no one else will
> want to take it on, effectively making it a single-purpose machine.
Not really, if it is general purpose, there are already software models (e.g. the Oxford Bulk Synchronous Parallel (BSP) model, the OpenMP API for Shared Memory Programming, and the BIP message passing model (for Myrinet) for programming similar machines. There only difference between programming something for a Beowulf cluster and Blue Gene is the granularity of the processor units.
What IBM probably did was ask themselves what the failure rate was going to be in the processor units. With 1M processors it might be quite high. Customers aren't going to be happy if your machine is down most of the time getting boards replaced. This is now solved in multiprocessor & clustered architectures where you can afford to take out a node for a few minutes to hours to replace parts. However if you are running integrated calculations (i.e. this isn't a client-server archecture) that take days to weeks and the data in one node interacts with *all* of the other data, then when you pull a node you slow down the entire calculation. The clever trick is going to be detecting the failures (you don't want soft failures, you want hard failures) and having the data arranged so that multiple processors/nodes can rapidly get to it.
>
> P.S. I apologize for my sloppy editing on my original post
> (which was truly my first post to this list).
No problem. The information was quite helpful and appreciated.
Robert