COMP:WARS: RE: Software/Hardware Architectures

From: Eugene Leitl (eugene.leitl@lrz.uni-muenchen.de)
Date: Thu Jul 15 1999 - 02:33:53 MDT


Billy Brown writes:
> I don't think we were really getting anywhere with the previous line of
> responses, so I decided to try it again from the beginning. Here goes:
 
Yep, this has been also my impression. We do really seem to have
unresolvably different ways of seeing matters. Not that diversity is a
bad thing, but it sure turns tedious having to reiterate
things/sugarcoat them in different verbiage.
 
> Regarding Current Software
> Current PC software is written for hardware that is actually in use, not
> hypothetical designs that might or might not ever be built. This is
> perfectly logical, and I don't think it makes sense to blame anyone for it.

It might be logical, yet there is plenty of reasons to blame
widespread "investment protection" attitude for it. Investment
protection is great for local optimization, but is deletorious
even on the middle run. And is really really desastrous on the
long run.

To name a few notorious perpetrators: Big Blue, Intel, Microsoft
have been reserved a special circle in Dante's Hell. (At least
in my private universion of it).

> If a better architecture becomes available, we can expect ordinary market
> forces to lead them to support it in short order (look at Microsoft's

Alas, woefully, free markets seem to fail here miserably. Technical
excellence has very little to do with market permeation. Deja vu deja
vu deja vu deja vu.

> efforts with regard to the only-marginally-superior Alpha chip, for
> example).
 
There is really no fundamental difference between x86 family, PowerPC,
diverse MIPSen or Alpha. They all suck.

> The more sophisticated vendors (and like it or not, that included Microsoft)

The trouble with Microsoft is that we're reasoning with a marketplace
flattened by more than a decade of its influence as a point of
reference. To be fair we must evaluate multiple alternate branches
of reality-as-it-could-have-been, which necessarily makes for
extremely subjective judgements. Your mileage WILL vary.

As to Microsoft, I guess all the intelligence cream they've been
skimming off academia/industry for years & all these expenses in R&D
will eventually lead somewhere. Right now, what I see doesn't strike
me as especially innovative or even high-quality, no Sir. Particularly
regarding ROI in respect to all these research gigabucks pourin'
in. Administratory hydrocephalus begets administratory hydrocephalus.

> have been writing 100% object-oriented, multithreaded code for several years
> now. They use asynchronous communication anywhere there is a chance that it

I hear you. It is still difficult to belive.

> might be useful, and they take full advantage of what little multiprocessor
> hardware is actually available. There is also a trend currently underway

Well, essentially all we've got is SMP. I guess it makes use of
shared-memory paradigm which is a dead-end. Shared-memory is at
least as unphysical as caches, in fact more so.

> towards designing applications to run distributed across multiple machines
> on a network, and this seems likely to become the standard approach for
> high-performance software in the near future.
 
I know clustering is going to be big, and is eventually going to find
its way into desktops. It's still a back-assed way of doing things,
maybe smart RAM will have its say yet. If only Playstation 2 would be
already available, oh well. Marketplace will sure look different a
year downstream. Difficult to do any planning when things are so in
flux.
 
> Regarding Fine-Grained Parallelism
> Parallel processing is not a new idea. The supercomputer industry has been

Heck, LISP is the second oldest HLL known to man, and Alonzo
Church invented lambda calculus in the 1940's. Eniac was a RISC
machine. Unix is a 1970's OS. GUIs/mice/Ethernet are 1970's
technologies. Von Neumann invented cellular automata in the 1950's.

What has age anything to do with how good an idea it is? If anything,
a brand-new untried idea is something to be wary of.

> doing it for some time now, and they've done plenty of experimenting with
> different kinds of architectures. They have apparently decided that it

Nope, sorry, don't think so. Most of IT landscape is shaped by vogues,
and right now parallelism is getting fashionable (what a damnable
word) again. (While ALife is heading into oblivion, which is imo a
damn shame).

> makes more sense to link 1,000 big, fast CPUs with large memory caches than
> 100,000 small, cheap CPUs with tiny independant memory blocks. That fits

Heck, caches hierarchies are unphysical. Fat CPUs are incommensurable
with nonnegligeable fast on-die SRAM blocks _and_ high good die
yield. Wafer-scale integration is impossible without good die yield
and failure tolerance. Kbit buses are impossible if you don't have
embedded DRAM technology due to packaging constraints. Embedded
RAM technology is hardly one year old. VLIW is at the threshold
of going mainstream, and VLIW only makes good sense with kBit
broad buses/on-die memory. High code density is impossible
without threaded code, and threaded code requires stack CPU
hardware support. Common HLLs don't support threaded code/stack
CPUs. No language supports fine-grain maspar systems.
Blahblahblah. I could go on for a long time but (I hope) my
point is nauseatingly clear: it's a big ball of yarn buried in
yonder tarpit, and requires a whole lotta muscle to haul it out.
You have to do it in one piece because everything is
coherent/contiguous/synergistic. A bit of tarry string wrestled
from the pit won't excite anybody. Please go for the whole hog.

IT is just another acronym for Inertia Technology. We're caught
in a local minimum, but this doesn't mean there is no lower one.
And a (number of) decision(s) has been made in the past which
made us land in this particular minimum. It could have been a
different one.

Sorry if this sounds like a just another technoshaman mantra, but
that's just how things are.

> perfectly with what I know about parallel computing - the more nodes you
> have the higher your overhead tends to be, and tiny nodes can easily end up
> spending 100% of their resources on system overhead.
 
There are codes where Amdahl is going to bite ya. There are a lot
where overhead is not a problem. My particular problem (from the
domain of physical simulation on a 3d lattice) is the latter case.
 
> Now, if someone has found a new technique that changes the picture, great.
> But if this is something you've thought up yourself, I suggest you do some
> more research (or at least propose a more complete design). When one of the

Heck I did it years ago. Somebody even used the writeup in a CPU
design class. As I don't have a fab and several 100 M$ to burn (and,
incidentally, more important research to do) I can hardly be expected
to assault the buttress alone, can I? You'll wind up in the moat, all
dirty & bloody, and have to listen to stupid French jokes and be used
for carcass target practice in the bargain.

> most competitive (and technically proficient) industries on the planet has
> already tried something and discarded it as unworkable, its going to take
> more than arm-waving to convince me that they are wrong.

Right, the whole area of supercomputing is going to vanish into a
logics cloud overnight -- because as everybody knows they are
all monoprocessors. Beowulf is just a passing fad -- pray no
attention to exponential growth of 'wulfers. Photolitho-semiconductor
CPUs will scale in to 10, 100, 1000 GHz regime trivially. Einstein
was dead wrong, and you can signal faster than speed of
light in vacuum. It makes actual sense implementing a Merced
in buckytube logic. People who proved that reversible cellular
automata in molecular logics are the most efficient way of
doing computation were just dweebs. Right.
 
> Regarding the Applicability of Parallelism
> The processes on a normal computer span a vast continuum between the
> completely serial and the massively parallel, but most of them cluster near
> the serial end of the spectrum. Yes, you have a few hundred process in

Says who.

> memory on your computer at any given time, but only a few of them are
> actually doing anything. Once you've allocated two or three fast CPUs (or a

How would you know? I gave you a list of straightforward jobs my
machine could be doing right now. Sounds all very parallel to
me. Remember, there is a reason why I need to build a Beowulf.

> dozen or so slow ones) to the OS and any running applications, there isn't
> much left to do on a typical desktop machine. Even things that in theory

I guess I don't have a typical desktop machine, then. I could really
use an ASCI Red here, or better one of these kCPU QCD DSP jobs.

> should be parallel, like spell checking, don't actually get much benifit
> from multiple processors (after all, the user only responds to one dialog
> box at a time).

Spell checking? I never do spell checking. I do have the C. elegans
genome sitting on my hard drive here, though, which I'd love to do some
statistical analysis on. Guess what? Another embarrasingly parallel
app.

> On servers there is more going on, and thus more opportunity for
> parallelism. However, the performance bottleneck is usuall in the network

You know what? We're going to move to xDSL pretty quick. And we're
going to need a database-backed web site, both for the intranet and
the outside. No never fork no more...

> or disk access, not CPU time. You can solve these problems by introducing
> more parallelism into the system, but ultimately it isn't cost-effective.
> For 99% of the applications out there, it makes more sense to buy 5
> standardized boxes for <$5,000 each than one $100,000 mega-server (and you
> get better performance, too).
 
Well, I guess I must be pretty special, because the $100,000
mega-server doesn't make at all sense when you want to do
multi-million particles MD. Lots of cheap PC with full duplex
FastEthernet, very much yes. And there is no bottleneck, since from a
certain minimal system size/node onwards the things scales O(N), and I
mean _strictly_ O(N).
 
> Of course, there are many processes that are highly amenable to being run in
> a parallel manner (video rendering, simulation of any kind, and lots of
> other things), but most of them are seldom actually done on PCs. The one

Well, I hate to keep repeating this, but it is not really that rare
you seem to think it is.

> example that has become commonplace (video rendering) is usually handled by
> a specialized board with 1 - 8 fast DSP chips run by custom driver-level
> software (once again, the vendors have decided that a few fast, expensive
> chips are more economical than a lot of slow, cheap ones).
 
Cheap!=slow. A $30 DSP can outperform a $300 CPU because it don't have
to put up with legacy bloat.

> Side Issues
> 1) Most parallel tasks require that a large fraction of the data in the
> system be shared among all of your CPUs. Thus, your system needs to provide

YMMV. Mine don't.

> for a lot of shared memory if it is going to be capable of tackling

Shared memory does not exist, at least not >2-4 ports. If you attempt to
simulate that, you will have to pay dearly in logic and cache
coherence issues, which starts to slow you down very quickly (point of
diminishing returns is just round the corner). You can simulate shared
memory with message-passing, though. If you really, really really need
it.

> molecular CAD, atmospheric simulations, neural networks, etc. That brings

<laugher>. Molecular CAD, weather codes and neural codes are already
are or patently formulable in an embarrasingly parallel
way. Really. Look on the code shelves.

> up all those issues of caching, inter-node communication and general
> overhead you were trying to avoid.
 
Caching doesn't exist (pray pay no attention to the clever fata morgana).
Internode-communication is readily addressable (=solved) by having a
failure-tolerant routing protocol with switch fabric built into the
CPU. Next Alpha is going to have multi-10 GByte/s inter-CPU signalling
with 15 ns signalling latency. 3d lattice topology (6 links/CPU) is
really sufficient for most codes I care about.
 
See SGI/Cray, DSP clusters and Myrinet Beowulfs for illustration.

> 2) You also can't get away from context switching. Any reasonably complex
> task is going to have to be broken down into procedures, and each processor
> will have to call a whole series of them in order to get any usefull work
> done. This isn't just an artifact of the way we currently write software,

Untrue. You almost never have to switch context if you have 1 kCPUs to
burn. You only have to do this if you run out of the allocable CPU
heap (when the number of your objects exceed the number of your CPUs).

> either. It is an inevitable result of the fact that any interesting
> computation requires a long series of distinct operations, each of which may
> require very different code and/or data from the others.
 
Strangely, my needs are very different.

> Billy Brown, MCSE+I
> ewbrownv@mindspring.com



This archive was generated by hypermail 2.1.5 : Fri Nov 01 2002 - 15:04:29 MST