Computer Architecture

updated 2009-04-29.

Some notes on Computer Architecture. Very incomplete.

Contents:

news
computer architecture comic strips
The Freedom CPU Project
- DLX
- MMIX being developed by Donald E. Knuth
- TTA
Using FPGAs to simulate CPUs
Extremely Simple CPU Architectures
cellular automata
self-replication
Minimum Instruction Set Computing (MISC)
the Turing tarpit
Opcode considerations
subroutines
Taxonomy of Multiprocessors
Computation Structures
neural networks (NNs)
minimal-gate CPU
stack machines
operating systems development (your own OS)
compiler quotes
low power design
top N books on CPU development
historical computer architectures (museums)
dealing with interrupts: some ideas about interrupt-time code.
Forth
benchmarks, ways to fairly compare the speed, power usage, etc. of 2 different processors.
Other pages about lots of CPUs
hardware compilers
unsorted other snippets of info

[FIXME: should I put links to Beowulf here ?]

David also maintains related files:

GUI: graphical user interface video_game.html#gui
traditional architecture 3d_design.html#architecture
the human brain learning.html#brain
I've moved a lot of computer design quotes to 3d_design.html because they seem to generally applicable to all kinds of design.
software tools used to design computer architecture details schematic.html
wearable_electronic.html has some info on how important energy efficiency is, and how to improve it in software.
The ARM arm.html computer architecture is one of my favorites.
Most computers in 1999 are implemented in VLSI. This also points to some photos of CPUs and related design tools.
Future computers are likely to be nanocomputers.
clockless logic vlsi.html#clockless
Computational RAM vlsi.html#cRAM (also called iRAM, intelligent RAM) is a interesting highly-parallel computer architecture.
bignums.html#cpu CPU die sizes
f_mmu.html a MMU for the freedom CPU.
minimal_instruction_set.html
floating-point format
clockless logic
FPGA
endian_faq.html
quantum_c_faq.html
neural networks and SIMD machines are often used for Machine Vision
Many computer architectures have already been implemented as µcontroller chips (micro-controllers) which are, of course, used in Robot Brains.
the human brain learning.html#brain

``I wisdom dwell with prudence, and find out knowledge of witty inventions.'' -- Proverbs 8:12

news

computer architecture news

The latest computer architecture information is at

news:comp.arch mirror http://groups.google.com/groups?as_ugroup=comp.arch computer architecture Usenet newsgroup, /* was http://www.dejanews.com/bg.xp?level=comp.arch */ and its subgroups, such as
news:comp.arch.hobbyist mirror http://groups.google.com/groups?group=comp.arch.hobbyist and
news:comp.arch.embedded mirror http://groups.google.com/groups?group=comp.arch.embedded
news http://www.cpureview.com/

computer architecture comic strips

the Megahertz myth http://www.geekculture.com/joyoftech/joyarchives/243.html a few cartoons about computer architecture.

The Freedom CPU Project

CPU Design HOW-TO by Al Dev (Alavoor Vasudevan) http://www.linuxdoc.org/HOWTO/CPU-Design-HOWTO.html many useful links on CPU design (both commercial and free; both historical and theoretical) [FIXME: could most of the documentation here be moved there ?]
f-cpu http://f-cpu.org/ [FIXME: all these links still active ???] http://f-cpu.tux.org/ mirror: http://f-cpu.dyn.ml.org/
f-cpu mailing list archives: http://www.egroups.com/group/f-cpu | mirror http://archives.seul.org/f-cpu/f-cpu/
It would be nice to run some benchmarks #benchmarks .
PC Processor Microarchitecture: A Concise Review of the Techniques Used in Modern PC Proc http://www.mdronline.com/q/@19718145vtdxzz/x86/microarchitecture/
Microcontroller.com has some nice tutorials http://www.Microcontroller.com/Learn-Embedded/Embedded-Systems-Tutorials.asp on low-power system design (and some examples that explain some DSP architecture)
ARC Cores http://www.arccores.com/ seems to be a competitor to f-CPU.
the Sayuri free CPU http://www.morphyplanning.co.jp/FreeCPU/freecpu-e.html by Toyoaki Sagawa ? simple, TTA-like architecture. The "test-and-move" instruction is interesting.
"C modeling accelerates HDL-system design" article by Tom Balph And Pat O'Malley, Motorola Semiconductor 1998-10-22 http://www.ednmag.com/reg/1998/102298/22ms376.htm "You can use C models to evaluate early architectures and to verify intermediate and final system-simulation results."
Latest proposed instruction set for the f-cpu http://f-cpu.tux.org/fdesign/
Check out the "Hardware prototyping" section of http://world.std.com/~wware/ for more FPGA and full-custom information.
design goals (at least what DAV thinks *should* be the design goals): (see computer_architecture.html#CPU_goals ) flexibility (easy to explore design variations and design improvements), minimum design and development costs, speed.
- (b) maximum processing power per mm^2 for heavily parallelized tasks (ray tracing, etc.)
- (c) full-out, spare-no-expense, maximum processing speed for those sequential calculations where almost every step depends on all previous steps.
http://www.estec.esa.nl/wsmwww/leon/

The LEON core is a SPARC* compatible integer unit developed for future space missions. It has been implemented as a highly configurable, synthesisable VHDL model. To promote the SPARC standard and enable development of system-on-a-chip (SOC) devices using SPARC cores, the European Space Agency is making the full source code freely available under the GNU LGPL license.
The LEON core has been extensively tested against the SPARC V8 architecture manual and the IEEE-P1754 (SPARC) standard, but have not been formally tested and certified by SPARC international as being SPARC V8 compliant. ...
DO YOU WANT TO HELP? If you wish to contribute to LEON and work on (or donate) any of these modules, please Jiri Gaisler.
f-cpu mailing list: http://www.egroups.com/list/f-cpu/
floating point endian_faq.html#float

X-URL: http://www.egroups.com/list/f-cpu/
Date: Tue, 12 Jan 1999 21:47:57 -0500 (EST)
From: Robert Dale <rob at nb.net>
To: F-CPU <f-cpu at egroups.com>
Subject: [f-cpu] CVS server

We have anonymous (read-only) CVS access to
AlphaRISC's TTA sim he kindly wrote.

$ export CVSROOT=:pserver:anonymous@www.deepfreeze.org:/home/src/f-cpu
$ cvs login
(Logging in to anonymous@www.deepfreeze.org)
CVS password: <hit Enter>
$ cvs co f-cpu

If you need help and/or want write access, feel free to email me.
We'll probably put some help documentation on the website. ;)

--
Robert Dale

is one of the architectures suggested for the F-CPU. Its major advantage is: gcc has already been ported to it, so we don't need to write a new compiler on top of everything else. So we can immediately compile the Linux kernel and other code for this chip *now*, before we even start doing anything else, which should make it easy to cache analysis and other trade-off analysis. Also, it means that when the hardware is turned on for the first time, it could boot directly into Linux, which would be cool.

(Can we really go to 64 bit architecture and still use this compiler ?)

(Can we really use a TTA architecture for the attached FPUs ?)

"DLX is a great architecture (which begot the MIPS, BTW)." -- recc. Andrew Derrick Balsa 1998-09-13
H&P's Quantitative Approach "it's _the_ bible of CPU architecture. ... [describes DLX instruction set on] page 99 and whereabouts. The reasoning in H&P is so clear and makes so much sense that I think there is little to add to what they wrote." -- recc. Andrew Derrick Balsa 1998-09-13
"DLX instruction set and instruction set format, somehow: 32-bit fixed length instruction, 6 bits for the opcode, two source and one destination operands encoded each in a five bit field, immediate data limited to 16 bits so we can fit it in a 32-bit instruction, and jumps/branches/calls limited to 26+2 bits range."
DLX simulator http://www.cc.gatech.edu/gvu/people/student/Reid.Harmon/superdlx/
DLX simulator http://everest.ee.umn.edu/mcerg/fast-dlx/
gcc for the DLX http://spectra.eng.hawaii.edu/~tep/EE461/dlxman/dlxcc.html
gcc for the DLX http://everest.ee.umn.edu/mcerg/gcc-dlx.html
"There is a VHDL description of DLX and I even think we'll find a Verilog one," -- Andrew Derrick Balsa 1998-09-13

MMIX

is one of the architectures suggested for the F-CPU.

MMIX 2009 a RISC computer for the third millennium http://sunburn.stanford.edu/~knuth/mmix.html | mirror http://www-cs-faculty.stanford.edu/~knuth/mmix.html "based on the principles of RISC architecture as expounded for example in _Computer Architecture_ by Hennessy and Patterson."
being developed by Donald E. Knuth christlib.html#knuth
Designed as a theoretical ("mythical") example for use in a future edition of _The Art of Computer Programming_ by Donald E. Knuth. In particular, it is designed to be easy to understand, (so as not to distract from the main points of his text) and to give students a general idea of how high-level constructions are actually implemented in machines. Since some of his text deals with assemblers, it is very easy to write simple assemblers for it. [FIXME: move everything not specific to MMIX to the Knuth section]
disadvantage: this was *not* designed to be particularly fast, low-power, cheap, or any of the other things that real CPUs need to deal with. (But perhaps, as a side-effect of it being easy for humans to understand, it will be easy for us to figure out how to make it fast, etc.)
advantage: by the time we get working hardware, tons of Knuth fans will want to play with it.
advantage: It is likely that some independent group will port gcc to this architecture.
MMIX is a machine that operates primarily on 64-bit words. It has 256 general-purpose 64-bit registers that each can hold either fixed-point or floating-point numbers. Most instructions have the 4-byte form `OP X Y Z', where each of OP, X, Y, and Z is a single 8-bit byte. If OP is an ADD instruction, for example, the meaning is ``X=Y+Z''; i.e., ``Set register X to the contents of register Y plus the contents of register Z.'' The 256 possible OP codes fall into a dozen or so easily remembered categories.
"Introduction to MMIX" http://sunburn.stanford.edu/~knuth/mmix.ps.gz has detailed preliminary information. I thought I saw a typo in section 40, but on further reflection I see that it is correct. The f(x,y,z) on that page is also called the majority of x, ~y, and z.
non-problems: Some people claim that 256 registers are "too many" :-). section 42 describes a limited version of MMIX with only 32 local registers plus 32 global registers (plus the 32 special registers), that is binary compatible with most MMIX programs.
the cycle counter, the interval counter, and the usage counter seem incredibly useful for profiling.
It is nice that "DIV" calculates both the quotient and the remainder, even though those 4 numbers (2 input, 2 output) don't fit very well in the general 2 input, 1 output scheme. But I hesitate to condone the plethora of "special registers" (32 of them, apparently chosen to give one for every letter of the Roman alphabet, plus some duplicates) (see page 31) that seem to be required for this architecture. I like lots of general-purpose registers, and I like the local-register windowing scheme; but I don't like special-purpose registers.
Is there some reason it had to have exactly 128 different instructions ?

The funky "mask registers" used for "MUX" (section 10) seem kind of ugly to me. If I were designing MMIX, I would leave out this instruction, and try to replace it with something more useful. The loss of "MUX", in my opinion, would not destroy the consistency discussed in the section on the LDO and LDOU instructions. It can be easily emulated:

	// $X = ($Y ^ rM) v ($Z|Z ^ ~rM).
	MUX $X, $Y, $Z|Z

	// sequence equivalent to MUX $X, $Y, $Z|Z
	AND $t1, $Y, $rM
	ORN $t2, $rM, $Z|Z
	ORN $X, $t1, $t2.

	// sequence equivalent to MUX $X, $Y, $Z|Z
	NAND $t1, $Y, $rM
	ORN $t2, $rM, $Z|Z
	NAND $X, $t1, $t2.

	// special peephole optimization for MUX $X, $Y, 0
	AND $X, $Y, $rM

	// special peephole optimization for MUX $X, $Y, $Z
	// where $Z contains all ones:
	ORN $X, $Y, $rM

I'm growing to like "SADD" and "MOR"; they do things in one cycle that, when I don't have them, take a much longer sequence of other instructions. They seem relatively simple to implement. (since XOR can be emulated as $x = ($y v ~$z) ^ (~$y v $z), MXOR can be emulated as
```
	// simulate MXOR $X, $Y, $Z|Z
	NOT $znot, $Z|Z
	NOT $ynot, $y
	MOR $t3, $y, $znot
	MOR $t4, $ynot, $Z|Z
	AND $X, $t2, $t4
	
```
)
disadvantage: the TRIP and TRAP mechanism seems really complicated.
Subroutine linkage is, at first glance, unnecessarily complicated.

some more compiler/assembler notes:

	NOT $X, $Z|Z // set $X = ~$Z, i.e., bitcomplement; i.e., flip all the bits

can often be merged with previous or following bit instructions by replacing AND, OR with NAND, NOR, ANDN, ORN. When it can't, the assembler can use any of these equivalent instructions:

	// When Z is a register, either one of these instructions
	// bitcomplements it.
	NOR $X, $Z, 0
	ANDN $X, $Z, 0
	// When Z is a literal in the range -1 <= Z < #FF,
	// we can complement it in one instruction
	// at compile time.
	// if Z is outside this ranges,
	// just load ~Z directly into a register.
	SUB $X, zero, (1+Z) // sets $X = ~Z for -1 <= Z <= #FE.
	ANDN $X, zero, Z // sets $X = ~Z for 0 <= Z <= #FF.

DAV idly wonders if it would be better if LSS, LSSU, EQL, EQLU, FLSS, FEQL, CSWAP used all-ones (-1)(~0) for true, rather than the value #1. This might make some sequences 1 instruction shorter, others 1 instruction longer, (can use NEGU to flip between the 2 versions of true). but most the same length.
_MMIXware: A RISC Computer for the Third Millennium_ book by Donald E. Knuth (Heidelberg: Springer-Verlag, 1999) http://www-cs-faculty.stanford.edu/~knuth/mmixware.html [FIXME: read this book; with the errata on this page] includes ``a full implementation of IEEE standard floating point arithmetic is developed in terms of 32-bit integer arithmetic, including new routines for floating point input and output that deliver the maximum possible accuracy.'' [FIXME: add to my floating-point links]

TTA

is one of the architectures suggested for the F-CPU.

a TTA architecture where there is only one instruction (MOV) and everything is event-driven http://cardit.et.tudelft.nl/MOVE/
the MOVE project Automatic Synthesis of Application Specific Processors http://cs.et.tudelft.nl/MOVE/ TTA, transport triggered architecture

Using FPGAs to simulate CPUs

(only with FPGAs can you have reconfigurable computing)

see also robot_links.html#PLD Programmable Logic (FPGA, PLD, CPLD, etc.) and the devices needed to program them.

see also #simple_cpu for simple CPU architectures that one would think would be simple to implement in a FPGA.

see also vlsi.html#PCI_on_FPGA for FPGA devices compatible with the PCI bus.

[backup copy of something I wrote on http://electronicschat.org/echatwiki/HomebuiltCpu ]

I imagine that just about anyone who has programmed 80x86 assembly language has dreamed about building their own CPU with their own assembly language.

If you've dreamed about building your own CPU that runs your own assembly language, today is a wonderful time to be living. There are many, many ways to fulfull your vision. (A few of them are even useful). -- DavidCary ( http://david.carybros.com/html/computer_architecture.html#simple_cpu )

(Note: this page is *not* about building 80x86-based desktop computers. That sort of thing is discussed over at http://en.wikibooks.org/wiki/How_To_Build_A_Computer . Building a custom 80x86-based laptop -- I don't know.).

* alt.comp.hardware.homebuilt FAQ by Mark Sokos http://www.faqs.org/faqs/homebuilt-comp-FAQ/index.html

== useful restrictions ==

=== tiny size ===

Sometimes you want a tiny little processor -- say, you want it to fit inside a model rocket.

Imagine a fully custom tiny little processor, running exactly the instruction set you've always dreamed of. How is that different from a tiny PIC or Atmel processor, programmed to *emulate* your instruction set?

[http://www.taniwha.com/~paul/fc/ Taniwha Flight Computer Home Page]

[http://img.cmpnet.com/edtn/ccellar/e023pdf1.pdf "Picaro: A Stamp-like Interpreted Controller"] article by Tom Napier 1998-04 in _Circuit Cellar Ink_

I'm thinking of building one of these with an external *serial* memory (much smaller and cheaper than the normal parallel memory ... slow, but probably plenty fast enough for what I want it to do). Unlike most processors, this way you can execute code in serial memory. (I think this is also going to give the lowest-cost and also the lowest-power way to get your own custom instruction set).

=== higher speed ===

You can build a "processor" that performs some kinds of special-purpose tasks faster than Intel's latest device. Use FPGAs.

You can build a CPU that is "instant on" and "instant off" (use FLASH, and enough capacitance to dump the essential bits of the current state of RAM to FLASH).

=== extreme environments ===

Can you get it working over the entire "industrial temperature range"? ( -40 °C to +125 °C )

== challenging, educational, "fun" restrictions ==

=== built with "classic" TTL chips only ===

Generally built around a 74LS181 ALU or similar ( 74HC181 ). (Several homebuilt CPUs have been built like this)

=== bizzareness ===

Since we're building thing by scratch, why count in the "natural binary code"

Couldn't we increment address registers in some *other* sequence ?

What other "interesting" sequences would be interesting to experiment with ?

Is there some sequence that uses *less* hardware (easier to build) -- perhaps LFSR ? Is there some sequence that uses less power (runs longer on batteries) -- perhaps gray code ?

=== all one part ===

As many of you are aware, most projects are built from a multitude of different parts. There's always one part that takes the longest time to ship in. And then I always break something, and I have to wait even longer for the replacement to ship in.

However, computers have been built out of large numbers of the *same* device wired together appropriately.

* transistors * [http://en.wikipedia.org/wiki/Apollo_Guidance_Computer 4,100 ICs, each containing a single 3-input NOR logic gate.] * dual 3-input NOR gate ICs

I (DavidCary) have been wondering:

* given what we now know about "simple" RISC and zero-operand instruction sets, could I design a CPU with significantly *less* than 4,100 NOR gates ? * What other "universal" chips can be used (in large enough quantities) to build an entire CPU ?

I want to use chips that are readily available, and also "dense". (Obviously, the densest chips are the all-in-one CPU microcontrollers ... but I can't customize those. What other points on the spectrum are available?)

Let's focus on a 8 bit register (8 D flip-flops) for a moment. I imagine I'll use a bunch of them in my CPU. (program counter, registers, address pointers, etc.)

 chips/bit;  chips/ 8 bit register; chips/3-input-NOR

 '''universal chips'''
 5(?) 40 1 single NOR
 3(?) 24 1/2  dual NOR
 2(?) 16 1/3 triple NOR 74HC27
 1    8  1  dual 4-input mux 74HC153
 1/2  4  3(?) quad 2-input mux inverting 74HC158

 '''non-universal chips'''
 1/2  4  N dual D flip-flop 74HC74
 1/4  2  N quad D flip-flip 74HC173
 1/6  2  N hex D flip-flop 74HC174
 1/8  1  N octal D flip-flop 74HC564

Obviously, anything that could be built from single NOR gate ICs, could also be built from the much "denser" dual NOR gate ICs, and it would take significantly less space, time, effort, weight, etc.

It looks like the octal D flip-flop is the densest chip. Unfortunately, it is *not* "universal" -- parts of the CPU (the ALU, etc.) need to act in ways that I don't think D flip-flops can act. So if I want to stick to the "all one part" idea, I can't use it.

It looks like it takes three '158 chips to emulate a simple 3-input NOR, but only one '153 chip. So I suspect the '153 is better for building the random logic in the control section and the ALU.

* What other "universal" chips are there ? * Which of the universal chips can implement a given CPU in the fewest number of chips ? (If the CPU is dominated by registers, I suspect it will be one that can store the most bits in the fewest number of chips -- the '158 is the best I've found so far).

=== all 2 parts ===

Similar to the above, but easing the restriction to allow 2 different kinds of chips.

* If I use 2 different kinds of chips, which 2 chips can implement a given CPU in the fewest number of chips ? (The '564 is 4 times as dense as the '158 for storing bits ... and the '153 looks like it will be more dense than the '158 for most random control logic.)

http://c2.com/cgi/wiki?FpgaCpu
"Computer Architecture Lab wiki": a hands-on introduction into computer architecture. The main target is to build a simple, pipelined microprocessor and run it in an FPGA. http://en.wikiversity.org/wiki/Computer_Architecture_Lab
the FPGA-Synth wiki http://wiki.fpga.synth.net/ using FPGAs to build/design synthesisers and audio processing engines.
[news] fpga-cpu : FPGA CPU and SoC discussion list http://groups.yahoo.com/group/fpga-cpu/ This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project.
RECONF http://reconf.org/ is developing a "design environment to be able to efficiently use dynamically reconfigurable FPGAs (D_FPGAs). ... will make possible to design innovative low cost architectures, for adaptive computing systems ... The main targeted application domains are: real time image processing, signal processing,..."
Delivering RISC Processors in an FPGA for $2.00 http://www.altera.com/literature/wp/wp_risc.pdf
http://en.wikipedia.org/wiki/Reconfigurable_computing
Homebrewing RISC Microprocessors In FPGAs http://www3.sympatico.ca/jsgray/homebrew.htm by Jan Gray, who apparently also maintains FPGA CPU News http://fpgacpu.org/ and the "FPGA CPU Mailing List" http://fpgacpu.org/mailinglist.html

It is amazing what you can squeeze onto these parts if you design the machine architecture carefully to exploit FPGA resources. In contrast, there was a very interesting article in a recent EE Times by a fellow from VAutomation doing virtual 6502's ... Although the 6502 design used only about 4000 "ASIC gates" it didn't quite fit in a XC4010, a so- called "10,000 gate" FPGA. That a dual-issue 32-bit RISC should fit, and a 4 MHz 6502 does not, states a great deal about VHDL synthesis vs. manual placement, about legacy architectures vs. custom ones, and maybe even something about CISC vs. RISC...
-- Jan Gray http://www3.sympatico.ca/jsgray/home1.txt
FPGA CPU News "share the lore of designing new processors and integrated systems-on-chips using FPGAs (field-programmable gate arrays)." http://www.fpgacpu.org/ | mirror http://www.idempotent.com/log/apr02.html
A GNU/GPLed 64-bit CPU architecture [This is a free CPU completely independent of the Freedom CPU Project] a free CPU to be released soon. It runs at 9MHz and is programmed into a $10 FPGA. http://slashdot.org/articles/98/10/27/1259212.shtml ; http://www.dejanews.com/getdoc.xp?AN=405443751 ; http://www.dejanews.com/getdoc.xp?AN=406936493.1
Design Your Own Microprocessor by Jim Turley 2002 May http://www.circuitcellar.com/library/print/0502/

"according to Dataquest, the business of embedding CPUs into other chips is growing at about 25% per year and stood at about $20 billion in 2000. ...
The average is 2.3 processors per intelligent ASIC (those with any programmability) and rising. ...
... in the free category ... processor designs from CMOSexod and Free-IP cores. Both designs have little 8-bit processors you've never heard of but come with free tools to get you started. The Free-IP Project ... OpenCores ...
Elixent combines configurable processors with field-programmable logic to create a constantly changing processor. Elixent's tool switches implementations on the fly based on your workload. ...
Elixent enters the world of dynamic reconfigurability, or reconfigurable computing (RC). RC ... change the hardware on the fly to suit the application. ...
... It stands to reason that whenever one part of the circuit is working the other parts must be idle. ... Hardware systems have always been created from a superset of all the functions required over the life of the product, rolled into one.
If RC takes off, that might change. For example, future generations of cell phones may use half the transistors to do 10 times the work. Fewer transistors may be all that's needed, if none are wasted on functions that aren't required right now.
... Jim Turley ... visit his web site at www.jimturley.com. "

mentions these companies [FIXME: add to my other list of companies?] : VAutomation ARC International http://www.vautomation.com ... CMOSexod http://www.cmosexod.com/freeip.htm ... Elixent Ltd. http://www.elixent.com ... Lexra Inc. http://www.lexra.com ... OpenCores http://www.opencores.org/projects ... ARCtangent, ARChitect ARC International http://www.arccores.com ... Handel-C Celoxica http://www.celoxica.com ... LEON-1 Distributed by The European Space Agency http://www.estec.esa.nl/wsmwww/leon/ ... ST210 Agilent Technologies http://we.home.agilent.com ... STMicroelectronics http://us.st.com ... Jazz processor Improv Systems, Inc. http://www.improvsys.com ... Xtensa Tensilica http://www.tensilica.com ... The Free-IP Project http://www.free-ip.com with XESS, Corp. http://www.xess.com ...
"Nallatech Forges Unique Alliance to Advance FPGA-Based Supercomputing" 2005-05-25 http://www.supercomputingonline.com/nl.php?sid=8778

... the FPGA High Performance Computing Alliance (FHPCA) ... The FHPCA aims to advance the development and adoption of FPGA-based high-performance computing through collaboration on the design and development of world's most powerful FPGA-based supercomputer using COTS (Commercial Off The Shelf) technology. ... provide opportunities for education and training ... facilitate technology transfer, including assistance with porting conventional supercomputing applications to an FPGA-based supercomputer; and promote research in the application of FPGA-based high-performance computing by making facilities available to visiting academics. ... "Nallatech ... We have systems installed in applications as diverse as genomic processing and military radar. ..." said Allan Cantle, CEO of Nallatech. "FPGA computing is today where conventional microprocessor-based computing was 15 years ago. The potential exists to deliver unprecedented computational capacity, using less power in a smaller space, however to unleash that potential the industry needs to develop the means to give users low risk access to that power." ... his white paper, "FPGA Centric Computing Institute" ... in October 2003.
Benefits of Integrated System Design for complex FPGAs by Thomas Brückner

According to DataQuest, 17% of all FPGA design starts in 2003 included an embedded microprocessor. By 2007, estimates predict embedded microprocessor utilization in FPGAs will grow to 37%. ...
http://www.microswiss.ch/tld/2004/abstracts.html ; http://www.microswiss.ch/tld/2004/papers/Brueckner.pdf
"b16 -- A Forth Processor in an FPGA" by Bernd Paysan 2003

The b16 Processor is a minimalistic stack processor, inspired by Chuck Moore's recent work. The original incarnation is 16 bit, byte addressed. There are 32 instructions, each is 5 bits. Three and a bit (literally!) are packed into a 16 bit word (bundle). The first slot in the bundle can only be a nop or a call.

-- http://b16-cpu.de/ (designed in Verilog)
http://www.fpgacpu.org/ seems to have the latest news. [FIXME: check out ``Tim Böscke: Minimal 8 Bit VHDL CPU designed for a 32 macrocell CPLD. One page of VHDL!'' and compare to DAV's ``minimal'' CPU; ... ]
"An 8-bit Stack Processor" by Steven Sutankayo and Rob Chapman http://www.compusmart.ab.ca/rc/Papers/8bitprocessor/stackproc.html "a lightweight processor can be synthesized on the FPGA ... an eight bit data bus, eight bit address bus, one data stack, and one return stack. It is a zero operand processor ... A simple test system requires 500 LCs, approximately 90% of an Altera 10k10 series FPGA. ... " apparently designed in VHDL
Forth Chips http://www.ultratechnology.com/chips.htm seems to point to many CPU-in-FPGA projects, including
- [small freeware CPU] the E16 CPU http://www.tinyboot.com/E16.html ``an experimental stack-based CPU for FPGAs'' Interesting instructions `` ``!{'', ``}!'', ``@{'', ``}@''. "pre-load" and "post-load", "pre-store" and "post-store" help let core CPU run at faster clock rate than RAM.
  DAV: As CPUs get relatively faster than RAM, preload becomes more important. After deciding on an address and sending it to RAM, it takes several CPU cycles before the RAM is able to reply with the data. If the very next instruction requires that data, then the CPU has to wait. But if the instruction set is designed so that the programmer can designate an address ahead of time, this allows the programmer to squeeze in a few ``internal'' instructions instead of pausing. (This is very similar to the "branch delay slot" on some RISC processors). Early (asynchronous) RAM chips allowed you to present the data at the same time as the address when you wanted to write to the RAM chip. This *seems* faster than presenting the address, then later presenting the data, in isolation. But in the context of alternating reads and writes, where naturally the data coming from the RAM only arrives a memory cycle *after* the RAM is presented with an address, it turns out to be faster to always delay the data (whichever direction it is going) for writes as well. This leads to synchronous RAM ( in particular, SDRAM -- but SRAM stands for something different: static RAM). DAV: does simply A! help, or do we need to also indicate whether it will be a load or a store ? There are several improvements that Chuck has added to his newer Forth virtual machine model. One of them is the address register and the other is the circular stacks. Chuck has explained that hardware considerations aside, the idea of the address register was that the Forth words @ and ! ( fetch and store) were clumsy at the top of the stack and were based on smaller atomic operations that the programmer could take advantage of. @ was broken into two operations A! and @A. Likewise ! was broken into A! and !A. ... advantage is the use of auto-increment addressing memory access opcodes ... -- Jeff Fox, talking about Chuck Moore http://www.ultratechnology.com/forth3.htm
[FIXME: Why isn't this listed on that page ? ]
- $605 Design Your Own Processor in FPGA (tm) Lab http://www.ultratechnology.com/fpgakit.htm (connects to a PC via a serial cable) Xilinx XC4010 FPGA (about 10 000 gates) which is the same DYOP kit mentioned on the "Design Your Own Processor Info Page" http://www.tefbbs.com/spacetime/index.htm
- Guided Exploration of two FPGA-based CPU Designs led by John Rible, SandPiper Technology http://www.sandpipers.com/ recommends a FPGA board, some VHDL/Verilog books, and some Forth software.
  
  ... I strongly recommend the book HDL Chip Design ...
  Prentice-Hall has published a Xilinx FPGA lab book and win95/NT software which includes a coupon for an XESS FPGA board. Adding the cost of a power supply, you can be doing experiments for less than $250. ...
  [FIXME: get this book _HDL Chip Design_ book by Douglas J. Smith, 1998, Doone Publications, ISBN 0-9651934-3-8 ( Computer Literacy , $65+tax). ]
The ``FPSLIC'' from Atmel http://www.atmel.com/ad/fpslic01.html contains both a AVR microcontroller and a AT40K FPGA that is dynamically reconfigurable in-system from the AVR microcontroller.
http://www.burched.com.au/ sells FPGA prototyping boards (already has the FPGA and support circuitry to plug into a PC; large prototyping areas for your own circuitry). The US $230.00 ``SUPER-VALUE-PACK'' http://www.burched.com.au/b3supervaluepack.html ``for FPGA-CPU computer architecture enthusiasts'' includes Xilinx Spartan2 XC2S200 ( 200 Kgate, 100 MHz ); 2MBits of external SRAM chips, and lots of prototyping area and easy probe access headers. (Xilinx advertises ``under $10'' (in 250,000 units, slowest speed) Spartan2 XC2S100 (100 Kgate) ).
[FPGA] The MPGA ( meta-FPGA ) http://sourceforge.net/projects/mpga/ | http://ce.et.tudelft.nl/~reinoud/mpga/README.html is a fascinating project. I think the ASCII bitstream format is pretty. [FIXME:]
http://www.dspia.com/related_links.html has some useful links to doing DSP on FPGA hardware, with Verilog.
XESS "Elastic Computing Division" http://www.xess.com/ "We think field programmable logic devices will have an even larger impact on electronics than the microprocessor. "
"Figuring out reconfigurable logic" article in http://www.ednmag.com/ednmag/reg/1999/080599/16df2.htm has some advice on reconfigurable system design.
"Run Time Reconfigurable Logic" by Jamil Khatib http://www.geocities.com/SiliconValley/Pines/6639/fpga/ "Most of the software compilers targeted to these environments are based on gcc." (one of the people on the f-cpu mailing list)
(FPGA, CPLD, PLD) Free and Low-Cost Software http://www.optimagic.com/lowcost.html a long list of (FPGA, CPLD, Reconfigurable Computing) Programmable Logic Boards from various companies http://www.optimagic.com/boards.html
Field-Programmable Custom Computing Machines http://www.fccm.org/
List of FPGA-based Computing Machines http://www.io.com/~guccione/HW_list.html (exact pin-compatible 6502 and other microprocessors; cellular automata; real-time image processing; ...)
http://users.ece.gatech.edu/~grimace/research/fpga_xs/index.html "the basics of reconfigurable computing ... a quick start outline"
US$2,500 FPGA Prototyping Boards http://www.fliptronics.com/hw_products.html
Is the Xputer http://xputers.informatik.uni-kl.de/xputer/index_xputer.html based on a FPGA ? Lots of papers on this page about working with FPGAs and reconfigurable computing.
http://www.people.cornell.edu/pages/arb15/ee475.html a simple RISC implemented in VHDL and downloaded to a Xilinx XC4010XL [offline ?]
First Reconfigurable Computer Proposal 6/22/87 http://www.vcc.com/nsf1.html
reconfigurable computing patent http://www.patents.ibm.com/details?pn=US05684980__
reconfigurable computing patent http://www.patents.ibm.com/details?pn10=US05361373
reconfigurable computing patent http://www.patents.ibm.com/details?pn=US05603043__
reconfigurable computing patent http://www.patents.ibm.com/details?pn=US05796623__
reconfigurable computing patent http://www.patents.ibm.com/details?pn10=US05036473
reconfigurable computing patent http://www.patents.ibm.com/details?pn10=US04935734
Role of Reconfigurable Computing http://www.cs.berkeley.edu/~amd/reconfig_com_roundtable_oct96/
http://www.reconfig.com/ [always offline ?]
Programmable Products http://www.programmable-products.com/ sells some pre-assembled FPGA boards ("JumpShot") (can they be used for reconfigurable computing ?)
"APS- The FPGA Solutions Company: FPGAs, XILINX, VHDL, Verilog, Development Boards, DSP Modules" http://www.associatedpro.com/ sells some pre-assembled FPGA boards (starting at $199.00 including a single 3K gate XC5202 FPGA) (can they be used for reconfigurable computing ?)
a list of books on FPGAs http://home.korax.net/~telic/books/asic.htm
Multi-Adaptive Processing (MAP(tm)) http://www.srccomp.com/products_map.htm
``Altera's Nios(tm) soft core embedded processors are developed specifically for programmable logic. ... scalable, allowing you to fit several Nios embedded processors onto one devices ...'' http://www.altera.com/cuttingedge
OpenIP Project vlsi.html#openip seems to be doing to interesting stuff with FPGAs

MPROZ -- A 16 bit minimal processor ftp://137.193.64.130/pub/mproz/ a small CPU student project < klee at informatik.unibw-muenchen.de >

Only 3 instructions:

br adr
	Load adr into PC if F=0
	Clear F

add adr1,adr2
	Add the contents of memory location adr1 and the contents of
	memory location adr2 and store the result in memory location adr2.
	Store the carry of the addition in F.

nor adr1,adr2
	Calculate the NOR function of the contents of memory location adr1
	and the contents of memory location adr2 and store the result in
	memory location adr2. F=1 if result =0 , else F=0.

and only 1 user-accessible register (PC) and only 1 user-accessible flag (F).

DAV: Unfortunately, self-modifying code is required for subroutine call/return. This makes re-entrant code impossible.

DAV: some simple macros by DAV:


	; These macros require 3 special locations:
	; ZERO: a location reserved to handle the value 0.
	; ONES: a location reserved to handle the value with every bit set (sometimes called -1 or ~0).
	; Set them up with
	;     really_clear ZERO
	;     really_mov -1, ONES
	; .

	;;;;;;;;; single-instruction macros (aliases)

	not a ; invert each bit in a
		nor a,a
		; or alternatively
		nor ZERO, a ; 

	clear a ; clear all bits in a to zero. Side effect: sets F1
		nor ONES, a ; 

	mov 0, a ;
		clear a


	;;;;;;;;; multi-instruction macros (perhaps should be subroutines)

	really_clear a; -- this is used only to set up location ONES and ZERO;
		add a,a
		add a,a
		add a,a
		... (repeated for 16 bits, or whatever the wordsize is)
		add a,a

	really_mov -1, a; -- this is used only to set up location ONES;
		really_clear a
		not a

	mov -1, a ; set location a to the value -1
		clear a
		not a

	mov -2, a ; set location a to the value -2 
		mov -1, a
		add a,a

	mov 1, a ; set location a to the value 1 -- this is used to set up location ONE
		mov -2, a
		not a
		; this expands to the 4 instructions
		clear a
		not a
		add a,a
		not a


	rotl a ; rotate a left, moving MSB to LSB ; side effect: always clears F.
		add a,a
		br continue:
		add ONE, a
	  continue:




	rotlc a ; move F into LSB of a, shifting all bits left, leaving old MSB in F
		br F_was_0:
		add a,a
		br F_was_1_high_bit_was_0 ;
		; handle case of ``F was 1, high bit was 1''
		add ONE, a
		br set_F_and_continue
		; execution never gets here
	  F_was_1_high_bit_was_0:
		add ONE, a
		br continue
		; execution never gets here
	  F_was_0:
		add a,a
		; the next 2 instructions seem redundant in the F_was_zero case,
		; but they are necessary to handle the F_was_1 case.
		br continue
	  set_F_and_continue:
		; this must fall through to the end, because branches always clear F
		nor ONES, ZERO ;
	  continue:


	mov a, b ; set location b to the value in location a. WARNING: doesn't work when a=b.
		clear b
		add a,b

	jump addr ; unconditionally jump to given address (side effect: clears F)
		; isochronous code should use this:
		add ZERO, ZERO
		br addr
		; alternatively, one instruction quicker when F=0:
		br addr
		br addr

lots of details -- the complete schematic diagram of the CPU, a pretty photograph of the test board (FPGA + RAM + ROM + some LEDs to simulate a traffic light) ... some assembly-language code ...

http://gene.bsee.swin.edu.au/daveb/CurrProb.htm FPGA MicroProcessor Niklaus Wirth has designed a simple RISC 24 bit microprocessor called Wotan "I need a artificial life engine a la Tierra." http://www.rereth.ethz.ch/infk/computer_systeme/wirth/pj.01.html "The Trianus project attempts to tightly integrate the process of designing electronic circuits with Field-Programmable Gate Arrays (FPGAs)."
Elanix, Inc. http://elanix.com/ sells a nifty design, simulation, and analysis software tool that can output the DSP algorithm you choose directly to a FPGA.
FPGA related WWW Links http://www.mrc.uidaho.edu/fpga/
T7L Technology http://www.t7l.com/ has developed RISC cores (8 bit µcontrollers to 32 bit processors) optimized for Xilinx XC4000 FPGAs TX400 family customizable instruction set delivered in Xilinx optimized netlist or VHDL source code $ 4 000 to $ 100 000 a Xilinx-based development board is $295
Associated Professional Systems Incorporated http://www.associatedpro.com/ sells lots of interesting electronic tools and subassemblies, including the "APS-X84 Low Cost XILINX FPGA Development Board As low as $199.00" plugs into ISA slot.
BYU Electrical Engineering's Configurable Computing Laboratory http://splish.ee.byu.edu/ "high-performance systems that are based on FPGAs and FPGA-like devices. Research areas that are of current interest include: run-time reconfiguration, hybrid FPGA architecture, DSPs and FPGAs, and application-specific processors."

Date: Thu, 11 Jun 1998 05:32:23 +0200 (MET DST)
To: David Cary <d.cary at ieee.org>
From: Majordomo at lslsun.epfl.ch
Subject: Welcome to nlc

--

Welcome to the nlc mailing list!

If you ever want to remove yourself from this mailing list,
you can send mail to "Majordomo@lslsun.epfl.ch" with the following command
in the body of your email message:

    unsubscribe nlc David Cary <d.cary@ieee.org>

Here's the general information for the list you've
subscribed to, in case you don't already have it:

This is a mailing list for discussing the 'C -> netlist' compiler, or 'nlc'
for short.

To send mail messages to the nlc mailing list, send your message to:
	nlc@lslsun.epfl.ch

I have made nlc, the C to netlist compiler, available for anonymous
ftp on lslsun5.epfl.ch (128.178.150.25) in /pub/nlc-0.9.tar.gz.
It is compressed using gzip (available at a GNU archive site near you).
A compiled binary version for Sparc Solaris 2.x is in the file
/pub/nlc-0.9.bin.gz.

If you have any trouble getting the file, let me know. I can probably
mail it to you.

For those that are interested in the Spyder project, you will find
below references to already published work.

Please let me know if you have trouble locating these papers, and I'll
see what I can do.

Have fun and please keep in touch,
					Christian

---

@Article{key,
  author = 	 "Christian Iseli and Eduardo Sanchez",
  title = 	 "Spyder: A {SURE} ({SU}perscalar and
		  {RE}configurable) Processor",
  journal =	 "The Journal of Supercomputing",
  year =	 1995,
  volume =	 9,
  number =	 3,
  pages =	 "231-252"
}

@InProceedings{key0,
  author = 	 "Christian Iseli and Eduardo Sanchez",
  title = 	 "A Superscalar and Reconfigurable Processor",
  volume =	 849,
  series =	 "Lecture Notes in Computer Science",
  pages =	 "168--174",
  booktitle =	 "Field-Programmable Logic Architectures, Synthesis
		  and Applications",
  year =	 1994,
  publisher =	 "Springer-Verlag",
  address =	 "Prague, Czech Republic",
  month =	 "September"
}

@inproceedings{key1,
  author =	 "Christian Iseli and Eduardo Sanchez",
  title =	 "{Beyond Superscalar Using FPGAs}",
  booktitle =	 "IEEE International Conference on Computer Design",
  address =	 "Cambridge Mass.",
  month =	 "October",
  year =	 1993
}

@inproceedings{key2,
  author =	 "Christian Iseli and Eduardo Sanchez",
  title =	 "{Spyder: A reconfigurable VLIW processor using FPGAs}",
  booktitle =	 "IEEE Workshop on FPGAs for Custom Computing Machines",
  address =	 "Napa",
  month =	 "April",
  year =	 1993
}

@InProceedings{key3,
  author = 	 "Christian Iseli and Eduardo Sanchez",
  title = 	 "Augmentation du parall\'{e}lisme par la reconfigurabilit\'{e}",
  editor =	 "L. Boug\'{e} and M. Cosnard and P. Fraigniaud",
  pages =	 "3--6",
  booktitle =	 "Actes des $6^{\`{e}me}$ Rencontres Francophones du
		  Parall\'{e}lisme, RenPar'6",
  year =	 1994,
  organization = "\'{E}cole normale sup\'{e}rieure",
  address =	 "Lyon",
  month =	 "June"
}

@InProceedings{key4,
  author = 	 "Serge Durand and Christian Iseli",
  title = 	 "Developing a reconfigurable coprocessor on an {SBus} board",
  pages =	 "5--9",
  booktitle =	 "Sun User Group 1993 Proceedings",
  year =	 1993,
  address =	 "San Jose, California",
  month =	 "December"
}

@inproceedings{key5,
  author =       "Christian Iseli and Eduardo Sanchez",
  title =        "{A \Cplusplus{} compiler for FPGA custom execution units synthesis}",
  pages =	 "173--179",
  booktitle =    "IEEE Symposium on FPGAs for Custom Computing Machines",
  address =      "Napa, CA",
  month =        "April",
  year =         1995
}

---

Here are some more info about Brent Chapman's "Majordomo" mailing list manager.

In the description below items contained in []'s are optional. When
providing the item, do not include the []'s around it.

Majordomo understands the following commands:

    subscribe <list> [<address>]
        Subscribe yourself (or <address> if specified) to the named <list>.
	(You probably have done that already if you are receiving this message).

    unsubscribe <list> [<address>]
        Unsubscribe yourself (or <address> if specified) from the named <list>.

    get <list> <filename>
        Get a file related to <list>.

    index <list>
        Return an index of files you can "get" for <list>.

    which [<address>]
        Find out which lists you (or <address> if specified) are on.

    who <list>
        Find out who is on the named <list>.

    info <list>
        Retrieve the general introductory information for the named <list>.

    lists
        Show the lists served by this Majordomo server.

    help
        Retrieve this message.

    end
        Stop processing commands (useful if your mailer adds a signature).

Commands should be sent in the body of an email message to
majordomo@lslsun.epfl.ch

Commands in the "Subject:" line NOT processed.

If you have any questions or problems, please contact
majordomo-owner@lslsun.epfl.ch




From: Christian Iseli <chris@lslsun.epfl.ch>
Date: Wed, 12 Aug 1998 13:53:07 +0200 (MET DST)
To: nlc@lslsun.epfl.ch
Subject: Re: Is this list dead ?
Sender: owner-nlc@lslsun.epfl.ch
Precedence: bulk

>I heard this mailing list talked about converting (mathematically
>intensive) C programs to run at high speed on FPGAs.
>
>I thought I would lurk and learn, but I haven't gotten any messages since I
>got the "Welcome to nlc" message on 1998-06-11.
>
>Is this list dead ?

Not really, but it was never very lively either... ;-)

As it happens, I'm now done with my PhD thesis and since I had no means of
pursuing nlc further, I now have another job and have little time to actively
develop nlc.

The PhD thesis report is available at:
 http://lslwww.epfl.ch/pages/publications/rcnt_theses/home.html

The latest nlc source code is available at:
 ftp://lslwww.epfl.ch/pub/

HTH.  Cheers,
					Christian

The exact opposite thing is also being developed:

vhdl2c "vhdl2c is a vhdl to `C' converter by Michael Knieser"

Extremely Simple CPU Architectures

I am fascinated by CPU designs that are extremely simple, approaching "minimal", in several (sometimes incompatible) senses of the word "simple".

Here "CPU", "MCU", and "PE" are almost interchangeable.

Easy to understand ( For example, the fixed-instruction-length and general-purpose-registers on a RISC machines generally make the programmer model easier to understand than CISC machines, even when extensive pipelining and large register area make them have much larger gate counts.)
Low gate count (implies low-cost) [FIXME: move #minimal-gate here ? ]
Minimal number of instructions / opcodes (sometimes confused with RISC) (this often implies high-speed) (see Minimum Instruction Set Computing (MISC) #misc ) and stack machines [FIXME: move stack machines stack here ?]
Simple interface (I've sketched out a Processing Element (PE) with 7 pins, and a layout in a massively parallel array on a single-sided (!) printed-wire board (PWB), although each PE internally is far more complicated than the "low gate count" architectures). There exists a commercial 8-pin MCU which seems pretty close.
Suitable for massively parallel processing, where each Processing Element (PE) is a small "simple" part of the whole -- a superset of Cellular Automata cellular_automata .

Often these characteristics exhibit synergy -- when you eliminate some opcodes, that eliminates some gates (making it lower-cost) and makes it easier to understand (less documentation required). Occasionally, though, driving a design to extreme simplicity according to one measure causes extra complexity in another area to compensate. This general concept is known as the waterbed theory . When applied to CPU designs, this is called the Turing tarpit #tarpit .

For purposes of robotics, "low-cost" and "Simple interface" are usually the dominant considerations. Some of the concepts of massively parallel processing are also present when one builds swarms of simple robots, but the kind of random communication between constantly-changing arrangements of simple robots is very different than the communication between the rigidly connected (most commonly in a 2 D mesh or a hypercube) elements of common cellular automata and multi-processors.

See also Using FPGAs to simulate CPUs #FPGA for some very simple CPUs designed to fit onto FPGAs. 2 very different reasons for simplicity there: (a) a simpler CPU can fit onto a smaller FPGA, making it much less expensive. (b) making the CPU smaller allows you to fit more copies of the CPU on a given FPGA, increasing MIPs at no cost. [FIXME: should combine these into 1 section ? Since the same op-code set may be implemented many ways, in TTL, in FPGA, in custom VLSI, it doesn't really make sense to split them into seperate sections ... On the other hand, ``less hardware'' means something a little different in these 3 technologies. ]

[here I *list* simple CPUs I know about; Opcode considerations #considerations and #FPGA also talk about tips for designing new CPU architectures. ]

... Here I also list all the CPUs I know about that were built up out of TTL.

The Apollo Guidance Computer (AGC) http://en.wikipedia.org/wiki/Apollo_Guidance_Computer

...
The Block I version used 4,100 ICs, each containing a single 3-input NOR gate. The later Block II version used dual 3-input NOR gates in a flat-pack; appoximately 5,600 gates in all. The gates were resistor-transistor logic (RTL). ...
The instruction format was 3 bits for opcode, 12 bits for address.

(details about each instruction in the instruction set there). (Knowing what we now know about MISC, how much smaller could we make a "reasonable" CPU out of NOR gates ? )
SPECULATIONS ON THE LINKING OF MECHANICAL COMPUTERS by Galen A. Tripp 1997 http://home.comcast.net/~galentripp/Spec_Com.html takes "getting back to basics" to an extreme ...
Mark 1 FORTH Computer http://www.holmea.demon.co.uk/Mk1/Architecture.htm by Andrew Holme almost entirely TTL 8 bit data bus, 16 bit address bus "The Mark 1 is a micro-programmed machine with a highly encoded "vertical" microcode. The microinstruction (µ) is only 8-bits wide."
Mark 2 FORTH Computer http://www.holmea.demon.co.uk/Mk2/Architecture.htm TTL, 16 bit data bus, 16 bit address bus "The Mark 2 was inspired by the PISC (Pathetic Instruction Set Computer) described by Bradford J. Rodriguez" "The Mark 2 has a highly encoded instruction set." "Programmable logic devices (PLDs) are used everywhere ... 22V10" "The Mark 2 is not that much faster than the Mark 1. Having twice the data width gives it a factor of 2 advantage, which it promptly throws away by taking two cycles to execute an instruction (FETCH-EXECUTE)." (DAV: I wonder if those 2 cycles could be overlapped -- pipelined ...)
"Picaro: A Stamp-like Interpreted Controller" article by Tom Napier 1998-04 in _Circuit Cellar Ink_ http://img.cmpnet.com/edtn/ccellar/e023pdf1.pdf For as long as I've been using microprocessors -- and I was designing the Intel 8080 into radiation-monitoring equipment in 1977 -- I've always itched to have control over the instruction set. The makers always seem to leave something out. ... Picaro makes a pleasant change from systems with 16-MB minimum memories and operating systems no human being can comprehend.
describes a very easy-to-build system: There's the CPU connected to some EPROM (for programs and data) and a crystal. The EPROM is serial EPROM (makes things *much* easier to wire up). The "CPU" is really a PIC with an internal interpreter with 2 modes. If the magic value is discovered in EPROM, the PIC starts fetching and "executing" (interpreting) the program in EPROM. Otherwise, it assumes the EPROM is blank, and waits for a properly-formatted program to stream in on the serial port which it programs into the EPROM. [FIXME: todo: build something like this ... but with a completely different instruction set, natch.]
Octal Computers by Ben Franchuk http://www.jetnet.ab.ca/users/bfranchuk/ A CPU built entirely from TTL chips, connected to low cost memory. "Designs port nicely to FPGA's too!"
```
From:  Ben Franchuk
Date:  Wed Jul 25, 2001  7:10 am
Subject:  Re: [fpga-cpu] A debugger for xr16 the easy way
```
... On my cpu on reset a bootstrap loader is run from the serial input. This way all I need for a external parts is [RAM] and a buffer for serial lines from the FPGA.
(in other words, no ROM or FLASH is needed for initial testing).
Grant Searle has some good construction articles:
``The Jupiter Ace hardware page: How to build your own !'' by Grant Searle BSc. http://www.home-micros.freeserve.co.uk/JupiterAce/JupiterAce.html
Build your own ZX81 http://www.home-micros.freeserve.co.uk/zx80/zx80nmi.html

... old micros which can still be built since they don't use custom components. ...

interesting wire-wrap style ... see ``Back of PCB'' photo. While this uses a Z80 cpu (and so is ``cheating'' in the context of building the CPU itself out of standard components), the TV / monitor output circuitry may be interesting... [FIXME: computer museum] [FIXME: crosslink to schematic.html ?]

`` TOY/2 - a minimal 16 bit CPU
TOY/2 was designed by Pascal Dornier and Stephan Paschedag ... in 1988 ...
TOY/2 can address a full 64K word program and memory address space. ... and takes up a whopping 3300 transistors (excluding I/O pads). ...
All instructions are 16 bit, 4 bits for the opcode, and 12 bits for the direct address. ''

-- http://www.pcengines.com/toy2.htm

DAV: I think the JMP instruction should be P := [vec] to make consistent with description and the conditional branches, and to allow jumping to code anywhere in the full 64 Kword program space. Only 15 defined instructions:

  Programmer-visible registers:
  A: accumulator
  P: program counter
  T: temporary register for indirect store
  C: carry flag
  Z: zero flag (DAV: does this *always* indicate the current state of A ?)

  ; 3 for program flow control
  JMP vec    P := [vec] ; indirect jump
  BCC vec    IF (carry clear) then PC := [vec]
  BNE vec    IF (not zero, i.e., 0==Z) then PC := [vec]

  ; 3 for direct load and store
  LDC src    A := [src], C:= 0. ; load A and clear carry
  LDA src    A := [src]         ; load A, don't clear carry
  STA dest   [dest] := A ;

  ; 3 logical ops
  (op) src   A := A (op) [src] ; arithmetic ops:
   ; (op) is one of: XOR, OR, AND.

  ; 2 arithmetic
  ADC src    A := A + [src]
  SBC src    A := A - [src] - C ; subtract

  ; 4 implicit instructions that don't use the 12 bit address:
  ROR        A,C := A,C ror 1 ; rotate right through carry
  TAT        T := A ; transfer A to T
  LDI        A := [A] ; load A indirect
  STT        [A] := T ; store T indirect

DAV: Given that code is in ROM, and we choose some arbitrary RAM area ``stack_start'' to hold the stack, and we choose some other arbitrary RAM location SP to hold the return stack pointer, call instructions could be implemented like this:


(Is this the most compact implementation of CALL ?).

  ...
  ; DAV: This implementation of CALL eats up
  ; 2 instructions + 1 word of ROM = 3 words per call.
  LDA $+2
  JMP &(banana1) 
  LDA $+2
  JMP &(banana2)
  ...
   ; linker must allocate a word of ROM to hold the $+2 value, 1 per CALL.
   ; linker must allocate a word of ROM to hold the &(banana2) value as well,
   ; but only 1 per subroutine, shared among all calls of banana2().


banana1: ; non-leaf subroutine
  ; we have return address in A,
  ; and SP points to an empty location to store it.
  ; adjust SP to point to empty location
  TAT
  LDA SP
  STT
  ADC &(1) ; assembler must allocate word of ROM to hold the constant 1
  STA SP
  ...
  ; return sequence
  JMP &(non_leaf_return)

non_leaf_return: ; common to all non-leaf subroutines
  LDA SP
  ADC &(-1)
  STA SP
  LDI
  STA temp
  JMP temp

banana2: ; leaf subroutine
  ; we have return address in A,
  ; and SP points to an empty location to store it.
  TAT
  LDA SP
  STT
  ; leave SP full until end of routine
  ...
  ; return sequence
  JMP &(subroutine_return)

subroutine_return: ; common to all subroutines
  LDA SP
  LDI
  STA temp
  JMP temp

DAV: From the instruction set listed here, I attempted to reverse-engineer a schematic suitable for implementation in TTL. I wonder how close this matches the original design by Pascal Dornier and Stephan Paschedag ?

 TOY/2 as reverse-engineered by David Cary
 (Warning: not yet implemented -- does this really work ?)
 Every instruction takes 2 cycles:
 1 SRAM memory cycle cycle (either reading [address] or [A], or writing T to [A]).
 1 SRAM memory cycle to read next instruction (from [PC]).

 Registers are labeled with their contents in the 1st cycle;
 PC and A are swapped in the last cycle of the instruction.

  Calculate "next PC" first:

  +------------------------------+
  | +----------+                 V
  | |          V             +-------+
  | | +-------------------+  |   A   |(PC)
  | | | opcode /\ address |  +-------+
  | | +-------/  \--------+      V
  | |    V           V           V
  | | (control       |  +--------+
  | |  lines)        |  |        V
  | |                V  V        |
  | |               +-\/--+      |
  | |            1->| mux |      |
  | |               +-----+      |
  | |                 V          |  
  | | (note 1)        V address  |      
  | |     |    +------------+    |   
  | |     V    |            |    |   
  | |   +---+  |   SRAM     |    |      
  | |2->| T |  |            |    |   
  | |   +---+  +------------+    |      
  | |     V          ^ data I/O  |   
  | |     |          V           V      
  | |     +--------->+<| buf |--<+
  | ^                V           V    
  | +----------------+           |   
  |                  V           V    
  |            +-----------+  +----+
  |            | [address] |  | PC |(A)
  |            +-----------+  +----+
  |                  V           V
  |               +------\  /------+ 
  |             4->\      \/      /-->Z,C
  |                 \    ALU     /
  |                  +----------+
  |                       V
  ^                       |
  +-----------------------+

Let's try that again.

  Calculate "next PC" first:

  first cycle: selecting either
    P := [address], or
    [A] := T and P := P + 1, or
    P := P + 1.
  second cycle: calculating
    A := A (op) [address], or
    A := [A],
  and fetching next instruction from [P].

  Whoopsies, that's not going to work --
  -- we can't fetch [A] and [P] at the same time.
  Simpler to calculate A in the first cycle,
  so we can do
    A := [A] in that first cycle.
  and then always calculate the next P (and simultaneously fetch it)
  in the next cycle.

  first cycle: calculating
    A := A (op) [address], or
    A := [A], or
    [A] := T (and keep A the same ?)
  second cycle: selecting either
    P := [address], or
    P := P + 1.
  and fetching next instruction from [P].

  Um... but won't that fetch the next instruction from the *old* P,
  so that we get a 1-instruction branch delay?

  Registers are labeled with their contents in the 1st cycle;
  PC and A are swapped in the last cycle of the instruction.

  Calculate "next A" first:

  +------------------------------+
  | +----------+                 |
  | |          V                 V
  | | +-------------------+      |
  | | |   instruction     |  +-------+
  | | | opcode /\ address |  |   A   |(PC)
  | | +-------/  \--------+  +-------+
  | |    V           V           V
  | | (control       |  +--------+
  | |  lines)        |  |        V
  | |                V  V        |
  | |               +-\/--+      |
  | |            1->| mux |      |
  | |               +-----+      |
  | |                 V          |  
  | | (note 1)        V address  |      
  | |     |    +------------+    |   
  | |     V    |            |    |   
  | |   +---+  |   SRAM     |    |      
  | |2->| T |  |            |    |   
  | |   +---+  +------------+    |      
  | |     V          ^ data I/O  |   
  | |     |          V           V      
  | |     +--------->+<| buf |--<+
  | ^                V           V    
  | +----------------+           |   
  |                  V           |    
  |                  |           |
  | [address] ([PC]) |           |
  |                  |           |
  |                  V           V
  |               +------\  /------+ 
  |             4->\      \/      /-->Z,C
  |                 \    ALU     /
  |                  +----------+
  |                       V
  |                    +-----+
  |                    | PC  | (A)
  |                    +-----+
  ^                       V
  +-----------------------+


    Control line summary:
	0: instruction register : opcode : next value always latched at end of last cycle.
	0: instruction register : address : next value always latched at end of last cycle,
	   never used during last cycle. (Perhaps share with some other register only used during last cycle ?)
	   
	0: [address] transparent register: always transparent during 1st cycle,
	   latched at end of 1st cycle, and held during last cycle.
	0: A(PC) next value always latched at end of every cycle
	0: PC(A) next value always latched at end of every cycle
	2: Mux selection: depends on opcode and C and Z during 1st cycle;
	   always comes from PC on last cycle.
	2: T: output enabled only on STI ( [A] := T ) instruction; latched only during TAT instruction.
	1: SRAM: last cycle always a read; 1st cycle may be read or write.
	4: ALU function select (is this really all the control signals needed ?)
	0: Z: zero flag always set or cleared during last cycle to reflect current state of A.
	1: C: new value of carry flag only latched on LDC, ADC, SBC, ROR instructions.
	--
	10 bits ? Is this all ?


	Notes:
	1. Only 1 instruction loads T: the TAT instruction.
	There are several ways to implement this:
	  * the output of A(PC) (during the 1st cycle)
	  * the output of PC(A) (during the last cycle), or
	  * the output of the ALU (during the last cycle),
	whichever is easiest.

	2. The register to hold [address] must be transparent,
	to implement JMP and BCC and BNE properly.
        ''(... Much later: I don't think we need a latch here. ...)

	3. There's still room for 1 more instruction. What should it be ?
	Some options that require *zero* more hardware (just setting up the uinstruction decoder):
	  * PC-relative jumps: PC := PC + [vec] (for position-independent code)
	  * TPT: Transfer PC to T: T := PC (to allow position-independent code)
	  * T := PC + [vec] (to allow position-independent code)
	  * [A] := PC+1 (to allow position-independent CALL)

  This implementation calculates next PC first
  (using transparent [address] register), then next A.
  (I think this ended up simpler than calculate next A first, then next PC).
  (Of course, faster CPUs calculate both at the same time,
  and have Harvard-style fetches from data cache simultaneous with fetches from instruction cache).

  In any CPU architecture that
  has the data and the program in 1 unified memory,
  clearly we must
  have 2 memory cycles/instruction to do Load and Store.
  (1) first cycle: do any Load or Store required by
  the instruction in the instruction register, and
  (2) second cycle: load the next instruction into the instruction register
  through [P].

  In any CPU architecture that has the data and program in 1 unified memory,
  in order to do any indexed write, we need at least 4 registers:
  Instruction register (to remember that we are doing an indexed write)
  PC (to remember what instruction to do next)
  address (where to write)
  data (what to write)
  (Mark's TTY CPU has fewer than that number of registers,
  so it cannot do an indexed write).
  (Well, we could do it with only 3 registers,
  if we allow self-modifying code
  to write the index into the location that will be read into the instruction register,
  and making the instruction register wide enough to hold that address).

  If we also try to re-use the ALU to increment the PC every time
  (rather than using an incrementing register for PC),
  then we must calculate the next PC every instruction.
  If we calculate the next PC during the second cycle,
  then we are loading the next instruction from the old PC --
  creating a delay slot of 1 instruction.

  If we restrict ourselves to exactly 2 memory cycles per instruction,
  and only 1 ALU,
  and we try to avoid that delay slot, then:
  During jumps, we must be calculating the next PC on the first cycle,
  so we can latch it at the end of that cycle and use [PC]
  to load the next instruction.
  So the question is:
  On non-jumps, do we use the ALU to increment PC on the first cycle,
  so using the ALU to calculate the next A must happen on the second cycle?
  Or do we always increment PC on the second cycle,
  and always use the ALU for other calculations only on the first cycle?
  
  Branches (conditional and unconditional): prefer loading new PC on first cycle, to avoid delay slot.

  Let's try always updating PC on the first cycle.
  This makes the [A] instructions ( A := [A] )( [A] := T ) a bit tricky,
  since they *must* execute on the first cycle
  (the SRAM is busy fetching the next instruction on the second cycle).
  Since the ALU is busy calculating the next value of PC on the first cycle,
  but the next value of A often depends on a value being fetched from SRAM in the first cycle,
  clearly the value from SRAM must be latched at the end of the first cycle in order to be used in the second cycle.

  To avoid that latch means we must calculate the next value of A on the first cycle,
  while the data required is still streaming out of the SRAM.
  Still, branches (conditional and unconditional) must also update PC on the first cycle,
  in order to load the next instruction on the second cycle.

  yet another implementation of Toy/2 from DAV: dedicated P and A registers.

  +----------------------------+--------+
  | +----------+               |        |
  | |          V               |        |
  | | +-------------------+    V        V
  | | |   instruction     | +------+  +-------+
  | | | opcode /\ address | |  PC  |  |   A   |
  | | +-------/  \--------+ +------+  +-------+
  | |    V           V       V          V
  | | (control       |   +-----\/--------+
  | |  lines)        |   |  mux          |
  | |                |   +---------------+
  | |                |    V
  | |                |    +------+
  | |                V    V      |
  | |               +-\/---+     |
  | |            1->| mux  |     |
  | |               +------+     |
  | |                 V          |  
  | | (note 1)        V address  |      
  | |     |    +------------+    |   
  | |     V    |            |    |   
  | |   +---+  |   SRAM     |    |      
  | |2->| T |  |            |    |   
  | |   +---+  +------------+    |      
  | |     V          ^ data I/O  |   
  | |     |          V           V      
  | |     +--------->+<| buf |--<+
  | ^                V           V    
  | +----------------+           |   
  |                  V           |    
  |                  |           |
  | [address] ([PC]) |           |
  |                  |           |
  |                  V           V
  |               +------\  /------+ 
  |             4->\      \/      /-->Z,C
  |                 \    ALU     /
  |                  +----------+
  |                       V
  ^                       |
  +-----------------------+

	1. Only 1 instruction loads T: the TAT instruction.
	There are several ways to implement this:
	  * the output of A
	  * the output of the ALU (during the first cycle), while selecting A
          * the output of some mux, while selecting A
	whichever is easiest.

The Mano machine http://en.wikipedia.org/wiki/Mano_machine is similar to the PDP-8 and (more distantly) the TOY/2. The data bus is 16 bits. All instructions, loads, and stores are one 16-bit word long. Memory-referencing instructions contain 4 bits of op code and 12 address bits. The programmers model only has 2 registers and 2 bits: one 16-bit accumulator; one (12 bit?) program counter; the carry bit; and a "halt" bit.

"Microcoded Versus Hard-wired Control: A comparison of two methods for implementing the control logic for a simple CPU" article by Phil Koopman BYTE Magazine January 1987 http://www.skidmore.edu/~pvonk/cs318/calendars/Documents/hard-wired.pdf describes the Toy CPU.

The MS-1 Home Designed Computer http://www.gl.umbc.edu/~msokos1/ms1.html "My goal is to create a reasonably powerful computer using no prefab cpu chips, and as few programmable devices as possible." "the MS-1 design ... has 32 bits of address space, and can manipulate 8, 16, and 32 bit data. A pipelined RISC style architecture was chosen for its speed and simplicity, which help to get the best performance possible given the limited budget. ... In order to cut down the ever increasing costs, I have decided to give MS-1 the ability to use standard ISA cards."
``A Minimal TTL Processor for Architecture Exploration'' by Bradford J. Rodriguez http://www.zetetics.com/bj/papers/piscedu2.htm

The PISC is a processor constructed from discrete TTL logic, which illustrates the operation of both hardwired and microcoded CPUs. ... simple hardware modifications demonstrate interrupts, memory segmentation, microsequencers, parallelism, and pipelining. A standalone PISC board should be an economical and effective tool for teaching processor design.
... Pathetic Instruction Set Computer ... Requiring only 22 standard TTL chips (excluding memory), it is well within the ability of a student to construct and understand. Its writeable microprogram store uses inexpensive EPROM and RAM. ...
Microprogram store and main program store are one and the same. Indeed, the PISC has characteristics of both a hardwired CPU and a microcoded CPU.
... Many weaknesses of the PISC become evident after a short period of use ... It can be argued that the PISC is a valuable educational tool because these faults, and several potential solutions, are painfully obvious. ...
... 2100 gates ...
The schematic diagram of the PISC-1a can be downloaded here. DAV: While the paper notes ``no conditional branch microinstruction -- an important need'', I agree that conditionals are important, but "machines do not need conditional branches, they only need conditional subroutine return instructions." -- Philip J. Koopman, Jr. #conditional_return
"POD6502: A project to build your own 6502 using only common available TTL-IC's." by Ruud Baltissen http://home.hccnet.nl/g.baltissen/pod6502.htm uses EPROM for the instruction decoder; can replace a 2 MHz 6502 CPU at full speed. ... If we eliminate the decimal arithmetic instructions, it can use a standard TTL ALU ...
MISC CPUs computer_architecture.html#misc

Douglas W. Jones http://www.cs.uiowa.edu/~jones/arch/

Subject: 
             Re: Home-made CPUs
        Date: 
             20 Nov 1998 00:00:00 GMT
       From: 
             jones@cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879)
 Organization: 
             The University of Iowa
 Newsgroups: 
             sci.electronics.design

From article <365533b8.334246@wingate>,
by t.dorrington@dial.pipex.com (Tim Dorrington):
> 
> Has anyone out there managed to design/build their own very simplistic
> CPU from basic logic chips, RAM and ROM?  I know that this is
> obviously a huge project which has no real practical use, but I just
> wondered if anyone has done it for the challenge?

It's been done, and it's not that huge a project.  Reasonable CPU designs
take about 50 SSI and MSI TTL chips.  See the Ultimate RISC and the
Minimal CISC architectures indexed on http://www.cs.uiowa.edu/~jones/arch/.
A fair number of people have built that particular RISC.

Of course, I just wrote the specs for it, you've got to figure out how to
reduce it to a chip level design.  It's a fun exercise!

                                Doug Jones
                                jones@cs.uiowa.edu

Simplex-III http://www.iinet.net.au/~daveb/simplex/simplex.html | http://members.iinet.net.au/~daveb/simplex/simplex.html "The entire thing fitted into some 120 TTL/MSI parts."
Homebuilt CPUs tend to be relatively simple; see the alt.comp.hardware.homebuilt FAQ by Mark Sokos http://www.faqs.org/faqs/homebuilt-comp-FAQ/index.html /* was http://www.gl.umbc.edu/~msokos1/ */
Motorola http://www.motorola.com/mcore Current implementations of Motorola's M-CORE microRISC ... 2.2 square mm.
The CPU/16 Writable Instruction Set Machine (WISC) http://www.ece.cmu.edu/~koopman/stack_computers/sec4_2.html /* was http://www.cs.cmu.edu/~koopman/stack_computers/sec4_2.html */ and more details: ``MVP Microcoded CPU/16 Architecture'' http://www.ece.cmu.edu/~koopman/forth/rochester_86b.pdf that was built (apparently in 1986, perhaps earlier) out of (less than 100 chips; wire-wrapped prototype fit on a 13 inch x 4 inch IBM PC expansion card) 74LS00 series MSI components and static RAMs for data stack (256x16), return stack (256x16), program and data RAM (64K x 16), and micro-program memory (2Kx32). DAV thinks it looks very similar to the TTL design in _Computation Structures_ (identical ALU design, similar microcode), with this major added feature: the CPU/16 could be halted by the host, and all the RAMs (Including the micro-program memory !) and most of the registers examined and modified by the host, single-stepped or run freely (~ 3 MHz) until it executes HALT. (a very similar RTX32P implemented in 2.5 micron standard cell CMOS with on-chip RAM-based data stack, return stack, and micro-program RAM, runs at 8 MHz ). Lacks interrupts.
Mountain View Press: The Forth Source http://www.TheForthSource.com/ Sells

WISC CPU/16 - A patented hardware design of a 16-bit stack machine with a user writable instruction set. Available in several forms: Complete schematics and documentation with which you can build your own ($50.00); Kits with all of the parts to wire wrap your own ($500.00); bare PC board which you can stuff with your own chips ($250.00); and Assembled and tested system ($750.00). Design uses a PC/AT as an I/O server. FAST!

and also sells lots of educational materials for learning the Forth programming language.
(FIXME: move to http://c2.com/cgi/wiki?WritableInstructionSetComputer ? )
``3.2.3.4 Hardwired vs. microcoded instructions'' http://www.ece.cmu.edu/~koopman/stack_computers/sec3_2.html#3234

... one major design tradeoff ... hardwired control and microcoded control. ...
An introduction to the concepts ... may be found in (Koopman 1987a).
Hardwired designs traditionally allow faster and more space efficient implementations to be made. ...
... a microcoded machine can use fewer bits to specify the same possible instructions, ...
... microcoded implementations are more convenient to implement in discrete component designs, so they predominate in board-level implementations. Most single-chip implementations are hardwired.
... Koopman, P. (1987a) Microcoded versus hard-wired control. Byte, January 1987, 12(1) 235-242
-- _Stack Computers: the new wave_ book by Philip J. Koopman, Jr. 1989 http://www.cs.cmu.edu/~koopman/stack_computers/
Jan's Razor: In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die. http://www.fpgacpu.org/log/mar02.html#020305 also has the interesting idea of *sharing* resources -- gives examples of 1 barrel shifter and 2 multipliers shared between a cluster of 10 PEs.
http://www.tu-harburg.de/~setb0209/cpu/ points to
- ``Minimal 8-Bit CPU in VHDL -- an (successful) attempt to fit an entire CPU into a 32 macrocell CPLD.''
- ``a 4 Bit CPU I build several years ago, mainly using 74LSXX MSI parts.''

From: Bill Buzbee (bill at buzbees.com)
Subject: Re: building a 8 or 16 bit cpu out of TTL parts
Newsgroups: comp.arch.hobbyist
Date: 2002-10-30 14:30:45 PST

"john dobbs"  wrote...
> Anyone ever thought of building a 8 or 16bit cpu using traditional TTL
> chips (NANDs,ORs,Flipflops etc) I could see how it would be a useful
> learning device, or for the hardcore guys using transistors/diodes and
> no IC's period.

I agree with the other posters that there's no *rational* reason to do this
using TTL - FPGA is the way to go these days.  However, if you're feeling
irrational, it can be a fun experience.  Here's a link to recent simple CPU
project you should check out:

http://www.venturalink.net/~jamesc/ttl/

Also, I found the following books very useful when trying to come up to
speed on the subject:

    Digital Computer Electronics, by Albert Paul Malvino, McGraw-Hill, 1977
    Understanding Digital Computers, by Forrest Mimms III, Radio Shack,
2nd. Edition,1987.

Good luck,
..Bill Buzbee

DAV: the ``Mark's 8 chip uP'' at the venturalink site looks very clever. I am very impressed. It is far more compact than any other TTL/PAL based CPU I've seen. (It is even smaller than most ``single-board computers'' that use monolithic CPU chips). It makes me want to design a CPU again.

http://homebrewcpu.com/

``... building my own computer from scratch. By "scratch", I mean designing my own instruction set, wire-wrapping a CPU out of a pile of 74 series TTL devices and writing (or porting) my own assembler, compiler, linker, text editor and very rudimentary operating system.
... User and supervisor modes exist, along with hardware address translation ...
... I really want to better understand hardware and operating systems. ... pushing ... the limits of my hobbyist abilities, ... support for preemptive multitasking and paging to enable me to support a "real" (though, I hope, greatly simplified) operating system for my machine.
... wherever possible I'll trade off speed for simpler circuits. Also, as I get closer to actually having to start wrapping wires, I find myself more freely trading off speed for fewer connections.
Compactness. One of my pet peeves is how bloated modern software is. I think there's a lot you can do with 128K bytes of addressing, and I like the idea of keeping things compact and utilitarian. I've tried to construct an expressively dense set of 1-byte opcodes.
At the end of the day, I'd like to have a working, and useful, machine that I understand completely. Oh, and it's also got to have a real front panel with lots and lots of cool blinky lights. ''

The ``links'' page points to many other web pages and books that deal with: small processor design and implementation, digital logic, retargeting C compilers and Forth compilers. [FIXME: does that make this section redundant ?]
``Design and Construction of the Very Simple Computer (VSC): A Laboratory Project for an Undergraduate Computer Architecture Course'' paper by R. A. Pilgrim http://www.homebrewcpu.com/p151-pilgrim.pdf

... relatively inexpensive project (<$100 including power supply, ICS, breadboards and assorted hardware) ...

[FIXME: finish reading]
Harry Porter's Relay Computer http://web.cecs.pdx.edu/~harry/Relay/
```
General Features:
All relays are the identical part (Four-Pole-Double-Throw, 12 Volts)
415 Relays
111 Switches
350 LEDs
Max Power Consumption: Estimated 12 Amps @ 13.5 Volts (160 Watts) 

Features:
	8 general purpose registers (of 8 bits each)
	instruction register (8 bits)
	program counter (16 bits)
	2 additional registers (16 bits each)
	7 bit ALU (operations: AND, OR, XOR, NOT, SHL, ADD, INC)
	16 bit increment unit
	32 Kbytes of main memory (implemented using one CMOS chip -- static RAM)
```
Each instruction is 8 bits long. Some instructions (JUMP, CALL, BRANCH, SET-16) are followed by 16 additional bits of immediate data.
```
The instruction set:
	MOVE       register to register
	CLEAR      set register to zero
	LOAD       1 byte from memory to register
	STORE      1 byte from register to memory
	AND        8 bit, register to register
	OR         8 bit, register to register
	XOR        8 bit, register to register
	NOT        8 bit, register to register
	SHL        8 bit, register to register, shift left 1 bit
	ADD        8 bit, register to register
	INCREMENT  8 bit, register to register
	INCR-XY    16 bit, XY register
	GOTO       unconditional
	CALL       return address is stored in XY register
	RETURN     jump indirect through a register
	BRANCH     conditional (beq, bne, blt, bcy)
	SET-16     load 16-bit immediate value into register
	              (The 2-byte value follows the instruction.)
	SET-8      load 5-bit sign-extended immediate value
	              into register (with sign extension)
	HALT       suspend execution
```
The clock cycle time is approximately 5 Hz. Instructions take between 8 and 24 cycles. Obviously not fast, but lights blink and it makes noise.
The general purpose registers, which each have 8 bits, are called:
```
	A, B, C, D, M1, M2, X, and Y
```
In some instructions, M1 and M2 are combined to form a 16-bit register, called M. Likewise, in some instructions, X and Y are combined to form a 16-bit register, called XY.
The program counter (PC) is 16-bits.
There is a 16 increment unit, which adds 1, and there is a 16-bit register called INC dedicated to the increment unit. This register is not visible to the programmer's model and is used only (1) to increment the PC during each instruction, and (2) in the INCR-16 instruction.
The instruction register (INST) is 8-bits and holds the instruction being executed.
There is a 16-bit register, called J, which is used during the CALL, JUMP, and GOTO instructions. These instructions load the register from the immediate 16-bit value following the instruction, and then to complete the transfer of control, the J register is copied to the PC register.
...
```
The LOAD and STORE Instructions
===============================
```
The memory is organized as 32K bytes, addressed from hex 0000 through 7FFF.
The LOAD instruction uses the 16-bit value in the M register as an address and reads a byte from Main Memory into either the A, B, C, or D registers. Likewise, the STORE instruction assumes the address is in the M register and stores the value from either A, B, C, or D into a byte in Main Memory.
...
```
The CALL Instruction
====================
```
The CALL instruction is exactly like the GOTO instruction, except that, in addition, the address of the next instruction after the CALL instruction is saved in the XY register. This instruction is encoded as:
```
	1 1 1 0 0 1 1 1    a a a a a a a a    a a a a a a a a
```
where the field "aaaaaaaa aaaaaaaa" is the address of the subroutine.
```
The RETURN Instruction
======================
```
The RETURN instruction moves the contents of XY back to the PC. This can be used to effect a RETURN (when used after a CALL) or an indirect jump, when XY contains a computed value.
This instruction is slightly more flexible and is actually a 16-bit move instruction that can perform any one of the following moves:
```
	PC  =  XY
	PC  =  M
	PC  =  J
	XY  =  M
	XY  =  J
	XY  =  0

...

Timeline, History, and General Comments
=======================================
```
I teach CS and I have always loved computers and been interested in how they work. Over the years I have found that you can read and study and make paper designs, but there is no substitute for actually building things. There are always processes at work that you can't understand until you actually build a working unit. I had wanted to build a computer out of relays when I was much younger, but I didn't have the knowledge, patience, money, or time.
...
Giving final exams are a panic for students, but are really different for the professors. It is the only time that there are no lectures to plan and no exams to grade. As I sat there during a final one term, I sketched out the outline of an architecture that I might use. So I designed the ALU to "fit" into a complete machine. However, when I decided to build the ALU, I still wasn't committed to building anything more. It is important to stay focussed on a project that you can complete. It is easy (and fun) to fantasize about building something big, but unless one is able to concentrate long enough to get something completed, it is really just idle day-dreaming!
Once I finished the ALU, I decided to go ahead and build the second cabinet, the register unit. However, in my mind, I had not committed to building anything beyond that. I figured that if I finished it and it worked, I would then decide whether I wanted to proceed. I knew there was a possibility I might be burned out, or might run into problems that would make a full computer non-functional. I always want successes, not failures, so I decided I would rather have a fully-working half-computer that a half-finished full-computer.
After finishing the ALU and the register unit, I found I enjoyed all aspects of the construction. Usually design is the most fun and, when building stuff, we tend to leave the most unpleasant or difficult tasks until the very end, but after completing 2 cabinets, I knew pretty much what would be involved. At that time I decided to keep going.
...
```
Costs
=====

415 Relays, 4PRLY-12, plus extras for prototyping ($3.40 each)
350 LEDs ($1.49 each)
111 Switches, SPDT, on-on mini toggle, MTS-4 ($0.90 each)
Acrylic boards, pre-cut and pre-drilled, 4 cabinets (total $1,095.37, aprx $275 per cabinet)
	ALU (1 board), 82.19
	Register Cabinet (10 boards), $251.60 total
	PGM-CTRL Cabinet (10 boards), $381.80 total
	Sequencer Cabinet (5 boards): $379.78 total
4 Mahogany Cabinets (est $300 each)
2 Power Supplies, 12V, 10Amp ($79.99 each)
20 Capacitors, 100UF, 100V non-polar ($1.55 each)
Memory Board:
	1 SRAM, 32K dip, Jameco # 82472CA, Other #: 62256LP-70 ($5.49)
	1 eight-channel FET driver module, NCD, www.controlanything.com, IOTEST-L ($49.00)
	3 eight-channel LED array module, NCD, www.controlanything.com, 8-FET ($10 each)
	1 Prototyping board ($5.99)
	1 small power supply (for memory) 5V, 4Amp, 20Watt, Jameco #: 213583CA, ($26.95)
DB-9 Sub-miniature Connectors 
	32 Plugs ($1.59 each)
	32 Receptacles ($1.76 each)
	64 Connector hoods ($0.49 each)
22-Gauge black solid copper wire, 100 feet ($4.49 each), est. 20 rolls
8-connector cable, CAT5e, 4 pairs, 24AWG solid, 328 feet ($42.00)
4 Fans ($10 each)
Misc Hardware. (Est $100)

Summary:
	Relays, $1,411
	LEDs, $521
	Switches, $100
	Acrylic Boards, $1,095
	Cabinets, $1,200
	Power Supplies $160
	SRAM Memory, $117
	Capacitors, $148
	Connectors, $138
	Wire, $132
	Fans, $20
	Hardware, $100

Grand Total, $5,142  (per unit: $1,285)
```
MT15 by Dieter Mueller 2005 http://people.freenet.de/dieter.02/mt15.htm "MT15 is a CPU for architecture exploration." "MT15 is a (mostly) transistorised CPU." Almost entirely built out of individual transistors (as well as diodes and resistors). ... "While trying to build smaller/faster transistorised flipflops, I figured out that DECL flipflops would take 40 percent less PCB space, while being a lot faster, at nearly the same power dissipation." -- http://people.freenet.de/dieter.02/decl1.htm, ... runs on 2.5 V (!)

cellular automata

cellular automata is related to other interests I have:

It's much easier, cheaper, and less dangerous to play with self-replication in a simulated cellular automata than in physical reality.
nanotech.html
It has been shown that many "cellular automata" are Turing-complete -- one can build computer architectures from arrays of identical "cells". David Cary suspects that this may be the optimum computer architecture for nanocomputers nanotech.html#nanocomputer . Today, we simulate cellular automata on serial computers. In the future, I (DAV) expect to simulate serial computers on cellular automata.
robot_links.html#simulated has some cellular-automata-like stuff intended to simulate real-world agents.

[FIXME: CA stuff scattered elsewhere ...]

The cellular automata questions I'm most interested in are:

"What CA rules are good for computronium ?"
"What are good patterns for computronium ?".
What can we learn about replicating patterns in a cellular automata that can help understand self-replicators in the real world ? (In particular, how to make them safer and easier ?)

There's a little loop here -- first we use (simulated) cellular automata to learn about replication, then we use that information to design replicating tools. We also use (simulated) cellular automata to explore good rules and good patterns for computronium. Then we use those replicating tools to (build enough copies of themselves to) build computronium. Then we use *that* computronium (hardware cellular automata) as a better computer.

The latest CA news and discussion is, of course, at news:comp.theory.cell-automata | http://groups.google.com/groups?group=comp.theory.cell-automata
The Wireworld computer http://www.quinapalus.com/wi-index.html "as far as we know, the first ever computer implemented as a cellular automaton that you might reasonably want to write a program for." ( the instruction set is a kind of "move machine" ... should I mention this above, with "move machines" and "extremely simple CPU architectures" ? )
Cellular automata http://cell-auto.com/ has many example Java applets [FIXME: ... finish reading] seems to have a lot of infomation on ``self-reproducing'' patterns [replication] has lots of information on different kinds of neighborhoods -- hexagonal, square, triangular, tripod, etc.
Patterns, Programs, and Links for Conway's Game of Life http://www.radicaleye.com/lifepage/ [FIXME: this page lists a lot of the Conway information I have here. Mirror that page; send the author other related links I've collected; delete redundant stuff from my pages]
http://www.fastlane.net.au/~fishbulb/index.cgi?gol.html Java cellular automata ... allows you to play with different rules.
http://www.ncal.verio.com/~ejia/EDU/artific.htm#*CELLULAR_AUTOMATA* another list of cellular automata links.
comp.theory.cell-automata
_Crystalline Computation_ http://psoup.math.wisc.edu/papers/margolus.ps.gz by N. Margolus, talks about the likely direction of future computing devices; and the design of the CAM-8 http://www.im.lcs.mit.edu/cam8.html .
"Excellent paper (several chapters of a forthcoming book, actually), if you are interested in cellular automata, reversible computation, hardware implementations and CAM modelling, you should check it out. ... Caution, this expands into 17381429 Bytes of PostScript." -- recc. Eugene Leitl <eugene@liposome.genebee.msu.su>
DAV: I agree. This paper has good stuff about the interaction (co-evolution) of hardware and software, efficient realization of computational architectures, some of the background thinking behind the massively parallel CAM-8 processor, and likely directions of future computing devices." -- DAV
the theory behind the CAM-8 http://www.im.lcs.mit.edu/cam8.html a dedicated cellular automata processor. http://www.im.lcs.mit.edu/broch/med1.html mentions that it is fast at generating the data needed for holograms.
http://hex.org.uk/diffusion/ ``This applet displays a cellular automata designed to illustrate a variety of types of reversible diffusion.''
Modelling fractal drainage systems using cellular automata http://fractaldrainage.com/ some very pretty pictures of real objects and of computer simulations. Check out the ``further work'' section.
Cellular automata http://cell-auto.com/
http://cell-auto.com/neighbourhood/qbert/

Small size is not everything - Q*Bert-based systems will require more cells for some tasks than the equivalent when expressed in (say) the Margolus neighbourhood. However for the majority of applications which involve performing cmputations, the additional richness provided by a larger LUT seems likely to be wasted much of the time, doing the equivalent of transporting signals from one point to another. We believe the advantage of the smaller cells in terms of their smaller size and higher update frequency will commonly win out.
Finally the Q*Bert neighbourhood is essentially hexagonal - and there is a general additional efficiency of hexagonal packing schemes compared to rectangular ones - a matter I will not go into further here.
cellular automata Optimisation http://cell-auto.com/optimisation/ some techniques for simulating cellular automata really, really fast on conventional PCs. [FIXME: interesting ideas for video_game.html ?] points to ``About my Conway's Game of Life Applet'' by Alan Hensel http://hensel.lifepatterns.net/lifeapplet.html ``I tend to think of cellular automata optimization as being related to data compression. This is also a simple concept with no simple solution, and what solutions are best depends on the type of data being processed.''
by Ross Rhodes http://www.bottomlayer.com/ talks about simulating our reality (quantum physics) on a cellular automata, and philosophical speculations on the possibility that our reality actually is a computer simulation, most likely a cellular automata. ``Digital Physics''. Some controversial theology here. An interesting view of free will: ``In my judgment, the most persuasive interpretation of QM ... supports the conjecture that conscious beings (humans) are not part of the created universe -- at least, not in the same way as the sun, moon and stars.'' He makes the analogy -- God : humans : ``the beginning'' and other miracles : physical reality :: programmer : game player : video game CD : running video game. I'm not sure where the computer hardware itself fits into this analogy, if it even fits at all. [FIXME: add this link to my free will link collection science_quotes.html#free_will ].
This is a Turing Machine implemented in Conway's Game Life. http://www.rendell.uk.co/gol/tm.htm Designed by Paul Rendell. 02/03/00.
http://www.lrz-muenchen.de/~ui22204/.html/hotlist.html has some links to cellular automata and other "Novel/unusual Computation", and to nanotechnology.
http://www.halcyon.com/hkoenig/LifeInfo/Objects/Oscillators/6/6P2.html some simple cellular automata patterns http://www.halcyon.com/hkoenig/LifeInfo/Objects/P3-rotors/R6/P3R6.html
cellular automata and some free source code (a C program that generates postscript output) http://chaos.ph.utexas.edu/~weeks/graphics/a2ps.html
Cellular Automaton http://mathworld.wolfram.com/CellularAutomaton.html /* was http://www.astro.virginia.edu/~eww6n/math/CellularAutomaton.html */
CAPOW (Cellular Automata & Electric Power) http://www.mathcs.sjsu.edu/capow/ has been researching new kinds of continuous-valued cellular automata for use in simulating the flow of electricity in a powergrid. can simulate and analyze various one-dimensional and two-dimensional cellular automata. "Continuous-Valued Cellular Automata for Physics, Biology, and the Sheer Gnarl of It". , by Rudy Rucker. free download. Has VRML output for viewing the cellular automata as a 3 D height field.
Andrew Trevorrow's Shareware http://www.kagi.com/akt/lifelab.html | http://www.kagi.com/authors/akt/ includes the really nifty "LifeLab -- a software laboratory for 2D cellular automata."
Conway's Life jmartens@freenet.vcu.edu wrote: >where can I get a decent version of Conway's Life for Macintosh? I've heard this one is very good: LifeLab 3.0 by Andrew Trevorrow, for Macs -- Mac Plus to Power Mac. ftp://redback.cs.uwa.edu.au/Others/AndrewTrevorrow/lifelab.sea Also don't miss this life pattern collection: ftp://life.anu.edu.au/pub/complex_systems/alife/life/lifep.zip ___/\lan |-|ensel alanh@digital.net
The Programmer's Challenge to_program.html#programmers_challenge There are archives of *fast* cellular automata code from a previous challenge.
http://www-im.lcs.mit.edu/links.html lots of good links for cellular automata, nanoscale computation, etc.
the Primordial Soup Kitchen http://psoup.math.wisc.edu/kitchen.html lots of pretty cellular automaton images and animations
"Conway's _Game of Life_, probably the most widely known CA rule ... a universal CA rule. Many other non-invertable universal CA rules are known -- the simplest is due to Roger Banks[89]. ... One problem with Conway's Life ... is that it lasts for such a short time ... For a space of 2Kx2K bits, ... this rule typically goes through fewer than 2^14 ... before repeating a configuration and entering a cycle. ... useful computing structures in Life are very fragile: gliders typically vanish as soon as they touch anything. ... ... if you want to make a discrete world that lasts long enough to do interesting things, it is a good idea to make it invertible. ... ... all invertable CA's can be reexpressed isomorphically in a partitioning format -- where conservations and invertibility are manifest ..." -- N. Margolus
[89] Toffoli, T. and N. Margolus, _Cellular Automata Machines -- a new environment for modeling_, MIT Press (1987)
DAV: Which of the 2 CA rules detailed in the _Crystalline Computation_ paper is more "interesting" (can build more compact logic circuits out of): "BBMCA", or "Critters" ? (David Cary thinks they are both "more interesting" than the Conway rules) More importantly, since hydrodynamics required a triangular grid, what rules are most "interesting" on triangular grids ? (Perhaps that grid has rules that are "more interesting" than any square-grid rules).
```
(a)
blocking on even steps:
 aabbccddee.....
 aabbccddee.....
 ffgghhiijj.....
 ffgghhiijj.....
 kkllmmnnoo.....
 kkllmmnnoo.....
 ...       ...zz

blocking on odd steps:
 abbccddee.....a
 fgghhiijj.....f
 fgghhiijj.....f
 kllmmnnoo.....k
 kllmmnnoo.....k
 ...
 abbccddee...zza

(b)
 OO -> OO
 OO    OO

 *O -> *O
 OO    OO

 *O -> O*
 O*    *O

 *O -> O*
 *O    O*

 O* -> **
 **    *O

 ** -> **
 **    **
```
Fig. 1.5 The invertible "Critters" CA.
(a) The solid and dotted blockings are used alternately.
(b) The Critters rule.
The Critters rule is ... "rotationally symmetric" ... "conserves 1's (particles), and only 3 cases change." ... "each of the 16 possible initial states of a block is turned into a distinct result state. Thus the Critters rule is invertable." ... "Unlike the gliders in Life, Critters gliders are quite robust. ... when two of these gliders collide in an empty region ... ... If nothing hits this blob for a while, we always see at least one of the gliders emerge."
```
(a)
 OO -> OO
 OO    OO

 *O -> OO
 OO    O*

 *O -> O*
 O*    *O

 *O -> *O
 *O    *O

 O* -> O*
 **    **

 ** -> **
 **    **
```
Fig. 1.8 A invertible Billiard Ball Model CA.
(a) The BBMCA rule ... "rotationally symmetric" "conserves 1's (particles), and only 2 cases change."
"we can recover macroscopic 2D hydrodynamics from a model that is only slightly more complicated than the HPP gas. A single-speed model with 6 particles per site, moving in 6 directions on a triangular lattice, will do. If all zero-net-momentum collisions cause the molecules at the collision site to scatter into a rotated configuration, and otherwise the particles go straight, then in the low speed limit we recover isotropic macroscopic fluid dynamics."
[100] Yakhot, V. and S. Orszag, "Reynolds number scaling of cellular-automaton hydrodynamics", _Phys. Rev. Lett._ (1986), 1691-1693.
[FIXME: need better link] Ruben Agin has completed the next phase of his CA digital logic simulation. He has layed out the "Sexium" chip, a VLSI design he did for a class at MIT, on a CA-silicon waffer. This chip runs assembly language programs at about 1 KHz (this is comparable to VLSI simulation packages currently running on common workstations). The design is based on the 2D CA digital logic experiments he did in the past. Using 3D CA digital logic would yield faster simulated clock rates.
Lafe Technologies http://www.lafetech.com/ uses "Cellular Automata Transforms" to perform data compression and data encryption. Looks very impressive. (How does this work ?)
Dr. J Dana Eckart http://www.cs.runet.edu/~dana/ does research on cellular automata. You can download "The Cellular Automata Simulation System: the Cellang cellular automata programming language, along with the corresponding documentation, viewer, and various tools." in Unix, DOS, and Apple Macintosh versions.
Webside CA - The Isle Ex CA Explorer http://www.aros.net/~exe/wca.htm cellular automata, in Java.

self-replication

It has been shown that self-replicators can be designed in cellular automata #cellular_automata . some of this deals with ``real'', physical replication. While other parts -- cellular automata, quines, etc -- deal only with patterns in a computer. Should I separate them ? But some of the theory applies to both.

related to reconfigurable robots robot_links.html#reconfigurable and robot construction (humans building robots) robot_links.html#construction and tool closure 3d_design.html#closure . and the bootstrap problem [FIXME:] [FIXME: cross-link all the self-replication stuff on my web pages. nano, robot, cellular automata, etc. Point back and forth from ``replication'' section to computer architecture # cellular automata robots nanotech idea_space and unknowns ]

COMPUTER CONTROLLED LEGO CAR FACTORY http://cad.bu.edu/cgc/newlego/ this looks like a good step towards self-replication
"Merkle and the case of the misleading metaphor" by Howard http://nanobot.blogspot.com/2003_11_02_nanobot_archive.html#106789094846753055 discussion: http://nanodot.org/article.pl?sid=03/11/12/0537243
"Self replication and nanotechnology" http://www.zyvex.com/nanotech/selfRep.html points to "NASA and Self-Replicating Systems: Implications for Nanotechnology" by Ralph C. Merkle http://www.zyvex.com/nanotech/selfRepNASA.html
"Many Future Nanomachines: A Rebuttal to Whiteside's Assertion That Mechanical Molecular Assemblers Are Not Workable and Not A Concern." http://www.imm.org/SciAmDebate2/whitesides.html [FIXME: to read]
Scientists at TIGR Uncover the Minimal Number of Cellular Genes Needed for Life http://www.tigr.org/minimal/ ``Researchers at The Institute for Genomic Research (TIGR) have uncovered the number of... essential genes necessary for life in Mycoplasma genitalium, the simplest known cell. ... the minimum number of protein-coding genes required for cellular life in the laboratory is between 265 and 350. Surprisingly, this minimal gene set includes about 100 genes of unknown function. This finding draws into question a prevailing assumption that the basic molecular mechanisms underlying cellular life are understood, at least in broad outline.''
http://depts.washington.edu/astrobio/research/evo_biology.html ``We are studying the evidence for non-DNA life. In particular, we want some idea of the sequence specificity and minimum size of a self-replicating ribonucleotide.''
Byl's rules for Self Reproducing CA http://www.krl.caltech.edu/~brown/alife/announce/prog/0008.html
Grobots http://dschudy.tripod.com/grobots/ by Devon Schudy and Warren Schudy ``Grobots is a game in which users program simulated robots to reproduce and fight without user intervention.'' see computer_architecture.html#replication
The Artificial Self-Replication Page http://lslwww.epfl.ch/~moshes/selfrep/ by Moshe Sipper. Pointers to some fascinating papers on self-reproduction, some theories about all possible (including machanical, software, and biological) self-reproducing systems, and lots of links on cellular automata. [ Robert Freitas ???]
Advanced Automation for Space Missions http://www.islandone.org/MMSG/aasm/ Edited by Robert A. Freitas, Jr. (1980). A classic article on self-replication.
Architectural Considerations for Self-replicating Manufacturing Systems (SRMS) by J. Storrs Hall, PhD.* http://www.foresight.org/conferences/MNT6/Papers/Hall/index.html (1998) calculates the optimum number of pure replication generations before switching to manufacturing the desired end product in order to finish that product most rapidly. Also, a "Zeno's Factory" concept is presented, claiming that, given a simple working system and a set of designs, the most rapid path to some end product is for each replicator to replicate for a couple of generations then build the next slightly-improved version, rather than trying to jump directly to the "most efficient" design directly.
[FIXME: split out stuff from this page from my own thoughts.]
The ``Game of Life'' http://www.astro.virginia.edu/~eww6n/life/ is the most well-known Cellular Automaton, invented by John Conway and popularized by Martin Gardner. Lots and lots of interesting properties and pretty animations have been collected for this cellular automaton. But David Cary doubts that this is really the most interesting ("computationally compact") cellular automata; other cellular automata are likely to have even more pretty animations and properties. "replicator - a Life object which repeatedly forms copies of itself. Such things are known to be possible in Life, but no example is known. But in the HighLife variant, there is a simple replicator." "HighLife - an alternate set of rules similar to Conway's, but with the additional rule that 6 neighbors generates a birth. Most of the interest in this variant is due to the replicator that evolves from:
```
 ***.
 ...*
 ...*
 ...*
```
-- http://www.cs.jhu.edu/~callahan/glossary.html "unit Life cell: a pattern with two states, which is determined by its previous state and the previous state of its neighbors, using exactly the rules used to compute it; that is, it simulates its own universe. None have been constructed in Conway's Life yet." When I (DAV) talk about one entity replicating, I mean that we end up with 2 (or more) practically identical copies of the original, in every way (in particular, size). A ``unit Life cell'' seems to me to be somehow related to the so-called ``replicating-tile'' http://www.geocities.com/alclarke0/PolyPages/Reptiles.htm (which creates identical copies, but scaled much larger)
Logic Systems Laboratory http://lslwww.epfl.ch/ Information and tutorials on cellular automata, genetic algorithms, neural networks, and the POE model. "Developing novel bio-inspired computing machines and programs inspired by Nature and endowed with capabilities such as adaptation, evolution, growth, healing, replication, and learning." "Designing novel reconfigurable systems based on high-complexity field-programmable gate arrays".
"Looking (and dreaming) toward the future, one can imagine nano-scale (bioware) systems becoming a reality, which will be endowed with evolutionary, reproductive, regenerative, and learning capabilities." http://lslwww.epfl.ch/pages/tutorials/poe/home.html
[replication / quine / robots]

We're also trying to build self-reproducing robots. We've been doing experiments with Fischer Technik and Lego. We're trying to build a robot out of Lego which can put together a copy of itself with Lego pieces. Obviously you need motors and some little computational units, but the big question is to determine what the fixed points in mechanical space are to create objects that can manipulate components of themselves and construct themselves. There is a deep mathematical question to get at there, and for now we're using these off-the-shelf technologies to explore that. Ultimately we expect we're going to get to some other generalized set of components which have lots and lots of ways of cooperatively being put together, and hope that we can get them to be able to manipulate themselves. You can do this computationally in simulation very easily, but in the real world the mechanical properties matter. What is that self-reflective point of mechanical systems? ...
-- ``Beyond computation: a talk with Rodney Brooks'' 2002 http://www.kurzweilai.net/meme/frame.html?main=/articles/art0475.html | http://www.edge.org/3rd_culture/brooks_beyond/beyond_p4.html
http://nano.xerox.com/nanotech/convergent.html Check this out if you are interested in the idea of a Convergent Molecular Replicator

Minimum Instruction Set Computing (MISC)

see #simple_cpu and minimal_instruction_set.html

(news) The NOSC (No Operand Stack Computer) mailing list http://strangegizmo.com/forth/NOSC/
http://www.UltraTechnology.com/ /* was http://www.dnai.com/~jfox/ */ The term MISC refers to Minimal Instruction Set Computers in general, and to the chips designed by Chuck Moore at Computer Cowboys. "MuP21 includes not only a Forth Engine CPU, but also a memory interface processor, and a video output processor on the chip. With only 7000 CMOS transistors MuP21 can execute 80 million instructions per second and only draw 50 milliwatts of power."
http://www.ForthChip.com/ [FIXME: send in suggestions]

------------------------------
Date: Mon, 19 Jun 2000 16:27:44 +0200
From: Tim Böscke 
To: <MISC>
Subject: Re: MISC

> >The _real_ minimum instruction set is one with a
> >subtract-and-branch-when-negative instruction btw.
> >
>
> Eugene Styer mentions six 'one instruction computers' at
> <http://eagle.eku.edu/faculty/styer/oisc.html>
>

Well - if you look at them closely it becomes obvious that
the six machines are after all just flavours of two ideas.

1) The move machine
2) subtract, branch

The mentioned "subtract" machine is a combination of both.

However I dont really consider 1) as a one instruction machine.
After all it is just a fancy way of doing microcoding. A move
to the PC is after all not just a move, but a JMP. And thus
you already have two instructions. The same goes for the ALU
stuff.

------------------------------

------------------------------
Date: Thu, 22 Jun 2000 00:27:00 -0600
From: Roger Ivie <rivie at teraglobal.com>
To: MISC
Subject: Re: MISC
Content-Type: text/plain; charset="us-ascii" ; format="flowed"

>Well - if you look at them closely it becomes obvious that
>the six machines are after all just flavours of two ideas.
>
>1) The move machine
>2) subtract, branch
>
>The mentioned "subtract" machine is a combination of both.
>
>However I dont really consider 1) as a one instruction machine.
>After all it is just a fancy way of doing microcoding. A move
>to the PC is after all not just a move, but a JMP. And thus
>you already have two instructions. The same goes for the ALU
>stuff.

Bear in mind that the PC doesn't have to be an actual register. A good
example of this is the PDP-5, which stored the PC in location 0.
Executing an instruction began by fetching location 0 and using it as
the address of the instruction to be executed. It was possible for a
DMA device to cause a jump by writing to location 0; this was, in fact,
how the front panel loaded an address into the PC.

--
Roger Ivie
rivie@teraglobal.com
Not speaking for TeraGlobal Communications Corporation

------------------------------

the MISC archives, mailing list discussion on Minimal Instruction Set Chips http://pisa.rockefeller.edu:8080/MISC/ has details on the "F21" MISC chip. It has 27 instructions, so it can pack 4 instructions (each 5 bits long) into a 20 bit machine word. http://www.UltraTechnology.com/misc.html says "Everybody can subscribe by sending mail to MISC-request@pisa.rockefeller.edu with the word "subscribe" in the Subject: line."
Clive http://www.cdworld.co.uk/zx2000/clive.html claims that he has designed a CPU with only "16 principle instructions", but he doesn't list any details. I suspect that he means that the opcode field of a instruction is 4 bits long, and this doesn't include the register fields or the address mode bits.
Dr Neil Burgess mentions "ultraRISC processor, that has only 15 instructions" http://www.acue.adelaide.edu.au/leap/discipline/eng/Burgess.html No details are listed, so I suspect that this is merely 15 different primary opcodes, not including the register fields or address bits.
Patriot Scientific Corporation http://www.ptsc.com/ interesting radar systems, designed native Java "shBoom(tm) microprocessor for "embedded web servers". http://www.circuitcellar.com/articles/misc/tom-92.pdf packs 8 bit instructions, 4 to a "cell", very similar to Moore's MISC chip. The flavors of branches are almost identical.

other attempts at a minimal instruction set

Steamer16: a high-performance homebrewer's microprocessor http://www3.sympatico.ca/myron.plichota/steamer1.htm by Myron Plichota <myron.plichota at sympatico.ca>. written in VHDL and prototyped on a single wire-wrapped protoboard. Has only 7 instructions. Packets of 5 instructions (5 instructions * 3 bits = 15 bits/packet) execute in 6 cycles/packet at a cycle rate of 20 MHz. Both the address and data busses are 16 bits. Unusual ``ArF'' call/return protocol.

qUark ../mirror/quark.txt a viable stack-computer with 4-bit opcodes (c) vic plichota, original concept by Myron Plichota Dec '98.

------------------------------
Date: Sat, 20 Feb 1999 12:54:34 +0300
From: "Stas Pereverzev" 
To: MISC
Subject: Re: nFORTH v2.3
Content-Type: text/plain;
	charset="koi8-r"
Content-Transfer-Encoding: 7bit

>Comments, folks?


You need only five instructions, not 16. They are:

ALU:
1. nand
2. shr

RAM:
3. store ( addr n -- )
4. lit   ( -- n )

CONTROL:
5. ncret  \ JUMP to addr in N, if carry flag isn't set in T,
          \ also drop both T and N


Also, if PC is memory variable (or can be addressed as memory variable)
we can awoid "ncret" instruction:
In that case we sholud use NCSTORE instead STORE:
ncstore (addr n flag -- )

ncstore (addr n 0 ) - same as store,
ncstore (PC n -1 )  - same as ncret

We need only 2 bits per instruction in that case.

That all folks ;-)

Stas.

------------------------------

(does this really work ???)

Turing tarpit

Turing Tar Pit.

the Turing tarpit http://www.wikipedia.com/wiki/Turing_tarpit
"Leith - the programming language of poetry" http://lcamtuf.coredump.cx/ll.c "Leith is a tarpit language" "it is almost guaranteed that feeding an arbitrary text file into the interpreter would have interesting, unique or puzzling effects, and not end in an endless loop, a crash, or an instant exit."
http://www.everything2.com/index.pl?node=Turing%20tar-pit
The Turing Tarpit by Brian Connors http://www.geocities.com/ResearchTriangle/Station/2266/tarpit/tarpit.html lists a bunch of really strange computer programming languages.
Turing tar-pit http://www.jargon.net/jargonfile/t/Turingtar-pit.html definition in the Jargon File creed.html#jargon
Turing tar-pit http://foldoc.linuxguruz.org/foldoc.php?Turing+tar-pit definition in FOLDOC creed.html#foldoc
"Epigrams in Programming" by Alan J. Perlis 1980 http://www.cs.yale.edu/homes/perlis-alan/quotes.html | mirror http://www-2.cs.cmu.edu/~spot/programming-epigrams.html has lots of good advice, including "54. Beware of the Turing tar-pit in which everything is possible but nothing of interest is easy." [FIXME: ... copy this to c_programming.html; general programming tips ? simplicity ?] Other interesting tips: "10. ... Accumulate idioms. ..." "24. Perhaps if we wrote programs from childhood on, as adults we'd be able to read them." "58. Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it." "79. A year spent in artificial intelligence is enough to make one believe in God." "113. The only constructive theory connecting neuroscience and psychology will arise from the study of software." "116. You think you know when you can learn, are more sure when you can write, even more when you can teach, but certain when you can program." "117. It goes against the grain of modern education to teach children to program. What fun is there in making plans, acquiring discipline in organizing thoughts, devoting attention to detail and learning to be self-critical?" "From ACM's SIGPLAN publication, (September, 1982), Article "Epigrams in Programming", by Alan J. Perlis of Yale University."

Opcode considerations

Some things you might think about if you are designing a new instruction set (for a Von Neumann machine).

DAV thinks one of the most important things about an instruction set is its support for subroutine call/return #subroutines . ... [FIXME: I've seen "calling sequences" for several CPUs; make one big list out of them ? like
- Calling sequence for 65816 http://compilers.iecc.com/comparch/article/94-02-015
- ...
]
DAV often works with embedded systems where the program is in ROM. a few processors make it impossible to put programs in ROM; they require self-modifying code. (the original MIX; John von Neumann). Support for ROM-able code also makes it easier to support re-entrant code.
Think about interrupts.
$$$ vs. Mips vs. code-size vs. number-of-transistors vs. Mips/Watt
it would be nice if the all-zeros opcode did something like HALT; since all-zeros is common in data areas, and if the PC accidentally blunders into a data area we don't want it randomly doing strange things.
position-independent code (relocatable code) makes linkers and loaders *much* simpler. (I think that superuser sometimes *needs* access to absolute addresses, but do user programs ever really need absolute addresses ?)
it would be nice if, no matter what bytes were in a file, when it is executed (as a user-mode application), it would be impossible to crash the machine or any other process. user-mode vs. superuser mode, and traps on I/O and MMU page faults (virtual addresses) can help. "memory protection"
The ``JavaOS'' #JavaOS attempts to do this in software, so it doesn't need MMUs, etc.
"machines do not need conditional branches, they only need conditional subroutine return instructions." -- _Stack Computers: the new wave_ by Philip J. Koopman, Jr. 1989 http://www.ece.cmu.edu/~koopman/stack_computers/sec9_5.html /* was http://www.cs.cmu.edu/~koopman/stack_computers/sec9_5.html */ (conditional return)
DAV: If you're going to be implementing FORTH on top of the architecture anyway, what with it's threaded code and all (lists of subroutines to call, rather than lists of call instructions to those subroutines) ... do you really need a ``call'' instruction ? Or do all the ``calls'' you need really reduce down to the ``next'' instruction at the end of a subroutine, to lookup the next subroutine on a list and call it ? -- DAV
[FIXME: very muddled thinking here:]
With threaded execution, what we have is: blocks: some register RN point to the ``next item'' in the list. (RN never points to a real assembly-language opcode; it points somewhere in the middle of a list of pointers, each of those pointers point directly to real assembly-language opcodes) Let's say that happens to be a (leaf, assembly-language) routine. When the (leaf, assembly-language) routine is done, it does NEXT (conditional ?). If the ``next item'' on the list is a pointer to another (leaf, assembly-language) routine, NEXT simply does
```
	PC := [[RN]]
	RN++

	(compare to
	Normal return instructions do
	PC := [SP]
	SP++
	)
```
So how does NEXT know whether this new routine is a leaf or not ? Or do we just assume it's always a leaf, and add a special "receive" opcode instruction(s) at the beginning if it happens to not be a leaf (reminiscent of the ARM procedure call standard) ? something like this:
```
	; function header ("receive")
	; push old RN to return stack
	[SP] := RN
	SP--
	; make RN point into block of subroutine, rather than calling block
	RN := [PC + n]
	; NEXT
	PC := [[RN]]
	RN++
```
At the end of a (non-leaf) subroutine, the instruction at the end of the block needs to (leaf) point to code that properly does a return.
```
	; pop old RN from return stack
	SP++
	RN := [SP]
	; NEXT
	PC := [RN]
	RN++
```
Control structures (Things that are not blocks): sequence, selection, iteration: [FIXME:] Perhaps it would be interesting to deal with these only at the threaded-call level, not on the assembly-langauge level. Then we have: selection: using conditional returns: ... perhaps add extra header to every subroutine; to jump *after* that header implies ``unconditionally execute this routine'', but to jump right at the beginning implies ``conditionally execute this routine: only when T is not zero''. I.e., that prepended extra header drops T, does a NEXT when that old value of T is zero, otherwise falls through.
How do I loop through a million pixels without leaving a million return addresses on the stack, when my only conditional is a conditional return ? just use ``fake jump'' of call/drop ?
[/muddled thinking]
"The AVR Microcontroller and C Compiler Co-Design" white paper by Dr. Gaute Myklebust http://atmel.com/dyn/resources/prod_documents/COMPILER.pdf "By allowing professional compiler developers at IAR Systems in Sweden to comment on the architecture and instruction set, we were able to make a microcontroller very well suited for C compiler generated code." ... discusses a few of the improvements to the AVR instruction set that they made ... How to generalize ?
Basically I would not have made a weaker CPU if it came at the expense of being able to use data structures. This means that I prefer to have some form of indexing instead of the cheaper alternatives: direct addressing only (like the original Von Neumann computer with the resulting need for self-modifying code and the execute instructions of future computers), register indirect addressing only (like the 8080), or memory indirect addressing only (like the pdp-8). I had considered the 6502 method of indexing (where the base address is fixed, or stored in a zero-page location, and the 8-bit offset is in a register), but having programmed both 6502s and 6800s, I know that fixed offsets are to be prefered to fixed base-addresses (thus the 6502 index registers are, IMHO, useless). So I know that I need to encode two fields for indexed instructions: where the base is and the value of the offset. You want the offset to be as large as possible since that limits the size of the data structures you can have, thus the current 2-bit/7-bit format. If the base address was not going to be in an actual register, the next best place is the top of stack (which is much better than in a zero-page location).
The other nice thing about indexing is that it allows you to make fast move unrolled move loops:
```
        lda     0,x
        sta     0,y
        lda     2,x
        sta     2,y
        etc.
```
At 8 cycles for loads and 10 cycles for stores and two bytes per transfer, this gives a move speed of 9 cycles per byte.
If I could get high memory to memory copy speed (which I think is important) with indexing, then I wouldn't have to worry wasting op-code space with any other method (I.E., special instructions or auto-increment addressing modes).
Another lesson from the 6502 (and the 8088 for that matter), is that it just really sucks if the largest datum you can manipulate is smaller than your address size. This means that the accumulator needs to be the same size as the PC -- 16-bits.
So there you go. If you have any ideas for improving it without expending too much extra hardware, I'd like to hear them.
-- Joseph H Allen http://www.dejanews.com/getdoc.xp?AN=408499073.1
Just a few features can help reduce code bloat. data_compression.html#compact_instruction_set
Microprocessor instruction set cards http://www.comlab.ox.ac.uk/archive/cards.html and links to computer emulators. (Some processors have some pretty clever ideas that you might not have thought up on your own).
archives of lots of processor-specific technical details. http://www.netbsd.org/MailingLists/ See what "features" most annoyed these programmers.
virtual addresses and paging ... different kinds of physical locations in memory:
- I/O devices (probably shouldn't cache reads or writes; only kernel-level device drivers should touch)
- addresses not connected to anything (not even RAM) and inadequately-decoded regions (devices that respond to the ``official'' address often also respond to bogus ``aliased'' addresses in these inadequately-decoded regions) (no reason to ever access these addresses, except perhaps kernel-level boot-time and hot-swap probing to detect if *anything* responds to that address)
- executable code (only needs to be read into the I-cache by the process associated with; never written ... debuggers and OS need to switch this back and forth from "data" to "executable code".) (mirrors part of some file with "x" permission)
- dynamic data (data only temporarily in use until this process exists; "the heap"; mirrors part of "the swap file"; read and written only by this process)
- constant data (mirrors part of some file with "r" permission, possibly with both "r" and "x" permission, but not "w" permission). (files with "x" permission have 2 parts, executable code and constant data).
- register stack (needs a completely different dirty-page algorithm)
- memory-mapped files
- Is that it ?
Note that if many users are running the same executable (say, EMACS), then there is only 1 copy of the executable code and constant data from the executable file in RAM, but each user has his own dynamic data, stack, and memory-mapped data.
Lots of people have written far better descriptions.

Proposed Additions to the PDP-11 Instruction Set:

BBW     Branch Both Ways
BEW     Branch Either Way
BBBF    Branch on Bit Bucket Full
BH      Branch and Hang
BMR     Branch Multiple Registers
BOB     Branch On Bug
BPO     Branch on Power Off
BST     Backspace and Stretch Tape
CDS     Condense and Destroy System
CLBR    Clobber Register
CLBRI   Clobber Register Immediately
CM      Circulate Memory
CMFRM   Come From -- essential for truly structured programming
CPPR    Crumple Printer Paper and Rip
CRN     Convert to Roman Numerals

from

Haney [1968]: Haney, F. M., "Using a Computer to Design Computer Instruction Sets," Ph.D. thesis, Carnegie-Mellon Univ., May 1968. Any bits of cleverness here?
"TOP SECRET Code: The Hard Part...Getting Harder - Part 2" article by Darren Ashby http://www.chipcenter.com/eexpert/dashby/dashby037.html

Screwy Things
Life is full of compromises, and microcontrollers are no exception. Every microcontroller I have ever used has something a little strange about it. Here's a list of few of things I've found that didn't quite make sense to me. Consider them a "heads-up" so you don't have to find them yourself.
...
... The process of designing the AVR instruction set included a lot of gives and takes. We have consulted C compiler experts who have given us a lot of advice on how to tune the instruction set to yield compact C-code. As an example, the compiler experts advised us to sacrifice the ADDI for a SBCI (subtract immediate with carry).
For those instructions that are missing, convenient workarounds exist. The code efficiency of the AVR should prove that we have found a good compromise between which instructions to implement, and which to omit.

[instruction set design]
We often want to take the dot-product of one N vector with another N vector.
DSPs are heavily optimized to do this very rapidly, often finishing in N + a few cycles.
They do this by implementing "MAC", multiply-and-accumulate, A = A + X*Y, plus various other tricks to get zero-overhead loops.
Other processors typically break this down into a "multiply" instruction and an "add" instruction.
Some very simple processors don't have a multiply instruction; they synthesize it out of repeated shifts-and-adds.
I've seen one very simple processor that didn't even have an add instruction. It synthesized it out of AND, XOR, shifts, etc.
I wonder if there's room for something intermediate between dedicated multiply hardware and synthesizing multiply out of shifts-and-adds.
In particular, I've been wondering if a dedicated square(x) or t(x) function would take sigificantly less silicon area than a full multiply(x,y) multiplier.
Then A = A + X*Y can be synthesized using
```
A = A + X*Y
is equivalent to
A = A + ( square(X+Y) - square(X) - square(Y) )/2
or
A = A + t(X+Y) - t(X) - t(Y)
... also
A = square(X)
is equivalent to
A = 2*t(X)-X
```
If we assume square(x) and t(x) and ADD and SUB and SHIFTR operate in a single cycle, and even if we don't combine the 2 of them in a single cycle, that gives 3 function cycles plus 4 addition cycles (plus a shift if we use square).
These 7 cycles are far less than the cycles needed to synthesize multiply out of shifts and adds ... although admittedly more than a full sized single-cycle multiplier.
The gap is reduced even further if we combine t() and ADD in a single cycle -- t()-and-ADD, analagous to multiply-and-accumulate.

subroutines

[FIXME: I have a lot of information on subroutines and the importance of well-factored code scattered about ... should I gather it all together here, or since it's so very important, leave a brief reference at each place ?]

Calls and returns:
Calls and returns are very important. Calls obviously take much less RAM space than in-line code except for the most trivial subroutines. Few people realize that calls can also take significantly less time than in-line code, because the most heavily re-used subroutines stay in the high-speed cache. Some programming styles (FORTH) create heavily factored code, where "call" is the most frequently executed instruction at run time and typically 1/3 of the (static) instructions [see _Stack Computers_ by Philip Koopman 1989, in particular the section "7.1 the importance of fast subroutine calls" http://www.ece.cmu.edu/~koopman/stack_computers/sec7_1.html which says ``expensive procedure calls lead to poorly structured programs''. Also,

... Good Forth programming style ... Subroutines often only consist of 5 or 10 instructions. A static frequency of approximately 50% of the instructions being subroutine calls ... especially effective in environments with limited memory capacity. It also encourages the use of machines with fast subroutine calls.
-- _Stack Computers_ by Philip Koopman http://www.ece.cmu.edu/~koopman/stack_computers/sec3_3.html#3322 ]. Since calls occur so frequently, it is important to minimize their RAM and cycle time overhead. Koopman suggests combining them with other instructions, so every instruction combines a call, branch, or exit with some other operation (since, at the hardware level, a data operation and a program flow operation can be executed in parallel).
From a runtime perspective, every call is paired with a return. So (everything else being equal) whatever minimizes (calling time + receiving time + returning time + resume time) is best. Sometimes you can get a net improvement by making one of these slightly slower and another one faster.
From a code space perspective, each function is likely to be called more than once, there are many more call statements than return statements (unless your subroutines have lots of exit points). So to minimize space, it's probably better to reduce (call size + resume size) even at the expense of (receive size + return size).
FIXME: who said that quote about ``direct threaded ... can be faster than subroutine threaded...'' ? perhaps I'm remembering: "Subroutine threading may be faster than direct threading. ... On the 6809 or Zilog Super8, DTC is faster than STC. ... The only way to know for sure is to write sample code. ... on CPUs lacking a subroutine call instruction -- such as the 1802 -- ITC is often more efficient than DTC." -- Brad Rodriguez http://www.zetetics.com/bj/papers/moving1.htm
Ah, here it is: "Note that, in contrast to popular myths, subroutine threading [a list of consecutive function calls] is usually slower than direct threading [a list of consecutive function pointers] ." -- "Threaded Code" article by Anton Ertl. http://www.complang.tuwien.ac.at/forth/threaded-code.html

Taxonomy of Multiprocessors

also see Koopman's taxonomy of microprocessors [FIXME: ]

p.606 Models of Computation The von Nuemann model of computation ... alternative models ... include:

Multiple sequential processes interacting via semaphores, shared variables, or ... [other] communication and synchronization primitives. ...
Message-passing systems
Data-flow systems, ... a representation of the data-dependency graph of a computation is used as a program. Nodes (operators) of a graph serve as destinations for messages containing operand values; when a node has its requisite operands, it performs the associated operation and ships the result to dependent nodes. Data-flow machines stress fine-grained parallelism ...
Graph-reduction architectures ... Often /combinatory logic/, a relative of the /lambda calculi/, provides the underlying computational model.
Logical-inference architectures based on the class of /logic programming languages/, the best known of which is Prolog ... ... such machines ... emphasize mechanisms for efficiently performing simple logical inferences, /backtracking/ control structures used in searches for proofs of propositions, and a generalized string-matching operation called /unification/.
Cellular automata are arrays of identical finite-state machines that operate synchronously. The input to the FSM at each cell is just the set of state variables from FSMs at neighboring cells. Thus the state of a cell during clock period /i/+1 reflects only the states of neighboring cells during clock period /i/. The "programming" of a cellular automaton involves forcing the states of its cells into a prescribed initial configuration, from which it evolves autonomously. The dimensionality and topology of the array, as well as the transition diagram of each FSM, determine the characteristics of the automaton. ... ... It is customary to count a cell as part of its own neighborhood if its inputs include its own state variables; interestingly, the latter need not be the case.) ... ... LIFE, along with many other cellukar automata, can (given an appropriate initial configuration) simulate a universal Turing machine and hence can be used to perform general computation. ...
Trainable automata and learning machines ... a disparate collection of machines that are not programmed in any conventional sense; rather than executing explicit instructions, they react in complex and indirect ways to environmental stimuli. ... simple adaptive control systems ... trainable automata ... ... neural networks ...
...

The above list is neither exhaustive nor even representative...

... p.612 Taxonomy of Multiprocessors ... Michael Flynn's taxonomy ...

single instruction stream, single data stream SISD ... the traditional single-sequence machine ...
single instruction stream, multiple data stream SIMD ... a single sequence of instructions to control multiple independent (but usually identical) data paths. ... a /vector machine/
Multiple instruction stream, multiple data stream MIMD ... in effect, a group of interconnected single-sequence machines ... An MIMD system can present a programming environment very similar to that of a multiprocess (time-shared) computer ... except that the MIMD system uses multiple processors to achieve /real/ parallelism, whereas the time-shared machine multiplexes a single processor to /simulate/ parallelism. ...
No instruction, multiple data stream NIMD multiprocessors ... cellular automata, neural networks, and other machines that exhibit no identifiable instruction stream. ... NIMD machines ... present something of a philosophical enigma: Their nonprogrammability implies, in a sense, that they can do only one thing; yet that one thing, at some futuristic extreme, might be to decide what needs to be done and then do it. A nonprogrammable machine might have the power of a universal Turing machine, perhaps mimicking other computers after having observed their behavior. ...

Computation Structures

Ward and Halstead

... p. 347 Programming languages offer several /classes/ (allocation disciplines) for storage. These differ ... in the degree to which responsibility for allocation and deallocation of storage is assumed implicitly by the system (as opposed as being left to the programmer).

... common storage classes:

Static Storage allocated at the time the program is written, compiled, or loaded. This is the simplest allocation strategy and requires no run-time mechanism; its major drawback is that storage requirements must be anticipated in advance of program execution ... ... The FORTRAN language is designed to use static storage exclusively; consequently, it is always possible to analyze a FORTRAN program and determine the amount of main memory it requires for execution. This level of predictability ... allows a compile-time guarantee thta a given program will run on a particular machine configuration; but it eliminates from FORTRAN the possibility of certain programming constructs (such as recursive procedures) where space requirements cannot be bounded at compile time. ... In a formal sense, ALGOL programs correspond to Turing machines while FORTRAN programs correspond to finite-state machines.
Stack, or /automatic/, storage is allocated from a common processor stack. This is the most common /dynamic/ storage allocation scheme. ... The major limitation of stack storage is that it requires allocation and deallocation to follow /stack discipline/, according to which the deallocation of a region B must precede the deallocation of A if the allocation of A preceded the allocation of B. ...
Program-managed heap storage ... a variety of techniques developed for managing free-storage pools that avoid imposing stack discipline on allocation and deallocation. Such pools are often called /heaps/. The simplest of these techniques involve explicit allocation and deallocation of each block of storage. ... often supported as library extensions to languages whose primary storage management is stack-oriented.
Garbage-collected heap storage allows unconstrained allocation of blocks of storage, requires no explicit deallocation, and is the most general of the storage management techniques. ... available in ... LISP ... The automatic reclaimation (/garbage collection/) of unused storage ... ... requires a mechanism for a considerable amount of run-time management.

... p.354 the /general-register/ organization, in which ... active registers are available for explicit use by programs to hold frequently accessed data. ...

Proponents of the general-register organization cite two relatively independent argument in its favor:

Performance. The faster accessibility of data stored in registers motivates their use for heavily referenced variables. The general-register architecture allows the programmer ... [or] the compiler ... to explicitly allocate registers for the storage of selected critical data.
Compactness of code. Since there are many fewer registers than main-memory locations, instructions that access only registers can be encoded in fewer bits than those capable of referencing arbitrary locations.

... objections to the general-register scheme: * Lack of implementation transparency. The number of general registers is effectively "frozen" in the machine-language specification; a higher-performance implementation, offering a greater number of registers, will require modification of the machine langaguage and hence of the software designed for it. * Programming difficulty. The need to allocate registers to variables complicates the programming task.

... There are alternatives to the general-register organization that offer, at some additional implementation cost, its performance and compactness advantages while avoiding the above objections. ... cache-memory ... ... tricks for address encoding ... [DAV: is it really possible to have a "no programmer-visible registers" architecture ?]

... p.365 Pascal and Ada are also concerned with /safety/ in the sense that they strive to make as many programming errors as possible detectable at compile time rather than run time.

neural networks (NNs)

[FIXME: gather other NN information scattered over my other pages here.]

The simplest introduction I've found to neural nets is "Neural nets explained: How neural networks give machines the ability to learn from experience." article by Alan Zeichick (2000) http://www.redherring.com/mag/issue81/mag-neural-81.html
Imagination Engines, Inc. http://www.imagination-engines.com/ "The World's First Company Dedicated to Producing Neural Networks Capable of Human Level Invention, Discovery, and Artistic Creativity" ... "Neural networks do not necessarily become smarter as they incorporate increasing numbers of computational units. ... more is better [is a] myth"
"I think ANNs are going to be replaced with spiking networks in the coming years. ... since ... a network of spiking neurons can do anything a traditional analog neural network can do, with the same number of neurons (plus a small constant, 1-3 neurons or so). [but] a single spiking neuron can perform coincidence detection; any network of analog neurons needs at least (n+2)/4 units, where n is the number of inputs." -- "Joseph J. Strout" <jstrout@ucsd.edu> http://www-acs.ucsd.edu/~jstrout/ Date: Tue, 20 Jan 1998 To: Mind Uploading Research Group <murg@myofb.org> ** Send all subscribe/unsubscribe commands to MURG-Request@myofb.org **

minimal-gate CPU

[FIXME: move to #simple_cpu

"A Writable Computer" by Rob Chapman http://www.compusmart.ab.ca/rc/Papers/8bitprocessor/ http://www.compusmart.ab.ca/rc/Papers/writablecomputer.pdf http://www.compusmart.ab.ca/rc/Papers/AWritableComputer.pdf
... requires 19 control signals, most of them directly mapped from the 16 bit instruction word from memory. sometimes you could make it do 2 completely different things at once by designing the instruction bits ... Very simple ... even addition is broken into even smaller instructions.
"statistical addition": Addition of TOS and NOS is synthesized by simultaneously calculating
```
	TOSnext = (TOS and NOS)*2; // carry bits
	NOSnext = TOS xor NOS; // sum bits
```
then repeating until 0==TOS, and the sum is in NOS. ...
DAV: Say we have a long column of numbers to add (in particular, trying to synthesize MAC multiply-and-accumulate). Without changing this architecture: Rather than running every sum to completion, perhaps quicker to combine 3 or 4 or 7 values at a time into 2 partial-sum values:
```
	carry_bits = 2*majority( a b c );
	sum_bits = a xor b xor c;
```
Is there any minor change to this architecture that would speed up MAC ?
DAV: If the CPU synthesizes addition anyway, out of logical ops and bit-shifting, does it make any sense to choose some *other* counting sequence, different from natural binary? Perhaps grey codes? LFSR sequences?

http://www.dejanews.com/getdoc.xp?AN=411378028 says:

From: mj@isy.liu.se (Michael Josefsson)
Subject: How to do the smallest processor ever?
Date: 13 Nov 1998 00:00:00 GMT
Newsgroups: comp.arch

Hi all,

while everybody seem to make everything as parallell as possible (to get the
speed up), I tinker with the idea of making the easiest possible processor.

Wherever I can, I want to get away from hardware. So these are some ideas that I have
come up with (discussions baits perhaps):

1. Make it serial(!). Yes, it slows down things but OTOH there is only 1 bit
busses. Adding 2 numbers (8-bit) would require a great number of clocks in
multiples of eight. Slow? Yes! But feasible.

2. The original idea used only one 1 bit bus but I have come to the conclusion
(opposite the reason to do this in the first place) that it is far more easy
to implement something using 2 or 3 buses.

3. Minimise the number of registres. Yes a PC, IR and AR (address register)
would be hard to get by without, but otherwise it would suffice to have only
one other register - the accumulator. Incrementing PC is done with the ALU.

4. External memory is always parallell so I imagine that there would be 8
bit busses to the outside.

5. Speed would be slow with some 10-15 clocks to get the smallest operation
done. Given a TTL implementation the total number of mips, at 10 MHz say,
would be around 0.7 - not much, but then again what would one expect from a
handful of TTLs?

Some ASCII graphics might help

                (for shifts)                        * 1bit buses everywhere!
           |-----------<------------                * Accumulator knows how to shift
	|  -|\   ------------------ |               * Everything is clocked bit-by-bit
        |---| |--|8bit accumulator|--------|        * PC and IR better parallell?
        |   |/   ------------------        |
        |                                  |
        |                                  |
        |                    0/1(ld/pc++)  |            Proc Control
        |              /|     |            |            /\/\/\
        |            /  |-----/------------|             | | |
        |-----------|ALU<                             --------------
        |            \  |---------/-------------|-----|IR/OP decode|
        |              \|         |             |     --------------
        |           1bit ALU      |             |
        |                         |             |
        |          -------------  |             |
        |----------|Program cntr|-|       --------------
        |          -------------  |       |Parallell to|
        |                         |       |serial conv.|
        |          -------------  |       --------------
        |----------|Addr reg    |-|             |
        |          -------------  |      From ext memory
        |                         |
        |                         |
   -----------               -----------
   |serial to|               |serial to|
   |parallell|               |parallell|
   |converter|               |converter|
   -----------               ----------
        |                         |
        V                   Ext address
    To ext memory

Has this been done anywhere? The focus seems to get everything as fast as
possible and not as small as possible.

Cheers,
/Micke Josefsson
University of Linkoping
Sweden

One reply http://www.dejanews.com/getdoc.xp?AN=411507856.1

From: Gary Boone <gary@micromethods.com>
Subject: Re: How to do the smallest processor ever?
Date: 13 Nov 1998 00:00:00 GMT
X-Posted-Path-Was: not-for-mail
Content-Type: text/plain; charset=us-ascii
X-ELN-Date: Fri Nov 13 09:05:21 1998
Organization: Micro Methods
Mime-Version: 1.0
Newsgroups: comp.arch

Regarding minimal processor history aspects:

1. See "Computer Structures ..." Siewiorek, Bell & Newell,
   1982, beginning at page 581, for a description of TMS1000
   microcontrollers, an increased function (and cost
reduced)
   re-design based on TMS0100 architecture that I worked on.

2. In 1971 "state of the art" 10-micron 1-metal PMOS,
TMS0100
   4-bit data paths were used, not for performance, rather
to
   allow less chip area (and single-chip design) compared to
   then-conventional 1-bit serial data paths. Consider, for
   example, area savings due to 3-transistor RAM compared to
   conventional 6-transistor shift register memory. Control
   decode for 4-bit data paths also saved (no "bit" timing).

3. The "most minimal" processor I know of was made at
Litronix
   circa 1976, for use in digital watches. As I recall, this
   processor could only count, not even add. The architect
was
   Steve McCrystal. I've lost track of my copy of his
amazing
   instruction set. Steve, if you see this, please post it!

Gary Boone

The Motorola MC14500B http://www.microprocessor.sscc.ru/great/s2.html#MC14500B http://www.questlink.com/keyword.htm?proc=keyword&view=list&kwrd=14500 http://www.questlink.com/page.htm?proc=brief&part=MC14500B&vendor=11205&brief=22059&view=153540 had only 16 pins; processed everything serially.

The Museum of HP Calculators http://www.hpmuseum.org/ mentions some zero-IC calculators (two types: slide rules and discrete transistors).

stack machines

(see Minimum Instruction Set Computing (MISC) for more stack machines)

see also data_compression.html#space-optimizing_compiler and data_compression.html#compact_instruction_set for more thoughts on compact, dense, non-bloated code.

"Stack machines have several inherent advantages over typical RISC and CISC processors:
- Very compact program size (see data_compression.html#program_compression)
- Very fast execution of arithmetic operations and subroutine calls
- Very fast and predictable interrupt latency.
... ideally suited to embedded real-time control applications. ..."
-- p. 601, _Computer Architecture: A Designer's text based on a generic RISC_ book by Feldman and Retter
Philip J. Koopman, Jr. http://www.cs.cmu.edu/People/koopman/ has written a lot about computer architecture.
- Why Stack Machines ? http://www-2.cs.cmu.edu/People/koopman/forth/whystack.html
- Lots of info on stack-based computers (and FORTH);
- hints on getting doctorate in electrical and computer engineering;
- embedded computers -- safety, reliability, graceful degradation;
- computer architecture
- _Stack Computers: the new wave_ book by Philip J. Koopman, Jr. 1989 (the complete book is online http://www.cs.cmu.edu/~koopman/stack_computers/ ) [FIXME: read this] The stuff on "handling stack overflow" and "the importance of fast subroutine calls" #subroutines seems applicable to all architectures.
  Includes "a study of FORTH instruction frequencies". Includes a analysis of the effect of increasing or decreasing return stack and/or data stack (how to handle stack overflow / stack underflow). Includes "the minimum" set of 17 operators for the Canonical Stack Machine [FIXME:] .
- A Brief Introduction to Forth http://www.cs.cmu.edu/~koopman/forth/hopl.html talks about the benefits of highly-factored code.
- Stack Computers & Forth http://www.cs.cmu.edu/~koopman/stack.html some of Koopman's patents, more information on WISC #wisc , and many other good Forth links.
- ``Writable instruction set, stack oriented computers: The WISC Concept'' article by Philip Koopman Jr. 1987 http://www.ece.cmu.edu/~koopman/forth/rochester_87.pdf
  `` ... Software that is able to customize the hardware to meet critical application-specific processing requirements will be able to attempt more difficult tasks on less expensive hardware.
  ... efficient hardware support for procedure calls, ... ability to tailor hardware to applications based on software requirements. ...
  ... propagation of initial biases ...
  ... CPU speeds have outstripped bulk memory speeds ... There are 2 ways to solve this problem:
  - speed up average memory access time, and
  - increase the amount of work done per memory access.
  Cache memory decreases average memory access time ... Other techniques to speed memory access include interleaving banks of memory and pre-fetching opcodes ... methods tend to increase speed ... at the cost of added hardware complexity. Separate data and program memories ...
  ... increasing the average amount of work done by each opcode fetched from memory... has led to the development of what is now called the Complex Instruction Set Computer (CISC) ...
  Truly interactive processing (which does _not_ mean doing batch-oriented edit-compile-link-execute-crash-debug cycles from a terminal) is ... not taught ... in universities.
  ... there is still considerably more mileage to be gained from uniprocessors by breaking out of past cycles and looking at the hardware/software problem as a whole. The answer lies not with a new hardware architecture that mirrors current software, nor in changing software to suit current hardware. The answer lies in a redefinition of how we think about hardware and software. ...
  The first step ... take to heart the philosophy of breaking up a problem into smaller sub-problems. Instead of building a computer that supports procedure calls as special operations, what if we design a computer to expect subroutine calls as its primary mode of operation ?
  ... Programs are now organized as a tree structure... In such a machine, the very notion of a program "counter" becomes obsolete.
  If this machine could actually process procedure calls simultaneously with other operations, modularity in programs would not be penalized. Such a machine would encourage better software design, and could fundamentally alter the way programmers think about programs.
  ... A WISC machine has high-speed procedure processing capability along with the capability to redefine the instruction set. ... an interesting and somewhat unexpected cascade of benefits is realized. ...
  ... makes the hardware simpler, faster, and less expensive to develop and manufacture. ...
  procedure calls may be made a zero cost in execution time ... a stack-oriented opcode need only take roughly one-quarter of a 32-bit instruction word, the remaining instruction word bits are available to use as a procedure branching address
```
		+--------+----------------------+
		| opcode |          address     |
		+--------+----------------------+
		
```
  Figure 7. Generic WISC instruction format
  ... simple hardware means simple microcode.
  ... a single-format 32-bit micro-instruction format is more than sufficient for a WISC machine.
  ... The combination of hardware stacks with writable microcode memory results in the blurring of the boundaries between high level programs, machine code, and microcode.
  ... a procedure can be transparently replaced with a microcoded primitive simply by replacing the procedure call with an opcode. There is no impact to any other aspect of the source code. ... lead to a view of microcode memory as a cache memory for frequently used operations.
  ... The most heavily executed procedures can the be partly or wholly transferred from high level code to microcode, resulting in a significant speed increase. [DAV: is it possible to automatically do this for every (inner) loop, at run time / load time ?] ...
  ... If a suitable microcoded instruction set is used, compiled object code can closely correspond to the original source code, resulting in simpler and more efficient compilers and debugging tools. ...
  ... WISC machine ... Programmers are not penalized for organizing programs into small, understandable procedures. This results in compact tree-oriented program structures which are composed of hierarchically arranged solutions to sub-problems. Thus programs can be simultaneously optimized for small memory space, fast execution speed, and low development cost. ...
  Design of a 32-bit WISC machine
  ... CPU/32 ...
```
		BIT: | 31   23 | 22          2 | 1   0 |
		     | opcode  |    address    | call/ |
		     |         |               | exit  |     

		Figure 8. CPU/32 instruction format.
		
```
  ...
  The CPU/32 has no program counter. Each instruction contains the address of the next instruction. The only exception ... procedure returns ... the return stack value is passed directly through the memory address logic to access the next sequential instruction in the calling program.
  While there is no program counter, there is an incrementer within the program memory logic that is used to add a 1 word displacement to procedure call addresses before they are saved on the stack. ... The incrementer is also useful in block memory moves. ... ''
  [FIXME: is the ``propagation of initial biases'' something that is universal enough to include under ``general design'' ? (see also 9.6 The impact of stack machines on computing http://www.ece.cmu.edu/~koopman/stack_computers/sec9_6.html ... I thought I saw Koopman write about this topic in another document ... reminiscent of a rumor I've heard about trains in the U.S. ... ) Koopman's solution to this problem, ``looking at the hardware/software problem as a whole.'', is reminiscent of Buckminster Fuller's ``start with the universe''. But I've also seen this perverted into ``everything must change at once'' computer_architecture.html#everything_must_change . ... the ``cascade of benefits'' is reminiscent of Fuller's ``synergy'' ]
  ... very reminiscent of Transmeta's ``code morphing'' concept ...
  DAV: Let's take this program-as-tree-structure idea seriously. Why should a return happen to go to the next statement after the statement called it ? If the tree is a binary tree, then we can imagine each node calls 2 lower-level routines before returning to its parent higher-level routine. (tail-recursion makes this equivalent to 1 call and 1 jump). ... If this were a pure tree, it's pretty simple to rearrange the tree so the destination of that 2nd call comes immediately after the instruction that calls it, simplifying to ``jump PC+1'' and leading to Koopman's single-address format. But is that true for the acyclic graph produced when a subroutine is called by many different higher-level nodes ? No. What happens then ? Some cases: When the common subroutine is the left (1st called) subroutine of more than 1 node: No problemo -- Koopman's single-address format still handles it fine.
  DAV: When the common subroutine is the right (last called, AKA jumped-to) subroutine of more than 1 node: It's impossible to have more than 1 location ``fall through'' to this common subroutine. The simplest, most compact way to handle this is with a NOP, Jump instruction.
  DAV: If the instruction to which this NOP,jmp points also happens to be a a NOP,jmp instruction, then this NOP,jmp instruction can be adjusted to point to the destination of *that* jmp instruction -- so there's no more than 1 consecutive NOP.
  On p. 69 Koopman discusses ways for a compiler to optimize away this NOP, by ``bubbling up'' (slightly) expanding the code. It doesn't expand as much as subroutine in-lining, because we do not duplicate the *entire* subtree, only the right-hand branches. Perhaps we do not even need to duplicate that much ... We only need 2 versions of a subroutine to eliminate NOPs. Call the 1st non-NOP opcode executed by a subroutine op1. One version of that subroutine has op1 embedded in it, called by
```
		(any op) call version1.
		
```
  If the compiler can put something productive into (any op), it uses this version. If the compiler is forced to put NOP into that call, it changes
```
		NOP call version1
		
```
  into
```
		op1 call version2
		
```
  where version2 leaves out that 1st op1 and depends on the caller to execute it. .
  DAV: on-the-fly opcode redefinition: Perhaps it would be interesting to dynamically execute data structures. The ``opcode'' at each node would merely indicate ``this is a leaf'' or ``this is not a leaf'', and that opcode would be redefined every time we wanted to do something different to the tree. Does opcode redefinition really give us enough flexibility to do prefix, infix, and postfix tree handling ?
Bernd Paysan http://www.jwdt.com/~paysan/ implemented a interesting "4stack Processor" in Verilog comments on compiler optimization on a multi-stack (rather than general register or data/return stack) CPU. Forth for Linux, with lots of GUI words for X. Collection of One-Screeners of Forth
Gforth http://www.jwdt.com/~paysan/html/gforth_toc.html (GPL'ed version of FORTH) includes Object-oriented Forth http://www.jwdt.com/~paysan/html/gforth_50.html
M. Anton Ertl http://www.complang.tuwien.ac.at/~anton/ interested in Forth, Linux, Constraint Logic Programming, (?) PostScript. points to "What is the PDF format good for?" A good article recommending never to use ".pdf" files; compressed PostScript (".ps.gz") is always better, and perhaps texinfo, LaTeX, Linuxdoc-SGML, HTML are also useful options.
Forth Research at Institut für Computersprachen and related topics, such as stack-based languages (PostScript, JavaVM), threaded-code interpreters, and stack machines. http://www.complang.tuwien.ac.at/projects/forth.html
Implementation of Stack-Based Languages on Register Machines is pointed to by http://www.complang.tuwien.ac.at/forth/performance.html
http://playground.sun.com/pub/p1275/home.html#OpenFirmware IEEE Boot Firmware == IEEE 1275-1994 == Sun's Open Firmware (FCode) Standard
"Finite State Machines in Forth" by J. V. Noble http://dec.bournemouth.ac.uk/forth/jfar/vol7/paper1/toc.html
euroFORTH '98 http://dec.bournemouth.ac.uk/forth/euro/ef98.html (Does this point to anything useful ?)
EuroFORTH conference mailing list http://www.egroups.com/list/euroforth/
http://www.forth.org/fig.html has some Open Firmware links.
MOVING FORTH Part 1: Design Decisions in the Forth Kernel http://www.zetetics.com/bj/papers/moving1.htm How to port Forth to a new CPU. [FIXME: Should this be moved to #os, since this description assumes Forth *is* the OS on the machine to which you are porting ? Anyway, writing a Forth interpreter should be a good first step towards writing an OS.] [FIXME: more for people writing a Forth interpreter, not for people wanting to write programs in Forth.]

operating systems development (your own OS)

David Cary thinks that if you know enough to develop an adequate OS, your time and skill could be applied to other projects more worthy of your effort todo.html .

Nevertheless, there's nothing like producing a system where you wrote every byte of its software. (It may be a waste of time, but it's sure a learning experience). You might want to start with a simple PIC robot_links.html#pic .

compiler quotes

There seems to be some confusion about what a compiler can and cannot do.

Here I have information about compilers in general. Also see Porting C compilers #porting_c

I'm especially interested in:

very simple compilers, easy for a human to understand and retarget to other (possibly complex) processors.
(possibly complex) compilers that target extremely small, simple machines. (cross-compilers)
very simple compilers, so they can actually run (self-hosted) on relatively small, simple machines.

compiler quotes

``Programming Optimization'' by Paul Hsieh http://www.pobox.com/~qed/optimize.html

Optimization is but one of many desirable goals in software engineering, and is often antagonistic to other important goals such as stability, maintainability, and portability. At its most cursory level (efficient implementation, clean non-redundant interfaces) optimization is beneficial and should always be applied. But at its most intrusive (inline assembly, pre-compiled/self-modified code, loop unrolling, bit-fielding, superscalar and vectorizing) it can be an unending source of time consuming implementation and bug hunting.
``The Great Debate: "Do Compilers Produce Code as Good as Humans?" "Will Compilers Ever Produce Code as Good as Humans?" "Is is Worth it?" '' http://webster.cs.ucr.edu/Page_asm/GreatDebate/GreatDebate.html
Introduction to Code Tuning by Steven C. McConnell 1993 http://www.construx.com/stevemcc/cctune.htm

I was magnificently pleased with myself. ... I calculated that I had saved ... a lot of loop overhead. ... I was so pleased, that I decided to measure the speed improvement, ... so that I could pat myself on the back more quantitatively.
Do you know what I found?
No improvement whatsoever. ... The compiler's optimizer had optimized the first code well enough that my optimization didn't help at all. I was powerfully disappointed, and I learned that that only thing you can usually be sure of without measuring performance is that you've made your code harder to read.
In short, if it's not worth measuring to prove it's more efficient, then it's not worth sacrificing clarity for a performance gamble.
...
Each compiler has different strengths and weaknesses, and some are better suited to your program than others.
...
When to Tune: ... when to optimize ... Make the program right. Make it modular and easily modifiable so that it's easy to work on later. When it's complete and correct, check the performance. If it's lacking, then make it fast and small. In short, don't optimize until you know you need to.
...
[FIXME: should I put this quote with the Abrash stuff in video_game.html ... cross link ...]
http://www.peterindia.com/CompilerDesign.html has a nice list of links to Compiler Construction Tools; Compiler FAQs, Links and Web Resources; Commercial and Free Compilers for many languages including Pascal, Matlab, C.
``Optimizing compilers like GNU C or IBM's PL/x encode hundreds of man-years of experience in converting basic language structures into highly efficient machine code and so generally do this better than ordinary assembler programmers inventing the code for themselves.'' http://www.linuxworld.com/site-stories/2002/0416.mainframelinux-p6.html
``using assembler to debottleneck applications is ... Common practice for mainframers but not for Unix users.'' http://www.linuxworld.com/site-stories/2002/0426.mainframelinux.react.html
1d_design.html has a bunch of ``clever'' tricks that compilers and assembly language writers can use to implement max(a,b),

[DAV has snipped out a bunch of stuff from this long post,
leaving only things directly relevant to compilers]
------------------------------
Date: Tue, 20 Jun 2000 23:15:04 -0700
From: Jeff Fox 
To: misc
Subject: CPUs and Forth
...
I know what you mean.  I have confidence that people are still
capable of being smarter than their PCs at this stage of technology
and don't really need to adopt the stance that they need a compiler
that is smarter than they are.

But I hear all the time that this is the goal today, don't think
about any of that stuff and just have faith that the compiler
will be smarter than you and generate optimized code for you.
We are told that we should just have faith that the code in the
canned libraries is going to be efficient and better than code
we could write ourselves.  We are told that we need hardware that
is just too complicated for humans to deal with and the only
option is tools that are smarter than we are.   We have taken
a very different approach where we enjoy being in control and
being able to think about the problem.
...
iTV was spawned from a fund available to NASA contractors who
wanted to take high technology developed for NASA into the
commercial marketplace.  Gary Langford and Joe Zott had been
doing spacecraft design for NASA through SkyWatch and had
submitted designs for various spacecraft that proposed using
Chuck's chips.  They got the idea that they chips could also
make very cheap internet appliances and iTV was born.

I go back quite a ways. I was a physics major at U of Iowa when
they built a number of experiments for the deep space probes.
They were the only University putting experiments and equipment
into those missions and I looked over shoulders and asked a lot
of questions.  Those were days before Forth chips let alone MISC.
Between that time and when I started working with MISC I sold
a couple of Forth systems to scientists working at NASA and
heard stories from them about the use of Forth on early
spacecraft.  They used the silicon on saphire 1802 in those
days because it was low power and could take radiation.  They
were strange processors and one of the things that made them
useful was Forth.  They were weak so a lot of them were required.
Those early deep space exploring spacecraft were Forth multiprocessing
robots.  Forth made em work and they worked remarkably well.
As I say as far as I know Forth is the only thing we have sent
outside of our own solar system (and not because we were trying
to get rid of it either!)
...
> I went onto comp.arch.fpga a few months ago

I used to read comp.arch, comp.embedded, comp.realtime, comp.lang.forth,
comp.robotics, and comp.robotics.ai.philosophy

...
I found the majority of the people in those groups
are so used to bloated and inefficient software that they
really believe things like that a Pentium needs 50 thousand
instructions inside of an empty loop.

Most often when we would discuss most anything I would be
talking nano-seconds or micro-seconds and they would be
be talking milli-seconds or seconds.  So when they would say
that with a thousand dollar computer they could do such
and such in 20 milliseconds and I would say well we do the
same thing in 20 microseconds or 20 nanoseconds on a machine
costing almost nothing this was pretty hard for them to
grasp.  The same thing with replacing megabytes of code
with kilobytes of code.  (Well they are related.)
...

Well to be fair, the ANS standard was never intended as tutorial on
Forth let alone a tutorial on how to write good Forth.  Like other
standards it aimed for system implementors not end users.  As a
result I think understanding the standard is many times more
complicated than understand a real Forth system even one that is
ANS compliant.

> Reminds me of when I thought of FORTH as a scripting language because it
> was so easy to implement a FORTH interpretter in C.

One of my main complaints of ANS is that it is being used to promote
Forth as yet another inefficient scripting language extension to C.
I would hate to see that become Forth's fate.  I prefer to think of
it as a systems language than a scripting language.  But if your
perspective is that the OS, the compiler and most of the programs
are written in C then about the only place to put Forth is as
a scripting language extension to the C enviroment and most of
the concern about the Forth will really be C interfacing issues.

I think of Forth as providing the OS, the compiler, the programs
and so think of it as needing to be an effient systems language
and don't see that the main issue is making it conform to the
rules of a C environment
...

Jeff Fox

------------------------------

[low power] [FIXME: DAV has a 1802 instruction set manual on his shelf ... Has this already been digitized and put online ?] 1802 emulator http://kristopherjohnson.net/cgi-bin/twiki/view/Main/TinyELF

Optimization: "[A optimizing compiler] not only saves me the trouble of doing the same optimizations by hand, it gets it right the first time around." -- Jerry Avins Subject: Re: Top 10 Language Constructs (Forth) Date: 22 Jul 2000 00:00:00 GMT Newsgroups: comp.lang.forth
"compiler writers are very good at writing code that doesn't exercise any of the bugs in their compiler."
3. All the knowledge of how to generate actual object modules can be centralized in one place. This is a non-trivial issue if the format is complex or if it ever has to change, and it means that bug fixes can be applied in one and only one place. -- (Henry Spencer) Feb 1993
No human chess player in the world could defeat even a 1960's-era chess program in a tournament if required to play 500 games straight through with no breaks. For today's programs, the number would probably be 10 rather than 500.
No compiler can touch the best hand-coders for fairly small pieces of code that do complex things under messy conditions (e.g., on hairy machines like 80x86's). No human being can touch a good compiler when optimizing large amounts of "boring" code that has been written for clarity and human under-standing, rather than tuned for a particular architecture. No human being can match the ability of optimizing compilers to keep the code optimized as it changes or as it's moved among different types of hardware.
-- Jerry
-- From: Date: Wed, 27 Oct 1993 16:34:05 GMT Newsgroups: comp.compilers
Interview with a Pascal Architect 2000-02-18 http://pascal-central.com/interview1.html ``Compaq's John Reagan, the architect of Compaq Pascal for OpenVMS. John provides some great insight into the art of compiler making which should be of interest to software developers everywhere, regardless of your preferred programming language or platform.''
Q: Are there any books you recommend for designing compilers?
John Reagan:
- * Data Structures and Algorithms by Aho, Hopcroft, and Ullman
- * Algorithms + Data Structures = Programs by Niklaus Wirth (out of print)
- * Software Defect Removal by Robert Dunn (out of print)
- * Compiler Construction for Digital Computers by David Gries (out of print)
- * Building an Optimizing Compiler by Robert Morgan
[FIXME: link from book recommendations]
DAV: Are C compilers smart enough to make these improvements ?
Starting with
```
void tolowercase(const char const *input, char *output) {
  int i=0;
  for (i=0; i<strlen(input); i++)
    output[i] = tolower(input[i]);
  output[i] = '\0';
}
```
Since the pointer input never changes, and the characters it points to never change (or do they -- when output points to the same place ?), then there's no need to re-run strlen() for each character. A smart compiler should store that value only once:
```
void tolowercase(const char const *input, char *output) {
  int i=0;
  unsigned int length = strlen(input);
  for (i=0; i<length; i++)
    output[i] = tolower(input[i]);
  output[i] = '\0';
}
```
Since the *order* that we go through the loop doesn't matter, some architectures (such as the PIC) scan through strings *much* faster end-to-beginning rather than beginning-to-end:
```
void tolowercase(const char const *input, char *output) {
  int i = strlen(input);
  output[i] = '\0';
  while( 0 < i ){
    i--;
    output[i] = tolower(input[i]);
  };
}
```
For this particular function, there's really no need to use strlen() or a temporary variable at all.
```
void tolowercase(const char const *input, char *output) {
  int i=0;
  do{
    output[i] = tolower(input[i]);
	i++;
  }while( 0 != output[i] );
}
```
I'm pretty sure compilers automatically optimize array references, like that, to give something closer to
```
void tolowercase(const char *input, char *output) {
  do{
    *output++ = tolower(*input++);
  }while( 0 != *output );
}
```
That should be the fastest on normal architectures (ones where RAM is so slow that there's time for a few quick operations between reading and writing).
On some architectures, I find it faster to break things up into seperate loops. Apparently optimizing each loop seperately sometimes more than compensates for the extra loop overhead.
(In particular, on machines that are register starved, if you have to keep shuffling around temporary variables in RAM inside one large loop, but you can break it up into smaller loops that can keep all the temporary variables in registers ... but once you have everything in registers, further splitting doesn't help. ).
```
void tolowercase(const char *input, char *output) {
  strcpy( output, input );
  while( 0 != *output ){
    *output = tolower(*output);
	output++;
  };
}
```
[FIXME: consider moving to http://c2.com/cgi/wiki?edit=BetterForLoopConstruct ]

Porting C compilers

Porting the "gcc" compiler; Porting other C compilers

2 sections here:

Porting "gcc", a full-size commercial-strength open-source compiler, and re-targeting it as a cross-compiler.

porting simpler C compilers, especially ones targeting very simple processors such as the Microchip PIC, especially ones that optimize space over time. This includes "Small C", a highly restricted subset of C (no structures; DAV finds this annoying) that was designed to require a far simpler compiler than one that can handle the full C language.

other C Compilers:

SDCC is an ANSI C compiler for Intel 8051, Dallas-DS390, Microchip PIC, and the Zilog Z80 based MCUs http://sdcc.sourceforge.net/ [FIXME: search for "sdcc" in my files, why do I have information about it scattered everywhere ? ...]
uCc is a Small-C type compiler ... simplicity ... works with glibc and bootstraps with gcc. https://sourceforge.net/projects/ucc/
Small C This is the definitive collection of Small-C related information and source code. http://www.ddjembedded.com/languages/smallc/ including the entire book (online) _A Small-C Compiler: Language, Usage, Theory, and Design_ book by James E. Hendix. [DAV: how difficult would it be to add structures to Small C ?]
Small C cross-compiler for the Motorola DSP56800 http://home.attbi.com/~petegray

Since the full source to GCC is so large, perhaps it would be good to get a simple C compiler working first on a new OS or architecture.

(Does it make sense to bootstrap gcc using Small C or another C compiler ? Or should we just use gcc on a previous machine as a cross-compiler ? )

information about porting "gcc":

the GCC home page http://www.gnu.org/software/gcc/ (open-source C compiler.)
Cross GCC https://sourceforge.net/projects/crossgcc/ ``objective is to create binary packages providing cross-compilers to the Linux environment that can generate executables to other operating systems and plataforms.''
lcc http://www.cs.princeton.edu/software/lcc (open-source C compiler ?)
Tendra C++ http://www.tendra.org (open-source C compiler ?)
Open64 http://open64.sourceforge.net (open-source C compiler ?)
GCC for OpenRISC https://sourceforge.net/projects/gcc-or/
GCC-based C cross-compiler for the Hewlett-Packard HP-48 scientific calculator (4-bit data bus, 20-bit address bus, 64-bit register architecture). https://sourceforge.net/projects/hp48xgcc/
GCC for Palm OS https://sourceforge.net/projects/prc-tools/

low power design

low power design [Should this be moved to vlsi.html ?]

Designing for low power has secondary benefits ... assembling a quiet PC hardware_david_uses.html#quiet_pc

There are many levels to low-power design. To get a low-power web server (top level) requires software that doesn't waste power, a low-power CPU, low-power clocking, and low-power circuits (including low-power oscillator). [FIXME: the last section is over at schematic.html; should I break up the other levels into their own sections ?]

top N books on CPU development

[FIXME: move to Wikibooks: Microprocessor Design/Resources

Hennessy, John L. and Patterson, David A. _Computer Architecture: A Quantitative Approach: Second Edition_ Morgan Kaufman Publishers, Inc. http://www.mkp.com/1996 ISBN: 1558603298
EET 483 Computer Architecture: Author Dave Patterson's slides - Patterson and Hennessy text http://www.cas.uc.edu/~ciminero/Courses/483/Overhead/select.html
_The Mythical Man-Month: Essays on Software Engineering, Anniversary Edition_ book by by Frederick P. Brooks, Jr. "Great book. A must-read if you involved in any software project bigger than 1 person, or if you are involved any any complex project (software or not) with many people." -- David Cary (*)
Maxfield, Clive. _Bebop to the Boolean Boogie: An Unconventional Guide to Electronics Fundamentals, Components, and Processes_ by Clive Maxfield ... 471-page paperback (ISBN 1-878707-22-1) ($35) (1995) http://www.maxmon.com/ "Excellent high-school-level instroduction to digital electronics. Covers practical basics: how circuit boards (PWBs), multichip modules, transistors, and integrated circuits are physically created. Covers the theoretical basics: Boolean algebra, transistors-as-switches, logic gates binary numbers. Later sections of the book skim over FPGAs, optical interconnects, nanotechnology, and Seafood Gumbo." -- David Cary (David Cary owns a copy) (Max's other books look even more relevant to the f-cpu, but I haven't read them yet).
free (?) logic simulators: Clive "Max" Maxfield http://ro.com/~bebopbb includes C source code for "a simple home-brewed logic simulator"
see schematic.html for a few more free (?) logic simulators.
_Computation Structures_ by Stephen A. Ward and Robert H. Halstead, Jr. (c)1990 by The Massachusetts Institute of Technology The MIT Press. "Exhaustive explaination of each of the "levels" of thinking about a computer. Because it explains all the theory and buzzwords like "SIMD" and "hypercube" and "RISC" and pipelining and register-forwarding, it ends up being huge, dry, and at the college-senior level. However, it has an example CPU design in the appendix that is exhaustively specified and designed to be easy to build out of easy-to-obtain TTL and EPROM devices -- a high-school senior with a little soldering experience could probably assemble one. This is the first time David Cary understood the evils of the /metastable state/ vlsi.html#metastable ." -- recc. David Cary
DAV: see idea_space.html#level for more of my ramblings about the concept of a ``level''.
[FIXME: What is the URI for this article ?]
```
Date: 1997-06-03
From: Jim Dodd <jim_dodd@onsetcomp.com>
Organization: Onset Computer Corp.
To: Mot-68HC11-Apps@freeware.mcu.motsps.com
Subject: Comparison of HC11 and HC12 Op-Codes
```
Mr. Sibigtroth wrote a fascinating piece called "A Closer Look at Instruction Set Design" in "Electronic Design" magazine in the August 19, 1996 issue. He looked at the HC12 instruction set from two angles: The first angle was that Motorola took input from users of the HC11 about what are weak points in the instruction set and how they were improved in the HC12 instruction set. The second angle was how to design a flexible yet fast instruction set that also allows for support of high level languages. I don't know if this article is also available through Motorola or if back issues are available of "Electronic Design" (their Web site is at www.penton.com/ed). I just hope that those of you who have old copies of the magazine will be able to look it up or that someone can tell us where it exists in Moto space on the web.
The article certainly makes my mouth water for the day when I can do some work on the 'HC12.
```
Date: 19970604
From: Dietmar Block <exp145@physik.uni-kiel.de>
To: Mot-68HC11-Apps@freeware.mcu.motsps.com
Subject: Re: Comparison of HC11 and HC12 Op-Codes
```
I can't tell you if I found the right article (I have no access to this magazine) but at the page www.mcu.motsps.com/lit/fam_12.htm the article AN1284 'Transporting HC11 Code to HC12 Devices' looks a little like that.
Dietmar http://www.mcu.motsps.com/lit/app_notes/an1284.pdf

From: K.J.Wood 
Subject: Re: Book: The Anatomy of a High-Performance Microprocessor
Date: 10 Nov 1998 00:00:00 GMT
...
> The Anatomy of a High-Performance Microprocessor
> A Systems Perspective (Interactive Book & CD-ROM)
> by Bruce Shriver and Bennett Smith
>
> from IEEE computer Society.
...
I work on media processors (<plug>TriMedia</plug>) and 3D graphics
and I bought the book through work to check out what the other guys
are doing :-)

Cynically, I suppose it's an indirect form of advertising for AMD
but it's done with style and good taste.
...
...
Chapters 2 and 3 deal with microarchitecture over 250 pages
and suffer from the tendency of computer books to tabulate vast
amounts of data. OTOH if that's what you're interested in it does
look like good stuff, I'll probably absorb it by osmosis over time.

I liked Chapters 1 4 5 6. OK, some of it is "PCs for beginners", but
it's -really- handy having that sort of stuff in one place.  When I
have to write documents that mention things about PCs that "everybody
knows" it's useful to be able to point to something from this book.

The CD ROM is good. Sure you could dig around on the WWW for it all
but again, having it all in one place is nice and it encourages you
to read around the subject.

I'd like to hear from someone who knows whether or not it is a good
book from an architecture viewpoint, but from a systems viewpoint I
would recommend it for someone new to the field (and as someone new
to the field).

--
K. J. Wood
Philips Research Laboratories,  Cross Oak Lane,
Redhill,  SURREY  RH1 5HA, United Kingdom.
Phone: +44 1293 815328 Fax: +44 1293 815500 karl@prl.research.nospam.philips.com

historical computer architectures (computer museums)

Here I point to a few museums that list lots and lots of computer architectures. [FIXME: delete all the redundant stuff on this page that can be found easier at these ``museums''].

Some of these architectures can be simulated on FPGAs .

Also see robot_links.html#ucontrollers for information about a few of the most popular architectures currently in production.

"The Bonehead Computer Museum(TM) is dedicated to preservation and broad dissemination of inelegance, awkwardness, unnecessary expense and imponderability--in a word, kludgiosity -- in the design of digital electronic computing machines and related technologies." http://wetmachine.com/bonehead/
Project: Processor Documentation http://sourceforge.net/projects/processor-info/ ``An attempt to compile a comprehensive, open-source collection of various information about processors (from z80s to coppermines) such as the primary characteristics, op. codes, op. code hex values, etc.''
http://sourceforge.net/projects/ems-java/ is an open source project to *simulate* many different microprocessors. A little more exciting than dry data sheets ... [FIXME: still needs help]
the computer museum http://syssrc.com/html/museum/ also has a page on the history and operation of slide rules.
John Bayko (Tau) http://www.cs.uregina.ca/~bayko/ keeps a list of Great Microprocessors of the Past and Present http://www.cs.uregina.ca/~bayko/cpu.html | mirror http://www3.sk.sympatico.ca/jbayko/cpu.html mirror: http://www.geocities.com/TheTropics/7696/cpu_history.html | http://bwrc.eecs.berkeley.edu/CIC/archive/cpu_history.html interesting architectural ideas. (and some other interesting stuff). a few comments on *lots* of different CPU chips. Opinionated.
Chipdir http://www.amc.com/chipdir/ lists lots of processors, as well as other chips. Chipdir generally just has information about using already-build chips, but these links might be useful to those designing a new CPU:
- Instruction sets http://www.chipdir.com/chipdir/iset/ for many processors: the 4004, 6809, 680x0, 80x86, Transputer, etc.
- Processor Simulator http://www.chipdir.com/chipdir/js/sim.htm for a very simple processor, written in JavaScript.
- Division tricks for small processors http://www.chipdir.com/chipdir/oth/divideb5.txt (various algorithms; includes division routines written in actual assembler code for 8051 and for MicroChip PIC).
http://www.it.kth.se/~e93_mda/electronics/logic/cpu/ ??? lots of information on lots of different CPUs ???
The Computer Architecture FAQ http://www.geocities.com/Athens/Agora/7256/comparch.html [FIXME: today this page is offline. Has it moved ?] http://www.crackinguniversity2000.it/Agora/7256/comparch.html
The Math Forum http://forum.swarthmore.edu/ K-12, college, and advanced math. Has short article "Calculators and Trig functions" explaining how calculators *really* approximate trig functions.
the TI-85 ... calculator does not use series or polynomial approxiamations, but rather the so-called CORDIC method. ... The constants arctan (2^{-k}) are stored in the calculator. -- http://mathforum.org/library/drmath/view/54012.html the same technique described slightly differently: http://mathforum.org/library/drmath/view/51900.html
Eric Smith http://brouhaha.com/~eric/ has lots of information on Hewlett-Packard calculators (including scanned patents that include microcode), Reverse-engineering, disassemblers, and decompilers, and the museum of retrocomputing.
The Retrocomputing Museum http://www.tuxedo.org/~esr/retro/ | mirror http://www.catb.org/~esr/retro/ ``Our exhibits include many languages, some machine emulators, and a few games.'' including The BlooP and FlooP languages from Chapter XIII of Goedel, Escher, Bach: An Eternal Golden Braid by Douglas R. Hofstadter.
Some design goals for the Saturn processor (used in the HP-48 series calculators), some implications, and future trends http://www.brouhaha.com/~eric/hpcalc/rant.html .
Retrocomputing http://www.brouhaha.com/~eric/retrocomputing/ ``I'm interested in old computing devices. Mostly machines made before 1981, but with a few exceptions. ... Wanted: ... Apple ][ (not ][+) ...'' has lots of information on early machines. Lots of links to 6502 programming.
Le musée de l'informatique: Silicium http://www.silicium.org/
The Computer Museum http://www.tcm.org/
http://old-computers.com/
Histoire de l'Informatique: Le musée http://www.citeweb.net/guillier/his_info/musee/
Computer Processor Architectures and Operating Systems on the Web http://dns.uncor.edu/links/siteos.htm [OS ?]
"Online Museum and Technical History of Hewlett-Packard (now Agilent Technologies) Electronic Test Equipment" by Kenneth A. Kuhn. http://kennethkuhn.com/hpmuseum/ "This may be one of the largest home electronics shops anywhere. All of this vintage analog electronics is menu driven using an analog GUI -- the instruments perform the function the knobs point to. No programming required. Help menus not needed." (This is a museum of all kinds of HP products *except* HP computers.)

dealing with interrupts

dealing with interrupts: some ideas about interrupt-time code. [FIXME: make a clean seperation between peripheral interrupts here (which the CPU can safely ignore for one or two cycles), and MMU traps f_mmu.html which must block WRITE instructions before they overwrite ``protected'' memory. ] [FIXME: should I seperate out things that are only useful to CPU designers, vs. things that are only useful to CPU users ?]

Ralf Brown's Interrupt List http://www.cs.cmu.edu/afs/cs.cmu.edu/user/ralf/pub/WWW/ralf-home.html mirrored at http://www.ctyme.com/rbrown.htm (nice search tool) All x86 programmers should get a copy of the "interrupt list" by Ralf Brown. for various operating systems and peripheral hardware on a x86 ``PC'' platform.
"Making interrupt design firmware friendly" article by David A. Fechser, _EEdesign_ October 17, 2002 http://www.eetimes.com/story/OEG20021017S0048 http://www.us.design-reuse.com/articles/?id=4154&print=yes discusses ways to design hardware to reduce interrupt latency. Mostly focuses on designing hardware that interfaces with a CPU. "Interrupt bits should always be kept in their own register with no other writeable register bits. Normal register bit fields are modified by reading the register, changing the appropriate bit field, and writing back the full register. When interrupt Write-1-to-Clear bits share a register with normal register bits, an extra step is required by the firmware to avoid writing a '1' to the Write-1-to-Clear bit locations when setting other bits in the register. This leads to extra code if the firmware engineer realizes the problem, or buggy code if he or she does not." "Make all interrupts maskable. ... come out of power-up and hardware reset... all interrupts should be masked."
"If every hardware engineer just understood that..." http://www.microsoft.com/whdc/resources/MVP/xtremeMVP_hw.mspx "If hardware engineers understood that Windows is not a real-time operating system and it cannot put tight bounds on interrupt latencies, then they would not create hardware that has to be "touched" within a timeout period." ... "It might be a good idea to include a serial number string descriptor with your USB device. If this statement does not make sense to you, then you must be a typical hardware designer." ... "I believe it is absolutely critical for any professional to be familiar with technology at least one level deeper and one level higher than they work. For Windows driver developers this means knowing the platform (hardware access, assembly, operating system internals, etc.), and knowing the application layer (Win32, perhaps COM, etc.). For hardware engineers this means knowing some physics (parasitic capacitance, radio frequency radiation characteristics, etc.), and knowing the platform (Windows quirks, bus specification, Microsoft technology whitepapers, etc.)." ... "write-only registers make debugging almost impossible"
dma_faq.html has some pointers to information on the 8259 Programmable Interrupt Controller (PIC) on the PC.
If you absolutely must have very fast and predictable interrupt latency, consider using a stack machine #stack .
The proper way to implement interrupts is level-sensitive, not edge-sensitive. Only in that way can one properly share interrupt lines. When the first interrupt comes in, the CPU must save enough state to resume what it was doing (possibly continuing execution until it gets to a convenient breaking point), disable interrupts (of current interrupt line's priority and all lower priorities) (otherwise you get problems with recursive interrupt nesting), jump to the interrupt routine for that line, which scans all the devices associated with that line and requests that they release the interrupt ... then returns from the interrupt (and restores interrupts). It is nice to have "software interrupts" that act *identically* to hardware interrupts, so you can test all the interrupt routines by faking the interrupt in software.
When you design a CPU to handle a hardware interrupt, you only have 3 choices of what to do when it happens:
1. neglect it until the current instruction is finished. (On return from interrupt, start executing the next instruction).
2. handle the interrupt while the interrupted instruction is partially finished. (save enough of this intermediate state so that, on return from interrupt, finish executing this instruction, then continue on to the next instruction).
3. Roll back the effects of the interrupted instruction such that it is as if it never happened. (On return from interrupt, re-start executing this instruction again).
The simplest design is to do (1) for all instructions. Some CISC machines have instructions that can last hundreds of cycles (such as a "stringcopy" instruction). Real-time systems need a response within a few cycles; so (1) is unacceptable unless the instruction set is specifically designed such that *every* instruction is guaranteed finish in a few cycles.
(3) is impossible for instructions that have already written a value to I/O locations, even when restarting the instruction is guaranteed to write the *same* value. Many pieces of hardware will *do* something on each write, so writing twice to some location, is very different from writing only once to that location. Properly-designed hardware, however, should be tolerant of restarted multiple reads.
A special case where the hardware *must not* finish the current instruction before the trap: when a instruction (directly or as a side effect) does a Write (Store), it must not actually do a Write (to any address) whenever it doesn't have permission to write to the effective address. This includes: - when that address is in a virtual page that is swapped out to disk (the trap causes the OS to freeze that process until the page is loaded in; then that instruction can be re-started). - when that address is in a virtual page to which this process does not have write permission (the trap typically causes the OS to freeze that process and hand it to a debugger, or just to coredump). - when that address is to some I/O hardware; the kernel-level device driver for that hardware should be the only thing that writes to that hardware. (the trap typically causes the OS to freeze that process and hand it to a debugger, or just to coredump).
(I suppose it might be OK to complete the Write to cache, as long as that cache line is immediately marked as "invalid").
Of course, this has nothing to do with peripheral hardware interrupts -- -- so (1) is acceptible as a response to all peripheral hardware interrupts.
For nearly all hardware, the write ordering is important, and usually so is the read ordering. If you do out-of-order execution and reading and writing, and/or cache reads and writes, device drivers need some way to enforce the desired order when writing to hardware (typically a "barrier" or "synchronize" or "flush cache" or "read uncached, write uncached" instruction is provided to implement this). (In the C programming language c_programming.html#volatile , "volatile" is used to describe this situation).
[CPU design]

Also, when handling interrupts, please EXCHANGE registers, instead of moving them from one register to another. For instance, there could be a register for PC, IPC (Interrupt Handler PC), and RPC (Return PC). When an interrupt occurs, PC and IPC are exchanged, and all future execution occurs where IPC left off. When executing an RFI (return from interrupt) instruction, IPC and PC are exchanged again. Likewise, when calling a subroutine (actually a co-routine), it exchanges RPC and PC. Returning from the co-routine simply involves exchanging RPC and PC again.
-- Samuel A. Falvo II (?) (1999 March)
[CPU design] It would be nice if CPUs were designed to avoid task starvation. Normally one peripheral signals for attention, the interrupt handler deals with it for a few microseonds, then returns to processing normal tasks for the next millisecond. What if that peripheral signaled for more attention, just before the interrupt handler enabled interrupts ? What happens if peripheral went crazy and continued to generated interrupts faster than the CPU could handle them ? Many CPUs execute the return-from-interrupt instruction then the next instruction they execute is the first instruction of the interrupt routine. In my opinion, a CPU should always execute at least 1 instruction of normal tasks before handling interrupts. Otherwise you run the risk of one little relatively unimportant peripheral completely locking up the CPU. -- DAV 2002-07-22
yesterday:
Once the CPU directly read and wrote all values directly to I/O devices.
today:
Main memory serves as the I/O interface for most I/O devices. today a CPU typically flushes a buffer of values to DRAM, then triggers a I/O device which independently reads those values when the CPU is doing something else. When the I/O device has new information for the CPU, it writes those values directly to DRAM, then sends a interrupt to the CPU.
tomorrow:
Some people speculate that all peripherials will go to some sort of serial I/O. This is because it's much faster to burst an entire packet of data to the CPU than it is to request an IRQ, wait for the CPU to respond with a grant IRQ, and then finally send the data over.
"Trampolines for Embedded Systems: Minimizing interrupt handlers latency" article by Joseph M. Link _Dr. Dobb's Journal_ 2001-09 http://ddj.com/showArticle.jhtml?documentID=ddj0109h , http://www.ddj.com/article/printableArticle.jhtml?articleID=184404772&dept_url=/ /* was http://www.ddjembedded.com/resources/articles/2001/0109h/0109h.htm */ explains "interrupt handler wrappers", "disposable registers.", interrupt-time view of a multitasking OS, "AST (Asynchronous System Trap)" which "simply lets an interrupt handler queue a function for execution at a later time"
"a rather simple method of allowing interrupt handlers to be written in C, reducing overall interrupt latency, and providing for simple and efficient task switching in a multitasking environment."

The flowchart next to Link's article seems to restore disposable registers far too soon -- but the actual code is correct: restoring them just before returning from the interrupt.
Why does Link have 2 different checks for ``nested'' subroutines ?
It took me (DAV) a long time to understand why Link thought he needed 2 different checks for ``nested'' subroutines.
We obviously need to check the g_doingASTS flag to see if we've interrupted the initial interrupt while it's handling the ASTs.
Why do we need that other check ? To handle this situation:
- One peripheral triggers an initial interrupt, (turning off IRQs for peripherals of the same and lower-priority). Let's say it's the buffer-full interrupt on the UART. Before that interrupt routine has a chance to set the g_doingASTS flag, another interrupt comes in, and the CPU pushes the PC and starts executing a second routine. (Obviously this is not an interrupt from the *same* peripheral, because we wrote the initial interrupt handler to set the g_doingASTS flag *before* re-enabling IRQs from that peripheral).
Well, now we're hosed. There's only 2 possible ways we can write the interrupt handler for this second routine:
- After setting AST bits, check only the g_doing_ASTs flag. We'll see that it is ``no'', so we set it to ``yes'', and eventually enable interrupts and call the AST_handler(). That might run for some time, and other interrupts could come and cause the AST_handler() to run for a few more cycles. Meanwhile, more characters could come in on the UART and overflow the buffer. Even worse, if this 2nd interrupt or those other other interrupts include a timer tick, then we might decide to switch tasks. That buffer-overflow interrupt may never be resumed if the task that happened to be executing when it occured is never re-activated. That makes it almost certain that the buffer will overflow.
- After setting AST bits, use both checks. If we use the 2 checks, we can correctly detects that this (higher-priority) interrupt is a nested interrupt, and exit quickly. We assume that the (lower-priority) initial interrupt to handle any ASTs the nested interrupt registered. (If that lower-priority initial interrupt wasn't written to handle ASTs, well, tough, I guess we'll handle those ASTs at the next timer tick, which is connected to a interrupt handler that will handle them).
Some ideas:
- If we set the g_doingASTS flag as the very first instruction in the interrupt routine, could we avoid this situation on the 680x0 ? That isn't an option on some architectures where we really can't do anything before pushing the disposable registers, and where further nested interrupts can occur before we have a chance to do anything else.
  This problematic situation never happens on the PIC. In effect, every interrupt on the PIC has the highest priority, so an interrupt routine cannot be further interrupted unless it explicitly sets GIE to allow interrupts. So all we need to do is put the command to set g_doing_ASTs before the command to allow interrupts, and we only need to do 1 check.
  This problematic situation never occurs on the ARM. It isn't needed for implementing ASTs on the Microchip PIC or on the ARM (because the CPU disables all peripheral interrupts when ``the'' interrupt occurs).
  If you use an architecture where this problematic situation does occur, you must make sure that (lower-priority) interrupt routines can deal with being interrupted. In particular, if your IRQ_handler() decides to set bits in the g_AST_register, that *must* be done atomically. The wrong way to do it is to load g_AST_register into a CPU register, set the bit, then write it out again, You might be overwriting other bits that have been set by higher-priority interrupts.
OK, so now what ?
- Advantage of doing 2 checks: we can guarantee that the the C language IRQ_handler() eventually finishes executing before interrupts return to any task-level code, (i.e., it avoids the ``disadvantage of doing 1 check'').
- disadvantage of Link's method: Say we still have a couple of ``simple'' subroutines -- when the interrupt occurs, everything is handled immediately, and interrupts are never enabled except implicitly by the return-from-interrupt instruction. These interrupts were written before anyone ever heard of ASTs. Sometimes one peripheral triggers an interrupt, (turning off IRQs for peripherals of the same and lower-priority), then while the IRQ handler for that interrupt is executing, it is interrupted by another (higher-priority) interrupt (before the g_doingASTS flag is set). If this higher-priority interrupt only checks g_doingASTS, then
- advantages of doing only 1 check:
- disadvantage of doing 1 check:
Crude rough draft of how DAV would do this (perhaps I'm making some terrible mistake that Link has already considered and rejected).
(The g_ prefix indicates a global variable to which all instances of the interrupt routines have access. Other variables are local to this particular instance, and are pushed on the stack when interrupted).
- The simplest possible interrupt routines never do [allow interrupt] or [disable interrupts]. That simple routine guarantees that interrupts will use only one stack frame. This is probably the best if you can tolerate the latency. This has worse interrupt latency.
  (This corresponds to figure 1 / listing 1 of the article)
```
		(IRQ) (disables interrupts)
		save disposable registers
		pass arguments for handler
		[[ call IRQ_handler ]]
		; the IRQ_handler can be written in C.
		[[ call AST_handler ]]
		; the AST_handler can be written in C.
		if( top_task != current_task ){
			; time to context switch.
			[save current task's context to stack]
			[switch stacks]
			[restore top task's context from stack]
			[set g_current task = g_top task]
		};
		[restore disposable registers]
		(RTI) (allows interrupts)
```
- This version guarantees that interrupts will use at most 2 stack frames. This version allows at most 2 interrupts to be stacked: the ``initial interrupt'' (all ASTs are handled by this), which might be interrupted by a ``nested interrupt''. Because the interrupts are never enabled during the nested interrupt, the nested interrupt can never be interrupted. The ``initial interrupt'' is allowed to be interrupted any number of times, however.
  This gives much lower latency, because new interrupts can be accepted and (partially) handled by the IRQ_handler, while older interrupts are still being processed by the AST handler.
  (This corresponds to Figure 2 / Listing 4 of the article ... with what DAV is deluded into thinking are ``improvements'').
```
	(IRQ) (disables interrupts)
	save disposable registers
	pass arguments for handler
	[[ call IRQ_handler ]]
		; The IRQ handler may register one or more ASTs.
		; (i.e., set bits in g_AST_register].
		; Since the IRQ handler subroutine runs with interrupts off,
		; it doesn't need to be re-entrant.
		; Make sure IRQ_handler() does *not* modify ``g_top_task'' and ``g_current_task''.
	toss arguments for handler
	if( g_nested ){
		; nested interrupt -- do nothing.
		; Any ASTs that were just set by the handler will be handled by initial interrupt
		; after this nested interrupt returns.
	}else{
		; g_nested *was* false -- this is the ``initial interrupt'',
		; not a nested interrupt.
		g_nested = true; // might be done during test with test-and-set instruction

		// re-enabling IRQs of this priority level and higher priority
		// reduces interrupt latency by a couple of cycles.
		// (I do it a few cycles later, so it's optional whether to do it here).
		// Is there a problem with re-enabling IRQs of lower priority ?
		[Allow IRQs] // optional

		// must disable *all* IRQs that modify g_AST_register,
		// even if they are higher priority.
		[disable IRQs]
			; interrupts must be disabled while testing g_AST_register,
			; to avoid the risk of these bad sequences:
			; 1. g_AST_register tests equal to zero
			; 2. a (nested) interrupt comes in and sets some g_AST bits
			; 3. the nested interrupt exits
			; 4. the initial routine exists, because it thought g_AST was zero
			; 5. those set AST bits might not be handled for a very long time
			;    (not until another interrupt comes in).
			;    This is especially bad if IRQ_handler()
			;    leaves data structures in an inconsistent state,
			;    relying on AST_handler() to fix them up
			;    before any (non-interrupt-time) tasks see that data.
			; or
			; 1. g_AST_register tests to nonzero
			; 2. a copy is made
			; 3. a (nested) interrupt comes in and sets some g_AST bits
			; 4. the nested interrupt exits
			; 5. the initial routine sets g_AST to zero,
			;    as if that (nested) interrupt never happened.
		while( 0 != g_AST_register ){
			make copy of g_AST_register
			g_AST_register = 0;
			[ Allow IRQs ]
			[[ call AST handler with copy of AST_register ]]
			; The AST handler
			; or sometimes the (non-interrupt-time) tasks themselves
			; might modify
			; modify ``g_top_task'' and ``g_current_task''.
			; Since the AST handler is called *only*
			; by the initial interrupt,
			; it doesn't need to be re-entrant,
			; even though it runs with interrupts enabled.
			[ toss copy of g_AST_register ]

			if( g_top_task != g_current_task ){
				; time to context switch.
				; May need to disable IRQs while switching stacks,
				; but probably not -- most architectures
				; can switch stacks in one atomic instruction.
				[save current task's context to stack]
				[switch stacks]
				[set g_current_task = g_top_task]
				[restore top task's context from stack]
			};
			[ disable IRQs ]
		} // while()...
		g_nested = false;
	} // if( g_nested )...
	[restore disposable registers]
	(RTI) (allows interrupts)

	//

	// hints:
	// g_nested may be a single bit
	//     in the g_AST_register, with some slight modifications ...
```
  OK, so that looks pretty in pseudo-code. Does the real code look any better ?
  680x0/ColdFire listing:
```
     .text
 _Trampoline:
	/* 	save disposable registers */
     moveml      #0xC0C0,a7@-        /* save d0,d1,a0,a1 on the stack */
	/* call IRQ_handler */
     movel       #0xF0F0F0F0,a7@-    /* save spot for passed int and */
     movel       #0xF0F0F0F0,a7@-    /* pointer */
     jsr         0xA0A0A0A0          /* call respective IrqHandler */
     jmp         common
 TrampolineEnd:
     .globl      _TRAMP_SIZE,_Trampoline
 _TRAMP_SIZE:
     .long       TrampolineEnd-_Trampoline

 common:
     addql       #8,a7               /* toss parm1 and 2 */
	 /* if this is the initial interrupt */
     tasb        g_doingASTS           /* test-and-set */
     bneb        endif_initial          /* someone else is handling ASTs */
	 /* then handle ASTS */

		 /* optional allow/disable pair -- */
		 /* -- reduces interrupt latency by a couple of cycles. */
		 movew       #0x2000,sr          /* enable all interrupts */
		 movew       #0x2700,sr          /* disable irqs */
		 /* end optional pair */

		 /* while... */
		 brab        test_ast
		 /* ... do ... */
 loop_ast:
			 movel       _astRegister,a7@-   /* pass copy of astRegister */
			 clrl       _astRegister         /* and clear it */
			 movew       #0x2000,sr          /* enable all interrupts */
			  jsr         _astHandler
			 addql       #4,a7               /* pop astRegister copy */

			 /*	if( g_top_task != g_current_task )... *
			 moveal      _pRunningTask,a0    /* get pointer to current task */
			 moveal      _readyList,a1       /* get pointer to ptop */
			 cmpal       a0,a1               /* and dont save if */
			 beqb        current_task_OK     /* pRunningTask == ptop */
			 /* ...then... */
 SaveContext:
			 moveml      #0x3F3E,a7@-    /* save d2-d7,a2-a6 on the stack */
				 movel       a7,a0@              /* save stack pointer
 RestoreContext:                        first item in task struct */
				 /* switching to a different stack is atomic, ... */
				 /* so it's OK that interrupts are enabled now */
				 moveal      a1@,a7              /* point to ptops stack */
				 movel       a1,_pRunningTask    /* set runningTask to new task */
			 moveml      a7@+,#0x7CFC    /* restore d2-d7,a2-a6 from the stack */
 current_task_OK:
			/* ...endif */

			 movew       #0x2700,sr          /* disable irqs */
		 /* ... while( 0 != g_astRegister );
 test_ast:
			 tstl        _astRegister        /* any jobs to run? */
			 bneb        loop_ast              /* yes */
		 /* end while */

		 clrb        g_doingASTS           /* ok.. irqs still disabled */
 endif_initial:
	 /* endif */
     moveml      a7@+,#0x0303        /* restore d0,d1,a0,a1 from the stack */
     rte

 doingASTS:
     .byte       0x00
```
  PIC listing:
related links: http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2002/04/11/080137a&tgt=_top | http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2002/03/07/120706a&tgt=_top
Caxton Foster's _Real-Time Programming: Neglected Topics_, despite the title, is a very good introduction to the basic topics of real-time control, starting with simple things like interrupts and debouncing switches, all the way through digital filters. It's a thin paperback (Addison Wesley MicroBooks), and a (somewhat) experienced programmer can get through it in a couple of days.
-- recc. Comp.realtime FAQ http://www.faqs.org/faqs/realtime-computing/faq/ [FIXME: books to read] http://groups.google.com/groups?group=comp.realtime
interrupts on PIC: [FIXME: surely other web pages on piclist explain this much better ...] [FIXME: move to massmind.org ?] --
GIE is automatically disabled when the PIC takes the interrupt. GIE will be re-enabled by RETFIE at the end of the interrupt. You do **NOT** want GIE on in the interrupt routine. This can cause an interrupt within an interrupt. You would have to go thru a lot of trouble to make sure your interrupt routine is re-entrant to handle this case. Think of what your interrupt routine will do to the saved state of the first interrupt if a second one comes along before the first completes. In theory you could make your interrupt routine re-entrant, but I think this is extremely unlikely to be a benefit in practise. Also note that extra call stack locations are used up globally if you allow nested interrupts. -- Olin Lathrop http://www.embedinc.com http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2001/04/26/080610a&tgt=_top -- 1st of all, if the ONLY interrupt u need enabled is RB0 then INTCON = 90; //enables global and RB0 interrupts 2ndly, ur interrupt routine should be set up this way: static void interrupt isr(void){ //RB0 interrupt if (INTF){ // do something INTF=0; //clear RB0 interrupt flag } } FYI, u can only have one interrupt routine which means all ur interrupts should be handled within that one routine! so for example if u needed to have TMR0 interrupt enabled as well as RB0, then the above interrupt routine would look like this: static void interrupt isr(void){ //TMR0 interrupt if (T0IF){ // do something T0IF=0; //clear TMR0 interrupt flag } // RB0 interrupt if (INTF){ // do something INTF=0; //clear RB0 interrupt flag } } hope this helps. BTW, if u needed to have TMR0 interrupt enabled, then u would need to set the value in the OPTION register appropriately. this might be a good time to have a look at the datasheet for the PIC in use. seyi -- BY: Oluseyi Odeinde http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2001/04/02/084836a&tgt=_top
According to the Microchip PIC16CXXX MCU family documentation, http://www.microchip.com/1000/suppdoc/refernce/midrange/midsect/758/index.htm | http://www.microchip.com/download/lit/suppdoc/refernce/midrange/midsect/31008a.pdf
Section 8. Interrupts
...
The Global Interrupt Enable bit, GIE (INTCON<7>), enables (if set) all un-masked interrupts or disables (if cleared) all interrupts. Individual interrupts can be disabled through their corresponding enable bits in the INTCON register. The GIE bit is cleared on reset. The return from interrupt instruction, RETFIE , exits the interrupt routine as well as sets the GIE bit, which allows any pending interrupt to execute.
...
When an interrupt is responded to, the GIE bit is cleared to disable any further interrupt, the return address is pushed into the stack and the PC is loaded with 0004h. Once in the interrupt service routine the source(s) of the interrupt can be determined by polling the interrupt flag bits. Generally the interrupt flag bit(s) must be cleared in software before re-enabling the global interrupt to avoid recursive interrupts.
Individual interrupt flag bits are set regardless of the status of their corresponding mask bit or the GIE bit.
When an instruction that clears the GIE bit is executed, any interrupts that were pending for execution in the next cycle are ignored. The CPU will execute a NOP in the cycle immediately following the instruction which clears the GIE bit. The interrupts which were ignored are still pending to be serviced when the GIE bit is set again.
...
8.2.1 INTCON Register
Interrupt flag bits get set when an interrupt condition occurs regardless of the state of its corresponding enable bit or the global enable bit, GIE (INTCON<7>). This feature allows for software polling.
...
8.2.2 PIE Register(s)
... Peripheral Interrupt Enable registers (PIE1, PIE2). These registers contain the individual enable bits for the Peripheral interrupts. These registers will be generically referred to as PIE. If the device has a PIE register, The PEIE bit must be set to enable any of these peripheral interrupts. Note: Bit PEIE (INTCON<6>) must be set to enable any of the peripheral interrupts.
...
8.2.3 PIR Register(s)
... Peripheral Interrupt Flag registers (PIR1, PIR2). These registers contain the individual flag bits for the peripheral interrupts. These registers will be generically referred to as PIR. Note 1: Interrupt flag bits get set when an interrupt condition occurs regardless of the state of its corresponding enable bit or the global enable bit, GIE (INTCON<7>). Note 2: User software should ensure the appropriate interrupt flag bits are cleared (by software) prior to enabling an interrupt, and after servicing that interrupt.
...
8.5 Context Saving During Interrupts During an interrupt, only the return PC value is saved on the stack. Typically, users may wish to save key registers during an interrupt e.g. W register and STATUS register. This has to be implemented in software.
... Example 8-1: Saving the STATUS and W Registers in RAM (for Devices with Common RAM) Example 8-2: Saving the STATUS and W Registers in RAM (for Devices without Common RAM)
```
Example 8-5: Register Saving / Restoring as Macros
PUSH_MACRO MACRO ; This Macro Saves register contents
MOVWF W_TEMP ; Copy W to a Temporary Register
; regardless of current bank
SWAPF STATUS,W ; Swap STATUS nibbles and place
; into W register
MOVWF STATUS_TEMP ; Save STATUS to a Temporary register
; in Bank0
ENDM ; End this Macro
;
POP_MACRO MACRO ; This Macro Restores register contents
SWAPF STATUS_TEMP,W ; Swap original STATUS register value
; into W (restores original bank)
MOVWF STATUS ; Restore STATUS register from
; W register
SWAPF W_TEMP,F ; Swap W_Temp nibbles and return
; value to W_Temp
SWAPF W_TEMP,W ; Swap W_Temp to W to restore original
; W value without affecting STATUS
ENDM ; End this Macro




	;1. Store the W register, regardless of current bank.
	;For Devices with Common RAM,
	;the user register, W_TEMP, must be defined across all banks and must
	;be defined at the same offset from the bank base address
	;(i.e., W_TEMP is defined at 0x70 - 0x7F in Bank0).
	;For Devices without Common RAM,
	;the user register, W_TEMP, must be defined across all banks and
	;must be defined at the same offset from the bank base address
	;(i.e., W_TEMP is defined at 0x70 - 0x7F in Bank0).
	;The user register, STATUS_TEMP, must be defined in Bank0.
	;Within the 70h - 7Fh range (Bank0),
	;wherever W_TEMP is expected, the corresponding locations
	;in the other banks should be dedicated for the possible saving of the W register.
	MOVWF W_TEMP ; Copy W to a Temporary Register [FIXME: why not use SWAPF ?]

	;2. Store the STATUS register in Bank0.
	SWAPF STATUS,W ;(use SWAPF instead of MOVFW
	; so they can be restored without changing any status bits)
	BCF STATUS,RP0 ; Change to Bank0 regardless of
	; current bank
	MOVWF STATUS_TEMP ;
 The user register, STATUS_TEMP, must be defined in Bank0, in this example
STATUS_TEMP is also in Bank0.

	3. Execute the Interrupt Service Routine (ISR) code.
	:
	: (Interrupt Service Routine (ISR) )
	:

	4. Restore the STATUS (and bank select bit register).
	SWAPF STATUS_TEMP,W ; Swap original STATUS register value
	; into W (restores original bank)
	MOVWF STATUS ; Restore STATUS register from
	; W register

	5. Restore the W register.
	SWAPF W_TEMP,F ; Swap W_Temp nibbles and return
	; value to W_Temp
	SWAPF W_TEMP,W ; Swap W_Temp to W to restore original
	; W value without affecting STATUS

	RTI
```
```
8
Example 8-6: Source File Template
Example 8-7: Typical Interrupt Service Routine (ISR)
LIST p = p16C77 ; List Directive,
; Revision History
;
#INCLUDE <P16C77.INC> ; Microchip Device Header File
;
#INCLUDE <MY_STD.MAC> ; Include my standard macros
#INCLUDE <APP.MAC> ; File which includes macros specific
; to this application
; Specify Device Configuration Bits
__CONFIG _XT_OSC & _PWRTE_ON & _BODEN_OFF & _CP_OFF & _WDT_ON
;
org 0x00 ; Start of Program Memory
RESET_ADDR : ; First instruction to execute after a reset
end
org ISR_ADDR ;
PUSH_MACRO ; MACRO that saves required context registers,
; or in-line code
CLRF STATUS ; Bank0
BTFSC PIR1, TMR1IF ; Timer1 overflow interrupt?
GOTO T1_INT ; YES
BTFSC PIR1, ADIF ; NO, A/D interrupt?
GOTO AD_INT ; YES, do A/D thing
: ; NO, do this for all sources
: ;
BTFSC PIR1, LCDIF ; NO, LCD interrupt
GOTO LCD_INT ; YES, do LCD thing
BTFSC INTCON, RBIF ; NO, Change on PORTB interrupt?
GOTO PORTB_INT ; YES, Do PortB Change thing
INT_ERROR_LP1 ; NO, do error recovery
GOTO INT_ERROR_LP1 ; This is the trap if you enter the ISR
; but there were no expected
; interrupts
T1_INT ; Routine when the Timer1 overflows
: ;
BCF PIR1, TMR1IF ; Clear the Timer1 overflow interrupt flag
GOTO END_ISR ; Ready to leave ISR (for this request)
AD_INT ; Routine when the A/D completes
: ;
BCF PIR1, ADIF ; Clear the A/D interrupt flag
GOTO END_ISR ; Ready to leave ISR (for this request)
LCD_INT ; Routine when the LCD Frame begins
: ;
BCF PIR1, LCDIF ; Clear the LCD interrupt flag
GOTO END_ISR ; Ready to leave ISR (for this request)
PORTB_INT ; Routine when PortB has a change
: ;
END_ISR ;
POP_MACRO ; MACRO that restores required registers,
; or in-line code
RETFIE ; Return and enable interrupts
Part of the function of the RETFIE instruction is to set the GIE bit =
1. It is exactly the same as the RETURN instruction except for that one
feature.
```
Example 8-3 stores and restores the STATUS and W registers for devices with general purpose RAM only in Bank0 (such as the PIC16C620).
...
Question 2: My system seems to lock up. Answer 2: If interrupts are being used, ensure that the interrupt flag is cleared after servicing that interrupt (but before executing the RETFIE instruction). If the interrupt flag remains set when the RETFIE instruction is executed, program execution immediately returns to the interrupt vector, since there is an outstanding enabled interrupt.
...
-- The PIC will automatically disable GIE during the interrupt handler and reenable it during the RETFIE. It _is_ possible to deliberately reenable interrupts in your interrupt handler but this is pretty ugly (you have to have multiple save areas and decide which one applies to a given invocation of the interrupt). Here is pseudocode for a is a reasonable way to go at it if you are expecting a lot of interrupts. This avoids repeatedly handling interrupts and saving/restoring context:
```
interrupt_handler:
    save context

check_next:
    if RB0/INT interrupt
        clear RB0/INT flag
        handle RB0/INT condition
        goto check_next
    endif

    if TMR interrupt
        clear TMR interrupt flag
        handle TMR condition
        goto check_next
    endif

    restore context
    RETFIE
```
Bob Ammerman RAm Systems -- http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2001/04/26/060855a&tgt=_top
The 16f84 datasheet http://www.microchip.com/1000/pline/picmicro/category/digictrl/8kbytes/devices/16f84a/5432/index.htm | http://www.microchip.com/download/lit/pline/picmicro/families/16f8x/35007b.pdf in section 4.2 PORTB and TRISB Registers has this little sentence:

A mismatch condition will continue to set flag bit RBIF. Reading PORTB will end the mismatch condition and allow flag bit RBIF to be cleared.

Peter Betts further clarifies:
--
On the 16F84, When the RB0/INT-port triggers an interrupt, you have to make a read of PORTB and THEN Clear the interrupt flag I've done this inside the ISR and it's fine. movf PORTB,w bcf INTCON,RBIF ; Clear pending interrupt flag -- Peter Betts http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2001/02/22/022923a&tgt=_top
AN566: Using the Port B Interrupt on Change as an External Interrupt http://www.microchip.com/1000/suppdoc/appnote/all/an566/
ServiceIRQ source code http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2001/02/07/101658a&tgt=_top
(moved to http://c2.com/cgi/wiki?ScheduledTask )

The RMA (rate monotonic algorithm) is a procedure for assigning fixed priorities to tasks ... A task is considered schedulable if all tasks meet all deadlines all the time. The algorithm is simple:

Assign the priority of each task according to its period, so that the shorter the period the higher the priority.

RMA is the optimal fixed-priority algorithm. If a task set cannot be scheduled using the RMA algorithm, it cannot be scheduled using any fixed-priority algorithm.
...
Sometimes a particular set of tasks will have a total utilization above the worst-case schedulable bound and still be schedulable with fixed priorities. Schedulability then ... must be analyzed by hand. ...
Guidelines:
...
Always assign priorities according to RMA. Manually assigning fixed priorities will not give you a better solution.
If total utilization is less than or equal to [ ln(2) = 69.3% of CPU utilization], all tasks will meet all deadlines, so no additional work needs to be done.
...
To achieve 100% CPU utilization when using fixed priorities, assign periods so that all tasks are harmonic. This means that for each task, its period is an exact multiple of every other task that has a shorter period.
... For example, a three-task set whose periods are 10, 20, and 40 ... is harmonic, and preferred over a task set with periods 10, 20, and 50.
-- ``Rate Monotonic Scheduling'' article by David B. Stewart and Michael Barr in _Embedded Systems Programming_ 2002-03
See ``What is RMA?'' section of http://www.faqs.org/faqs/realtime-computing/faq/
[FIXME: does RMA belong here with interrupts, or over in #os ?]

FORTH

(moved to forth.html )

benchmarks

benchmarks for computer architecture. CPU design goals. benchmarking.

While faster is better, all else being equal, many times all else is not equal. Speed is becoming less important in computer design (many tasks can already be completed ``fast enough''). Power usage, reliability, ease-of-use, and other factors are becoming more important. We must make sure to test for the things we really want creed.html#really_want , and we don't need to test for things that are irrelevant.

What do we really want ? Here are a few of the design goals for various systems. Naturally, if your particular goals heavily emphasize one of these, it will require a different architecture than one that emphasizes a different goal.

CPU design goals: (see computer_architecture.html#CPU_goals )(2003-01-10)

flexibility (easy to explore design variations and design improvements)
reliability
minimum design and development costs
finish a task with the least energy; given a fixed set of daily tasks, run the longest on a set of batteries. (low-power design [FIXME:]) (variable clock rates) ... If you know ahead of time some task needs to be done in S seconds and requires N cycles to do, is the optimum clock speed really N/S cycles/second ?
finish a very small task quickly: handle sound and video that is coming at a fixed rate: make the worst-case time fast; don't miss deadlines (isochronous) (DSPs) (alternatives to caches)
finish a large, heavily parallel task quickly: maximum processing power per mm^2 for heavily parallelized tasks (ray tracing, etc.) ... given a fixed, large silicon area, enough for dozens of processors, sometimes it's worth it to make all the processors slower if that lets you shoehorn a few more processors in. (parallel processing; Beowulf)
finish a large, serial task quickly: full-out, spare-no-expense, maximum processing speed for those sequential calculations where almost every step depends on all previous steps. (This is where most of the design effort was focused until around the 1990s) (CISC, RISC, caching, etc)

Over at data_compression.html#benchmark I list benchmark test images for image compression.

assembling a quiet PC hardware_david_uses.html#quiet_pc

One might think that comparing 2 pieces of software ``which one compresses my files smaller ?'', vs. comparing 2 CPUs ``which one executes my code faster / better / with less power'' are completely unrelated. But there's one tiny bit of overlap: data_compression.html#program_compression indicates that while dealing with compressed data and compressed executable code is often slower than uncompressed, sometimes there's a synergy -- if we compress it enough (using techniques I talk about there) so it all fits in a higher-level cache, then our programs run faster (what I'm talking about here).

Over at bignums.html#cpu and at robot_links.html#ucontrollers and at vlsi.html#cRAM I list information about some historical CPUs. [FIXME: should I merge them ?]

Why are there not more ``reliability'' benchmarks ?

(reliability benchmark)

The Signal 11 FAQ: http://www.bitwizard.nl/sig11/ ... Whenever I build new computers.. this is how I ALWAYS test the stability of the machine.
-- recc. David Ranch http://www.ecst.csuchico.edu/~dranch/LINUX/index-linux.html#trinityos
"If your overclocked CPU can run Prime 95 for several hours you can consider your system stable" -- Johan De Gelas http://aceshardware.com/Spades/read.php?article_id=38 ... http://www.alienpc.com/files/prime95.zip (reliability benchmark)
"The Sorry State of Hardware Reviews" article by DMOS http://www.devhardware.com/c/a/News-Bytes/The-Sorry-State-of-Hardware-Reviews/ is a rant about the improper use of benchmarks. Every piece of hardware should get a fair review, showing its plusses and negatives in an equal light, with technical data and reproducible, reliable results to back that up. ... links to several hardware review sites that do get it right ...
One benchmark I'd like to run on the working silicon is measure "The BDTImark: A Measure of DSP Execution Speed" http://www.bdti.com/articles/ (lots of other low-power design links here) (Maybe BDTImark/mA would be a useful low-power design metric -- it should be proportional to total work done per AA cell.)
"Linux Number Crunching: Benchmarking Compilers and Languages for ia32" article by Scott Robert Ladd; updated: 4 January 2003 http://coyotegulch.com/reviews/almabench.html timing heavy floating-point astronomical calculations. ... more benchmarks: http://coyotegulch.com/reviews/
``Failure IS an Option'' opinion by Dale M. Gray http://www.spacedaily.com/news/oped-01b.html Interesting opinion on how expeditions should be designed; benchmarking in early sailing ships [general benchmarking; nothing about CPUs specifically]
Linux Benchmarking HOWTO by André D. Balsa, andrewbalsa@usa.net v0.12, 15 August 1997 http://metalab.unc.edu/LDP/HOWTO/Benchmarking-HOWTO.html has this excellent piece of advice: "Benchmarking per se is senseless, a waste of time and money; it is only meaningful as part of a decision process, i.e. if one has to make a choice between two or more alternatives. "
EEMBC, the Embedded Microprocessor Benchmark Consortium, develops and certifies real-world benchmarks and benchmark scores to help designers select the right embedded processors for their systems. http://www.eembc.org/
Linux Benchmarking - Article III - Interpreting benchmark results by André D. Balsa (writer/coordinator), andrewbalsa@usa.net v0.85, 2 January 1997 http://www.tux.org/~balsa/linux/benchmarking/articles/html/Article3e.html older mirrors (v0.82, 31 December 1997 and older) http://www.linuxgazette.cz/issue24/Article3e.html http://www.deva.net/lg/issue24/Article3e.html http://www.conmet.it/lg/issue24/Article3e.html http://www.linuxgazette.cz/issue24/Article3e.html http://www.tdyc.com/archive/lg/issue24/Article3e.html http://www.memphisonline.com/linux/LDP/LDP/LG/issue24/Article3e.html http://www.csn.ul.ie/resources/mirrors/LDP/LDP/LG/issue24/Article3e.html http://www.csn.ul.ie/resources/mirrors/LDP/LDP/LG/issue24/Article3e.html http://www.memphisonline.com/linux/LDP/LDP/LG/issue24/Article3e.html http://www.csn.ul.ie/resources/mirrors/LDP/LDP/LG/issue24/Article3e.html http://www.tux.org/~balsa/linux/benchmarking/articles/html/Article2c.html
the SPEC CPU benchmarks http://www.hp.com/products1/itanium/performance/architecture/speccpu.html
Standard Performance Evaluation Corporation (SPEC) http://www.specbench.org/
The SPEC Benchmarks http://home.earthlink.net/~mrob/pub/benchmarks/spec.html (the "SPECint" and "SPECfp" numbers)
[benchmarking] ``EDN, µC vendors join forces to develop benchmarks'' http://archives.e-insite.net/archives/ednmag/reg/1997/080197/16le.htm

Twenty leading microcontroller manufacturers have formed the EDN Embedded Microprocessor Benchmark Consortium (EEMBC--pronounced "embassy") http://archives.e-insite.net/eembc.htm . Markus Levy, EDN's Microprocessor and DSP editor, founded the group in an effort, to finally kill Dhrystone MIPS as the universal microprocessor benchmark.
...
EEMBC will evaluate performance against such metrics as power consumption, code density, and compiler technology.
HINT, a benchmarking program that claims to be "more realistic" than SPEC. ftp://ftp.scl.ameslab.gov/pub/HINT/doc/
The Sort Factory http://www.citeweb.net/delafoss/Sort/Factory.htm lots of information on lots of different sorting algorithms, in lots of different languages, plus benchmarks.
http://www.cs.princeton.edu/~rs/ more info on sorting and searching. [also related to compilers and optimization]
[cache design][benchmarking] http://www.idiom.com/~zilla/Computer/cachekiller.html Given a particular machine, this article describes how to tune a (rigged) ``benchmark'' program so that program does *really* bad on that particular machine (victim), while doing OK for nearly every other machine.
benchmarking http://dmoz.org/Computers/Performance_and_Capacity/
comp.benchmarks Frequently Asked Questions http://sacam.oren.ortn.edu/~dave/benchmark-faq.html
the Benchmark FAQ http://www.faqs.org/faqs/benchmark-faq/ where is the rest ?
"The Original Linux Benchmarks Archives" http://www.silkroad.com/linux-bm.html/
GNU/Linux Benchmarking - Practical aspects by André D. Balsa, , with corrections and contributions by Uwe F. Mayer, v0.51, 8 December 1997 http://www.tux.org/~balsa/linux/benchmarking/articles/html/Article2c.html
[museum][benchmark][schematic.html#PWB] ``Das virtuelle Computermuseum'' by Christian Zahn. http://members.aol.com/chrzahn/ and videospiele (video games). includes Benchmarks for many early computers (Apple ][, Atari 130XE, Robotron KC85, Apple Performa 6400, etc.) running BASIC. http://members.aol.com/CompHist/vergleich.html has photographs of the internal PWBs for many
Java Versus C/C++ Benchmarks http://www.idiom.com/~zilla/Computer/javaCbenchmark.html It appears that these benchmarks run practically the same speed whether using Java compilers or C compilers.

Other pages about lots of CPUs, reviews, compare and contrast, benchmarks, etc.

ftp://rtfm.mit.edu/pub/usenet/news.answers/microcontroller-faq/primer has tons of information on all kinds of CPUs, prepackaged kits, real-time operating systems (RTOS), etc.
Microprocessor Report http://www.chipanalyst.com/q/report/mpr.html "In addition to covering the chips themselves, the newsletter covers the microprocessor implications of emerging platforms, emerging personal computer technologies, workstation designs, mobile computing devices, embedded processors, DSP technology, and intellectual property issues."

hardware compilers

C-to-Verilog is a free service for circuit designers. ... Users may submit their C programs and download a verilog module which can then be embedded on FPGAs. Our website will automatically synthesize the C program into a verilog module http://www.c-to-verilog.com/
"SPARK: High-Level Synthesis" http://www.cecs.uci.edu/~spark/

SPARK is a C-to-VHDL ... particularly targeted to control-intensive microprocessor functional blocks and multimedia and image processing applications. We have validated the effectiveness of our approach ... for large real-life applications such as the Instruction Length Decoder from the Intel Pentium and multimedia applications such as MPEG-1, MPEG-2 and the GIMP image processing tool.
SPARK takes behavioral ANSI-C code as input, schedules it using speculative code motions and loop transformations, runs an interconnect-minimizing resource binding pass and generates a finite state machine for the scheduled design graph. Finally, a backend code generation pass outputs synthesizable register-transfer level (RTL) VHDL. This VHDL can then by synthesized using logic synthesis tools into an ASIC or by mapped onto a FPGA.
Oxford Hardware Compilation Research Group: Turning Software into Hardware http://www.comlab.ox.ac.uk/oucl/hwcomp.html "We have developed a C-like language ... that enable programmers to create hardware implementations of their programs. ... With our high-level language approach we can take the same Handel-C program and automatically map it into widely different implementations with no change to the program: Very fast hardware simulations on standard PCs ... Fast and flexible implementations on FPGA chips ASIC implementations. ... We also build hardware platforms with FPGAs ... for fast prototyping" [FIXME: move to vlsi.html#silicon_compiler ?]
... Just as assembly code gave way to higher-level languages, hardware engineers are gradually discovering the joys of high-level abstraction. The benefits are much the same, but so are the pitfalls and the battles. ...
Old-timers complained that compiled code could never be as fast, tight, or elegant as hand-written assembly code -- and they were absolutely right. But it didn't matter. ... compilers ... are more efficient in the one dimension that matters: programmer's time. ... compilers make [ more efficient of the user's resources by making less efficient use of the computer's resources ]. They also have the commercially important side effect of allowing less experienced (not to say less talented) programmers to write useful and functional code.
... 10-million-gate chips with multiple embedded processors ... It's not an efficient use of the engineers' time to design a chip that big, gate by carefully crafted gate. ...
...
... designing hardware, per se, is not the real objective. The goal should be to create a system that performs a given function. ... railroads in the 1880s ... focused on optimizing locomotives and boxcars while their customers were enthusiastically ignoring them and embracing automobiles and, later, aircraft. ... the railroads were not in the railroad business; they were in the transportation business. ...
... Suddenly all C programmers (or at least, all Handel-C programmers) become honorary chip designers.
...
The circuits that Handel-C (or SystemC, et. al) produces would make a hardware engineer gag, but that's no different from most programmers' reactions to early C compilers. ... But you're not supposed to look at the output and you're not supposed to care. If ... the tool ... works, you should look to more important tasks or you defeat the purpose of the tool.
...
The general weakness that all these software-cum-hardware languages share is that conventional programming languages simply can't express parallelism. There's no way to show that two functions (hardware blocks) are supposed to operate simultaneously. The C language ... is innately serial and has no method for expressing parallelism or simultaneity. ...
In the end, the perverted software languages are bound to prevail. There are a lot more programmers in the world than hardware engineers, and the balance will probably tip still further in the future. Elegance of implementation has never triumphed over timesaving hacks.
-- ``The Death of Hardware Engineering'' article by Jim Turley in _Embedded Systems Programming_ 2002-03 [FIXME: general design ?] [FIXME: C language flaws]
Handel-C http://www.embedded-solutions.ltd.uk/ "Handel-C is a programming language. It enables a software or hardware engineer to target directly FPGAs (Field Programmable Gate Array) in a similar fashion to classical microprocessor cross-compiler development tools, without recourse to a Hardware Description Language. Thereby allowing the designer to directly realise the raw real-time computing capability of the FPGA. Reconfigurable Computers ..." has a 'Try before buy' program http://www.embedded-solutions.ltd.uk/Programs/TryBeforeBuy/Overview.htm

a few selected CPUs

A few unusual, interesting CPUs.

David once wrote a paper about the TMS320C80 MVP (multimedia video processor) produced by TI. (includes a 32 bit RISC master processor (with IEEE-754 floating-point), 4 parallel DSP processors (64 bit instruction words, 32 bit integer units that can be used for 8 bit SIMD ops), and a video controller).
"Rapport plans to develop a version of its current KC256 chip that contains more than 1,000 separate eight-bit processing elements. These work together at relatively low speeds to process high-definition video ... At the same time, Rapport has figured out how to keep the power consumption down to make sure a phone's battery doesn't run down ... The Kilocore1025 will use one of IBM's PowerPC cores as part of the design ..." http://news.com.com/2061-10791_3-6057551.html
"Soft peripherals" article by Jim Turley 2003-05-14 http://www.embedded.com/story/OEG20030509S0043

Ubicom's built-in multitasker with response time measured in nanoseconds. Ubicom says its IP3023 chip has a worst-case interrupt latency that's a thousand times shorter than that of VxWorks or Linux. Three orders of magnitude can make a big difference in how you plan your interrupt handlers. ... A PCI interface, for example, is a 200-instruction loop that takes up about one-tenth of the processor's horsepower.
... The processor switches threads on every cycle; every 4ns at 250MHz. Call it extreme time slicing; each task runs for exactly one instruction before being "switched out."
You organize tasks through a 64-entry task table. Each clock tick, the processor executes one instruction from the next task in the table. The task table acts as a kind of cache, giving the IP3023 visibility into the next 64 instructions. Unlike a cache, there's no risk of missing or mispredicting the next instruction, so performance is completely reliable and deterministic.
It's pretty easy to calculate how often a task needs to run because the Ubicom processor's computing is so predictable. ...
The IP3023 never disables interrupts because there are no interrupts to disable. You handle interrupts with another task that monitors one or more input pins (you get to decide how many). Thus, any pins you want can become interrupt inputs. Their latency is determined by how frequently you choose to monitor them; if you schedule your interrupt request task to run every eighth cycle, for instance, you're looking at a 32ns interrupt latency.
... runs at 250MHz ...

unsorted

WWW Computer Architecture Page http://www.cs.wisc.edu/~arch/www/
http://www.cs.earlham.edu/courses/cs42/ [FIXME: check out digital logic simulators]
ARM Ltd. http://www.arm.com/ (formerly known as Acorn RISC Machine). Their 32-bit RISC CPUs have the interesting architectural feature of Conditional execution on every instruction to eliminate many branches.
TriCore http://www.tri-core.com/ another 32-bit microcontroller-DSP architecture "32-bit load-store Harvard architecture" "TriCore is a clean-slate design effort"

Date: Wed, 25 Feb 1998 13:08:23 -0800 (PST)
From: Mark Crosby <crosby_m@rocketmail.com>
Subject: >H Autopoiesis Vs Autopotency
To: transhuman@logrus.org
MIME-Version: 1.0
Sender: owner-test-new@logrus.org
Reply-To: transhuman@logrus.org

Transhuman Mailing List

Mitch Porter wrote:
< 'Autopoiesis' and 'autopotency' are different
concepts. An autopoietic system is basically a
self-regenerating system [CUT] (I say 'basically',
since Maturana & Varela's work, e.g. _Autopoiesis and
cognition_, has philosophical subtleties which I do
not fathom.) Whereas an autopotent system is one
having "complete power and knowledge  over itself"
(see Anders' lexicon). >

Yes, they are quite different concepts.  However,
autopoiesis is a highly-developed, formal approach to
theoretical biology that is probably very relevant to
any automorphing and autopotency technologies that we
may wish to develop.  (Developing some of these
connections, and differences, would be a good paper
for The Journal of Transhumanism ;)

George Kampis is the chairman of the Department of
History and Philosophy of Science at Eötvös Loránd
University, Budapest, Hungary.  While he is not
strictly an autopoiesis theorist, he has adapted many
of Maturana & Varela's principles in his Alife
research, which has focussed on self-modification in
neural networks.  A very interesting (10-page) paper
by Kampis provides an overview of these concepts:

"Computability, Self-Reference, and Self-Amendment"
is available at
http://www.c3.lanl.gov/~rocha/kampis.html

< Abstract: There exist theories of cognition that
assume the importance of self-referentiality and/or
self-modification. We argue for the necessity of such
considerations. We discuss basic concepts of
self-reference and self-amendment, as well as their
relationship to each other. Self-modification will be
suggested to involve non-algorithmic mechanisms, and
it will be developed as a primary concept from which
self-reference derives. A biologically motivated
mechanism for achieving both phenomena is outlined.
Problems of computability are briefly discussed in
connection with the definability and describability
of self-modifying systems. Finally, the relevance of
these problems to applications in semantic problems
of cognition is shown. >

Mark Crosby

The PAM Project http://pam.devinci.fr/ Reconfigurable Systems "Programmable Active Memory" (PAM) includes CPLDs, EPLDs, FPGAs
Documentation Library http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html This library contains technical documentation for DIGITAL Alpha microprocessors and Alpha ATX motherboards.
the Turing Machine http://www.wadham.ox.ac.uk/~ahodges/scrapmachine.html
"Of the many obstacles researchers face, inadequate computing power can be one of the most frustrating." -- http://livewire.nih.gov/Features/lobos.htm
On some parallel processor architectues, the *connections* take about 60 percent of the budget. Making PE nodes more complicated / expensive to simplify connections may actually be a net benefit. http://www.gcn.com/gcn/1998/august24/53.htm
Avalanche http://www.cs.utah.edu/projects/avalanche/ the scalable parallel processor project
comp.lang.misc ???
Intel Secrets http://www.x86.org/ "What Intel Doesn't Want You To Know" lots of very in depth technical articles and information.
The Connectionists Mailing List http://www.cnbc.cmu.edu/other/connectionists.html "CONNECTIONISTS is a restricted moderated mailing list primarily intended for discussion of technical issues relating to neural computation, and dissemination of information directly relevant to researchers in the field." (archives are online) "massively parallel (connectionist) computation"
Andrew Phillips http://www.scs.ch/~andrew/vliw_stuff.html interested in DSPs, VLIW architecture, etc.

http://www.dejanews.com/getdoc.xp?AN=391857956

From: jpd@isoroku.mit.edu (John Doty)
Subject: Re: Designing Hardware For Software Systems
Date: 16 Sep 1998 00:00:00 GMT
Organization: Massachvsetts Institvte of Technology
Newsgroups: comp.realtime

In article <6tmu41$odr@gcsin3.geccs.gecm.com> walter.mallory@gecm.com
(Walter Mallory) writes:

One of our digital engineers came up with a useful trick a few years ago.
He was designing an interface that would multiplex several ADC outputs
into a DSP input port, using an FPGA. The CAD system provides a way to
generate a stimulus on an interface pin to test a design in simulation,
but the engineer decided that it was easier to draw a circuit that
imitated the ADC than it was to write a stimulus specification. His
pseudo-ADC contained a counter, so it generated successive samples of 0,
1, 2, ...

This was fine for testing the design in simulation. However, when he
decided to make a real chip, he realized he had enough left over capacity
to incorporate his test circuit in the chip. So now, we have an interface
chip that can be easily be put into a test mode that can test the data
paths downstream of the ADC.

This was especially valuable because the scientific instrument on the
other side of the ADC doesn't work properly unless it is cooled to
cryogenic temperatures. Even when it's working properly, 99% of the data
is random noise (and the question comes up, "is the random noise we're
seeing the *right* random noise" :-). Having an easy way to run a
nonrandom pattern through the system is this very handy.

Since then, we've put pattern generators on most of our interfaces.
--
John Doty		"You can't confuse me, that's my job."
jpd@space.mit.edu

"Radical architecture" http://www.techweb.com/se/directlink.cgi?EET19990125S0001 TeraGen Corp.
http://www.altavista.com/cgi-bin/query?pg=q&text=yes&kl=XX&q=%2B%224+bit%22+microprocessor&act=search
http://www.nexus-standard.org/nexus-standard.nsf "Global embedded processor debug interface standard consortium" seems like a really good idea for all new microprocessor designs.
"complexity does not necessarily mean performance. Simple systems are generally faster and more resilient." ... "Forth is like Zen. It is simple, it is accessible, and it can be understood in its entirety without devoting your whole life to it." -- Dr. C. H. Ting http://www.dnai.com/~jfox/efzen.htm

"Programmable Logic Speeds Prototypes Into Production" article by Tom Troksa (Packet Engines Inc.), Steve Dabell (Packet Engines Inc.), and Martin S. Won (Altera Corp), in _Electronic Design_ http://www.elecdesign.com/ claims that the typical gate-array / ASIC design flow looks like


	Products requirements design ---> Products requirements documentation
	  |
	  v
	ASIC requirements design ---> ASIC requirements documentation
	  |
	  v
	ASIC specification design
	  |
	  v
	ASIC development strategy
	  |
	  v
	ASIC architecture
	  |
	  v
	ASIC top-level design
	  |
	  v
	ASIC hierarchical decomposition
	  |
	  v
 +->Module-level ASIC design
 |	  |  ^
 |	  v  |
 |	Module-level ASIC verification
 |	  |
 |	  v
 |	Top-level ASIC composition
 |	  |  ^
 |	  v  |
 +--Top-level ASIC verification
	  |
	  | Constraints
	  |   |
	  |   | Target Libraries
	  |   |   |
	  |   |   |
	  v   v   v
	ASIC synthesis ---> Gate-level netlist
	  |
	  v
	Preroute gate-level simulation
	  |
	  v
	Place and route ---> Back-annotated timing
	  |
	  v
	Static Timing Analysis
	  |
	  v
	Gate-level simulation (with annotated timing)
	  |
	  v
	Release ASIC to board.

TeraGen Corporation http://www.tera-gen.com/ "Innovative Processors for the Embedded Market(tm)" is looking for people to do all levels of microprocessor design, from * transistor layout and sizing * Verilog and C simulation * assembly language
http://www.triscend.com/ sells "Configurable Processors" that (all on one chip) contain a processor, programmable logic (similar to a FPGA), and memory. This lets the chips have "soft" on-chip peripherals.
a complete specification of the MIPS instruction set http://www.cs.virginia.edu/~nr/toolkit/specs/mips.html
R10000 Microprocessor User's Manual Copyright 1995, MIPS Technologies, Inc. -- 22 JAN 96 http://www.csrd.uiuc.edu/~ece412/cpu/r10k/t5.Ver.1.1.book_1.html
MIPS Instruction Set Architecture (ISA) http://www.csrd.uiuc.edu/~ece412/cpu/r10k/t5.Ver.1.1.book_7.html (MIPS I, II, III, IV)
http://terminator.tamu.edu/cgi-bin/nph-dynaweb/dynaweb/SGI_Admin/ChallS_OG/ ???
MIPS(R) RISC Resource Catalog Microprocessor Technologies and Tools http://www.directories.mfi.com/embedded/mips/
http://parker.eecs.berkeley.edu/~cs152/index_lectures.html Review of MIPS ISA, Performance; Performance, Technology & Delay Modeling; VHDL, Multiply, Shift, Divide; Divide, Floating Point, Pentium Bug;
The Very Simple Computer http://www.cas.uc.edu/~ciminero/Courses/483/Overhead/vsc_intro.html has the source code to a simple simulator written in C.
City of Bits http://mitpress.mit.edu/e-books/City_of_Bits/ online book ???
Using SPIM with Modern Compiler Implementation http://www.cs.princeton.edu/~appel/modern/spim/ "SPIM is a simulator for the MIPS instruction set. MIPS is a simple, clean, and efficient RISC computer architecture; Silicon Graphics workstations and Nintendo 64 use MIPS processors." ... "SPIM ... can be freely used and distributed for non-commercial purposes."
HOT314 Compiler Group http://www2.cs.cornell.edu/hot314/compiler.html reviews "a number of MIPS and DLX compilers." lots of pointers to MIPS and DLX information.
Don Lancaster is the expert on Cheap Video http://www.tinaja.com/glib/muse134.pdf
_Anatomy of a High Performance Microprocessor: A Systems Perspective_ interactive book/CD-ROM by Bruce Shriver and Bennet Smith uses the AMD K6 3D microprocessor as a case study.
Cornell University Elect Eng 475: Microprocessor Architectures http://www.tc.cornell.edu/Visualization/Education/ee475/ VHDL Tutorial; FPGA pinout on the Xess board; VHDL floating point; DSP on FPGA; a few microprocessor designs;
CISC can refer to "complex instruction sets" or "sets of complex instructions." People tend to think in terms of the former as the downfall of CISC, but actually it's the latter that brings on the microcoding and the slow cycle times. The 860 is a step in the direction of RISC processors that still have large, rich instruction sets. I'd like to see more movement in this direction---why not have conditional move, *and* min and max, *and* absolute value, and so on. Many of the things you want are FP instructions, which even tend not to be starved for bits in the opcode. And compilers would have no trouble using these instructions, too.
-- From: Dave Gillespie Date: Fri, 29 Oct 1993 22:02:44 GMT Keywords: C, optimize, assembler Newsgroups: comp.compilers
MIPS has both a "branch delay slot" after a branch before actually forking, and a "load delay slot" after a load before being able to use data just loaded.
OMIMO http://www.omimo.be/ The Open Microprocessor Systems Initiative (OMI) is about developing better embedded systems.

From: f95mabr@dd.chalmers.se
To: f-cpu@egroups.com
Date: Fri, 19 Mar 1999 17:41:24 -0000
...
Alpha Architecture Handbook section 1.4 : they have only four
different type of instruction structures:

  +--------------------------------
  |31-26|25-21|20-16|15 -- 5|4 - 0|
  +-----+-----+-----+-------+-----+
  | OP  | RA  | RB  | Func. | RC  | Operate Format
  +-----+-----+-----+-------+-----+
  | OP  | RA  | RB  |    Disp     | Memory Format
  +-----+-----+-----+-------------+
  | OP  | RA  |       Disp        | Branch Format
  +-----+-----+-------------------+
  | OP  |         Number          | PALcode Format
  +-----+-------------------------+
...
				Mathias

The Open Hardware Specification http://www.wpi.edu/~free779/ "The Open Hardware Specification, or OHSpec, project is an attempt to create a truly open hardware system. Using VHDL, Verilog, and associate circuit definition languages, we hope to be able to design a hardware system with the same processes that created Linux. The 'source code,' or the repository of markup files that are the blueprints of such a system, will be released under a modified form of the GPL."

From: kenl@compassnet.com.au (Ken Lee)
Subject: Re: How many different processors do you use?
Date: 04 Jun 1999 00:00:00 GMT
Newsgroups: comp.arch.embedded

Logically the main motivation for changing a processor design is based
on either:
1. Performance - the current design can no longer cut it.
2. Price - for a given quantity level there is some cost savings in
switching to another processor even considering the amortisation of
the design costs.

.. there is a 3rd reason but is rarely exercised:
3. Boredom - the engineer did it for the hell of it, there is no
visible added value.

Often management consider that everything can be fixed in software and
so from an original succinct processor design, it becomes something
like a: 8051 with megabytes of banked memory. In the end the
engineering time spent in making an inappropriate design jump hoops,
could have been better used in a new design with a new, more suitable
processor.

I would consider that one of the main criteria for selecting a
processor is its suitability in the job at hand. Otherwise using an
inappropriate processor will move the costs from the hardware design
to the software design.

Ken.

http://www.deja.com/[ST_rn=ps]/threadmsg_if.xp?AN=485567505

A Framework for Hardware-Software Co-Design of Embedded Systems (Polis) http://www-cad.eecs.berkeley.edu/Respep/Research/hsc/
"how to build your own personal supercomputer for less than $10,000!" http://www.supercomputer.org/ "This site is dedicated to the belief that bringing more computing power to more people will make this a better universe for everyone." [FIXME: Beowulf]
a list of the sites operating the 500 most powerful computer systems http://www.top500.org/
Advanced Computer Architecure EN291 Sec09 http://www.lems.brown.edu/~iris/en291-09/ has some nice lecture notes.

From: wardrg@my-deja.com
Subject: FIREFLY Embeddable MicroController Cores
Date: 13 Aug 1999 00:00:00 GMT
Organization: Deja.com - Share what you know. Learn what you don't.
X-Article-Creation-Date: Fri Aug 13 10:22:28 1999 GMT
X-MyDeja-Info: XMYDJUIDwardrg
Newsgroups: comp.arch.fpga
X-Http-User-Agent: Mozilla/4.0 (compatible; MSIE 4.5; Mac_PowerPC)

FIREFLY Embeddable MicroController Cores

At Mitel Semiconductor, we have combined our advanced CMOS ASIC
technology with our MicroController design expertise to produce a unique
Embedded MicroController ASIC capability. We have given the name Firefly
to a set of embeddable MicroController cores developed for use in these
ASICs.

Firefly cores use the ARM7TDMI ('Thumb') processor core - a RISC
processor core, especially suitable for ASIC applications in the wired,
wireless communications and networking.

Thumb is licensed by Mitel from ARM Ltd, and is offered as part of the
Mitel SystemBuilder? library of fully-supported embeddable
macrofunctions.

Mitel has enhanced the ARM core by adding a number of the sort of
peripherals everyone always needs, like a memory interface, an interrupt
controller, UART and timers, and turned it all into a single complex
microcontroller macrocell, the Firefly core. This first in the series of
Firefly cores is available in silicon and is accompanied by extensive
support.

Firefly ASICs bring all this practical design experience to customers
needing special-purpose microcontrollers. Mitel's Embedded
MicroController ASIC architecture has been specially developed to make
it easy to integrate customer's own logic with a Firefly core including
memory and other standard functions. A comprehensive design flow,
assists customers achieve right first time silicon of complex SLI
designs along with many other benefits.

For more information regarding FIREFLY please visit -

http://www.mitelse

Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.

http://www.mitelse

http://www.computeroemonline.com/content/homepage/
"ultra low-cost microprocessors" ??? (Of course, I think my MISC should be pretty low cost ... and Jeff Fox and Chuck Moore seem to have a pretty low cost MISC ...) http://slashdot.org/article.pl?sid=99/11/25/094246;mode=thread | http://www.zdnet.com/zdnn/stories/news/0,4586,2399797,00.html?chkpt=zdhpnews01
Object Oriented Modular Electronic Component System (OOMECS) http://www.oomecs.org/ "a free (as in freedom) computer hardware/software architecture"
Open Hardware Specification (OHSpec) http://www.WPI.EDU/~free779/main.html ???

Subject: Re: A Language Targeted at Microcontrollers
From: "Stephen Pelc" 
Date: 2000/01/07
Newsgroups: comp.arch.embedded

Gary Drummond wrote in message <387595C8.E53E1007@worldinter.net>...
>Forth might very well be an answer to your question, except for very
>time-critical applications, which a compiler for Forth might cure.
>The fact that the "CS" community doesn't approve of it doesn't
>mean much-COBOL isn't taught in universities any more, but
>there are waiting lists for classes in the community
>colleges. It doesn't take too many over-budget, two years late,
>conversions to C, to wake up the business world.

Modern Forth cross compilers, such as our Forth 6 VFX
compilers (see website for details), produce optimised
native code that is as fast as C. The code density is still
very good, and on our 68HC12 compiler, the VFX code
is smaller than the code from threaded systems. All the
interactivity of Forth is available in the usual way. All our
systems come with multitaskers.

Many of our customers who use multiple programming
languages believe that Forth is a better language for
embedded systems than C. We even have a C to
Forth translator so that you can reuse your existing C
code.
--
Stephen Pelc, sfp@mpeltd.demon.co.uk
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)2380 631441, fax: +44 (0)2380 339691
web: http://www.mpeltd.demon.co.uk

DSP http://wwwbode.informatik.tu-menchen.de/Par/arch/ramis/
http://w4.lns.cornell.edu/public/COMP/info/gdb/gdb_15.html debugging embedded targets with gdb. ???
Dr. Wolfgang Karl http://wwwbode.informatik.tu-muenchen.de/Par/arch/ramis/ RAMIS Reconfigurable Architectures Microprocessors Systems Design "FPGA based Reconfigurable Microprocessors" "Java programs into VHDL and then into Xilinx FPGAs"
Optimization Techniques for High-Performance DSPs by Rob Oshana http://embedded.com/1999/9903/9903osha.htm has a few words to say about CPU design. "Make the common case fast".
http://www.rabbitsemiconductor.com ???
Amiga http://www.amiga.com/
Seymour Cray http://www.mbbnet.umn.edu/hoff/hoff_sc.html

Cray is not amused. "I have really strong feelings about that," he said. "I feel the bigger the group that works on the project, the lower the chances for success. I'm appalled at our trying to make a country-wide coordinated effort. I just can't imagine it ever being successful.
"I believe you want a lot of independent people thinking their own thoughts and trying their own things. We're not going to participate in any national effort, and I don't want any money from the government. We've got competition within the company. I've got a group here five miles away who I know are trying to outdo me."

[DAV: much stuff snipped from this long post]
------------------------------
Date: Tue, 20 Jun 2000 23:15:04 -0700
From: Jeff Fox 
To: misc
Subject: CPUs and Forth
...
I know what you mean.  I have confidence that people are still
capable of being smarter than their PCs at this stage of technology
and don't really need to adopt the stance that they need a compiler
that is smarter than they are.
...
I really really wish people could
stay away from the religion, high priest, and cult bullshit.
I am happy to disucss technology, science, hardware, programming,
etc. but I get very very tired of people trying to paint this
as religion.
...
The thing I find most insulting about it is that it denies
that we have any real religious belief.  It seems to deny that
we have any concept of what religion is.  As an ordained minister
and as someone who has done meditation for over fourty years I am
amazed at how many people write things about me telling other
people about my religion.
...
Jeff Fox

------------------------------

Simultaneous Multithreading Project http://www.cs.washington.edu/research/smt/ "Our current smt research reexamines operating system design in the face of several architectural features that are unique to SMT -- its cycle-by-cycle sharing of hardware resources among threads and its hardware support for lightweight synchronization -- and extremely demanding request-driven parallel workloads, such as web servers. The research sits squarely between architecture and operating systems, examining (1) the design and performance of SMT processors with respect to their support for OS needs, and (2) the structure of operating systems in light of the capabilities of multithreaded processors."
Dean M. Tullsen http://www-cse.ucsd.edu/~tullsen/
``Engineering a Better Faster Risotto'' opinion by Herndon - March 27, 2001 http://www.spacedaily.com/news/oped-01a.html [general design ?] ``I know of no lawyers, medical doctors or top auto mechanics who feel their mission in life is to run a long and lonely road, to test themselves against impossible odds. This attitude is quite common, however, in athletes, dancers and artists for whom failure is virtually guaranteed.'' ... ``Personally, I'd prefer to have my satellite built by a lawyer or MD type of personality-- whose job is to do what he's done before, do it perfectly, and get paid for it-- rather than by a long distance marathon runner who's willing to cash in all the chips, give up college or a good job, and bust the piggy bank to take that one shot at making the Olympic team.''
[future computing] http://www.spacedaily.com/news/materials-01l.html [very interesting. Ordinary nitrogen gas, solidified at high pressure, turns into a semiconducting solid, which can remain stable at room pressure (and very low temperatures) ] ``team leader Russell Hemley. ... and colleagues Mikhail Eremets, Ho-kwang Mao and Eugene Gregoryanz performed the research at Carnegie's Geophysical Laboratory, a core institution of the NSF's Science and Technology Center for High-Pressure Research.''
http://www.codeplay.com/ appears to be hiring people who can write/improve compilers.
International conference on compilers, architecture, and synthesis for embedded systems http://www.capsl.udel.edu/conferences/cases2000 ???
"The Alpha also supports an optional off-chip L3 cache. The L3 cache is wave pipelined in the sense that the address on the address bus can change before the data bus is read by the processor. This feature essentially allows the L3 cache to be in the process of fetching multiple words at the same time. The effect is quite similar to using pipelines to increase the rate of instruction execution." http://www.ceng.metu.edu.tr/~e106170/postrisc.html
DAV: Some CPUs have a "branch delay slot" that unconditionally executes the (prefetched) instruction immediately after a branch instruction, while all the branch stuff is being worked out. Would it make any sense to have a "load delay slot" to execute other instructions after a load, while the (slow) TLB lookup and cache misses are being handled ? In particular, it might be interesting to guarantee that the value of a register will not change until the end of the *next* instruction -- so that
```
		STORE R1, dest
		LOAD  R1, source
		// now dest contains original contents of R1
	
```
could be re-arranged
```
		LOAD  R1, source
		STORE R1, dest
		// in ordinary architectures, dest contains
		// data from source -- but
		// with proposed "load delay slot" architecture,
		// dest contains original contents of R1.
	
```
(I think some compilers already think about this to avoid pipeline stalls...).
http://www.cnn.com/2001/TECH/computing/01/10/hard.drive.copy.protection.idg/ claims that "the National Committee for Information, Technology Standards" is "the industry body that sets the common standards upon which all PCs operate." DAV: Huh ? I thought IBM just set a de-facto standard ... "The subcommittee in charge of the ATA standard -- which controls hard disks and other drives -- is called the T13 group." ... "at least one member of the T13 group is talking. Andre Hedrick, a representative for Linux in the group and a storage industry consultant"
"Free chips for all: The status of open hardware designs: article by Jamil Khatib Hardware designer, Siemens-ICT August 2000 http://www-106.ibm.com/developerworks/library/openhw.html?dwzone=opensource
The New Chips on the Block By Linda Geppert, http://www.spectrum.ieee.org/pubs/spectrum/0101/1_2netp.html ``network processors are optimized to process data packets and send them on to the next node at wire speed ... Seven-layer processing at a wire speed of 10 Gb/s is no easy task, according to Eli Fruchter, president and CEO of EZchip. At 10 Gb/s, a 64-B packet arrives about every 60 ns (allowing a few nanoseconds for overhead). So for a processor running at 200 MHz, a packet arrives every 12th clock period--"and that's not a lot of clocks," he emphasized. ''
http://slashdot.org/articles/01/03/24/1840252.shtml
some considerations
by 3-State Bit on Saturday March 24, @06:03PM EST
You know Slashdot wouldn't suck if
1. People read the articles before posting.
2. Slashdot editors turned 30 seconds of their time toward making sure people can read the articles.
With that said, you can see a /summary/ at least of the article by going to pnas.org and clicking "Microfluidic networks solve computationally hard problems" near the center of the screen. (gets you here).
I don't know much of the specifics, but this doesn't seem to be an incredibly interesting development. Since "three-dimensional microfluidic networks" are not quantum-mechanical in nature, at best whatever they can do is to more /efficiently/ solve what we already can solve. Remember, people, NP stands for "non-polynomial[time]." In other words, as a given 'n' for the some measure of the complexity of the type of problem (such as n=6 for the specific achievement this article heralds) increases, the amount of computation (or compatational "time") increases at a rate greater than a polynmolial... in other words, at exponential or greater rates and not at something you can express in terms of O(x^n) with n fixed.
What does this mean for you? That this evolution is not interesting and does not shed new light on anything in the physical or mathematical world: nowhere does the article say that this system will solve in polynomial time the maximum clique problem. NP doesn't mean a problem is unsolvable: just that it becomes increasingly and increasingly difficult to solve as the size of it increases. Here is an introduction to the idea of NP. The clay institute is offering $1m for anyone that can solve NP, so I doubt this article claims to do anything of the sort, although, as we've all by now noticed, I can't actually see the article itself. Not worth $5 if you ask me.
Here is an article that already proposes DNA computing . (.gz, and probably not worth a d/l)
And here are some other NP problems
~ 3-State Bit has spoken. Tremble in his trinary.
??? http://www.geocities.com/SiliconValley/Chip/5014/links.html Jan Gray's Homebrewing RISC Microprocessors In FPGAs ... Hokie RISC Design by Scott Harper
[FIXME: where to put a section on building your own PC ?] If you want to learn how to build your own PC from off-the-shelf components, check out:
- Wikibooks: How To Assemble A Desktop PC
- AnandTech http://AnandTech.com/ introduction to RAID: "RAID Primer: What's in a number?"
- Tom's Hardware Guide http://www.TomsHardware.com/
``The Pentium 4 and the G4e: an Architectural Comparison'' by Jon "Hannibal" Stokes http://arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-1.html [FIXME: read]
GMD FIRST: German National Research Center for Information Technology Institute for Computer Architecture and Software Technology http://www.first.gmd.de/
``Software routine minimizes large logic tables'' article by Jerzy Chrzaszcz http://archives.e-insite.net/archives/ednmag/reg/1997/070397/14di_01.htm article includes source code. ``uses Espresso, a public-domain logic minimizer available from the University of California--Berkeley.''
Project AppleSeed: A Parallel Macintosh Cluster for Numerically Intensive Computing http://exodus.physics.ucla.edu/appleseed/appleseed.html gives information on how to do it, suggests that even middle school and high school students can assemble parallel computing clusters and write ``parallel'' programs.
[beowulf and other clusters: ``CPU farm''] Sun Grid Engine http://www.sun.com/gridware/ grid engine open source project http://supportforum.sun.com/gridengine/
PBS (Portable Batch System) http://www.openpbs.org/ ``PBS ... operates on networked, multi-platform UNIX environments, including heterogeneous clusters of workstations, supercomputers, and massively parallel systems.'' [compute farm] including Linux and Cray. PBS is available for download in source form. (Is this an open license ?)
``How Computers Really Work: A Treatise on Science and Magic'' by Bruce Linley http://www.brouhaha.com/~eric/computers/science_and_magic.html
http://datacompression.info/Lossless.shtml has some links to integrating data compression with hardware ...
http://goethe.ira.uka.de/people/ungerer/ ???
``Embedded Processors'' article by Jim Turley http://www.extremetech.com/article/0,3396,s%253D1005%2526a%253D21424,00.asp DSPs, Media Processors, ``A key factor in embedded success is getting to the one-spouse decision: the price point at which you can buy yourself a cool new gadget without asking permission first. Economists say that point is around $299; below that, sales take off. ...
Turley's Law (my own humble contribution to journalists' amusement) says that the amount of processing power you carry on your body doubles every two years. Pat yourself down and see if you're not already carrying more computing horsepower than NASA left on the surface of the moon. ''
http://www.gpsworld.com/gpsworld/article/articleDetail.jsp?id=3053 shows that sometimes a massively parallel solution is less expensive and uses less power than the traditional serial solution. ``division of responsiblity'' -- allowing the CPU to handle only/all non-real-time tasks ... [low power] [massively parallel] ["gps"]
McKusick http://www.salon.com/tech/fsp/2000/05/16/chapter_2_part_one/index3.html claims that `` NetBSD specializes in porting BSD to different computer architectures. OpenBSD http://openbsd.org/ concentrates on security issues. ... the just-merged FreeBSD and BSDi ... FreeBSD http://www.freebsd.org/ has, at least by seats, 80 percent of the market ... ''
``bootable business cards'' http://newsforge.com/article.pl?sid=02/03/28/0335245&tid=23 ``You can get 25 blank business card CDR's for about $30 at labelgear.com.'' [OS][wearable_computing idea ?]
the F-RISC project ... spartan RISC processors operating in the gigahertz range. http://inp.cie.rpi.edu/~mernest/ | http://inp.cie.rpi.edu/research/mcdonald/frisc/finalreports/summer96/fin0307.html | http://inp.cie.rpi.edu/research/mcdonald/frisc/
``The Impacts of Process Technology Changes on Computer Architecture.'' by David Wang http://www.wam.umd.edu/~davewang/process_arch.pdf

asynchronous circuits ... Though the asynchronous paradigm appears on the surface to fit the requirement in that it removes the need for the globally distributed clock signal, the learning curve in migrating to this paradigm is simply too steep for the world to embrace at this time.
...
The solution of designing better tools is a conceptually simple one to propose, and difficult one to implement.
...
A short summary of the effects of the increasing wiring delays in the overall delay equation is that the cost of data transportation will continue to increase, relative to the cost of data computation. It may become cheaper to recompute frequently used variables, instead of precomputing those values, and then attempt to bring those values on chip when they are needed. Already, the looming prospect of the change in the delay equation has prompted some calls to return to some old ideas which had been deemed obsolete. Some have proposed that macro instructions may be set up, so that frequently used sections of code may be kept on chip, and do not have to be retrieved from memory every time. The idea of using macro instructions to minimize the requirement of instruction bandwidth invokes the memory of the venerable X86 processor's use of instructions executed via microcode for such things as string manipulations. Reduced Instruction Set Computing had gained favor with the mantra that by simplifying the hardware, the hardware may be easier to design and achieve a higher clock frequency at the slight expense of increased memory storage and bandwidth requirements compared to the classical CISC processors of its day. It is therefore ironic that with the cost of data computation dropping relative to the cost of data transportation, Complex Instruction Set Computing appears to be making a mild comeback via the idea of macro instructions. The future and the past of computer architecture may yet cross path, and complex instructions which saves instruction bandwidth may yet again be important.
DAV: This seems to unfairly slam asynchronous designs. [vlsi][low power][asynchronous]
``Sometimes applications know when they are going to be idle. This would be a good time to kick off a full garbage collection.'' -- Kelvin Nilsen 2002-03
[OS] The linux-kernel mailing list FAQ http://www.tux.org/lkml/
DAV: this is the funniest article I've ever read on computer architecture. As far as I can tell, it's all true. ``Richard Feynman and The Connection Machine'' article by W. Daniel Hillis 1989 http://www.kurzweilai.net/meme/frame.html?main=/articles/art0504.html It talks abou logarithms 1d_design.html , cellular automata , simplicity

...
This was a typical Richard Feynman explanation. On the one hand, it infuriated the experts who had worked on the problem because it neglected to even mention all of the clever problems that they had solved. On the other hand, it delighted the listeners since they could walk away from it with a real understanding of the phenomenon and how it was connected to physical reality.
We tried to take advantage of Richard's talent for clarity by getting him to critique the technical presentations that we made in our product introductions. Before the commercial announcement of the Connection Machine CM-1 and all of our future products, Richard would give a sentence-by-sentence critique of the planned presentation. "Don't say `reflected acoustic wave.' Say echo." ...
...
... building a big computer is a good excuse to talk to people who are working on some of the most exciting problems in science. We started working with physicists, astronomers, geologists, biologists, chemists -- everyone of them trying to solve some problem that it had never been possible to solve before. ...
For Richard, figuring out these problems was a kind of a game. He always started by asking very basic questions like, "What is the simplest example?" or "How can you tell if the answer is right?" He asked questions until he reduced the problem to some essential puzzle that he thought he would be able to solve. Then he would set to work, scribbling on a pad of paper and staring at the results. ... Eventually he would either decide the problem was too hard (in which case he lost interest), or he would find a solution (in which case he spent the next day or two explaining it to anyone who listened). In this way he worked on problems in database searches, geophysical modeling, protein folding, analyzing images, and reading insurance forms.
...
... we never really knew what we were doing. But the things that we studied were so new that no one else knew exactly what they were doing either. It was amateurs who made the progress.
... The act of discovery was not complete for him until he had taught it to someone else.
...
Ars Technica on Hyperthreading http://slashdot.org/articles/02/10/03/213246.shtml?tid=96
CPU Information http://www.edgeworld.com/notebook/lcpu.htm Links to the different CPU manufacturers and related sites
emulation of Macintosh on other platforms (Windows, Linux, various PDAs, etc.) http://www.emaculation.com/
run the MAC OS on my PC http://www.sysopt.com/reviews/macemu/
Apple Macintosh Emulator for the Sharp Zaurus http://www.mmhart.com/macz.htm | http://www.killefiz.de/zaurus/showdetail.php?app=615
Lots of emulators that run on the Sharp Zaurus ( C64, x86, Palm Pilot, Atari 800, Gnuboy Gameboy, etc.) http://www.killefiz.de/zaurus/showapps.php?cat=15
http://www.mmhart.com/downloads.htm emulators and some games. Lots of Amiga information.
ALPACA stands for a language for programming arbitrary cellular automata. http://www.catseye.mb.ca/esoteric/alpaca/
``REDGREEN is an extremely complex cellular automaton'' http://www.catseye.mb.ca/esoteric/alpaca/redgreen/
``IBM POWER4 Processor Review'' by Pavel Danilov http://www.digit-life.com/articles/ibmpower4/
Someone suggested that we need a DOITOVERAGAINBUTDOITRIGHTTHISTIME() command. http://dev.perl.org/perl6/list-summaries/p6summary5.html
http://www.ultranet.com/~crfriend/museum/ ???
Virtual Machine Architectures http://www3.sk.sympatico.ca/jbayko/cpuAppendB.html
http://developer.apple.com/fonts/TTRefMan/RM05/Chap5.html ???
riday, 26 April, 2002, 11:19 GMT 12:19 UK Japanese supercomputer simulates Earth http://news.bbc.co.uk/1/hi/sci/tech/1951265.stm

... The Earth Simulator is capable of 35 teraflops, or 35 million million calculations per second.
It trounces ASCI White's 7 teraflops ...
Hans Werner Meuer, founder of the top 500 list of world supercomputers, says he expects the Earth Simulator to outperform all of its nearest 19 rivals put together. ...
``Is the Primordial Soup Done Yet? Quantifying Self-Organization, Especially in Cellular Automata'' by Cosma Rohilla Shalizi http://www.santafe.edu/~shalizi/Self-organization/soup-done/ [FIXME: toread]
[museums] hp virtual museum HP-35 Scientific Calculator Handheld Scientific Calculator, 1972 http://www.hp.com/hpinfo/abouthp/histnfacts/museum/personalsystems/0023/index.htm this is the calculator that replaced the slide rule.
``Introduction to Multithreading, Superthreading and Hyperthreading'' by Jon "Hannibal" Stokes http://arstechnica.com/paedia/h/hyperthreading/hyperthreading-1.html symmetric multiprocessing (SMP) ... simultaneous multithreading (SMT) ...
Classic Computing http://www.classiccmp.org/ ???
[#books ?] http://www.powells.com//psection/ComputerArchitecture.html
``An introduction to RISC architecture'' by Ayse Kozak http://www.ceng.metu.edu.tr/~e106170/charac.html ``RISC is not so much a set collection of design criteria as a general design philosophy, the acronym stands for either Reduced Instruction Set Computer (the most widely used meaning) or Reduced Instruction Set Cycles (according to IBM).''
includes
- Hardwired Control, Little or No Microcode in RISC Processors http://www.ceng.metu.edu.tr/~e106170/microcode.html
  
  ... The IBM 360 machine, created in 1964, was probably the first major microcoded processor system and was built out of the need for flexibility. ...
  ... hardwired approaches ... when used with non complex instructions like RISC ... take up far less decoding area [than] a microprocessor that uses microcoded control.
  Koopman (1987) ... suggests that as technology improves and microcoded memory access times and speed of processing increases [compared to main memory RAM access times], the speed advantages of hardwired systems decrease.
- Beyond RISC - The Post-RISC Architecture http://www.ceng.metu.edu.tr/~e106170/postrisc.html
  
  ...
  on-chip caches
  additional functional units for superscalar execution
  additional "non-RISC" (but fast) instructions
  ...
  branch prediction
  ...
DAV: I suspect branch prediction (speculative execution) ultimately not going to be useful except where sequential speed is necessary at any cost. A processor that doesn't spend any energy flipping bits it doesn't have to seems to be much more apropriate for power-limited designs. ... well, if it's just predicting which instructions need to be loaded into the cache, perhaps it's not really flipping any extra bits -- for loops, if it mis-predicts and loads the ``wrong'' instructions into the cache, well, it was going to load those instructions into the cache eventually anyway, so we really haven't ``wasted'' any energy.
``RISC vs. CISC: the Post-RISC Era: A historical approach to the debate'' article by by Hannibal http://www.arstechnica.com/cpu/4q99/risc-cisc/rvc-1.html
``RISC denotes a set of reduced instructions, not a reduced set of instructions. The number of instructions is not relevant. -Greg'' -- Greg Mildenhall 5 Oct 2001 http://www.cantech.net.au/plug/2001-10/msg00072.html
``Inside Intel Again: Merced vs. Crusoe ... or "CISC and RISC and VLIW -- oh my!"'' article by Rob Landley (TMF Oak) February 25, 2000 http://www.fool.com/portfolios/rulemaker/2000/rulemaker000225.htm
http://www.eng.fsu.edu/~dondo/microhist.txt A Brief History of Microprocessors by M. R. Arradondo ???
``Network Processor Designs for Next-Generation Networking Equipment'' White Paper by ??? http://www.ezchip.com/html/tech_nsppaper.html ``packet processing at multi-Gigabit wire speed.''
http://linux.inrialpes.fr/linux/RPM/cooker/contrib/alpha/microcode_ctl-1.06-5mdk.alpha.html | http://ftp.rpmfind.net/linux/RPM/cooker/contrib/RPMS/microcode_ctl-1.06-5mdk.i586.html `` Since PentiumPro, Intel CPU are make of a RISC chip and of a microcode whose purpose is to decompose "old" ia32 instruction into new risc ones. P6 familly is concerned: PPro, PII, Celeron, PIII, Celeron2. Recent kernels have the ability to update this microcode.
The microcode update is volatile and needs to be uploaded on each system boot. I.e. it doesn't reflash your cpu permanently. Reboot and it reverts back to the old microcode.'' Is this really true ? [FIXME: if it's really true, email Koopman]
"On a chip, regular parallel structures can be very dense compared to random control logic." -- Pascal Dornier
BOAR - An Advanced DSP/FPGA Emulation System http://www.students.tut.fi/~albert/BOAR/ by Jouni Isoaho, Pasi 'Albert' Ojala, Henrik 'Leopold' Herranen, Vesa Köppä, Jarkko Oksala Harri Hakkarainen BOAR provides a reprogrammable, high performance platform for prototyping DSP development applications. Its main use is to provide an environment to test and prototype VHDL ASIC designs on a real hardware. ... BOAR supports hardware/software codesign. points to Hedgehog - A younger brother for BOAR http://www.sci.fi/~cubase/board.html by Pekka Martikainen
the Linux Trace Toolkit http://opersys.com/LTT/ [benchmarking]
[#os] Anton Ertl http://www.complang.tuwien.ac.at/anton/sync-metadata-updates.html What's wrong with synchronous metadata updates ? (and) What would be the correct way ? (discusses writing stuff to disk -- you don't want to write everything ASAP, because the system can run much faster if you buffer things ... but if you buffer things to long, or if you write things in the ``wrong'' order, then bad things happen if the system crashes at the wrong time. )
Top 10 List: Ways to Simplify Programming by Mike Elola http://www.amresearch.com/10_ways.html
"XILINX, INSIGHT ELECTRONICS INTRODUCE LOW COST UDP STACK CORE FOR FPGAS" July 2, 2001 http://www.xilinx.com/prs_rls/0152udp.htm

Xilinx, Inc. (NASDAQ:XLNX) and Insight Electronics introduced today the industry's first User Datagram Protocol (UDP) stack core optimized for an FPGA. The UDP stack core provides a cost-effective solution for transmitting voice and data over the Internet. It can be used to create low cost consumer products such as Voice over Internet Protocol (VoIP) phones, Internet intercoms, remote security monitors, Internet-based voice recorders, and simple thin clients for home networking applications.
The UDP stack core has been used to create a low cost consumer VoIP reference design based on a single very low-cost Spartan-II FPGA. ...
... Xilinx invented the FPGA and fulfills more than half of the world demand for these devices today. ...
"Tensilica: Configurable Processor Cores Speed SOC Chip Design" http://www.tensilica.com/
... do not see the C-One as a Commodore 64 replica. It's a giant leap in computer technology, having the opportunity to change the behaviour of the hardware on the fly ...
The C-One aims at those who are into computer nostalgia, as well as those who want it for educational purpose. We'll supply all kinds of material for you to start VHDL programming, and instantly try it out on this board. Start modifying the board without soldering, extend the capabilities of your video output, or even switch to a completely different computer on the fly.
This computer is not for the usual point-and-click user. It's going back to the times where each and every bit of the machine was documented, and forward to a new kind of computer technology: Re-configurable hardware.
...
http://slashdot.org/articles/03/02/01/2035204.shtml?tid=127&tid=137&tid=164 [FIXME: read]
"Operating systems are how things work. Physics is a view of how things work; so is meta-physics; so is religion; and government; and family. I'm interested in making dramatic improvements in performance by taking awareness through a paradigm shift." -- michael miller http://www.hermit.com/hermit/logsum03.htm
Some notes on how to build a Linux cluster http://www.gdargaud.net/Hack/ClusterNotes.html Cluster And Grid Computing: High Availability: Beowulf clusters:
"Self-Repairing Computers" article by Armando Fox and David Patterson in Scientific American 2003-06 http://www.sciam.com/article.cfm?colID=1&articleID=000DAA41-3B4E-1EB7-BDC0809EC588EEDF

By embracing the inevitability of system failures, recovery-oriented computing returns service faster ...
... our team concentrates on designing systems that recover rapidly when mishaps do occur. ...
esim: A Structural Design Language for Computer Architecture Education ... The compiler and simulator for the language are freely distributable http://www.cse.ucsc.edu/~elm/Software/Esim/ (apparently a simplified subset of VHDL ?) [FIXME: crosslink to schematic.html#digital ]
http://www.simdtech.org/ ???
the first machine that had all the components now classically regarded as characteristic of the basic computer. Most importantly it was the first computer that could store not only data but any (short!) user program in electronic memory and process it at electronic speed. http://www.computer50.org/ [FIXME: toread. Any clever ideas here waiting to be rediscovered ?]
inside the HP-UX operating system http://www.hp.com/education/courses/h5081s.html ... http://www.poweritnetwork.com/unix(hp).htm table of contents
(LVM: Logical Volume Manager ... generalized RAID ...)
- The linux-lvm Archives http://lists.sistina.com/pipermail/linux-lvm/
- IBM has announced via mailing list its intention to release a Logical Volume Management (LVM) architecture for Linux. http://slashdot.org/article.pl?sid=00/06/15/1447255
Is The x86 Obsolete? http://slashdot.org/article.pl?sid=00/06/15/173226 [FIXME: read]
Wozniak Inducted Into Inventors Hall Of Fame http://slashdot.org/article.pl?sid=00/06/15/1646250 [FIXME: read]

Subject: CPU design


http://www.emulators.com/pentium4.htm
"it takes an Intel or an AMD or a Motorola a good 3 to 5 years to design a new
processor architecture."
also lists
and has hard benchmarks of several chips, including the Transmeta Crusoe chip.



"the fatal design flaws of the Pentium 4"
http://www.emulators.com/pentium4.htm
:
MISTAKE #1 - Small L1 data cache
" The Pentium 4 has a grossly under-sized 8K L1 data cache. "
MISTAKE #2 - No L3 cache
"How much is 256K or 2M? Well, that's about the typical size of an uncompressed
bitmap. It's the reason a Power Mac G4 running Photoshop kills a typical Pentium
III running Photoshop. "
MISTAKE #4 - Trace cache throughput too low
"Together, these execution units can in theory process 9 micro-ops per clock
cycle
- 4 simple integer operations, 1 integer shift/rotate, a read and write to
memory,
a floating point operating, and an MMX operation.
Sounds pretty sweet, except for the problem that the trace cache feeds only 3
micro-ops at a time! "
MISTAKE #6 - Shifts and rotates are slow
MISTAKE #8 - Instructions take more clock cycles to complete
"This is not so much a specific mistake as it is an overall side effect of the
first 7 idiotic mistakes."
"It reverts to 10 year old techniques which Intel abandoned and apparently
forgot why."



http://www.techextreme.com/display.asp?ID=290&Page=1

claims that the above article is missing the point.


"Remember, out-of-order execution does not only mean that the individual
instructions are executed out-of-order,
but parts of the instructions (the micro-ops) are as well."
--
http://www.emulators.com/pentium4.htm

Charlie Daly http://www.computing.dcu.ie/~cdaly/ What is the "D6" computer ? ... has a list of "Embedded Systems Internet Resources" [FIXME: to read] ... http://www.computing.dcu.ie/~cdaly/research/robotics.html ...
the IEEE Microprocessor Standards Zone. http://standards.ieee.org/micro/ [FIXME: to read]
New TASKING TriCore toolset from Altium uses Code-Efficient Viper compiler technology
Altium Limited has announced the release of a new TASKING embedded software development toolset for the TriCore architecture that has been enhanced with Tasking's Viper compiler technology. Altium benchmarks show faster execution speed and code size decreases of an average of 10% when compared to the previous TASKING TriCore toolset. The new toolset is compatible with all leading third-party TriCore products, such as emulators and RTOSs.
May 14, 2003

[DAV: I wish for more details ... so I could add these ideas to a list of general ideas for improving code size] -- http://microcontroller.com/news/bits_and_bytes.asp
ForthCPUs http://www.forthfreak.net/wiki/index.cgi?ForthCPUs lists several CPUs that were designed by Chuck Moore or other FORTH enthusiasts; that has had a significant influence on their architecture.
should x86 be kept alive? http://aceshardware.com/forum?read=105038552 discussion of the x86 instruction set architecture (ISA), alternatives "Do the other weaknesses of x86 really warrant a move to a new ISA? "
CodeForce (peter_de_heer) has a very strong opinion on the changes he wants: http://www.aceshardware.com/forum?read=105039062
On the other hand, "rational" makes a good argument that "The ISA just doesn't matter" http://www.aceshardware.com/forum?read=105038773 but Paul DeMone disputes every point.
"cheesemower" claims "all successful ISAs get Byzantine after a couple of years."
The x86 isn't all that complex -- it just doesn't make a lot of sense. -- Mike Johnson, Leader of 80x86 Design at AMD, Microprocessor Report (1994)
The Great CPU List http://www3.sk.sympatico.ca/jbayko/cpu4.html#Sec4Part14
"Why do we buy such absurdly powerful computers when many of us could get by with a good abacus? Are we all geneticists working on decoding the human genome?" article by John Naughton, Sunday June 23, 2002 http://observer.guardian.co.uk/business/story/0,6903,741022,00.html

his kids ... are, even as I write, editing home movies (and adding daft sound tracks, I shouldn't wonder).
And the point of all this? Simple: everything my friend's kids are doing is computer-intensive in that it requires fast, powerful processors together with lots of RAM and big disks. But they don't see it as computing. To them it's just record production, image manipulation or video editing. We are looking at technology's version of the old principle that work expands to fill the space available. And that is what explains why Dixons sell - and we buy - those absurdly powerful machines.
"the Advanced Computer Architecture Group At UCI." http://gram.eng.uci.edu/morphosys/

MorphoSys is a Reconfigurable Computer Architecture that is composed of a software programmable processing unit called TinyRISC and a reconfigurable hardware unit called RC Array. MorphoSys is currently running at 450MHz, using 0.13-micron technology and with an area of 8x8mm or 16x16mm with onchip memory.
... "the Reconfigurable Cell (RC) array. It has 64 reconfigurable cells arranged in an 8 by 8 array. Each cell has an ALU/MAC unit and a register file." http://www.eng.uci.edu/morphosys/rcarray.html
LSI TinyRISC processors using GNUPro Toolkit software. http://www.symuli.com/GnuPro/6_embed/emb09.html ... 32 registers ... has a detailed explanation of the function call protocol
"GNUPro Tools for embedded systems" http://www.redhat.com/docs/manuals/gnupro/GNUPro-Toolkit-99r1/pdf/6_embed.pdf tells how to set up Cygwin, and discusses development for ARM7, Hitachi H8, LSI TinyRISC, Matsushita MN10200, MIPS VR4100, MIPS VR5xxx, Mitsubishi D10V, Motorola M68K, NEC V850, PowerPC, SPARC, SPARClite, Toshiba TX39,
http://www.lsilogic.com/products/dsp/zsp/zsp_core.html "Superior code density" ???
MIPS-Light Instruction Set Architecture http://www.stanford.edu/class/ee282h/projects/info/isa.html "The MIPS-Light ISA is a stripped down version of the MIPS R2000 ISA which is very close to the DLX ISA that is described in the textbook."
"the MIT AI Lab Reversible Computing Home Page: Reversible Computing for Energy Efficient and Trustable Computation" http://www.ai.mit.edu/~cvieri/reversible.html ,
http://c2.com/cgi/wiki?ReversibleLogic
"Socket Rocket" by Tom Cantrell http://circuitcellar.com/pastissues/articles/Tom100/tom-100.pdf has a gentle explanation of classic 4-stage pipelining, why small CPUs have better interrupt response times than more complex CPUs, techniques for even faster interrupt response times, ...
DAV: hardware support for "zero-overhead looping" makes loop unrolling unnecessary ... which helps keep more stuff in the cache, speeding up other things. (but how does this effect interrupts?)
http://forum.microchip.com/tm.aspx?m=74091 discusses implementing a virtual machine for the PIC16. Mentions the http://en.wikipedia.org/wiki/MIL-STD-1750A , which was once the USAF standard ISA (instruction set architecture).
Computer Architecture courses tought by Chip Weems http://cs.umass.edu/~weems/courses.html

I'm not quite sure why http://npu2.npu.edu/sloc/notes/computer_architecture.html has mirrored this computer architecture page ...

Part of the

Previous		Next
	Random Choice

This page started 1998-04-21 and has backlinks

David Cary
d.cary+computer_architecture@ieee.org.

Return to index // end http://david.carybros.com/html/computer_architecture.html