updated 2009-04-29.
Some notes on Computer Architecture. Very incomplete.
Contents:
[FIXME: should I put links to Beowulf here ?]
David also maintains related files:
``I wisdom dwell with prudence, and find out knowledge of witty inventions.'' -- Proverbs 8:12
computer architecture news
The latest computer architecture information is at
news:comp.arch.hobbyist mirror http://groups.google.com/groups?group=comp.arch.hobbyist and
news:comp.arch.embedded mirror http://groups.google.com/groups?group=comp.arch.embedded
The LEON core is a SPARC* compatible integer unit developed for future space missions. It has been implemented as a highly configurable, synthesisable VHDL model. To promote the SPARC standard and enable development of system-on-a-chip (SOC) devices using SPARC cores, the European Space Agency is making the full source code freely available under the GNU LGPL license.
The LEON core has been extensively tested against the SPARC V8 architecture manual and the IEEE-P1754 (SPARC) standard, but have not been formally tested and certified by SPARC international as being SPARC V8 compliant. ...
DO YOU WANT TO HELP? If you wish to contribute to LEON and work on (or donate) any of these modules, please Jiri Gaisler.
X-URL: http://www.egroups.com/list/f-cpu/ Date: Tue, 12 Jan 1999 21:47:57 -0500 (EST) From: Robert Dale <rob at nb.net> To: F-CPU <f-cpu at egroups.com> Subject: [f-cpu] CVS server We have anonymous (read-only) CVS access to AlphaRISC's TTA sim he kindly wrote. $ export CVSROOT=:pserver:anonymous@www.deepfreeze.org:/home/src/f-cpu $ cvs login (Logging in to anonymous@www.deepfreeze.org) CVS password: <hit Enter> $ cvs co f-cpu If you need help and/or want write access, feel free to email me. We'll probably put some help documentation on the website. ;) -- Robert Dale
is one of the architectures suggested for the F-CPU. Its major advantage is: gcc has already been ported to it, so we don't need to write a new compiler on top of everything else. So we can immediately compile the Linux kernel and other code for this chip *now*, before we even start doing anything else, which should make it easy to cache analysis and other trade-off analysis. Also, it means that when the hardware is turned on for the first time, it could boot directly into Linux, which would be cool.
(Can we really go to 64 bit architecture and still use this compiler ?)
(Can we really use a TTA architecture for the attached FPUs ?)
is one of the architectures suggested for the F-CPU.
// $X = ($Y ^ rM) v ($Z|Z ^ ~rM). MUX $X, $Y, $Z|Z // sequence equivalent to MUX $X, $Y, $Z|Z AND $t1, $Y, $rM ORN $t2, $rM, $Z|Z ORN $X, $t1, $t2. // sequence equivalent to MUX $X, $Y, $Z|Z NAND $t1, $Y, $rM ORN $t2, $rM, $Z|Z NAND $X, $t1, $t2. // special peephole optimization for MUX $X, $Y, 0 AND $X, $Y, $rM // special peephole optimization for MUX $X, $Y, $Z // where $Z contains all ones: ORN $X, $Y, $rM
// simulate MXOR $X, $Y, $Z|Z NOT $znot, $Z|Z NOT $ynot, $y MOR $t3, $y, $znot MOR $t4, $ynot, $Z|Z AND $X, $t2, $t4)
NOT $X, $Z|Z // set $X = ~$Z, i.e., bitcomplement; i.e., flip all the bitscan often be merged with previous or following bit instructions by replacing AND, OR with NAND, NOR, ANDN, ORN. When it can't, the assembler can use any of these equivalent instructions:
// When Z is a register, either one of these instructions // bitcomplements it. NOR $X, $Z, 0 ANDN $X, $Z, 0 // When Z is a literal in the range -1 <= Z < #FF, // we can complement it in one instruction // at compile time. // if Z is outside this ranges, // just load ~Z directly into a register. SUB $X, zero, (1+Z) // sets $X = ~Z for -1 <= Z <= #FE. ANDN $X, zero, Z // sets $X = ~Z for 0 <= Z <= #FF.
is one of the architectures suggested for the F-CPU.
(only with FPGAs can you have reconfigurable computing)
see also robot_links.html#PLD Programmable Logic (FPGA, PLD, CPLD, etc.) and the devices needed to program them.
see also #simple_cpu for simple CPU architectures that one would think would be simple to implement in a FPGA.
see also vlsi.html#PCI_on_FPGA for FPGA devices compatible with the PCI bus.
[backup copy of something I wrote on http://electronicschat.org/echatwiki/HomebuiltCpu ]
I imagine that just about anyone who has programmed 80x86 assembly language has dreamed about building their own CPU with their own assembly language.
If you've dreamed about building your own CPU that runs your own assembly language, today is a wonderful time to be living. There are many, many ways to fulfull your vision. (A few of them are even useful). -- DavidCary ( http://david.carybros.com/html/computer_architecture.html#simple_cpu )
(Note: this page is *not* about building 80x86-based desktop computers. That sort of thing is discussed over at http://en.wikibooks.org/wiki/How_To_Build_A_Computer . Building a custom 80x86-based laptop -- I don't know.).
* alt.comp.hardware.homebuilt FAQ by Mark Sokos http://www.faqs.org/faqs/homebuilt-comp-FAQ/index.html
== useful restrictions ==
=== tiny size ===
Sometimes you want a tiny little processor -- say, you want it to fit inside a model rocket.
Imagine a fully custom tiny little processor, running exactly the instruction set you've always dreamed of. How is that different from a tiny PIC or Atmel processor, programmed to *emulate* your instruction set?
[http://www.taniwha.com/~paul/fc/ Taniwha Flight Computer Home Page]
[http://img.cmpnet.com/edtn/ccellar/e023pdf1.pdf "Picaro: A Stamp-like Interpreted Controller"] article by Tom Napier 1998-04 in _Circuit Cellar Ink_
I'm thinking of building one of these with an external *serial* memory (much smaller and cheaper than the normal parallel memory ... slow, but probably plenty fast enough for what I want it to do). Unlike most processors, this way you can execute code in serial memory. (I think this is also going to give the lowest-cost and also the lowest-power way to get your own custom instruction set).
=== higher speed ===
You can build a "processor" that performs some kinds of special-purpose tasks faster than Intel's latest device. Use FPGAs.
You can build a CPU that is "instant on" and "instant off" (use FLASH, and enough capacitance to dump the essential bits of the current state of RAM to FLASH).
=== extreme environments ===
Can you get it working over the entire "industrial temperature range"? ( -40 °C to +125 °C )
== challenging, educational, "fun" restrictions ==
=== built with "classic" TTL chips only ===
Generally built around a 74LS181 ALU or similar ( 74HC181 ). (Several homebuilt CPUs have been built like this)
=== bizzareness ===
Since we're building thing by scratch, why count in the "natural binary code"
000 001 010 011 100 ...
?
Couldn't we increment address registers in some *other* sequence ?
What other "interesting" sequences would be interesting to experiment with ?
Is there some sequence that uses *less* hardware (easier to build) -- perhaps LFSR ? Is there some sequence that uses less power (runs longer on batteries) -- perhaps gray code ?
=== all one part ===
As many of you are aware, most projects are built from a multitude of different parts. There's always one part that takes the longest time to ship in. And then I always break something, and I have to wait even longer for the replacement to ship in.
However, computers have been built out of large numbers of the *same* device wired together appropriately.
* transistors * [http://en.wikipedia.org/wiki/Apollo_Guidance_Computer 4,100 ICs, each containing a single 3-input NOR logic gate.] * dual 3-input NOR gate ICs
I (DavidCary) have been wondering:
* given what we now know about "simple" RISC and zero-operand instruction sets, could I design a CPU with significantly *less* than 4,100 NOR gates ? * What other "universal" chips can be used (in large enough quantities) to build an entire CPU ?
I want to use chips that are readily available, and also "dense". (Obviously, the densest chips are the all-in-one CPU microcontrollers ... but I can't customize those. What other points on the spectrum are available?)
Let's focus on a 8 bit register (8 D flip-flops) for a moment. I imagine I'll use a bunch of them in my CPU. (program counter, registers, address pointers, etc.)
chips/bit; chips/ 8 bit register; chips/3-input-NOR '''universal chips''' 5(?) 40 1 single NOR 3(?) 24 1/2 dual NOR 2(?) 16 1/3 triple NOR 74HC27 1 8 1 dual 4-input mux 74HC153 1/2 4 3(?) quad 2-input mux inverting 74HC158 '''non-universal chips''' 1/2 4 N dual D flip-flop 74HC74 1/4 2 N quad D flip-flip 74HC173 1/6 2 N hex D flip-flop 74HC174 1/8 1 N octal D flip-flop 74HC564
Obviously, anything that could be built from single NOR gate ICs, could also be built from the much "denser" dual NOR gate ICs, and it would take significantly less space, time, effort, weight, etc.
It looks like the octal D flip-flop is the densest chip. Unfortunately, it is *not* "universal" -- parts of the CPU (the ALU, etc.) need to act in ways that I don't think D flip-flops can act. So if I want to stick to the "all one part" idea, I can't use it.
It looks like it takes three '158 chips to emulate a simple 3-input NOR, but only one '153 chip. So I suspect the '153 is better for building the random logic in the control section and the ALU.
* What other "universal" chips are there ? * Which of the universal chips can implement a given CPU in the fewest number of chips ? (If the CPU is dominated by registers, I suspect it will be one that can store the most bits in the fewest number of chips -- the '158 is the best I've found so far).
=== all 2 parts ===
Similar to the above, but easing the restriction to allow 2 different kinds of chips.
* If I use 2 different kinds of chips, which 2 chips can implement a given CPU in the fewest number of chips ? (The '564 is 4 times as dense as the '158 for storing bits ... and the '153 looks like it will be more dense than the '158 for most random control logic.)
-- Jan Gray http://www3.sympatico.ca/jsgray/home1.txtIt is amazing what you can squeeze onto these parts if you design the machine architecture carefully to exploit FPGA resources. In contrast, there was a very interesting article in a recent EE Times by a fellow from VAutomation doing virtual 6502's ... Although the 6502 design used only about 4000 "ASIC gates" it didn't quite fit in a XC4010, a so- called "10,000 gate" FPGA. That a dual-issue 32-bit RISC should fit, and a 4 MHz 6502 does not, states a great deal about VHDL synthesis vs. manual placement, about legacy architectures vs. custom ones, and maybe even something about CISC vs. RISC...
"according to Dataquest, the business of embedding CPUs into other chips is growing at about 25% per year and stood at about $20 billion in 2000. ...
The average is 2.3 processors per intelligent ASIC (those with any programmability) and rising. ...
... in the free category ... processor designs from CMOSexod and Free-IP cores. Both designs have little 8-bit processors you've never heard of but come with free tools to get you started. The Free-IP Project ... OpenCores ...
Elixent combines configurable processors with field-programmable logic to create a constantly changing processor. Elixent's tool switches implementations on the fly based on your workload. ...
Elixent enters the world of dynamic reconfigurability, or reconfigurable computing (RC). RC ... change the hardware on the fly to suit the application. ...
... It stands to reason that whenever one part of the circuit is working the other parts must be idle. ... Hardware systems have always been created from a superset of all the functions required over the life of the product, rolled into one.
If RC takes off, that might change. For example, future generations of cell phones may use half the transistors to do 10 times the work. Fewer transistors may be all that's needed, if none are wasted on functions that aren't required right now.
... Jim Turley ... visit his web site at www.jimturley.com. "
mentions these companies [FIXME: add to my other list of companies?] : VAutomation ARC International http://www.vautomation.com ... CMOSexod http://www.cmosexod.com/freeip.htm ... Elixent Ltd. http://www.elixent.com ... Lexra Inc. http://www.lexra.com ... OpenCores http://www.opencores.org/projects ... ARCtangent, ARChitect ARC International http://www.arccores.com ... Handel-C Celoxica http://www.celoxica.com ... LEON-1 Distributed by The European Space Agency http://www.estec.esa.nl/wsmwww/leon/ ... ST210 Agilent Technologies http://we.home.agilent.com ... STMicroelectronics http://us.st.com ... Jazz processor Improv Systems, Inc. http://www.improvsys.com ... Xtensa Tensilica http://www.tensilica.com ... The Free-IP Project http://www.free-ip.com with XESS, Corp. http://www.xess.com ...
... the FPGA High Performance Computing Alliance (FHPCA) ... The FHPCA aims to advance the development and adoption of FPGA-based high-performance computing through collaboration on the design and development of world's most powerful FPGA-based supercomputer using COTS (Commercial Off The Shelf) technology. ... provide opportunities for education and training ... facilitate technology transfer, including assistance with porting conventional supercomputing applications to an FPGA-based supercomputer; and promote research in the application of FPGA-based high-performance computing by making facilities available to visiting academics. ... "Nallatech ... We have systems installed in applications as diverse as genomic processing and military radar. ..." said Allan Cantle, CEO of Nallatech. "FPGA computing is today where conventional microprocessor-based computing was 15 years ago. The potential exists to deliver unprecedented computational capacity, using less power in a smaller space, however to unleash that potential the industry needs to develop the means to give users low risk access to that power." ... his white paper, "FPGA Centric Computing Institute" ... in October 2003.
http://www.microswiss.ch/tld/2004/abstracts.html ; http://www.microswiss.ch/tld/2004/papers/Brueckner.pdfAccording to DataQuest, 17% of all FPGA design starts in 2003 included an embedded microprocessor. By 2007, estimates predict embedded microprocessor utilization in FPGAs will grow to 37%. ...
The b16 Processor is a minimalistic stack processor, inspired by Chuck Moore's recent work. The original incarnation is 16 bit, byte addressed. There are 32 instructions, each is 5 bits. Three and a bit (literally!) are packed into a 16 bit word (bundle). The first slot in the bundle can only be a nop or a call.
-- http://b16-cpu.de/ (designed in Verilog)
DAV:
As CPUs get relatively faster than RAM, preload becomes more important.
After deciding on an address and sending it to RAM,
it takes several CPU cycles before the RAM is able to reply with the data.
If the very next instruction requires that data,
then the CPU has to wait.
But if the instruction set is designed so that
the programmer can designate an address ahead of time,
this allows the programmer to squeeze in a few ``internal'' instructions
instead of pausing.
(This is very similar to the "branch delay slot" on some RISC processors).
Early (asynchronous) RAM chips allowed you to present the data at the same time as the address
when you wanted to write to the RAM chip. This *seems* faster
than presenting the address, then later presenting the data, in isolation.
But in the context of alternating reads and writes,
where naturally the data coming from the RAM only arrives a memory cycle *after*
the RAM is presented with an address,
it turns out to be faster to always delay the data (whichever direction it is going)
for writes as well. This leads to synchronous RAM ( in particular, SDRAM --
but SRAM stands for something different: static RAM).
DAV: does simply A! help, or do we need to also indicate whether it will be a load or a store ?
There are several improvements that Chuck has added to his newer Forth virtual machine model.
One of them is the address register and the other is the circular stacks.
Chuck has explained that hardware considerations aside,
the idea of the address register was that the Forth words @ and ! ( fetch and store)
were clumsy
at the top of the stack and
were based on smaller atomic operations that the programmer could take advantage of.
@ was broken into two operations A! and @A. Likewise ! was broken into A! and !A.
...
advantage is the use of auto-increment addressing memory access opcodes
...
--
Jeff Fox, talking about Chuck Moore
http://www.ultratechnology.com/forth3.htm
[FIXME: Why isn't this listed on that page ? ]
[FIXME: get this book _HDL Chip Design_ book by Douglas J. Smith, 1998, Doone Publications, ISBN 0-9651934-3-8 ( Computer Literacy , $65+tax). ]... I strongly recommend the book HDL Chip Design ...
Prentice-Hall has published a Xilinx FPGA lab book and win95/NT software which includes a coupon for an XESS FPGA board. Adding the cost of a power supply, you can be doing experiments for less than $250. ...
Only 3 instructions:
br adr Load adr into PC if F=0 Clear F add adr1,adr2 Add the contents of memory location adr1 and the contents of memory location adr2 and store the result in memory location adr2. Store the carry of the addition in F. nor adr1,adr2 Calculate the NOR function of the contents of memory location adr1 and the contents of memory location adr2 and store the result in memory location adr2. F=1 if result =0 , else F=0.and only 1 user-accessible register (PC) and only 1 user-accessible flag (F).
DAV: Unfortunately, self-modifying code is required for subroutine call/return. This makes re-entrant code impossible.
DAV: some simple macros by DAV:
; These macros require 3 special locations: ; ZERO: a location reserved to handle the value 0. ; ONES: a location reserved to handle the value with every bit set (sometimes called -1 or ~0). ; Set them up with ; really_clear ZERO ; really_mov -1, ONES ; . ;;;;;;;;; single-instruction macros (aliases) not a ; invert each bit in a nor a,a ; or alternatively nor ZERO, a ; clear a ; clear all bits in a to zero. Side effect: sets F1 nor ONES, a ; mov 0, a ; clear a ;;;;;;;;; multi-instruction macros (perhaps should be subroutines) really_clear a; -- this is used only to set up location ONES and ZERO; add a,a add a,a add a,a ... (repeated for 16 bits, or whatever the wordsize is) add a,a really_mov -1, a; -- this is used only to set up location ONES; really_clear a not a mov -1, a ; set location a to the value -1 clear a not a mov -2, a ; set location a to the value -2 mov -1, a add a,a mov 1, a ; set location a to the value 1 -- this is used to set up location ONE mov -2, a not a ; this expands to the 4 instructions clear a not a add a,a not a rotl a ; rotate a left, moving MSB to LSB ; side effect: always clears F. add a,a br continue: add ONE, a continue: rotlc a ; move F into LSB of a, shifting all bits left, leaving old MSB in F br F_was_0: add a,a br F_was_1_high_bit_was_0 ; ; handle case of ``F was 1, high bit was 1'' add ONE, a br set_F_and_continue ; execution never gets here F_was_1_high_bit_was_0: add ONE, a br continue ; execution never gets here F_was_0: add a,a ; the next 2 instructions seem redundant in the F_was_zero case, ; but they are necessary to handle the F_was_1 case. br continue set_F_and_continue: ; this must fall through to the end, because branches always clear F nor ONES, ZERO ; continue: mov a, b ; set location b to the value in location a. WARNING: doesn't work when a=b. clear b add a,b jump addr ; unconditionally jump to given address (side effect: clears F) ; isochronous code should use this: add ZERO, ZERO br addr ; alternatively, one instruction quicker when F=0: br addr br addr
lots of details -- the complete schematic diagram of the CPU, a pretty photograph of the test board (FPGA + RAM + ROM + some LEDs to simulate a traffic light) ... some assembly-language code ...
Date: Thu, 11 Jun 1998 05:32:23 +0200 (MET DST) To: David Cary <d.cary at ieee.org> From: Majordomo at lslsun.epfl.ch Subject: Welcome to nlc -- Welcome to the nlc mailing list! If you ever want to remove yourself from this mailing list, you can send mail to "Majordomo@lslsun.epfl.ch" with the following command in the body of your email message: unsubscribe nlc David Cary <d.cary@ieee.org> Here's the general information for the list you've subscribed to, in case you don't already have it: This is a mailing list for discussing the 'C -> netlist' compiler, or 'nlc' for short. To send mail messages to the nlc mailing list, send your message to: nlc@lslsun.epfl.ch I have made nlc, the C to netlist compiler, available for anonymous ftp on lslsun5.epfl.ch (128.178.150.25) in /pub/nlc-0.9.tar.gz. It is compressed using gzip (available at a GNU archive site near you). A compiled binary version for Sparc Solaris 2.x is in the file /pub/nlc-0.9.bin.gz. If you have any trouble getting the file, let me know. I can probably mail it to you. For those that are interested in the Spyder project, you will find below references to already published work. Please let me know if you have trouble locating these papers, and I'll see what I can do. Have fun and please keep in touch, Christian --- @Article{key, author = "Christian Iseli and Eduardo Sanchez", title = "Spyder: A {SURE} ({SU}perscalar and {RE}configurable) Processor", journal = "The Journal of Supercomputing", year = 1995, volume = 9, number = 3, pages = "231-252" } @InProceedings{key0, author = "Christian Iseli and Eduardo Sanchez", title = "A Superscalar and Reconfigurable Processor", volume = 849, series = "Lecture Notes in Computer Science", pages = "168--174", booktitle = "Field-Programmable Logic Architectures, Synthesis and Applications", year = 1994, publisher = "Springer-Verlag", address = "Prague, Czech Republic", month = "September" } @inproceedings{key1, author = "Christian Iseli and Eduardo Sanchez", title = "{Beyond Superscalar Using FPGAs}", booktitle = "IEEE International Conference on Computer Design", address = "Cambridge Mass.", month = "October", year = 1993 } @inproceedings{key2, author = "Christian Iseli and Eduardo Sanchez", title = "{Spyder: A reconfigurable VLIW processor using FPGAs}", booktitle = "IEEE Workshop on FPGAs for Custom Computing Machines", address = "Napa", month = "April", year = 1993 } @InProceedings{key3, author = "Christian Iseli and Eduardo Sanchez", title = "Augmentation du parall\'{e}lisme par la reconfigurabilit\'{e}", editor = "L. Boug\'{e} and M. Cosnard and P. Fraigniaud", pages = "3--6", booktitle = "Actes des $6^{\`{e}me}$ Rencontres Francophones du Parall\'{e}lisme, RenPar'6", year = 1994, organization = "\'{E}cole normale sup\'{e}rieure", address = "Lyon", month = "June" } @InProceedings{key4, author = "Serge Durand and Christian Iseli", title = "Developing a reconfigurable coprocessor on an {SBus} board", pages = "5--9", booktitle = "Sun User Group 1993 Proceedings", year = 1993, address = "San Jose, California", month = "December" } @inproceedings{key5, author = "Christian Iseli and Eduardo Sanchez", title = "{A \Cplusplus{} compiler for FPGA custom execution units synthesis}", pages = "173--179", booktitle = "IEEE Symposium on FPGAs for Custom Computing Machines", address = "Napa, CA", month = "April", year = 1995 } --- Here are some more info about Brent Chapman's "Majordomo" mailing list manager. In the description below items contained in []'s are optional. When providing the item, do not include the []'s around it. Majordomo understands the following commands: subscribe <list> [<address>] Subscribe yourself (or <address> if specified) to the named <list>. (You probably have done that already if you are receiving this message). unsubscribe <list> [<address>] Unsubscribe yourself (or <address> if specified) from the named <list>. get <list> <filename> Get a file related to <list>. index <list> Return an index of files you can "get" for <list>. which [<address>] Find out which lists you (or <address> if specified) are on. who <list> Find out who is on the named <list>. info <list> Retrieve the general introductory information for the named <list>. lists Show the lists served by this Majordomo server. help Retrieve this message. end Stop processing commands (useful if your mailer adds a signature). Commands should be sent in the body of an email message to majordomo@lslsun.epfl.ch Commands in the "Subject:" line NOT processed. If you have any questions or problems, please contact majordomo-owner@lslsun.epfl.ch From: Christian Iseli <chris@lslsun.epfl.ch> Date: Wed, 12 Aug 1998 13:53:07 +0200 (MET DST) To: nlc@lslsun.epfl.ch Subject: Re: Is this list dead ? Sender: owner-nlc@lslsun.epfl.ch Precedence: bulk >I heard this mailing list talked about converting (mathematically >intensive) C programs to run at high speed on FPGAs. > >I thought I would lurk and learn, but I haven't gotten any messages since I >got the "Welcome to nlc" message on 1998-06-11. > >Is this list dead ? Not really, but it was never very lively either... ;-) As it happens, I'm now done with my PhD thesis and since I had no means of pursuing nlc further, I now have another job and have little time to actively develop nlc. The PhD thesis report is available at: http://lslwww.epfl.ch/pages/publications/rcnt_theses/home.html The latest nlc source code is available at: ftp://lslwww.epfl.ch/pub/ HTH. Cheers, Christian
The exact opposite thing is also being developed:
vhdl2c "vhdl2c is a vhdl to `C' converter by Michael Knieser"
see also "Tiny robots" robot_links.html#tiny and http://electronicschat.org/index.cgi/HomebuiltCpu
I am fascinated by CPU designs that are extremely simple, approaching "minimal", in several (sometimes incompatible) senses of the word "simple".
Here "CPU", "MCU", and "PE" are almost interchangeable.
Often these characteristics exhibit synergy -- when you eliminate some opcodes, that eliminates some gates (making it lower-cost) and makes it easier to understand (less documentation required). Occasionally, though, driving a design to extreme simplicity according to one measure causes extra complexity in another area to compensate. This general concept is known as the waterbed theory . When applied to CPU designs, this is called the Turing tarpit #tarpit .
For purposes of robotics, "low-cost" and "Simple interface" are usually the dominant considerations. Some of the concepts of massively parallel processing are also present when one builds swarms of simple robots, but the kind of random communication between constantly-changing arrangements of simple robots is very different than the communication between the rigidly connected (most commonly in a 2 D mesh or a hypercube) elements of common cellular automata and multi-processors.
See also Using FPGAs to simulate CPUs #FPGA for some very simple CPUs designed to fit onto FPGAs. 2 very different reasons for simplicity there: (a) a simpler CPU can fit onto a smaller FPGA, making it much less expensive. (b) making the CPU smaller allows you to fit more copies of the CPU on a given FPGA, increasing MIPs at no cost. [FIXME: should combine these into 1 section ? Since the same op-code set may be implemented many ways, in TTL, in FPGA, in custom VLSI, it doesn't really make sense to split them into seperate sections ... On the other hand, ``less hardware'' means something a little different in these 3 technologies. ]
[here I *list* simple CPUs I know about; Opcode considerations #considerations and #FPGA also talk about tips for designing new CPU architectures. ]
... Here I also list all the CPUs I know about that were built up out of TTL.
...
The Block I version used 4,100 ICs, each containing a single 3-input NOR gate. The later Block II version used dual 3-input NOR gates in a flat-pack; appoximately 5,600 gates in all. The gates were resistor-transistor logic (RTL). ...
The instruction format was 3 bits for opcode, 12 bits for address.
(details about each instruction in the instruction set there). (Knowing what we now know about MISC, how much smaller could we make a "reasonable" CPU out of NOR gates ? )
For as long as I've been using microprocessors -- and I was designing the Intel 8080 into radiation-monitoring equipment in 1977 -- I've always itched to have control over the instruction set. The makers always seem to leave something out....
Picaro makes a pleasant change from systems with 16-MB minimum memories and operating systems no human being can comprehend.
describes a very easy-to-build system: There's the CPU connected to some EPROM (for programs and data) and a crystal. The EPROM is serial EPROM (makes things *much* easier to wire up). The "CPU" is really a PIC with an internal interpreter with 2 modes. If the magic value is discovered in EPROM, the PIC starts fetching and "executing" (interpreting) the program in EPROM. Otherwise, it assumes the EPROM is blank, and waits for a properly-formatted program to stream in on the serial port which it programs into the EPROM. [FIXME: todo: build something like this ... but with a completely different instruction set, natch.]
(in other words, no ROM or FLASH is needed for initial testing).From: Ben Franchuk Date: Wed Jul 25, 2001 7:10 am Subject: Re: [fpga-cpu] A debugger for xr16 the easy way... On my cpu on reset a bootstrap loader is run from the serial input. This way all I need for a external parts is [RAM] and a buffer for serial lines from the FPGA.
``The Jupiter Ace hardware page: How to build your own !'' by Grant Searle BSc. http://www.home-micros.freeserve.co.uk/JupiterAce/JupiterAce.html
Build your own ZX81 http://www.home-micros.freeserve.co.uk/zx80/zx80nmi.html
interesting wire-wrap style ... see ``Back of PCB'' photo. While this uses a Z80 cpu (and so is ``cheating'' in the context of building the CPU itself out of standard components), the TV / monitor output circuitry may be interesting... [FIXME: computer museum] [FIXME: crosslink to schematic.html ?]... old micros which can still be built since they don't use custom components. ...
-- http://www.pcengines.com/toy2.htm`` TOY/2 - a minimal 16 bit CPU
TOY/2 was designed by Pascal Dornier and Stephan Paschedag ... in 1988 ...
TOY/2 can address a full 64K word program and memory address space. ... and takes up a whopping 3300 transistors (excluding I/O pads). ...
All instructions are 16 bit, 4 bits for the opcode, and 12 bits for the direct address. ''
DAV: I think the JMP instruction should be P := [vec] to make consistent with description and the conditional branches, and to allow jumping to code anywhere in the full 64 Kword program space. Only 15 defined instructions:
Programmer-visible registers: A: accumulator P: program counter T: temporary register for indirect store C: carry flag Z: zero flag (DAV: does this *always* indicate the current state of A ?) ; 3 for program flow control JMP vec P := [vec] ; indirect jump BCC vec IF (carry clear) then PC := [vec] BNE vec IF (not zero, i.e., 0==Z) then PC := [vec] ; 3 for direct load and store LDC src A := [src], C:= 0. ; load A and clear carry LDA src A := [src] ; load A, don't clear carry STA dest [dest] := A ; ; 3 logical ops (op) src A := A (op) [src] ; arithmetic ops: ; (op) is one of: XOR, OR, AND. ; 2 arithmetic ADC src A := A + [src] SBC src A := A - [src] - C ; subtract ; 4 implicit instructions that don't use the 12 bit address: ROR A,C := A,C ror 1 ; rotate right through carry TAT T := A ; transfer A to T LDI A := [A] ; load A indirect STT [A] := T ; store T indirect
DAV: Given that code is in ROM, and we choose some arbitrary RAM area ``stack_start'' to hold the stack, and we choose some other arbitrary RAM location SP to hold the return stack pointer, call instructions could be implemented like this:
(Is this the most compact implementation of CALL ?). ... ; DAV: This implementation of CALL eats up ; 2 instructions + 1 word of ROM = 3 words per call. LDA $+2 JMP &(banana1) LDA $+2 JMP &(banana2) ... ; linker must allocate a word of ROM to hold the $+2 value, 1 per CALL. ; linker must allocate a word of ROM to hold the &(banana2) value as well, ; but only 1 per subroutine, shared among all calls of banana2(). banana1: ; non-leaf subroutine ; we have return address in A, ; and SP points to an empty location to store it. ; adjust SP to point to empty location TAT LDA SP STT ADC &(1) ; assembler must allocate word of ROM to hold the constant 1 STA SP ... ; return sequence JMP &(non_leaf_return) non_leaf_return: ; common to all non-leaf subroutines LDA SP ADC &(-1) STA SP LDI STA temp JMP temp banana2: ; leaf subroutine ; we have return address in A, ; and SP points to an empty location to store it. TAT LDA SP STT ; leave SP full until end of routine ... ; return sequence JMP &(subroutine_return) subroutine_return: ; common to all subroutines LDA SP LDI STA temp JMP temp
DAV: From the instruction set listed here, I attempted to reverse-engineer a schematic suitable for implementation in TTL. I wonder how close this matches the original design by Pascal Dornier and Stephan Paschedag ?
TOY/2 as reverse-engineered by David Cary (Warning: not yet implemented -- does this really work ?) Every instruction takes 2 cycles: 1 SRAM memory cycle cycle (either reading [address] or [A], or writing T to [A]). 1 SRAM memory cycle to read next instruction (from [PC]). Registers are labeled with their contents in the 1st cycle; PC and A are swapped in the last cycle of the instruction. Calculate "next PC" first: +------------------------------+ | +----------+ V | | V +-------+ | | +-------------------+ | A |(PC) | | | opcode /\ address | +-------+ | | +-------/ \--------+ V | | V V V | | (control | +--------+ | | lines) | | V | | V V | | | +-\/--+ | | | 1->| mux | | | | +-----+ | | | V | | | (note 1) V address | | | | +------------+ | | | V | | | | | +---+ | SRAM | | | |2->| T | | | | | | +---+ +------------+ | | | V ^ data I/O | | | | V V | | +--------->+<| buf |--<+ | ^ V V | +----------------+ | | V V | +-----------+ +----+ | | [address] | | PC |(A) | +-----------+ +----+ | V V | +------\ /------+ | 4->\ \/ /-->Z,C | \ ALU / | +----------+ | V ^ | +-----------------------+ Let's try that again. Calculate "next PC" first: first cycle: selecting either P := [address], or [A] := T and P := P + 1, or P := P + 1. second cycle: calculating A := A (op) [address], or A := [A], and fetching next instruction from [P]. Whoopsies, that's not going to work -- -- we can't fetch [A] and [P] at the same time. Simpler to calculate A in the first cycle, so we can do A := [A] in that first cycle. and then always calculate the next P (and simultaneously fetch it) in the next cycle. first cycle: calculating A := A (op) [address], or A := [A], or [A] := T (and keep A the same ?) second cycle: selecting either P := [address], or P := P + 1. and fetching next instruction from [P]. Um... but won't that fetch the next instruction from the *old* P, so that we get a 1-instruction branch delay? Registers are labeled with their contents in the 1st cycle; PC and A are swapped in the last cycle of the instruction. Calculate "next A" first: +------------------------------+ | +----------+ | | | V V | | +-------------------+ | | | | instruction | +-------+ | | | opcode /\ address | | A |(PC) | | +-------/ \--------+ +-------+ | | V V V | | (control | +--------+ | | lines) | | V | | V V | | | +-\/--+ | | | 1->| mux | | | | +-----+ | | | V | | | (note 1) V address | | | | +------------+ | | | V | | | | | +---+ | SRAM | | | |2->| T | | | | | | +---+ +------------+ | | | V ^ data I/O | | | | V V | | +--------->+<| buf |--<+ | ^ V V | +----------------+ | | V | | | | | [address] ([PC]) | | | | | | V V | +------\ /------+ | 4->\ \/ /-->Z,C | \ ALU / | +----------+ | V | +-----+ | | PC | (A) | +-----+ ^ V +-----------------------+ Control line summary: 0: instruction register : opcode : next value always latched at end of last cycle. 0: instruction register : address : next value always latched at end of last cycle, never used during last cycle. (Perhaps share with some other register only used during last cycle ?) 0: [address] transparent register: always transparent during 1st cycle, latched at end of 1st cycle, and held during last cycle. 0: A(PC) next value always latched at end of every cycle 0: PC(A) next value always latched at end of every cycle 2: Mux selection: depends on opcode and C and Z during 1st cycle; always comes from PC on last cycle. 2: T: output enabled only on STI ( [A] := T ) instruction; latched only during TAT instruction. 1: SRAM: last cycle always a read; 1st cycle may be read or write. 4: ALU function select (is this really all the control signals needed ?) 0: Z: zero flag always set or cleared during last cycle to reflect current state of A. 1: C: new value of carry flag only latched on LDC, ADC, SBC, ROR instructions. -- 10 bits ? Is this all ? Notes: 1. Only 1 instruction loads T: the TAT instruction. There are several ways to implement this: * the output of A(PC) (during the 1st cycle) * the output of PC(A) (during the last cycle), or * the output of the ALU (during the last cycle), whichever is easiest. 2. The register to hold [address] must be transparent, to implement JMP and BCC and BNE properly. ''(... Much later: I don't think we need a latch here. ...) 3. There's still room for 1 more instruction. What should it be ? Some options that require *zero* more hardware (just setting up the uinstruction decoder): * PC-relative jumps: PC := PC + [vec] (for position-independent code) * TPT: Transfer PC to T: T := PC (to allow position-independent code) * T := PC + [vec] (to allow position-independent code) * [A] := PC+1 (to allow position-independent CALL) This implementation calculates next PC first (using transparent [address] register), then next A. (I think this ended up simpler than calculate next A first, then next PC). (Of course, faster CPUs calculate both at the same time, and have Harvard-style fetches from data cache simultaneous with fetches from instruction cache).
In any CPU architecture that has the data and the program in 1 unified memory, clearly we must have 2 memory cycles/instruction to do Load and Store. (1) first cycle: do any Load or Store required by the instruction in the instruction register, and (2) second cycle: load the next instruction into the instruction register through [P]. In any CPU architecture that has the data and program in 1 unified memory, in order to do any indexed write, we need at least 4 registers: Instruction register (to remember that we are doing an indexed write) PC (to remember what instruction to do next) address (where to write) data (what to write) (Mark's TTY CPU has fewer than that number of registers, so it cannot do an indexed write). (Well, we could do it with only 3 registers, if we allow self-modifying code to write the index into the location that will be read into the instruction register, and making the instruction register wide enough to hold that address). If we also try to re-use the ALU to increment the PC every time (rather than using an incrementing register for PC), then we must calculate the next PC every instruction. If we calculate the next PC during the second cycle, then we are loading the next instruction from the old PC -- creating a delay slot of 1 instruction. If we restrict ourselves to exactly 2 memory cycles per instruction, and only 1 ALU, and we try to avoid that delay slot, then: During jumps, we must be calculating the next PC on the first cycle, so we can latch it at the end of that cycle and use [PC] to load the next instruction. So the question is: On non-jumps, do we use the ALU to increment PC on the first cycle, so using the ALU to calculate the next A must happen on the second cycle? Or do we always increment PC on the second cycle, and always use the ALU for other calculations only on the first cycle? Branches (conditional and unconditional): prefer loading new PC on first cycle, to avoid delay slot. Let's try always updating PC on the first cycle. This makes the [A] instructions ( A := [A] )( [A] := T ) a bit tricky, since they *must* execute on the first cycle (the SRAM is busy fetching the next instruction on the second cycle). Since the ALU is busy calculating the next value of PC on the first cycle, but the next value of A often depends on a value being fetched from SRAM in the first cycle, clearly the value from SRAM must be latched at the end of the first cycle in order to be used in the second cycle. To avoid that latch means we must calculate the next value of A on the first cycle, while the data required is still streaming out of the SRAM. Still, branches (conditional and unconditional) must also update PC on the first cycle, in order to load the next instruction on the second cycle.
yet another implementation of Toy/2 from DAV: dedicated P and A registers. +----------------------------+--------+ | +----------+ | | | | V | | | | +-------------------+ V V | | | instruction | +------+ +-------+ | | | opcode /\ address | | PC | | A | | | +-------/ \--------+ +------+ +-------+ | | V V V V | | (control | +-----\/--------+ | | lines) | | mux | | | | +---------------+ | | | V | | | +------+ | | V V | | | +-\/---+ | | | 1->| mux | | | | +------+ | | | V | | | (note 1) V address | | | | +------------+ | | | V | | | | | +---+ | SRAM | | | |2->| T | | | | | | +---+ +------------+ | | | V ^ data I/O | | | | V V | | +--------->+<| buf |--<+ | ^ V V | +----------------+ | | V | | | | | [address] ([PC]) | | | | | | V V | +------\ /------+ | 4->\ \/ /-->Z,C | \ ALU / | +----------+ | V ^ | +-----------------------+ 1. Only 1 instruction loads T: the TAT instruction. There are several ways to implement this: * the output of A * the output of the ALU (during the first cycle), while selecting A * the output of some mux, while selecting A whichever is easiest.
The Mano machine http://en.wikipedia.org/wiki/Mano_machine is similar to the PDP-8 and (more distantly) the TOY/2. The data bus is 16 bits. All instructions, loads, and stores are one 16-bit word long. Memory-referencing instructions contain 4 bits of op code and 12 address bits. The programmers model only has 2 registers and 2 bits: one 16-bit accumulator; one (12 bit?) program counter; the carry bit; and a "halt" bit.
"Microcoded Versus Hard-wired Control: A comparison of two methods for implementing the control logic for a simple CPU" article by Phil Koopman BYTE Magazine January 1987 http://www.skidmore.edu/~pvonk/cs318/calendars/Documents/hard-wired.pdf describes the Toy CPU.
The schematic diagram of the PISC-1a can be downloaded here. DAV: While the paper notes ``no conditional branch microinstruction -- an important need'', I agree that conditionals are important, but "machines do not need conditional branches, they only need conditional subroutine return instructions." -- Philip J. Koopman, Jr. #conditional_returnThe PISC is a processor constructed from discrete TTL logic, which illustrates the operation of both hardwired and microcoded CPUs. ... simple hardware modifications demonstrate interrupts, memory segmentation, microsequencers, parallelism, and pipelining. A standalone PISC board should be an economical and effective tool for teaching processor design.
... Pathetic Instruction Set Computer ... Requiring only 22 standard TTL chips (excluding memory), it is well within the ability of a student to construct and understand. Its writeable microprogram store uses inexpensive EPROM and RAM. ...
Microprogram store and main program store are one and the same. Indeed, the PISC has characteristics of both a hardwired CPU and a microcoded CPU.
... Many weaknesses of the PISC become evident after a short period of use ... It can be argued that the PISC is a valuable educational tool because these faults, and several potential solutions, are painfully obvious. ...
... 2100 gates ...
Subject: Re: Home-made CPUs Date: 20 Nov 1998 00:00:00 GMT From: jones@cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) Organization: The University of Iowa Newsgroups: sci.electronics.design From article <365533b8.334246@wingate>, by t.dorrington@dial.pipex.com (Tim Dorrington): > > Has anyone out there managed to design/build their own very simplistic > CPU from basic logic chips, RAM and ROM? I know that this is > obviously a huge project which has no real practical use, but I just > wondered if anyone has done it for the challenge? It's been done, and it's not that huge a project. Reasonable CPU designs take about 50 SSI and MSI TTL chips. See the Ultimate RISC and the Minimal CISC architectures indexed on http://www.cs.uiowa.edu/~jones/arch/. A fair number of people have built that particular RISC. Of course, I just wrote the specs for it, you've got to figure out how to reduce it to a chip level design. It's a fun exercise! Doug Jones jones@cs.uiowa.edu
Mountain View Press: The Forth Source http://www.TheForthSource.com/ Sells
WISC CPU/16 - A patented hardware design of a 16-bit stack machine with a user writable instruction set. Available in several forms: Complete schematics and documentation with which you can build your own ($50.00); Kits with all of the parts to wire wrap your own ($500.00); bare PC board which you can stuff with your own chips ($250.00); and Assembled and tested system ($750.00). Design uses a PC/AT as an I/O server. FAST!
and also sells lots of educational materials for learning the Forth programming language.
(FIXME: move to http://c2.com/cgi/wiki?WritableInstructionSetComputer ? )
-- _Stack Computers: the new wave_ book by Philip J. Koopman, Jr. 1989 http://www.cs.cmu.edu/~koopman/stack_computers/... one major design tradeoff ... hardwired control and microcoded control. ...
An introduction to the concepts ... may be found in (Koopman 1987a).
Hardwired designs traditionally allow faster and more space efficient implementations to be made. ...
... a microcoded machine can use fewer bits to specify the same possible instructions, ...
... microcoded implementations are more convenient to implement in discrete component designs, so they predominate in board-level implementations. Most single-chip implementations are hardwired.
... Koopman, P. (1987a) Microcoded versus hard-wired control. Byte, January 1987, 12(1) 235-242
DAV: the ``Mark's 8 chip uP'' at the venturalink site looks very clever. I am very impressed. It is far more compact than any other TTL/PAL based CPU I've seen. (It is even smaller than most ``single-board computers'' that use monolithic CPU chips). It makes me want to design a CPU again.
From: Bill Buzbee (bill at buzbees.com) Subject: Re: building a 8 or 16 bit cpu out of TTL parts Newsgroups: comp.arch.hobbyist Date: 2002-10-30 14:30:45 PST "john dobbs" wrote... > Anyone ever thought of building a 8 or 16bit cpu using traditional TTL > chips (NANDs,ORs,Flipflops etc) I could see how it would be a useful > learning device, or for the hardcore guys using transistors/diodes and > no IC's period. I agree with the other posters that there's no *rational* reason to do this using TTL - FPGA is the way to go these days. However, if you're feeling irrational, it can be a fun experience. Here's a link to recent simple CPU project you should check out: http://www.venturalink.net/~jamesc/ttl/ Also, I found the following books very useful when trying to come up to speed on the subject: Digital Computer Electronics, by Albert Paul Malvino, McGraw-Hill, 1977 Understanding Digital Computers, by Forrest Mimms III, Radio Shack, 2nd. Edition,1987. Good luck, ..Bill Buzbee
``... building my own computer from scratch. By "scratch", I mean designing my own instruction set, wire-wrapping a CPU out of a pile of 74 series TTL devices and writing (or porting) my own assembler, compiler, linker, text editor and very rudimentary operating system.
... User and supervisor modes exist, along with hardware address translation ...
... I really want to better understand hardware and operating systems. ... pushing ... the limits of my hobbyist abilities, ... support for preemptive multitasking and paging to enable me to support a "real" (though, I hope, greatly simplified) operating system for my machine.
... wherever possible I'll trade off speed for simpler circuits. Also, as I get closer to actually having to start wrapping wires, I find myself more freely trading off speed for fewer connections.
Compactness. One of my pet peeves is how bloated modern software is. I think there's a lot you can do with 128K bytes of addressing, and I like the idea of keeping things compact and utilitarian. I've tried to construct an expressively dense set of 1-byte opcodes.
At the end of the day, I'd like to have a working, and useful, machine that I understand completely. Oh, and it's also got to have a real front panel with lots and lots of cool blinky lights. ''
The ``links'' page points to many other web pages and books that deal with: small processor design and implementation, digital logic, retargeting C compilers and Forth compilers. [FIXME: does that make this section redundant ?]
[FIXME: finish reading]... relatively inexpensive project (<$100 including power supply, ICS, breadboards and assorted hardware) ...
General Features: All relays are the identical part (Four-Pole-Double-Throw, 12 Volts) 415 Relays 111 Switches 350 LEDs Max Power Consumption: Estimated 12 Amps @ 13.5 Volts (160 Watts) Features: 8 general purpose registers (of 8 bits each) instruction register (8 bits) program counter (16 bits) 2 additional registers (16 bits each) 7 bit ALU (operations: AND, OR, XOR, NOT, SHL, ADD, INC) 16 bit increment unit 32 Kbytes of main memory (implemented using one CMOS chip -- static RAM)Each instruction is 8 bits long. Some instructions (JUMP, CALL, BRANCH, SET-16) are followed by 16 additional bits of immediate data.
The instruction set: MOVE register to register CLEAR set register to zero LOAD 1 byte from memory to register STORE 1 byte from register to memory AND 8 bit, register to register OR 8 bit, register to register XOR 8 bit, register to register NOT 8 bit, register to register SHL 8 bit, register to register, shift left 1 bit ADD 8 bit, register to register INCREMENT 8 bit, register to register INCR-XY 16 bit, XY register GOTO unconditional CALL return address is stored in XY register RETURN jump indirect through a register BRANCH conditional (beq, bne, blt, bcy) SET-16 load 16-bit immediate value into register (The 2-byte value follows the instruction.) SET-8 load 5-bit sign-extended immediate value into register (with sign extension) HALT suspend executionThe clock cycle time is approximately 5 Hz. Instructions take between 8 and 24 cycles. Obviously not fast, but lights blink and it makes noise.
The general purpose registers, which each have 8 bits, are called:
A, B, C, D, M1, M2, X, and YIn some instructions, M1 and M2 are combined to form a 16-bit register, called M. Likewise, in some instructions, X and Y are combined to form a 16-bit register, called XY.
The program counter (PC) is 16-bits.
There is a 16 increment unit, which adds 1, and there is a 16-bit register called INC dedicated to the increment unit. This register is not visible to the programmer's model and is used only (1) to increment the PC during each instruction, and (2) in the INCR-16 instruction.
The instruction register (INST) is 8-bits and holds the instruction being executed.
There is a 16-bit register, called J, which is used during the CALL, JUMP, and GOTO instructions. These instructions load the register from the immediate 16-bit value following the instruction, and then to complete the transfer of control, the J register is copied to the PC register.
...
The LOAD and STORE Instructions ===============================The memory is organized as 32K bytes, addressed from hex 0000 through 7FFF.
The LOAD instruction uses the 16-bit value in the M register as an address and reads a byte from Main Memory into either the A, B, C, or D registers. Likewise, the STORE instruction assumes the address is in the M register and stores the value from either A, B, C, or D into a byte in Main Memory.
...
The CALL Instruction ====================The CALL instruction is exactly like the GOTO instruction, except that, in addition, the address of the next instruction after the CALL instruction is saved in the XY register. This instruction is encoded as:
1 1 1 0 0 1 1 1 a a a a a a a a a a a a a a a awhere the field "aaaaaaaa aaaaaaaa" is the address of the subroutine.
The RETURN Instruction ======================The RETURN instruction moves the contents of XY back to the PC. This can be used to effect a RETURN (when used after a CALL) or an indirect jump, when XY contains a computed value.
This instruction is slightly more flexible and is actually a 16-bit move instruction that can perform any one of the following moves:
PC = XY PC = M PC = J XY = M XY = J XY = 0 ... Timeline, History, and General Comments =======================================I teach CS and I have always loved computers and been interested in how they work. Over the years I have found that you can read and study and make paper designs, but there is no substitute for actually building things. There are always processes at work that you can't understand until you actually build a working unit. I had wanted to build a computer out of relays when I was much younger, but I didn't have the knowledge, patience, money, or time.
...
Giving final exams are a panic for students, but are really different for the professors. It is the only time that there are no lectures to plan and no exams to grade. As I sat there during a final one term, I sketched out the outline of an architecture that I might use. So I designed the ALU to "fit" into a complete machine. However, when I decided to build the ALU, I still wasn't committed to building anything more. It is important to stay focussed on a project that you can complete. It is easy (and fun) to fantasize about building something big, but unless one is able to concentrate long enough to get something completed, it is really just idle day-dreaming!
Once I finished the ALU, I decided to go ahead and build the second cabinet, the register unit. However, in my mind, I had not committed to building anything beyond that. I figured that if I finished it and it worked, I would then decide whether I wanted to proceed. I knew there was a possibility I might be burned out, or might run into problems that would make a full computer non-functional. I always want successes, not failures, so I decided I would rather have a fully-working half-computer that a half-finished full-computer.
After finishing the ALU and the register unit, I found I enjoyed all aspects of the construction. Usually design is the most fun and, when building stuff, we tend to leave the most unpleasant or difficult tasks until the very end, but after completing 2 cabinets, I knew pretty much what would be involved. At that time I decided to keep going.
...
Costs ===== 415 Relays, 4PRLY-12, plus extras for prototyping ($3.40 each) 350 LEDs ($1.49 each) 111 Switches, SPDT, on-on mini toggle, MTS-4 ($0.90 each) Acrylic boards, pre-cut and pre-drilled, 4 cabinets (total $1,095.37, aprx $275 per cabinet) ALU (1 board), 82.19 Register Cabinet (10 boards), $251.60 total PGM-CTRL Cabinet (10 boards), $381.80 total Sequencer Cabinet (5 boards): $379.78 total 4 Mahogany Cabinets (est $300 each) 2 Power Supplies, 12V, 10Amp ($79.99 each) 20 Capacitors, 100UF, 100V non-polar ($1.55 each) Memory Board: 1 SRAM, 32K dip, Jameco # 82472CA, Other #: 62256LP-70 ($5.49) 1 eight-channel FET driver module, NCD, www.controlanything.com, IOTEST-L ($49.00) 3 eight-channel LED array module, NCD, www.controlanything.com, 8-FET ($10 each) 1 Prototyping board ($5.99) 1 small power supply (for memory) 5V, 4Amp, 20Watt, Jameco #: 213583CA, ($26.95) DB-9 Sub-miniature Connectors 32 Plugs ($1.59 each) 32 Receptacles ($1.76 each) 64 Connector hoods ($0.49 each) 22-Gauge black solid copper wire, 100 feet ($4.49 each), est. 20 rolls 8-connector cable, CAT5e, 4 pairs, 24AWG solid, 328 feet ($42.00) 4 Fans ($10 each) Misc Hardware. (Est $100) Summary: Relays, $1,411 LEDs, $521 Switches, $100 Acrylic Boards, $1,095 Cabinets, $1,200 Power Supplies $160 SRAM Memory, $117 Capacitors, $148 Connectors, $138 Wire, $132 Fans, $20 Hardware, $100 Grand Total, $5,142 (per unit: $1,285)
cellular automata is related to other interests I have:
[FIXME: CA stuff scattered elsewhere ...]
The cellular automata questions I'm most interested in are:
There's a little loop here -- first we use (simulated) cellular automata to learn about replication, then we use that information to design replicating tools. We also use (simulated) cellular automata to explore good rules and good patterns for computronium. Then we use those replicating tools to (build enough copies of themselves to) build computronium. Then we use *that* computronium (hardware cellular automata) as a better computer.
"Excellent paper (several chapters of a forthcoming book, actually), if you are interested in cellular automata, reversible computation, hardware implementations and CAM modelling, you should check it out. ... Caution, this expands into 17381429 Bytes of PostScript." -- recc. Eugene Leitl <eugene@liposome.genebee.msu.su>
DAV: I agree. This paper has good stuff about the interaction (co-evolution) of hardware and software, efficient realization of computational architectures, some of the background thinking behind the massively parallel CAM-8 processor, and likely directions of future computing devices." -- DAV
the theory behind the CAM-8 http://www.im.lcs.mit.edu/cam8.html a dedicated cellular automata processor. http://www.im.lcs.mit.edu/broch/med1.html mentions that it is fast at generating the data needed for holograms.
Small size is not everything - Q*Bert-based systems will require more cells for some tasks than the equivalent when expressed in (say) the Margolus neighbourhood. However for the majority of applications which involve performing cmputations, the additional richness provided by a larger LUT seems likely to be wasted much of the time, doing the equivalent of transporting signals from one point to another. We believe the advantage of the smaller cells in terms of their smaller size and higher update frequency will commonly win out.
Finally the Q*Bert neighbourhood is essentially hexagonal - and there is a general additional efficiency of hexagonal packing schemes compared to rectangular ones - a matter I will not go into further here.
[89] Toffoli, T. and N. Margolus, _Cellular Automata Machines -- a new environment for modeling_, MIT Press (1987)
DAV: Which of the 2 CA rules detailed in the _Crystalline Computation_ paper is more "interesting" (can build more compact logic circuits out of): "BBMCA", or "Critters" ? (David Cary thinks they are both "more interesting" than the Conway rules) More importantly, since hydrodynamics required a triangular grid, what rules are most "interesting" on triangular grids ? (Perhaps that grid has rules that are "more interesting" than any square-grid rules).
(a) blocking on even steps: aabbccddee..... aabbccddee..... ffgghhiijj..... ffgghhiijj..... kkllmmnnoo..... kkllmmnnoo..... ... ...zz blocking on odd steps: abbccddee.....a fgghhiijj.....f fgghhiijj.....f kllmmnnoo.....k kllmmnnoo.....k ... abbccddee...zza (b) OO -> OO OO OO *O -> *O OO OO *O -> O* O* *O *O -> O* *O O* O* -> ** ** *O ** -> ** ** **
Fig. 1.5 The invertible "Critters" CA.
(a) The solid and dotted blockings are used alternately.
(b) The Critters rule.
The Critters rule is ... "rotationally symmetric" ... "conserves 1's (particles), and only 3 cases change." ... "each of the 16 possible initial states of a block is turned into a distinct result state. Thus the Critters rule is invertable." ... "Unlike the gliders in Life, Critters gliders are quite robust. ... when two of these gliders collide in an empty region ... ... If nothing hits this blob for a while, we always see at least one of the gliders emerge."
(a) OO -> OO OO OO *O -> OO OO O* *O -> O* O* *O *O -> *O *O *O O* -> O* ** ** ** -> ** ** **
Fig. 1.8 A invertible Billiard Ball Model CA.
(a) The BBMCA rule
...
"rotationally symmetric"
"conserves 1's (particles), and only 2 cases change."
"we can recover macroscopic 2D hydrodynamics from a model that is only slightly more complicated than the HPP gas. A single-speed model with 6 particles per site, moving in 6 directions on a triangular lattice, will do. If all zero-net-momentum collisions cause the molecules at the collision site to scatter into a rotated configuration, and otherwise the particles go straight, then in the low speed limit we recover isotropic macroscopic fluid dynamics."
[100] Yakhot, V. and S. Orszag, "Reynolds number scaling of cellular-automaton hydrodynamics", _Phys. Rev. Lett._ (1986), 1691-1693.
It has been shown that self-replicators can be designed in cellular automata #cellular_automata . some of this deals with ``real'', physical replication. While other parts -- cellular automata, quines, etc -- deal only with patterns in a computer. Should I separate them ? But some of the theory applies to both.
related to reconfigurable robots robot_links.html#reconfigurable and robot construction (humans building robots) robot_links.html#construction and tool closure 3d_design.html#closure . and the bootstrap problem [FIXME:] [FIXME: cross-link all the self-replication stuff on my web pages. nano, robot, cellular automata, etc. Point back and forth from ``replication'' section to computer architecture # cellular automata robots nanotech idea_space and unknowns ]
The ``Game of Life'' http://www.astro.virginia.edu/~eww6n/life/ is the most well-known Cellular Automaton, invented by John Conway and popularized by Martin Gardner. Lots and lots of interesting properties and pretty animations have been collected for this cellular automaton. But David Cary doubts that this is really the most interesting ("computationally compact") cellular automata; other cellular automata are likely to have even more pretty animations and properties. "replicator - a Life object which repeatedly forms copies of itself. Such things are known to be possible in Life, but no example is known. But in the HighLife variant, there is a simple replicator." "HighLife - an alternate set of rules similar to Conway's, but with the additional rule that 6 neighbors generates a birth. Most of the interest in this variant is due to the replicator that evolves from:
***. ...* ...* ...*-- http://www.cs.jhu.edu/~callahan/glossary.html "unit Life cell: a pattern with two states, which is determined by its previous state and the previous state of its neighbors, using exactly the rules used to compute it; that is, it simulates its own universe. None have been constructed in Conway's Life yet." When I (DAV) talk about one entity replicating, I mean that we end up with 2 (or more) practically identical copies of the original, in every way (in particular, size). A ``unit Life cell'' seems to me to be somehow related to the so-called ``replicating-tile'' http://www.geocities.com/alclarke0/PolyPages/Reptiles.htm (which creates identical copies, but scaled much larger)
"Looking (and dreaming) toward the future, one can imagine nano-scale (bioware) systems becoming a reality, which will be endowed with evolutionary, reproductive, regenerative, and learning capabilities." http://lslwww.epfl.ch/pages/tutorials/poe/home.html
-- ``Beyond computation: a talk with Rodney Brooks'' 2002 http://www.kurzweilai.net/meme/frame.html?main=/articles/art0475.html | http://www.edge.org/3rd_culture/brooks_beyond/beyond_p4.htmlWe're also trying to build self-reproducing robots. We've been doing experiments with Fischer Technik and Lego. We're trying to build a robot out of Lego which can put together a copy of itself with Lego pieces. Obviously you need motors and some little computational units, but the big question is to determine what the fixed points in mechanical space are to create objects that can manipulate components of themselves and construct themselves. There is a deep mathematical question to get at there, and for now we're using these off-the-shelf technologies to explore that. Ultimately we expect we're going to get to some other generalized set of components which have lots and lots of ways of cooperatively being put together, and hope that we can get them to be able to manipulate themselves. You can do this computationally in simulation very easily, but in the real world the mechanical properties matter. What is that self-reflective point of mechanical systems? ...
see #simple_cpu and minimal_instruction_set.html
------------------------------ Date: Mon, 19 Jun 2000 16:27:44 +0200 From: Tim Böscke To: <MISC> Subject: Re: MISC > >The _real_ minimum instruction set is one with a > >subtract-and-branch-when-negative instruction btw. > > > > Eugene Styer mentions six 'one instruction computers' at > <http://eagle.eku.edu/faculty/styer/oisc.html> > Well - if you look at them closely it becomes obvious that the six machines are after all just flavours of two ideas. 1) The move machine 2) subtract, branch The mentioned "subtract" machine is a combination of both. However I dont really consider 1) as a one instruction machine. After all it is just a fancy way of doing microcoding. A move to the PC is after all not just a move, but a JMP. And thus you already have two instructions. The same goes for the ALU stuff. ------------------------------
------------------------------ Date: Thu, 22 Jun 2000 00:27:00 -0600 From: Roger Ivie <rivie at teraglobal.com> To: MISC Subject: Re: MISC Content-Type: text/plain; charset="us-ascii" ; format="flowed" >Well - if you look at them closely it becomes obvious that >the six machines are after all just flavours of two ideas. > >1) The move machine >2) subtract, branch > >The mentioned "subtract" machine is a combination of both. > >However I dont really consider 1) as a one instruction machine. >After all it is just a fancy way of doing microcoding. A move >to the PC is after all not just a move, but a JMP. And thus >you already have two instructions. The same goes for the ALU >stuff. Bear in mind that the PC doesn't have to be an actual register. A good example of this is the PDP-5, which stored the PC in location 0. Executing an instruction began by fetching location 0 and using it as the address of the instruction to be executed. It was possible for a DMA device to cause a jump by writing to location 0; this was, in fact, how the front panel loaded an address into the PC. -- Roger Ivie rivie@teraglobal.com Not speaking for TeraGlobal Communications Corporation ------------------------------
Steamer16: a high-performance homebrewer's microprocessor http://www3.sympatico.ca/myron.plichota/steamer1.htm by Myron Plichota <myron.plichota at sympatico.ca>. written in VHDL and prototyped on a single wire-wrapped protoboard. Has only 7 instructions. Packets of 5 instructions (5 instructions * 3 bits = 15 bits/packet) execute in 6 cycles/packet at a cycle rate of 20 MHz. Both the address and data busses are 16 bits. Unusual ``ArF'' call/return protocol.
qUark ../mirror/quark.txt a viable stack-computer with 4-bit opcodes (c) vic plichota, original concept by Myron Plichota Dec '98.
------------------------------ Date: Sat, 20 Feb 1999 12:54:34 +0300 From: "Stas Pereverzev" To: MISC Subject: Re: nFORTH v2.3 Content-Type: text/plain; charset="koi8-r" Content-Transfer-Encoding: 7bit >Comments, folks? You need only five instructions, not 16. They are: ALU: 1. nand 2. shr RAM: 3. store ( addr n -- ) 4. lit ( -- n ) CONTROL: 5. ncret \ JUMP to addr in N, if carry flag isn't set in T, \ also drop both T and N Also, if PC is memory variable (or can be addressed as memory variable) we can awoid "ncret" instruction: In that case we sholud use NCSTORE instead STORE: ncstore (addr n flag -- ) ncstore (addr n 0 ) - same as store, ncstore (PC n -1 ) - same as ncret We need only 2 bits per instruction in that case. That all folks ;-) Stas. ------------------------------
(does this really work ???)
Turing Tar Pit.
Some things you might think about if you are designing a new instruction set (for a Von Neumann machine).
]
The ``JavaOS'' #JavaOS attempts to do this in software, so it doesn't need MMUs, etc.
DAV: If you're going to be implementing FORTH on top of the architecture anyway, what with it's threaded code and all (lists of subroutines to call, rather than lists of call instructions to those subroutines) ... do you really need a ``call'' instruction ? Or do all the ``calls'' you need really reduce down to the ``next'' instruction at the end of a subroutine, to lookup the next subroutine on a list and call it ? -- DAV
[FIXME: very muddled thinking here:]
With threaded execution, what we have is: blocks: some register RN point to the ``next item'' in the list. (RN never points to a real assembly-language opcode; it points somewhere in the middle of a list of pointers, each of those pointers point directly to real assembly-language opcodes) Let's say that happens to be a (leaf, assembly-language) routine. When the (leaf, assembly-language) routine is done, it does NEXT (conditional ?). If the ``next item'' on the list is a pointer to another (leaf, assembly-language) routine, NEXT simply does
PC := [[RN]] RN++ (compare to Normal return instructions do PC := [SP] SP++ )So how does NEXT know whether this new routine is a leaf or not ? Or do we just assume it's always a leaf, and add a special "receive" opcode instruction(s) at the beginning if it happens to not be a leaf (reminiscent of the ARM procedure call standard) ? something like this:
; function header ("receive") ; push old RN to return stack [SP] := RN SP-- ; make RN point into block of subroutine, rather than calling block RN := [PC + n] ; NEXT PC := [[RN]] RN++At the end of a (non-leaf) subroutine, the instruction at the end of the block needs to (leaf) point to code that properly does a return.
; pop old RN from return stack SP++ RN := [SP] ; NEXT PC := [RN] RN++
Control structures (Things that are not blocks): sequence, selection, iteration: [FIXME:] Perhaps it would be interesting to deal with these only at the threaded-call level, not on the assembly-langauge level. Then we have: selection: using conditional returns: ... perhaps add extra header to every subroutine; to jump *after* that header implies ``unconditionally execute this routine'', but to jump right at the beginning implies ``conditionally execute this routine: only when T is not zero''. I.e., that prepended extra header drops T, does a NEXT when that old value of T is zero, otherwise falls through.
How do I loop through a million pixels without leaving a million return addresses on the stack, when my only conditional is a conditional return ? just use ``fake jump'' of call/drop ?
[/muddled thinking]
-- Joseph H Allen http://www.dejanews.com/getdoc.xp?AN=408499073.1Basically I would not have made a weaker CPU if it came at the expense of being able to use data structures. This means that I prefer to have some form of indexing instead of the cheaper alternatives: direct addressing only (like the original Von Neumann computer with the resulting need for self-modifying code and the execute instructions of future computers), register indirect addressing only (like the 8080), or memory indirect addressing only (like the pdp-8). I had considered the 6502 method of indexing (where the base address is fixed, or stored in a zero-page location, and the 8-bit offset is in a register), but having programmed both 6502s and 6800s, I know that fixed offsets are to be prefered to fixed base-addresses (thus the 6502 index registers are, IMHO, useless). So I know that I need to encode two fields for indexed instructions: where the base is and the value of the offset. You want the offset to be as large as possible since that limits the size of the data structures you can have, thus the current 2-bit/7-bit format. If the base address was not going to be in an actual register, the next best place is the top of stack (which is much better than in a zero-page location).
The other nice thing about indexing is that it allows you to make fast move unrolled move loops:
lda 0,x sta 0,y lda 2,x sta 2,y etc.At 8 cycles for loads and 10 cycles for stores and two bytes per transfer, this gives a move speed of 9 cycles per byte.
If I could get high memory to memory copy speed (which I think is important) with indexing, then I wouldn't have to worry wasting op-code space with any other method (I.E., special instructions or auto-increment addressing modes).
Another lesson from the 6502 (and the 8088 for that matter), is that it just really sucks if the largest datum you can manipulate is smaller than your address size. This means that the accumulator needs to be the same size as the PC -- 16-bits.
So there you go. If you have any ideas for improving it without expending too much extra hardware, I'd like to hear them.
Lots of people have written far better descriptions.
BBW Branch Both Ways BEW Branch Either Way BBBF Branch on Bit Bucket Full BH Branch and Hang BMR Branch Multiple Registers BOB Branch On Bug BPO Branch on Power Off BST Backspace and Stretch Tape CDS Condense and Destroy System CLBR Clobber Register CLBRI Clobber Register Immediately CM Circulate Memory CMFRM Come From -- essential for truly structured programming CPPR Crumple Printer Paper and Rip CRN Convert to Roman Numerals
Screwy Things
Life is full of compromises, and microcontrollers are no exception. Every microcontroller I have ever used has something a little strange about it. Here's a list of few of things I've found that didn't quite make sense to me. Consider them a "heads-up" so you don't have to find them yourself.
...
... The process of designing the AVR instruction set included a lot of gives and takes. We have consulted C compiler experts who have given us a lot of advice on how to tune the instruction set to yield compact C-code. As an example, the compiler experts advised us to sacrifice the ADDI for a SBCI (subtract immediate with carry).
For those instructions that are missing, convenient workarounds exist. The code efficiency of the AVR should prove that we have found a good compromise between which instructions to implement, and which to omit.
[instruction set design]
DSPs are heavily optimized to do this very rapidly, often finishing in N + a few cycles.
They do this by implementing "MAC", multiply-and-accumulate, A = A + X*Y, plus various other tricks to get zero-overhead loops.
Other processors typically break this down into a "multiply" instruction and an "add" instruction.
Some very simple processors don't have a multiply instruction; they synthesize it out of repeated shifts-and-adds.
I've seen one very simple processor that didn't even have an add instruction. It synthesized it out of AND, XOR, shifts, etc.
I wonder if there's room for something intermediate between dedicated multiply hardware and synthesizing multiply out of shifts-and-adds.
In particular, I've been wondering if a dedicated square(x) or t(x) function would take sigificantly less silicon area than a full multiply(x,y) multiplier.
Then A = A + X*Y can be synthesized using
A = A + X*Y is equivalent to A = A + ( square(X+Y) - square(X) - square(Y) )/2 or A = A + t(X+Y) - t(X) - t(Y) ... also A = square(X) is equivalent to A = 2*t(X)-XIf we assume square(x) and t(x) and ADD and SUB and SHIFTR operate in a single cycle, and even if we don't combine the 2 of them in a single cycle, that gives 3 function cycles plus 4 addition cycles (plus a shift if we use square).
These 7 cycles are far less than the cycles needed to synthesize multiply out of shifts and adds ... although admittedly more than a full sized single-cycle multiplier.
The gap is reduced even further if we combine t() and ADD in a single cycle -- t()-and-ADD, analagous to multiply-and-accumulate.
[FIXME: I have a lot of information on subroutines and the importance of well-factored code scattered about ... should I gather it all together here, or since it's so very important, leave a brief reference at each place ?]
(see also data_compression.html#program_compression )
Calls and returns are very important. Calls obviously take much less RAM space than in-line code except for the most trivial subroutines. Few people realize that calls can also take significantly less time than in-line code, because the most heavily re-used subroutines stay in the high-speed cache. Some programming styles (FORTH) create heavily factored code, where "call" is the most frequently executed instruction at run time and typically 1/3 of the (static) instructions [see _Stack Computers_ by Philip Koopman 1989, in particular the section "7.1 the importance of fast subroutine calls" http://www.ece.cmu.edu/~koopman/stack_computers/sec7_1.html which says ``expensive procedure calls lead to poorly structured programs''. Also,
-- _Stack Computers_ by Philip Koopman http://www.ece.cmu.edu/~koopman/stack_computers/sec3_3.html#3322 ]. Since calls occur so frequently, it is important to minimize their RAM and cycle time overhead. Koopman suggests combining them with other instructions, so every instruction combines a call, branch, or exit with some other operation (since, at the hardware level, a data operation and a program flow operation can be executed in parallel).... Good Forth programming style ... Subroutines often only consist of 5 or 10 instructions. A static frequency of approximately 50% of the instructions being subroutine calls ... especially effective in environments with limited memory capacity. It also encourages the use of machines with fast subroutine calls.
From a runtime perspective, every call is paired with a return. So (everything else being equal) whatever minimizes (calling time + receiving time + returning time + resume time) is best. Sometimes you can get a net improvement by making one of these slightly slower and another one faster.
From a code space perspective, each function is likely to be called more than once, there are many more call statements than return statements (unless your subroutines have lots of exit points). So to minimize space, it's probably better to reduce (call size + resume size) even at the expense of (receive size + return size).
Ah, here it is: "Note that, in contrast to popular myths, subroutine threading [a list of consecutive function calls] is usually slower than direct threading [a list of consecutive function pointers] ." -- "Threaded Code" article by Anton Ertl. http://www.complang.tuwien.ac.at/forth/threaded-code.html
also see Koopman's taxonomy of microprocessors [FIXME: ]
_Computation Structures_ by Stephen A. Ward and Robert H. Halstead, Jr. (c)1990 by The Massachusetts Institute of Technology The MIT Press
p.606 Models of Computation The von Nuemann model of computation ... alternative models ... include:
The above list is neither exhaustive nor even representative...
... p.612 Taxonomy of Multiprocessors ... Michael Flynn's taxonomy ...
Ward and Halstead
quotes from _Computation Structures_ by Stephen A. Ward and Robert H. Halstead, Jr. (c)1990 by The Massachusetts Institute of Technology The MIT Press
... p. 347 Programming languages offer several /classes/ (allocation disciplines) for storage. These differ ... in the degree to which responsibility for allocation and deallocation of storage is assumed implicitly by the system (as opposed as being left to the programmer).
... common storage classes:
... p.354 the /general-register/ organization, in which ... active registers are available for explicit use by programs to hold frequently accessed data. ...
Proponents of the general-register organization cite two relatively independent argument in its favor:
... objections to the general-register scheme: * Lack of implementation transparency. The number of general registers is effectively "frozen" in the machine-language specification; a higher-performance implementation, offering a greater number of registers, will require modification of the machine langaguage and hence of the software designed for it. * Programming difficulty. The need to allocate registers to variables complicates the programming task.
... There are alternatives to the general-register organization that offer, at some additional implementation cost, its performance and compactness advantages while avoiding the above objections. ... cache-memory ... ... tricks for address encoding ... [DAV: is it really possible to have a "no programmer-visible registers" architecture ?]
... p.365 Pascal and Ada are also concerned with /safety/ in the sense that they strive to make as many programming errors as possible detectable at compile time rather than run time.
[FIXME: gather other NN information scattered over my other pages here.]
[FIXME: move to #simple_cpu
... requires 19 control signals, most of them directly mapped from the 16 bit instruction word from memory. sometimes you could make it do 2 completely different things at once by designing the instruction bits ... Very simple ... even addition is broken into even smaller instructions.
"statistical addition": Addition of TOS and NOS is synthesized by simultaneously calculating
TOSnext = (TOS and NOS)*2; // carry bits NOSnext = TOS xor NOS; // sum bitsthen repeating until 0==TOS, and the sum is in NOS. ...
DAV: Say we have a long column of numbers to add (in particular, trying to synthesize MAC multiply-and-accumulate). Without changing this architecture: Rather than running every sum to completion, perhaps quicker to combine 3 or 4 or 7 values at a time into 2 partial-sum values:
carry_bits = 2*majority( a b c ); sum_bits = a xor b xor c;
Is there any minor change to this architecture that would speed up MAC ?
DAV: If the CPU synthesizes addition anyway, out of logical ops and bit-shifting, does it make any sense to choose some *other* counting sequence, different from natural binary? Perhaps grey codes? LFSR sequences?
http://www.dejanews.com/getdoc.xp?AN=411378028 says:
From: mj@isy.liu.se (Michael Josefsson) Subject: How to do the smallest processor ever? Date: 13 Nov 1998 00:00:00 GMT Newsgroups: comp.arch Hi all, while everybody seem to make everything as parallell as possible (to get the speed up), I tinker with the idea of making the easiest possible processor. Wherever I can, I want to get away from hardware. So these are some ideas that I have come up with (discussions baits perhaps): 1. Make it serial(!). Yes, it slows down things but OTOH there is only 1 bit busses. Adding 2 numbers (8-bit) would require a great number of clocks in multiples of eight. Slow? Yes! But feasible. 2. The original idea used only one 1 bit bus but I have come to the conclusion (opposite the reason to do this in the first place) that it is far more easy to implement something using 2 or 3 buses. 3. Minimise the number of registres. Yes a PC, IR and AR (address register) would be hard to get by without, but otherwise it would suffice to have only one other register - the accumulator. Incrementing PC is done with the ALU. 4. External memory is always parallell so I imagine that there would be 8 bit busses to the outside. 5. Speed would be slow with some 10-15 clocks to get the smallest operation done. Given a TTL implementation the total number of mips, at 10 MHz say, would be around 0.7 - not much, but then again what would one expect from a handful of TTLs? Some ASCII graphics might help (for shifts) * 1bit buses everywhere! |-----------<------------ * Accumulator knows how to shift | -|\ ------------------ | * Everything is clocked bit-by-bit |---| |--|8bit accumulator|--------| * PC and IR better parallell? | |/ ------------------ | | | | | | 0/1(ld/pc++) | Proc Control | /| | | /\/\/\ | / |-----/------------| | | | |-----------|ALU< -------------- | \ |---------/-------------|-----|IR/OP decode| | \| | | -------------- | 1bit ALU | | | | | | ------------- | | |----------|Program cntr|-| -------------- | ------------- | |Parallell to| | | |serial conv.| | ------------- | -------------- |----------|Addr reg |-| | | ------------- | From ext memory | | | | ----------- ----------- |serial to| |serial to| |parallell| |parallell| |converter| |converter| ----------- ---------- | | V Ext address To ext memory Has this been done anywhere? The focus seems to get everything as fast as possible and not as small as possible. Cheers, /Micke Josefsson University of Linkoping Sweden
One reply http://www.dejanews.com/getdoc.xp?AN=411507856.1
From: Gary Boone <gary@micromethods.com> Subject: Re: How to do the smallest processor ever? Date: 13 Nov 1998 00:00:00 GMT X-Posted-Path-Was: not-for-mail Content-Type: text/plain; charset=us-ascii X-ELN-Date: Fri Nov 13 09:05:21 1998 Organization: Micro Methods Mime-Version: 1.0 Newsgroups: comp.arch Regarding minimal processor history aspects: 1. See "Computer Structures ..." Siewiorek, Bell & Newell, 1982, beginning at page 581, for a description of TMS1000 microcontrollers, an increased function (and cost reduced) re-design based on TMS0100 architecture that I worked on. 2. In 1971 "state of the art" 10-micron 1-metal PMOS, TMS0100 4-bit data paths were used, not for performance, rather to allow less chip area (and single-chip design) compared to then-conventional 1-bit serial data paths. Consider, for example, area savings due to 3-transistor RAM compared to conventional 6-transistor shift register memory. Control decode for 4-bit data paths also saved (no "bit" timing). 3. The "most minimal" processor I know of was made at Litronix circa 1976, for use in digital watches. As I recall, this processor could only count, not even add. The architect was Steve McCrystal. I've lost track of my copy of his amazing instruction set. Steve, if you see this, please post it! Gary Boone
The Motorola MC14500B http://www.microprocessor.sscc.ru/great/s2.html#MC14500B http://www.questlink.com/keyword.htm?proc=keyword&view=list&kwrd=14500 http://www.questlink.com/page.htm?proc=brief&part=MC14500B&vendor=11205&brief=22059&view=153540 had only 16 pins; processed everything serially.
The Museum of HP Calculators http://www.hpmuseum.org/ mentions some zero-IC calculators (two types: slide rules and discrete transistors).
(see Minimum Instruction Set Computing (MISC) for more stack machines)
see also data_compression.html#space-optimizing_compiler and data_compression.html#compact_instruction_set for more thoughts on compact, dense, non-bloated code.
... ideally suited to embedded real-time control applications.
..."
--
p. 601,
_Computer Architecture: A Designer's text based on a generic RISC_
book by Feldman and Retter
Includes "a study of FORTH instruction frequencies". Includes a analysis of the effect of increasing or decreasing return stack and/or data stack (how to handle stack overflow / stack underflow). Includes "the minimum" set of 17 operators for the Canonical Stack Machine [FIXME:] .
[FIXME: is the ``propagation of initial biases'' something that is universal enough to include under ``general design'' ? (see also 9.6 The impact of stack machines on computing http://www.ece.cmu.edu/~koopman/stack_computers/sec9_6.html ... I thought I saw Koopman write about this topic in another document ... reminiscent of a rumor I've heard about trains in the U.S. ... ) Koopman's solution to this problem, ``looking at the hardware/software problem as a whole.'', is reminiscent of Buckminster Fuller's ``start with the universe''. But I've also seen this perverted into ``everything must change at once'' computer_architecture.html#everything_must_change . ... the ``cascade of benefits'' is reminiscent of Fuller's ``synergy'' ]`` ... Software that is able to customize the hardware to meet critical application-specific processing requirements will be able to attempt more difficult tasks on less expensive hardware.
... efficient hardware support for procedure calls, ... ability to tailor hardware to applications based on software requirements. ...
... propagation of initial biases ...
... CPU speeds have outstripped bulk memory speeds ... There are 2 ways to solve this problem:
- speed up average memory access time, and
- increase the amount of work done per memory access.
Cache memory decreases average memory access time ... Other techniques to speed memory access include interleaving banks of memory and pre-fetching opcodes ... methods tend to increase speed ... at the cost of added hardware complexity. Separate data and program memories ...
... increasing the average amount of work done by each opcode fetched from memory... has led to the development of what is now called the Complex Instruction Set Computer (CISC) ...
Truly interactive processing (which does _not_ mean doing batch-oriented edit-compile-link-execute-crash-debug cycles from a terminal) is ... not taught ... in universities.
... there is still considerably more mileage to be gained from uniprocessors by breaking out of past cycles and looking at the hardware/software problem as a whole. The answer lies not with a new hardware architecture that mirrors current software, nor in changing software to suit current hardware. The answer lies in a redefinition of how we think about hardware and software. ...
The first step ... take to heart the philosophy of breaking up a problem into smaller sub-problems. Instead of building a computer that supports procedure calls as special operations, what if we design a computer to expect subroutine calls as its primary mode of operation ?
... Programs are now organized as a tree structure... In such a machine, the very notion of a program "counter" becomes obsolete.
If this machine could actually process procedure calls simultaneously with other operations, modularity in programs would not be penalized. Such a machine would encourage better software design, and could fundamentally alter the way programmers think about programs.
... A WISC machine has high-speed procedure processing capability along with the capability to redefine the instruction set. ... an interesting and somewhat unexpected cascade of benefits is realized. ...
... makes the hardware simpler, faster, and less expensive to develop and manufacture. ...
procedure calls may be made a zero cost in execution time ... a stack-oriented opcode need only take roughly one-quarter of a 32-bit instruction word, the remaining instruction word bits are available to use as a procedure branching address
+--------+----------------------+ | opcode | address | +--------+----------------------+Figure 7. Generic WISC instruction format
... simple hardware means simple microcode.
... a single-format 32-bit micro-instruction format is more than sufficient for a WISC machine.
... The combination of hardware stacks with writable microcode memory results in the blurring of the boundaries between high level programs, machine code, and microcode.
... a procedure can be transparently replaced with a microcoded primitive simply by replacing the procedure call with an opcode. There is no impact to any other aspect of the source code. ... lead to a view of microcode memory as a cache memory for frequently used operations.
... The most heavily executed procedures can the be partly or wholly transferred from high level code to microcode, resulting in a significant speed increase. [DAV: is it possible to automatically do this for every (inner) loop, at run time / load time ?] ...
... If a suitable microcoded instruction set is used, compiled object code can closely correspond to the original source code, resulting in simpler and more efficient compilers and debugging tools. ...
... WISC machine ... Programmers are not penalized for organizing programs into small, understandable procedures. This results in compact tree-oriented program structures which are composed of hierarchically arranged solutions to sub-problems. Thus programs can be simultaneously optimized for small memory space, fast execution speed, and low development cost. ...
Design of a 32-bit WISC machine
... CPU/32 ...
BIT: | 31 23 | 22 2 | 1 0 | | opcode | address | call/ | | | | exit | Figure 8. CPU/32 instruction format....
The CPU/32 has no program counter. Each instruction contains the address of the next instruction. The only exception ... procedure returns ... the return stack value is passed directly through the memory address logic to access the next sequential instruction in the calling program.
While there is no program counter, there is an incrementer within the program memory logic that is used to add a 1 word displacement to procedure call addresses before they are saved on the stack. ... The incrementer is also useful in block memory moves. ... ''
... very reminiscent of Transmeta's ``code morphing'' concept ...
DAV: Let's take this program-as-tree-structure idea seriously. Why should a return happen to go to the next statement after the statement called it ? If the tree is a binary tree, then we can imagine each node calls 2 lower-level routines before returning to its parent higher-level routine. (tail-recursion makes this equivalent to 1 call and 1 jump). ... If this were a pure tree, it's pretty simple to rearrange the tree so the destination of that 2nd call comes immediately after the instruction that calls it, simplifying to ``jump PC+1'' and leading to Koopman's single-address format. But is that true for the acyclic graph produced when a subroutine is called by many different higher-level nodes ? No. What happens then ? Some cases: When the common subroutine is the left (1st called) subroutine of more than 1 node: No problemo -- Koopman's single-address format still handles it fine.
DAV: When the common subroutine is the right (last called, AKA jumped-to) subroutine of more than 1 node: It's impossible to have more than 1 location ``fall through'' to this common subroutine. The simplest, most compact way to handle this is with a NOP, Jump instruction.
DAV: If the instruction to which this NOP,jmp points also happens to be a a NOP,jmp instruction, then this NOP,jmp instruction can be adjusted to point to the destination of *that* jmp instruction -- so there's no more than 1 consecutive NOP.
On p. 69 Koopman discusses ways for a compiler to optimize away this NOP, by ``bubbling up'' (slightly) expanding the code. It doesn't expand as much as subroutine in-lining, because we do not duplicate the *entire* subtree, only the right-hand branches. Perhaps we do not even need to duplicate that much ... We only need 2 versions of a subroutine to eliminate NOPs. Call the 1st non-NOP opcode executed by a subroutine op1. One version of that subroutine has op1 embedded in it, called by
(any op) call version1.If the compiler can put something productive into (any op), it uses this version. If the compiler is forced to put NOP into that call, it changes
NOP call version1into
op1 call version2where version2 leaves out that 1st op1 and depends on the caller to execute it. .
DAV: on-the-fly opcode redefinition: Perhaps it would be interesting to dynamically execute data structures. The ``opcode'' at each node would merely indicate ``this is a leaf'' or ``this is not a leaf'', and that opcode would be redefined every time we wanted to do something different to the tree. Does opcode redefinition really give us enough flexibility to do prefix, infix, and postfix tree handling ?
Gforth http://www.jwdt.com/~paysan/html/gforth_toc.html (GPL'ed version of FORTH) includes Object-oriented Forth http://www.jwdt.com/~paysan/html/gforth_50.html
Forth Research at Institut für Computersprachen and related topics, such as stack-based languages (PostScript, JavaVM), threaded-code interpreters, and stack machines. http://www.complang.tuwien.ac.at/projects/forth.html
Implementation of Stack-Based Languages on Register Machines is pointed to by http://www.complang.tuwien.ac.at/forth/performance.html
David Cary thinks that if you know enough to develop an adequate OS, your time and skill could be applied to other projects more worthy of your effort todo.html .
Nevertheless, there's nothing like producing a system where you wrote every byte of its software. (It may be a waste of time, but it's sure a learning experience). You might want to start with a simple PIC robot_links.html#pic .
See also
[FIXME: consider moving to http://c2.com/cgi/wiki?OperatingSystemsResearch ... http://c2.com/cgi/wiki?OperatingSystemsDesignPrinciples ... or perhaps http://cliki.tunes.org/Microkernel-based%20OS ]
David Cary is fascinated by the variety of computer architectures, and nearly all of the "layers of abstraction" idea_space.html#level of a computer system (from the raw atoms, electrons, and photons at the bottom ... through typography ... to the user interface at the surface). However, I am singularly uninterested in the OS layer. I've seen this sequence of events far too often:
(For example of the ``everything must change'' mentality, see ``a major architectural overhaul is still required, on pretty much all fronts'' -- Ron http://www.cs.caltech.edu/~adam/LOCAL/faq-fork.html#munchkins , or check out Luke 5:36-38 )
Why re-invent the wheel ? There are already far too many OSes out there (in my opinion). Alan Turing proved that any OS can be ported to any computer architecture. Of course we want systems that are better in *every* area, but (correct me if I'm wrong here) it should be possible to just improve one area at a time. ( That's the main idea behind "scaffolding" and "levels" idea_space.html#level )
If you're interested in computer architecture, fine, invent a new architecture, and get an ugly old OS running on this clean new hardware (I hear that Linux is *designed* to be easy to port).
If you want to make your own OS, fine, get it running on ugly old hardware and port a bunch of the ugly old applications to it (there are tons of "free source" applications *designed* to be easy to port .
There's tons of applications (both open and closed source) that are written in C or C++. So you might consider
Porting "gcc" or another C compiler #porting_c to your OS. Once you have the C compiler and C libraries ported to your OS, suddenly thousands of programs "just work".
Porting a FORTH or FROTH compiler http://sourceforge.net/projects/froth/ is much easier than gcc, but not as many applications are written in this language.
If you're really serious about creating your own OS, here's some links I think you might be interested in.
First we have some general links relevant to developing a OS:
`` I think Linux is a great thing, because ... because, of all the operating systems that are at all relevant today, Unix is the best of a bad lot.
(Yes, that's right, though I've been living in Unix for more than a decade, I think Unix sucks. Read the ``Unix Hater's Handbook'' if you want to know why. But I'd rather run Unix than Windows or MacOS any day, because Unix sucks less. That doesn't mean it doesn't suck.) ...
Of course, all of the software I write runs on Linux; that's the beauty of standards, and of cross-platform code. I don't have to run your OS, and you don't have to run mine, and we can use the same applications anyway!
... I hope that some day it will have evolved to the point where my mom can take home a Linux box, turn it on, and get on with her life without having to become a Unix sysadmin first, and without having to give up on all the ease of use she's come to expect from allegedly less powerful operating systems.
Because, you see, what I want to do is to commoditize the OS. I want to have access to all the applications that I need to do the things that I need to do, regardless. Why should someone have to retrain themselves to use a new application that does the same basic thing as the old application, just because something as trivial as the operating system changed out from under them? ''
the Global File System http://sources.redhat.com/cluster/gfs/ , http://www.redhat.com/software/rha/gfs/ /* was http://www.globalfilesystem.org/ , http://www.openGFS.org/ */ seems like a really good idea. It creates what appears to be a single remote file system that many people can access (say, .../public_html/...), but (a) is fault tolerant -- the system is distributed over many servers, with enough redundancy that any single server (any hard drive) can fail without any users noticing (no lost data). (b) distributes/balances the load over many servers.
Next we have lots of stuff about particular OS ideas (usually embedded in a specific OS):
Some systems have the GUI mixed up in the OS. Other systems have a distinct GUI layer between between the application and the OS.
-- Alan Grimes http://users.rcn.com/alangrimes/UCE/sphere.txt [simplicity ?]``As long as it is conceptually straight-forward, avoiding or hiding complexity wherever possible, and the requirement that the user need to remember stuff ... be reduced or eliminated, it WILL be easy to use. ...
... when I discuss hiding complexity I don't mean throwing a door over it and welding it shut. Rather I mean compartementalizing it in such a way that the user can identify and focus on the things that concern him at present.''
http://webs.cs.berkeley.edu/tos/ latest source code: http://sourceforge.net/projects/tinyos/ [low-power] [program compression ?] [Smart Dust] ["the hardware design was made public" -- open hardware]TinyOS is a component-based runtime environment designed to provide support for deeply embedded systems which require concurrency intensive operations while constrained by minimal hardware resources. For example, originally designed for the Smart Dust hardware platform, our scheduler fits in under 200 bytes of program memory.
One of the major advantages to the Mac platform ... is that for the most part you can drag an app from one machine to another and launch it and it will work, without having to be "reinstalled." Applications that install kernel extensions are the exception, but in some cases, the application merely prompts you to install the appropriate bits once it is launched for the first time on the new machine. This is the way that things should work, but you don't always notice how nice it is when it does.http://www.osnews.com/story.php?news_id=2792
RTEMS (Real-Time Executive for Multiprocessor Systems) is a commercial grade real-time operating system designed for deeply embedded systems. It is a free open-source solution that supports multi-processor systems. RTEMS is designed to support applications with the most stringent real-time requirements while being compatable with open standards. Development hosts include both MS-Windows and Unix (GNU/Linux, FreeBSD, Solaris, MacOS X, etc.) platforms.http://rtems.com/
Choosing a Linux distribution http://mirrors.kernel.org/LDP/HOWTO/Installation-HOWTO/before.html#DISTRIBUTIONS basically just defers to http://lwn.net/
* Also, if you do create a big Makefile that will rebuild tomsrtbt from * * source, I would be glad to put it in the contrib directory on my site * * to make it easier for anyone else who wants for some reason to do the * * "big everything rebuild thing". *]
[FIXME: to_program: perhaps it would be interesting to port this to ARM, PIC, etc.]Ultra-flexible firmware through the magic of Forth. Tiny Open Firmware (TOF) for Coldfire, 68K and 8031 processors. ... free, full-source development environment
(Linux currently scales from a minimal size of around 500 kilobytes of kernel and 1.5MB of RAM, all before taking into consideration application and service requirements.) ... eCos provides the basic runtime infrastructure necessary to support devices with memory footprints in the 10's to 100's of kilobytes, or with real-time requirements." ''
``eCos Porting Guide'' article by Anthony J. Massa 2001-12-28 http://www.embedded.com/story/OEG20011220S0059
[FIXME: FPGA microprocessors ?] µClinux: the Embedded Linux/Microcontroller Project http://www.uclinux.org/ The Linux/Microcontroller project is a port of Linux to systems without a Memory Management Unit (MMU). uClinux first ported to the Motorola MC68328: DragonBall Integrated Microprocessor. The first target system to successfully boot is the 3Com PalmPilot ... It is currently maintained by co-creator D. Jeff Dionne. ... Jeff Dionne and John Drabik announced that they have sucessful ported uClinux to a Field Programmable Gate Array (FPGA) running the Leon SPARC open source core ...
Date: Tue, 12 May 1998 13:57:48 +0100 To: seul-dev-help@seul.org X-URL: http://www.seul.org/archives/seul/dev/help/Apr-1998/msg00045.html X-Personal_name: Dan Moore From: mooreds@whitman.edu Subject: contacting other groups Sender: owner-seul-dev-help@seul.org Here's another group that is apparently trying to put together a 'linux for the masses'. http://independence.dunadan.com/ Daniel
Our secret ambition is utterly insane. We would like to design an infinitely mutable, portable, and third-party extensible OS, as simple and as elegant as the Mac, which does not rely on dll's, object dependancies, exception traps, or any other messy run-time lookups. It's impossible, but we think we know how to do it.
Most people think a MMU is necessary for virtual memory to work ... apparently not:
``Applications are given a mechanism to declare allocated memory blocks for swapping while not accessing them (this works even on systems without an MMU).''
``All our software is written and tested using our own tools under DPS (our OS) on computers which are our development; and all the hardware and PCB design are done using our own CAD tools. In fact, at the main development location we do not have a single Wintel or other kind of non-TGI designed computer.'' -- http://transgalactic.freeyellow.com/whower.htm
Is this related to
``JavaOS: An Operating System for Small Devices: Designed for PDAs and embedded systems'' article by Steve Mann 1996 http://www.ddj.com/documents/s=949/ddj9617f/9617f.htm
[FIXME: computer_architecture#os]JavaOS doesn't require a memory-management unit (MMU).
?
the JOS Wiki http://jos.sourceforge.net/
AROS is a rewrite of AmigaOS 3.1. These are the main goals:
- 1. Be binary compatible on Amiga (so it can plug&play replace the original OS)
- 2. Be source compatible on any other hardware (so Amiga software can be ported to other hardwares with no effort)
- 3. Provide every hardware with Amiga apps with native look and feel or, as the user chooses, ... with Amiga look and feel.
... we want to be binary compatible to the old AmigaOS on Amiga. The reason for this is just that a new OS without any programs which run on it has no chance to survive. Therefore we try to make the shift from the old OS to our new one as painless as possible (but not to the extent that we can't improve AROS afterwards). ...
AROS is Open Source
Lineo has released "AtomicRTAI, a working real-time Linux distribution that fits on a single floppy diskette." "AtomicRTAI is open source and freely distributed under the GPL and LGPL licenses." http://www.zentropix.com/products/atomicrtai.html
building a minimal+simple+fast kernel
The LispOS/VM Mailing Lists Archives http://cathcart.sysc.pdx.edu/lispos/ml/
[FIXME: should I split out Beowulf and similar "operating systems" into a seperate category ?]
Hope you have fun !
There seems to be some confusion about what a compiler can and cannot do.
Here I have information about compilers in general. Also see Porting C compilers #porting_c
I'm especially interested in:
compiler quotes
Optimization is but one of many desirable goals in software engineering, and is often antagonistic to other important goals such as stability, maintainability, and portability. At its most cursory level (efficient implementation, clean non-redundant interfaces) optimization is beneficial and should always be applied. But at its most intrusive (inline assembly, pre-compiled/self-modified code, loop unrolling, bit-fielding, superscalar and vectorizing) it can be an unending source of time consuming implementation and bug hunting.
[FIXME: should I put this quote with the Abrash stuff in video_game.html ... cross link ...]I was magnificently pleased with myself. ... I calculated that I had saved ... a lot of loop overhead. ... I was so pleased, that I decided to measure the speed improvement, ... so that I could pat myself on the back more quantitatively.
Do you know what I found?
No improvement whatsoever. ... The compiler's optimizer had optimized the first code well enough that my optimization didn't help at all. I was powerfully disappointed, and I learned that that only thing you can usually be sure of without measuring performance is that you've made your code harder to read.
In short, if it's not worth measuring to prove it's more efficient, then it's not worth sacrificing clarity for a performance gamble.
...
Each compiler has different strengths and weaknesses, and some are better suited to your program than others.
...
When to Tune: ... when to optimize ... Make the program right. Make it modular and easily modifiable so that it's easy to work on later. When it's complete and correct, check the performance. If it's lacking, then make it fast and small. In short, don't optimize until you know you need to.
...
``using assembler to debottleneck applications is ... Common practice for mainframers but not for Unix users.'' http://www.linuxworld.com/site-stories/2002/0426.mainframelinux.react.html
[DAV has snipped out a bunch of stuff from this long post, leaving only things directly relevant to compilers] ------------------------------ Date: Tue, 20 Jun 2000 23:15:04 -0700 From: Jeff Fox To: misc Subject: CPUs and Forth ... I know what you mean. I have confidence that people are still capable of being smarter than their PCs at this stage of technology and don't really need to adopt the stance that they need a compiler that is smarter than they are. But I hear all the time that this is the goal today, don't think about any of that stuff and just have faith that the compiler will be smarter than you and generate optimized code for you. We are told that we should just have faith that the code in the canned libraries is going to be efficient and better than code we could write ourselves. We are told that we need hardware that is just too complicated for humans to deal with and the only option is tools that are smarter than we are. We have taken a very different approach where we enjoy being in control and being able to think about the problem. ... iTV was spawned from a fund available to NASA contractors who wanted to take high technology developed for NASA into the commercial marketplace. Gary Langford and Joe Zott had been doing spacecraft design for NASA through SkyWatch and had submitted designs for various spacecraft that proposed using Chuck's chips. They got the idea that they chips could also make very cheap internet appliances and iTV was born. I go back quite a ways. I was a physics major at U of Iowa when they built a number of experiments for the deep space probes. They were the only University putting experiments and equipment into those missions and I looked over shoulders and asked a lot of questions. Those were days before Forth chips let alone MISC. Between that time and when I started working with MISC I sold a couple of Forth systems to scientists working at NASA and heard stories from them about the use of Forth on early spacecraft. They used the silicon on saphire 1802 in those days because it was low power and could take radiation. They were strange processors and one of the things that made them useful was Forth. They were weak so a lot of them were required. Those early deep space exploring spacecraft were Forth multiprocessing robots. Forth made em work and they worked remarkably well. As I say as far as I know Forth is the only thing we have sent outside of our own solar system (and not because we were trying to get rid of it either!) ... > I went onto comp.arch.fpga a few months ago I used to read comp.arch, comp.embedded, comp.realtime, comp.lang.forth, comp.robotics, and comp.robotics.ai.philosophy ... I found the majority of the people in those groups are so used to bloated and inefficient software that they really believe things like that a Pentium needs 50 thousand instructions inside of an empty loop. Most often when we would discuss most anything I would be talking nano-seconds or micro-seconds and they would be be talking milli-seconds or seconds. So when they would say that with a thousand dollar computer they could do such and such in 20 milliseconds and I would say well we do the same thing in 20 microseconds or 20 nanoseconds on a machine costing almost nothing this was pretty hard for them to grasp. The same thing with replacing megabytes of code with kilobytes of code. (Well they are related.) ... Well to be fair, the ANS standard was never intended as tutorial on Forth let alone a tutorial on how to write good Forth. Like other standards it aimed for system implementors not end users. As a result I think understanding the standard is many times more complicated than understand a real Forth system even one that is ANS compliant. > Reminds me of when I thought of FORTH as a scripting language because it > was so easy to implement a FORTH interpretter in C. One of my main complaints of ANS is that it is being used to promote Forth as yet another inefficient scripting language extension to C. I would hate to see that become Forth's fate. I prefer to think of it as a systems language than a scripting language. But if your perspective is that the OS, the compiler and most of the programs are written in C then about the only place to put Forth is as a scripting language extension to the C enviroment and most of the concern about the Forth will really be C interfacing issues. I think of Forth as providing the OS, the compiler, the programs and so think of it as needing to be an effient systems language and don't see that the main issue is making it conform to the rules of a C environment ... Jeff Fox ------------------------------
[low power] [FIXME: DAV has a 1802 instruction set manual on his shelf ... Has this already been digitized and put online ?] 1802 emulator http://kristopherjohnson.net/cgi-bin/twiki/view/Main/TinyELF
-- From: Date: Wed, 27 Oct 1993 16:34:05 GMT Newsgroups: comp.compilersNo human chess player in the world could defeat even a 1960's-era chess program in a tournament if required to play 500 games straight through with no breaks. For today's programs, the number would probably be 10 rather than 500.
No compiler can touch the best hand-coders for fairly small pieces of code that do complex things under messy conditions (e.g., on hairy machines like 80x86's). No human being can touch a good compiler when optimizing large amounts of "boring" code that has been written for clarity and human under-standing, rather than tuned for a particular architecture. No human being can match the ability of optimizing compilers to keep the code optimized as it changes or as it's moved among different types of hardware.
-- Jerry
Q: Are there any books you recommend for designing compilers?
John Reagan:
Starting with
void tolowercase(const char const *input, char *output) { int i=0; for (i=0; i<strlen(input); i++) output[i] = tolower(input[i]); output[i] = '\0'; }Since the pointer
input
never changes,
and the characters it points to never change
(or do they -- when output
points to the same place ?),
then there's no need to re-run strlen() for each character.
A smart compiler should store that value only once:
void tolowercase(const char const *input, char *output) { int i=0; unsigned int length = strlen(input); for (i=0; i<length; i++) output[i] = tolower(input[i]); output[i] = '\0'; }
Since the *order* that we go through the loop doesn't matter, some architectures (such as the PIC) scan through strings *much* faster end-to-beginning rather than beginning-to-end:
void tolowercase(const char const *input, char *output) { int i = strlen(input); output[i] = '\0'; while( 0 < i ){ i--; output[i] = tolower(input[i]); }; }For this particular function, there's really no need to use strlen() or a temporary variable at all.
void tolowercase(const char const *input, char *output) { int i=0; do{ output[i] = tolower(input[i]); i++; }while( 0 != output[i] ); }I'm pretty sure compilers automatically optimize array references, like that, to give something closer to
void tolowercase(const char *input, char *output) { do{ *output++ = tolower(*input++); }while( 0 != *output ); }That should be the fastest on normal architectures (ones where RAM is so slow that there's time for a few quick operations between reading and writing).
On some architectures, I find it faster to break things up into seperate loops. Apparently optimizing each loop seperately sometimes more than compensates for the extra loop overhead.
(In particular, on machines that are register starved, if you have to keep shuffling around temporary variables in RAM inside one large loop, but you can break it up into smaller loops that can keep all the temporary variables in registers ... but once you have everything in registers, further splitting doesn't help. ).
void tolowercase(const char *input, char *output) { strcpy( output, input ); while( 0 != *output ){ *output = tolower(*output); output++; }; }
[FIXME: consider moving to http://c2.com/cgi/wiki?edit=BetterForLoopConstruct ]
Porting the "gcc" compiler; Porting other C compilers
see also general compiler quotes #compiler
2 sections here:
Porting "gcc", a full-size commercial-strength open-source compiler, and re-targeting it as a cross-compiler.
porting simpler C compilers, especially ones targeting very simple processors such as the Microchip PIC, especially ones that optimize space over time. This includes "Small C", a highly restricted subset of C (no structures; DAV finds this annoying) that was designed to require a far simpler compiler than one that can handle the full C language.
other C Compilers:
Since the full source to GCC is so large, perhaps it would be good to get a simple C compiler working first on a new OS or architecture.
(Does it make sense to bootstrap gcc using Small C or another C compiler ? Or should we just use gcc on a previous machine as a cross-compiler ? )
information about porting "gcc":
low power design [Should this be moved to vlsi.html ?]
Designing for low power has secondary benefits ... assembling a quiet PC hardware_david_uses.html#quiet_pc
There are many levels to low-power design. To get a low-power web server (top level) requires software that doesn't waste power, a low-power CPU, low-power clocking, and low-power circuits (including low-power oscillator). [FIXME: the last section is over at schematic.html; should I break up the other levels into their own sections ?]
see also:
external links:
[DAV: Is it possible to integrate sensors with switching power supplies, to get the best "performance": - low voltage ripple at power input of device under test, - "high" time resolution (?), - "high" precision of exact amount of power used over the full range of currents used by that device - power-efficient (not wasting too much power in "precision resistors" or other losses).The monitors are Crystal Semiconductor CS5460 Single Phase Bi-Directional Power/Energy ICs. ... Precision inline resistors have been added to the power inputs of various components, and the monitors measure the potential drop across these resistors.
The CS5460 devices are all clocked in parallel ... This design permits all eight devices to be updated in parallel, ensuring that the power samples are in phase.
-- Jennifer Loiacono http://www.piclist.com/techref/postbot.asp?by=time&id=piclist\2002\02\27\200032a&tgt=postwe found that potatoes don't work for very long. The phosphoric acid in the potatoes quickly runs dry (within a matter of minutes). As others have already suggested, lemons (or other citrus fruits) will last a little bit longer. Galvanized roofing nails are also a good source of zinc. We cut a few solid copper o-rings from the lab - they work great - but I'm not sure of their general availability.
-- Douglas Wood http://www.piclist.com/techref/postbot.asp?by=time&id=piclist\2002\02\28\030950a&tgt=postI suggested driving a hobby DC motor/gear box combination with a fruit battery. Slipping all of the interceding research, I found that plastic soda bottle caps arranged in a honeycomb pattern (for density) filled with lemon juice work the best. Each battery cell was filled roughly half with juice and I used common zinc galvanized nails and copper mesh for the electrodes (The copper mesh provided a greater surface area for the chemical reaction to occur and produced about three times the current output when compared to a copper nail). Lemons work the best of all fruits (apples, oranges, lemon, limes, etc.) that I tried. Other foods I tried included [chilli] peppers and potatoes (potatoes are rechargeable!). The general rule of thumb is that a whole lemon will produce approximately 1 mA at 1 Vdc. I used the bottle caps to reduce the weight of the battery. Hope that helps... Douglas Wood
Linux and other open source products end up making the computer suck TWO TO THREE TIMES MORE POWER than Windows.has some interesting numbers: 1 gallon of oil can be burned to generate 26 kilowatt-hours of electricity. 1 pound of coal can be burned to generate roughly 4.4 kilowatt-hours of energy.
[FIXME: todo: put current meters / energy meters on my computers.]
"Is it better to turn off your computer at night, or let it run all the time?" http://www.everything2.com/index.pl?node_id=554411
[FIXME: move to Wikibooks: Microprocessor Design/Resources
free (?) logic simulators: Clive "Max" Maxfield http://ro.com/~bebopbb includes C source code for "a simple home-brewed logic simulator"
see schematic.html for a few more free (?) logic simulators.
DAV: see idea_space.html#level for more of my ramblings about the concept of a ``level''.
Date: 1997-06-03 From: Jim Dodd <jim_dodd@onsetcomp.com> Organization: Onset Computer Corp. To: Mot-68HC11-Apps@freeware.mcu.motsps.com Subject: Comparison of HC11 and HC12 Op-CodesMr. Sibigtroth wrote a fascinating piece called "A Closer Look at Instruction Set Design" in "Electronic Design" magazine in the August 19, 1996 issue. He looked at the HC12 instruction set from two angles: The first angle was that Motorola took input from users of the HC11 about what are weak points in the instruction set and how they were improved in the HC12 instruction set. The second angle was how to design a flexible yet fast instruction set that also allows for support of high level languages. I don't know if this article is also available through Motorola or if back issues are available of "Electronic Design" (their Web site is at www.penton.com/ed). I just hope that those of you who have old copies of the magazine will be able to look it up or that someone can tell us where it exists in Moto space on the web.
The article certainly makes my mouth water for the day when I can do some work on the 'HC12.
Date: 19970604 From: Dietmar Block <exp145@physik.uni-kiel.de> To: Mot-68HC11-Apps@freeware.mcu.motsps.com Subject: Re: Comparison of HC11 and HC12 Op-CodesI can't tell you if I found the right article (I have no access to this magazine) but at the page www.mcu.motsps.com/lit/fam_12.htm the article AN1284 'Transporting HC11 Code to HC12 Devices' looks a little like that.
From: K.J.Wood Subject: Re: Book: The Anatomy of a High-Performance Microprocessor Date: 10 Nov 1998 00:00:00 GMT ... > The Anatomy of a High-Performance Microprocessor > A Systems Perspective (Interactive Book & CD-ROM) > by Bruce Shriver and Bennett Smith > > from IEEE computer Society. ... I work on media processors (<plug>TriMedia</plug>) and 3D graphics and I bought the book through work to check out what the other guys are doing :-) Cynically, I suppose it's an indirect form of advertising for AMD but it's done with style and good taste. ... ... Chapters 2 and 3 deal with microarchitecture over 250 pages and suffer from the tendency of computer books to tabulate vast amounts of data. OTOH if that's what you're interested in it does look like good stuff, I'll probably absorb it by osmosis over time. I liked Chapters 1 4 5 6. OK, some of it is "PCs for beginners", but it's -really- handy having that sort of stuff in one place. When I have to write documents that mention things about PCs that "everybody knows" it's useful to be able to point to something from this book. The CD ROM is good. Sure you could dig around on the WWW for it all but again, having it all in one place is nice and it encourages you to read around the subject. I'd like to hear from someone who knows whether or not it is a good book from an architecture viewpoint, but from a systems viewpoint I would recommend it for someone new to the field (and as someone new to the field). -- K. J. Wood Philips Research Laboratories, Cross Oak Lane, Redhill, SURREY RH1 5HA, United Kingdom. Phone: +44 1293 815328 Fax: +44 1293 815500 karl@prl.research.nospam.philips.com
Here I point to a few museums
that list
lots and lots of computer architectures.
[FIXME: delete all the redundant stuff on this page
that can be found easier at these ``museums''].
Some of these architectures can be simulated on FPGAs .
Also see robot_links.html#ucontrollers for information about a few of the most popular architectures currently in production.
the TI-85 ... calculator does not use series or polynomial approxiamations,
but rather the so-called CORDIC method.
...
The constants arctan (2^{-k}) are stored in the calculator.
--
http://mathforum.org/library/drmath/view/54012.html
the same technique described slightly differently:
http://mathforum.org/library/drmath/view/51900.html
The Retrocomputing Museum http://www.tuxedo.org/~esr/retro/ | mirror http://www.catb.org/~esr/retro/ ``Our exhibits include many languages, some machine emulators, and a few games.'' including The BlooP and FlooP languages from Chapter XIII of Goedel, Escher, Bach: An Eternal Golden Braid by Douglas R. Hofstadter.
Some design goals for the Saturn processor (used in the HP-48 series calculators), some implications, and future trends http://www.brouhaha.com/~eric/hpcalc/rant.html .
Retrocomputing http://www.brouhaha.com/~eric/retrocomputing/ ``I'm interested in old computing devices. Mostly machines made before 1981, but with a few exceptions. ... Wanted: ... Apple ][ (not ][+) ...'' has lots of information on early machines. Lots of links to 6502 programming.
dealing with interrupts: some ideas about interrupt-time code. [FIXME: make a clean seperation between peripheral interrupts here (which the CPU can safely ignore for one or two cycles), and MMU traps f_mmu.html which must block WRITE instructions before they overwrite ``protected'' memory. ] [FIXME: should I seperate out things that are only useful to CPU designers, vs. things that are only useful to CPU users ?]
The simplest design is to do (1) for all instructions. Some CISC machines have instructions that can last hundreds of cycles (such as a "stringcopy" instruction). Real-time systems need a response within a few cycles; so (1) is unacceptable unless the instruction set is specifically designed such that *every* instruction is guaranteed finish in a few cycles.
(3) is impossible for instructions that have already written a value to I/O locations, even when restarting the instruction is guaranteed to write the *same* value. Many pieces of hardware will *do* something on each write, so writing twice to some location, is very different from writing only once to that location. Properly-designed hardware, however, should be tolerant of restarted multiple reads.
(I suppose it might be OK to complete the Write to cache, as long as that cache line is immediately marked as "invalid").
Of course, this has nothing to do with peripheral hardware interrupts -- -- so (1) is acceptible as a response to all peripheral hardware interrupts.
For nearly all hardware, the write ordering is important, and usually so is the read ordering. If you do out-of-order execution and reading and writing, and/or cache reads and writes, device drivers need some way to enforce the desired order when writing to hardware (typically a "barrier" or "synchronize" or "flush cache" or "read uncached, write uncached" instruction is provided to implement this). (In the C programming language c_programming.html#volatile , "volatile" is used to describe this situation).
-- Samuel A. Falvo II (?) (1999 March)Also, when handling interrupts, please EXCHANGE registers, instead of moving them from one register to another. For instance, there could be a register for PC, IPC (Interrupt Handler PC), and RPC (Return PC). When an interrupt occurs, PC and IPC are exchanged, and all future execution occurs where IPC left off. When executing an RFI (return from interrupt) instruction, IPC and PC are exchanged again. Likewise, when calling a subroutine (actually a co-routine), it exchanges RPC and PC. Returning from the co-routine simply involves exchanging RPC and PC again.
yesterday:
Once the CPU directly read and wrote all values directly to I/O devices.
today:
Main memory serves as the I/O interface for most I/O devices. today a CPU typically flushes a buffer of values to DRAM, then triggers a I/O device which independently reads those values when the CPU is doing something else. When the I/O device has new information for the CPU, it writes those values directly to DRAM, then sends a interrupt to the CPU.
tomorrow:
Some people speculate that all peripherials will go to some sort of serial I/O. This is because it's much faster to burst an entire packet of data to the CPU than it is to request an IRQ, wait for the CPU to respond with a grant IRQ, and then finally send the data over.
"a rather simple method of allowing interrupt handlers to be written in C, reducing overall interrupt latency, and providing for simple and efficient task switching in a multitasking environment."
The flowchart next to Link's article seems to restore disposable registers far too soon -- but the actual code is correct: restoring them just before returning from the interrupt.
Why does Link have 2 different checks for ``nested'' subroutines ?
It took me (DAV) a long time to understand why Link thought he needed 2 different checks for ``nested'' subroutines.
We obviously need to check the g_doingASTS flag to see if we've interrupted the initial interrupt while it's handling the ASTs.
Why do we need that other check ? To handle this situation:
- One peripheral triggers an initial interrupt, (turning off IRQs for peripherals of the same and lower-priority). Let's say it's the buffer-full interrupt on the UART. Before that interrupt routine has a chance to set the g_doingASTS flag, another interrupt comes in, and the CPU pushes the PC and starts executing a second routine. (Obviously this is not an interrupt from the *same* peripheral, because we wrote the initial interrupt handler to set the g_doingASTS flag *before* re-enabling IRQs from that peripheral).
Well, now we're hosed. There's only 2 possible ways we can write the interrupt handler for this second routine:
- After setting AST bits, check only the g_doing_ASTs flag. We'll see that it is ``no'', so we set it to ``yes'', and eventually enable interrupts and call the AST_handler(). That might run for some time, and other interrupts could come and cause the AST_handler() to run for a few more cycles. Meanwhile, more characters could come in on the UART and overflow the buffer. Even worse, if this 2nd interrupt or those other other interrupts include a timer tick, then we might decide to switch tasks. That buffer-overflow interrupt may never be resumed if the task that happened to be executing when it occured is never re-activated. That makes it almost certain that the buffer will overflow.
- After setting AST bits, use both checks. If we use the 2 checks, we can correctly detects that this (higher-priority) interrupt is a nested interrupt, and exit quickly. We assume that the (lower-priority) initial interrupt to handle any ASTs the nested interrupt registered. (If that lower-priority initial interrupt wasn't written to handle ASTs, well, tough, I guess we'll handle those ASTs at the next timer tick, which is connected to a interrupt handler that will handle them).
Some ideas:
- If we set the g_doingASTS flag as the very first instruction in the interrupt routine, could we avoid this situation on the 680x0 ? That isn't an option on some architectures where we really can't do anything before pushing the disposable registers, and where further nested interrupts can occur before we have a chance to do anything else.
This problematic situation never happens on the PIC. In effect, every interrupt on the PIC has the highest priority, so an interrupt routine cannot be further interrupted unless it explicitly sets GIE to allow interrupts. So all we need to do is put the command to set g_doing_ASTs before the command to allow interrupts, and we only need to do 1 check.
This problematic situation never occurs on the ARM. It isn't needed for implementing ASTs on the Microchip PIC or on the ARM (because the CPU disables all peripheral interrupts when ``the'' interrupt occurs).
If you use an architecture where this problematic situation does occur, you must make sure that (lower-priority) interrupt routines can deal with being interrupted. In particular, if your IRQ_handler() decides to set bits in the g_AST_register, that *must* be done atomically. The wrong way to do it is to load g_AST_register into a CPU register, set the bit, then write it out again, You might be overwriting other bits that have been set by higher-priority interrupts.
OK, so now what ?
- Advantage of doing 2 checks: we can guarantee that the the C language IRQ_handler() eventually finishes executing before interrupts return to any task-level code, (i.e., it avoids the ``disadvantage of doing 1 check'').
- disadvantage of Link's method: Say we still have a couple of ``simple'' subroutines -- when the interrupt occurs, everything is handled immediately, and interrupts are never enabled except implicitly by the return-from-interrupt instruction. These interrupts were written before anyone ever heard of ASTs. Sometimes one peripheral triggers an interrupt, (turning off IRQs for peripherals of the same and lower-priority), then while the IRQ handler for that interrupt is executing, it is interrupted by another (higher-priority) interrupt (before the g_doingASTS flag is set). If this higher-priority interrupt only checks g_doingASTS, then
- advantages of doing only 1 check:
- disadvantage of doing 1 check:
Crude rough draft of how DAV would do this (perhaps I'm making some terrible mistake that Link has already considered and rejected).
(The g_ prefix indicates a global variable to which all instances of the interrupt routines have access. Other variables are local to this particular instance, and are pushed on the stack when interrupted).
(This corresponds to figure 1 / listing 1 of the article)
(IRQ) (disables interrupts) save disposable registers pass arguments for handler [[ call IRQ_handler ]] ; the IRQ_handler can be written in C. [[ call AST_handler ]] ; the AST_handler can be written in C. if( top_task != current_task ){ ; time to context switch. [save current task's context to stack] [switch stacks] [restore top task's context from stack] [set g_current task = g_top task] }; [restore disposable registers] (RTI) (allows interrupts)
This gives much lower latency, because new interrupts can be accepted and (partially) handled by the IRQ_handler, while older interrupts are still being processed by the AST handler.
(This corresponds to Figure 2 / Listing 4 of the article ... with what DAV is deluded into thinking are ``improvements'').
(IRQ) (disables interrupts) save disposable registers pass arguments for handler [[ call IRQ_handler ]] ; The IRQ handler may register one or more ASTs. ; (i.e., set bits in g_AST_register]. ; Since the IRQ handler subroutine runs with interrupts off, ; it doesn't need to be re-entrant. ; Make sure IRQ_handler() does *not* modify ``g_top_task'' and ``g_current_task''. toss arguments for handler if( g_nested ){ ; nested interrupt -- do nothing. ; Any ASTs that were just set by the handler will be handled by initial interrupt ; after this nested interrupt returns. }else{ ; g_nested *was* false -- this is the ``initial interrupt'', ; not a nested interrupt. g_nested = true; // might be done during test with test-and-set instruction // re-enabling IRQs of this priority level and higher priority // reduces interrupt latency by a couple of cycles. // (I do it a few cycles later, so it's optional whether to do it here). // Is there a problem with re-enabling IRQs of lower priority ? [Allow IRQs] // optional // must disable *all* IRQs that modify g_AST_register, // even if they are higher priority. [disable IRQs] ; interrupts must be disabled while testing g_AST_register, ; to avoid the risk of these bad sequences: ; 1. g_AST_register tests equal to zero ; 2. a (nested) interrupt comes in and sets some g_AST bits ; 3. the nested interrupt exits ; 4. the initial routine exists, because it thought g_AST was zero ; 5. those set AST bits might not be handled for a very long time ; (not until another interrupt comes in). ; This is especially bad if IRQ_handler() ; leaves data structures in an inconsistent state, ; relying on AST_handler() to fix them up ; before any (non-interrupt-time) tasks see that data. ; or ; 1. g_AST_register tests to nonzero ; 2. a copy is made ; 3. a (nested) interrupt comes in and sets some g_AST bits ; 4. the nested interrupt exits ; 5. the initial routine sets g_AST to zero, ; as if that (nested) interrupt never happened. while( 0 != g_AST_register ){ make copy of g_AST_register g_AST_register = 0; [ Allow IRQs ] [[ call AST handler with copy of AST_register ]] ; The AST handler ; or sometimes the (non-interrupt-time) tasks themselves ; might modify ; modify ``g_top_task'' and ``g_current_task''. ; Since the AST handler is called *only* ; by the initial interrupt, ; it doesn't need to be re-entrant, ; even though it runs with interrupts enabled. [ toss copy of g_AST_register ] if( g_top_task != g_current_task ){ ; time to context switch. ; May need to disable IRQs while switching stacks, ; but probably not -- most architectures ; can switch stacks in one atomic instruction. [save current task's context to stack] [switch stacks] [set g_current_task = g_top_task] [restore top task's context from stack] }; [ disable IRQs ] } // while()... g_nested = false; } // if( g_nested )... [restore disposable registers] (RTI) (allows interrupts) // // hints: // g_nested may be a single bit // in the g_AST_register, with some slight modifications ...
OK, so that looks pretty in pseudo-code. Does the real code look any better ?
680x0/ColdFire listing:
.text _Trampoline: /* save disposable registers */ moveml #0xC0C0,a7@- /* save d0,d1,a0,a1 on the stack */ /* call IRQ_handler */ movel #0xF0F0F0F0,a7@- /* save spot for passed int and */ movel #0xF0F0F0F0,a7@- /* pointer */ jsr 0xA0A0A0A0 /* call respective IrqHandler */ jmp common TrampolineEnd: .globl _TRAMP_SIZE,_Trampoline _TRAMP_SIZE: .long TrampolineEnd-_Trampoline common: addql #8,a7 /* toss parm1 and 2 */ /* if this is the initial interrupt */ tasb g_doingASTS /* test-and-set */ bneb endif_initial /* someone else is handling ASTs */ /* then handle ASTS */ /* optional allow/disable pair -- */ /* -- reduces interrupt latency by a couple of cycles. */ movew #0x2000,sr /* enable all interrupts */ movew #0x2700,sr /* disable irqs */ /* end optional pair */ /* while... */ brab test_ast /* ... do ... */ loop_ast: movel _astRegister,a7@- /* pass copy of astRegister */ clrl _astRegister /* and clear it */ movew #0x2000,sr /* enable all interrupts */ jsr _astHandler addql #4,a7 /* pop astRegister copy */ /* if( g_top_task != g_current_task )... * moveal _pRunningTask,a0 /* get pointer to current task */ moveal _readyList,a1 /* get pointer to ptop */ cmpal a0,a1 /* and dont save if */ beqb current_task_OK /* pRunningTask == ptop */ /* ...then... */ SaveContext: moveml #0x3F3E,a7@- /* save d2-d7,a2-a6 on the stack */ movel a7,a0@ /* save stack pointer RestoreContext: first item in task struct */ /* switching to a different stack is atomic, ... */ /* so it's OK that interrupts are enabled now */ moveal a1@,a7 /* point to ptops stack */ movel a1,_pRunningTask /* set runningTask to new task */ moveml a7@+,#0x7CFC /* restore d2-d7,a2-a6 from the stack */ current_task_OK: /* ...endif */ movew #0x2700,sr /* disable irqs */ /* ... while( 0 != g_astRegister ); test_ast: tstl _astRegister /* any jobs to run? */ bneb loop_ast /* yes */ /* end while */ clrb g_doingASTS /* ok.. irqs still disabled */ endif_initial: /* endif */ moveml a7@+,#0x0303 /* restore d0,d1,a0,a1 from the stack */ rte doingASTS: .byte 0x00
PIC listing:
related links: http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2002/04/11/080137a&tgt=_top | http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2002/03/07/120706a&tgt=_top
-- recc. Comp.realtime FAQ http://www.faqs.org/faqs/realtime-computing/faq/ [FIXME: books to read] http://groups.google.com/groups?group=comp.realtimeCaxton Foster's _Real-Time Programming: Neglected Topics_, despite the title, is a very good introduction to the basic topics of real-time control, starting with simple things like interrupts and debouncing switches, all the way through digital filters. It's a thin paperback (Addison Wesley MicroBooks), and a (somewhat) experienced programmer can get through it in a couple of days.
GIE is automatically disabled when the PIC takes the interrupt. GIE will be re-enabled by RETFIE at the end of the interrupt. You do **NOT** want GIE on in the interrupt routine. This can cause an interrupt within an interrupt. You would have to go thru a lot of trouble to make sure your interrupt routine is re-entrant to handle this case. Think of what your interrupt routine will do to the saved state of the first interrupt if a second one comes along before the first completes. In theory you could make your interrupt routine re-entrant, but I think this is extremely unlikely to be a benefit in practise. Also note that extra call stack locations are used up globally if you allow nested interrupts. -- Olin Lathrop http://www.embedinc.com http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2001/04/26/080610a&tgt=_top -- 1st of all, if the ONLY interrupt u need enabled is RB0 then INTCON = 90; //enables global and RB0 interrupts 2ndly, ur interrupt routine should be set up this way: static void interrupt isr(void){ //RB0 interrupt if (INTF){ // do something INTF=0; //clear RB0 interrupt flag } } FYI, u can only have one interrupt routine which means all ur interrupts should be handled within that one routine! so for example if u needed to have TMR0 interrupt enabled as well as RB0, then the above interrupt routine would look like this: static void interrupt isr(void){ //TMR0 interrupt if (T0IF){ // do something T0IF=0; //clear TMR0 interrupt flag } // RB0 interrupt if (INTF){ // do something INTF=0; //clear RB0 interrupt flag } } hope this helps. BTW, if u needed to have TMR0 interrupt enabled, then u would need to set the value in the OPTION register appropriately. this might be a good time to have a look at the datasheet for the PIC in use. seyi -- BY: Oluseyi Odeinde http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2001/04/02/084836a&tgt=_top
According to the Microchip PIC16CXXX MCU family documentation, http://www.microchip.com/1000/suppdoc/refernce/midrange/midsect/758/index.htm | http://www.microchip.com/download/lit/suppdoc/refernce/midrange/midsect/31008a.pdf
Section 8. Interrupts
...
The Global Interrupt Enable bit, GIE (INTCON<7>), enables (if set) all un-masked interrupts or disables (if cleared) all interrupts. Individual interrupts can be disabled through their corresponding enable bits in the INTCON register. The GIE bit is cleared on reset. The
return from interruptinstruction,RETFIE
, exits the interrupt routine as well as sets the GIE bit, which allows any pending interrupt to execute....
When an interrupt is responded to, the GIE bit is cleared to disable any further interrupt, the return address is pushed into the stack and the PC is loaded with 0004h. Once in the interrupt service routine the source(s) of the interrupt can be determined by polling the interrupt flag bits. Generally the interrupt flag bit(s) must be cleared in software before re-enabling the global interrupt to avoid recursive interrupts.
Individual interrupt flag bits are set regardless of the status of their corresponding mask bit or the GIE bit.
When an instruction that clears the GIE bit is executed, any interrupts that were pending for execution in the next cycle are ignored. The CPU will execute a NOP in the cycle immediately following the instruction which clears the GIE bit. The interrupts which were ignored are still pending to be serviced when the GIE bit is set again.
...
8.2.1 INTCON Register
Interrupt flag bits get set when an interrupt condition occurs regardless of the state of its corresponding enable bit or the global enable bit, GIE (INTCON<7>). This feature allows for software polling.
...
8.2.2 PIE Register(s)
... Peripheral Interrupt Enable registers (PIE1, PIE2). These registers contain the individual enable bits for the Peripheral interrupts. These registers will be generically referred to as PIE. If the device has a PIE register, The PEIE bit must be set to enable any of these peripheral interrupts. Note: Bit PEIE (INTCON<6>) must be set to enable any of the peripheral interrupts.
...
8.2.3 PIR Register(s)
... Peripheral Interrupt Flag registers (PIR1, PIR2). These registers contain the individual flag bits for the peripheral interrupts. These registers will be generically referred to as PIR. Note 1: Interrupt flag bits get set when an interrupt condition occurs regardless of the state of its corresponding enable bit or the global enable bit, GIE (INTCON<7>). Note 2: User software should ensure the appropriate interrupt flag bits are cleared (by software) prior to enabling an interrupt, and after servicing that interrupt.
...
8.5 Context Saving During Interrupts During an interrupt, only the return PC value is saved on the stack. Typically, users may wish to save key registers during an interrupt e.g. W register and STATUS register. This has to be implemented in software.
... Example 8-1: Saving the STATUS and W Registers in RAM (for Devices with Common RAM) Example 8-2: Saving the STATUS and W Registers in RAM (for Devices without Common RAM)
Example 8-5: Register Saving / Restoring as Macros PUSH_MACRO MACRO ; This Macro Saves register contents MOVWF W_TEMP ; Copy W to a Temporary Register ; regardless of current bank SWAPF STATUS,W ; Swap STATUS nibbles and place ; into W register MOVWF STATUS_TEMP ; Save STATUS to a Temporary register ; in Bank0 ENDM ; End this Macro ; POP_MACRO MACRO ; This Macro Restores register contents SWAPF STATUS_TEMP,W ; Swap original STATUS register value ; into W (restores original bank) MOVWF STATUS ; Restore STATUS register from ; W register SWAPF W_TEMP,F ; Swap W_Temp nibbles and return ; value to W_Temp SWAPF W_TEMP,W ; Swap W_Temp to W to restore original ; W value without affecting STATUS ENDM ; End this Macro ;1. Store the W register, regardless of current bank. ;For Devices with Common RAM, ;the user register, W_TEMP, must be defined across all banks and must ;be defined at the same offset from the bank base address ;(i.e., W_TEMP is defined at 0x70 - 0x7F in Bank0). ;For Devices without Common RAM, ;the user register, W_TEMP, must be defined across all banks and ;must be defined at the same offset from the bank base address ;(i.e., W_TEMP is defined at 0x70 - 0x7F in Bank0). ;The user register, STATUS_TEMP, must be defined in Bank0. ;Within the 70h - 7Fh range (Bank0), ;wherever W_TEMP is expected, the corresponding locations ;in the other banks should be dedicated for the possible saving of the W register. MOVWF W_TEMP ; Copy W to a Temporary Register [FIXME: why not use SWAPF ?] ;2. Store the STATUS register in Bank0. SWAPF STATUS,W ;(use SWAPF instead of MOVFW ; so they can be restored without changing any status bits) BCF STATUS,RP0 ; Change to Bank0 regardless of ; current bank MOVWF STATUS_TEMP ; The user register, STATUS_TEMP, must be defined in Bank0, in this example STATUS_TEMP is also in Bank0. 3. Execute the Interrupt Service Routine (ISR) code. : : (Interrupt Service Routine (ISR) ) : 4. Restore the STATUS (and bank select bit register). SWAPF STATUS_TEMP,W ; Swap original STATUS register value ; into W (restores original bank) MOVWF STATUS ; Restore STATUS register from ; W register 5. Restore the W register. SWAPF W_TEMP,F ; Swap W_Temp nibbles and return ; value to W_Temp SWAPF W_TEMP,W ; Swap W_Temp to W to restore original ; W value without affecting STATUS RTI8 Example 8-6: Source File Template Example 8-7: Typical Interrupt Service Routine (ISR) LIST p = p16C77 ; List Directive, ; Revision History ; #INCLUDE <P16C77.INC> ; Microchip Device Header File ; #INCLUDE <MY_STD.MAC> ; Include my standard macros #INCLUDE <APP.MAC> ; File which includes macros specific ; to this application ; Specify Device Configuration Bits __CONFIG _XT_OSC & _PWRTE_ON & _BODEN_OFF & _CP_OFF & _WDT_ON ; org 0x00 ; Start of Program Memory RESET_ADDR : ; First instruction to execute after a reset end org ISR_ADDR ; PUSH_MACRO ; MACRO that saves required context registers, ; or in-line code CLRF STATUS ; Bank0 BTFSC PIR1, TMR1IF ; Timer1 overflow interrupt? GOTO T1_INT ; YES BTFSC PIR1, ADIF ; NO, A/D interrupt? GOTO AD_INT ; YES, do A/D thing : ; NO, do this for all sources : ; BTFSC PIR1, LCDIF ; NO, LCD interrupt GOTO LCD_INT ; YES, do LCD thing BTFSC INTCON, RBIF ; NO, Change on PORTB interrupt? GOTO PORTB_INT ; YES, Do PortB Change thing INT_ERROR_LP1 ; NO, do error recovery GOTO INT_ERROR_LP1 ; This is the trap if you enter the ISR ; but there were no expected ; interrupts T1_INT ; Routine when the Timer1 overflows : ; BCF PIR1, TMR1IF ; Clear the Timer1 overflow interrupt flag GOTO END_ISR ; Ready to leave ISR (for this request) AD_INT ; Routine when the A/D completes : ; BCF PIR1, ADIF ; Clear the A/D interrupt flag GOTO END_ISR ; Ready to leave ISR (for this request) LCD_INT ; Routine when the LCD Frame begins : ; BCF PIR1, LCDIF ; Clear the LCD interrupt flag GOTO END_ISR ; Ready to leave ISR (for this request) PORTB_INT ; Routine when PortB has a change : ; END_ISR ; POP_MACRO ; MACRO that restores required registers, ; or in-line code RETFIE ; Return and enable interrupts Part of the function of the RETFIE instruction is to set the GIE bit = 1. It is exactly the same as the RETURN instruction except for that one feature.Example 8-3 stores and restores the STATUS and W registers for devices with general purpose RAM only in Bank0 (such as the PIC16C620).
...
Question 2: My system seems to lock up. Answer 2: If interrupts are being used, ensure that the interrupt flag is cleared after servicing that interrupt (but before executing the RETFIE instruction). If the interrupt flag remains set when the RETFIE instruction is executed, program execution immediately returns to the interrupt vector, since there is an outstanding enabled interrupt.
...
-- The PIC will automatically disable GIE during the interrupt handler and reenable it during the RETFIE. It _is_ possible to deliberately reenable interrupts in your interrupt handler but this is pretty ugly (you have to have multiple save areas and decide which one applies to a given invocation of the interrupt). Here is pseudocode for a is a reasonable way to go at it if you are expecting a lot of interrupts. This avoids repeatedly handling interrupts and saving/restoring context:
interrupt_handler: save context check_next: if RB0/INT interrupt clear RB0/INT flag handle RB0/INT condition goto check_next endif if TMR interrupt clear TMR interrupt flag handle TMR condition goto check_next endif restore context RETFIE
Bob Ammerman RAm Systems -- http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2001/04/26/060855a&tgt=_top
The 16f84 datasheet http://www.microchip.com/1000/pline/picmicro/category/digictrl/8kbytes/devices/16f84a/5432/index.htm | http://www.microchip.com/download/lit/pline/picmicro/families/16f8x/35007b.pdf in section 4.2 PORTB and TRISB Registers has this little sentence:
A mismatch condition will continue to set flag bit RBIF. Reading PORTB will end the mismatch condition and allow flag bit RBIF to be cleared.
Peter Betts further clarifies:
--
On the 16F84, When the RB0/INT-port triggers an interrupt, you have to make a read of PORTB and THEN Clear the interrupt flag I've done this inside the ISR and it's fine. movf PORTB,w bcf INTCON,RBIF ; Clear pending interrupt flag -- Peter Betts http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2001/02/22/022923a&tgt=_top
AN566: Using the Port B Interrupt on Change as an External Interrupt http://www.microchip.com/1000/suppdoc/appnote/all/an566/
ServiceIRQ source code http://www.piclist.com/techref/postbot.asp?by=time&id=piclist/2001/02/07/101658a&tgt=_top
-- ``Rate Monotonic Scheduling'' article by David B. Stewart and Michael Barr in _Embedded Systems Programming_ 2002-03The RMA (rate monotonic algorithm) is a procedure for assigning fixed priorities to tasks ... A task is considered schedulable if all tasks meet all deadlines all the time. The algorithm is simple:
Assign the priority of each task according to its period, so that the shorter the period the higher the priority.
RMA is the optimal fixed-priority algorithm. If a task set cannot be scheduled using the RMA algorithm, it cannot be scheduled using any fixed-priority algorithm.
...
Sometimes a particular set of tasks will have a total utilization above the worst-case schedulable bound and still be schedulable with fixed priorities. Schedulability then ... must be analyzed by hand. ...
Guidelines:
...
Always assign priorities according to RMA. Manually assigning fixed priorities will not give you a better solution.
If total utilization is less than or equal to [ ln(2) = 69.3% of CPU utilization], all tasks will meet all deadlines, so no additional work needs to be done.
...
To achieve 100% CPU utilization when using fixed priorities, assign periods so that all tasks are harmonic. This means that for each task, its period is an exact multiple of every other task that has a shorter period.
... For example, a three-task set whose periods are 10, 20, and 40 ... is harmonic, and preferred over a task set with periods 10, 20, and 50.
See ``What is RMA?'' section of http://www.faqs.org/faqs/realtime-computing/faq/
[FIXME: does RMA belong here with interrupts, or over in #os ?]
(moved to forth.html )
benchmarks for computer architecture. CPU design goals. benchmarking.
While faster is better, all else being equal, many times all else is not equal. Speed is becoming less important in computer design (many tasks can already be completed ``fast enough''). Power usage, reliability, ease-of-use, and other factors are becoming more important. We must make sure to test for the things we really want creed.html#really_want , and we don't need to test for things that are irrelevant.
What do we really want ? Here are a few of the design goals for various systems. Naturally, if your particular goals heavily emphasize one of these, it will require a different architecture than one that emphasizes a different goal.
CPU design goals: (see computer_architecture.html#CPU_goals )(2003-01-10)
Over at data_compression.html#benchmark I list benchmark test images for image compression.
assembling a quiet PC hardware_david_uses.html#quiet_pc
One might think that comparing 2 pieces of software ``which one compresses my files smaller ?'', vs. comparing 2 CPUs ``which one executes my code faster / better / with less power'' are completely unrelated. But there's one tiny bit of overlap: data_compression.html#program_compression indicates that while dealing with compressed data and compressed executable code is often slower than uncompressed, sometimes there's a synergy -- if we compress it enough (using techniques I talk about there) so it all fits in a higher-level cache, then our programs run faster (what I'm talking about here).
Over at bignums.html#cpu and at robot_links.html#ucontrollers and at vlsi.html#cRAM I list information about some historical CPUs. [FIXME: should I merge them ?]
Why are there not more ``reliability'' benchmarks ?
-- recc. David Ranch http://www.ecst.csuchico.edu/~dranch/LINUX/index-linux.html#trinityosThe Signal 11 FAQ: http://www.bitwizard.nl/sig11/ ... Whenever I build new computers.. this is how I ALWAYS test the stability of the machine.
Every piece of hardware should get a fair review, showing its plusses and negatives in an equal light, with technical data and reproducible, reliable results to back that up.... links to several hardware review sites that do get it right ...
EEMBC, the Embedded Microprocessor Benchmark Consortium, develops and certifies real-world benchmarks and benchmark scores to help designers select the right embedded processors for their systems.http://www.eembc.org/
The SPEC Benchmarks http://home.earthlink.net/~mrob/pub/benchmarks/spec.html (the "SPECint" and "SPECfp" numbers)
Twenty leading microcontroller manufacturers have formed the EDN Embedded Microprocessor Benchmark Consortium (EEMBC--pronounced "embassy") http://archives.e-insite.net/eembc.htm . Markus Levy, EDN's Microprocessor and DSP editor, founded the group in an effort, to finally kill Dhrystone MIPS as the universal microprocessor benchmark.
...
EEMBC will evaluate performance against such metrics as power consumption, code density, and compiler technology.
SPARK is a C-to-VHDL ... particularly targeted to control-intensive microprocessor functional blocks and multimedia and image processing applications. We have validated the effectiveness of our approach ... for large real-life applications such as the Instruction Length Decoder from the Intel Pentium and multimedia applications such as MPEG-1, MPEG-2 and the GIMP image processing tool.
SPARK takes behavioral ANSI-C code as input, schedules it using speculative code motions and loop transformations, runs an interconnect-minimizing resource binding pass and generates a finite state machine for the scheduled design graph. Finally, a backend code generation pass outputs synthesizable register-transfer level (RTL) VHDL. This VHDL can then by synthesized using logic synthesis tools into an ASIC or by mapped onto a FPGA.
-- ``The Death of Hardware Engineering'' article by Jim Turley in _Embedded Systems Programming_ 2002-03 [FIXME: general design ?] [FIXME: C language flaws]... Just as assembly code gave way to higher-level languages, hardware engineers are gradually discovering the joys of high-level abstraction. The benefits are much the same, but so are the pitfalls and the battles. ...
Old-timers complained that compiled code could never be as fast, tight, or elegant as hand-written assembly code -- and they were absolutely right. But it didn't matter. ... compilers ... are more efficient in the one dimension that matters: programmer's time. ... compilers make [ more efficient of the user's resources by making less efficient use of the computer's resources ]. They also have the commercially important side effect of allowing less experienced (not to say less talented) programmers to write useful and functional code.
... 10-million-gate chips with multiple embedded processors ... It's not an efficient use of the engineers' time to design a chip that big, gate by carefully crafted gate. ...
...
... designing hardware, per se, is not the real objective. The goal should be to create a system that performs a given function. ... railroads in the 1880s ... focused on optimizing locomotives and boxcars while their customers were enthusiastically ignoring them and embracing automobiles and, later, aircraft. ... the railroads were not in the railroad business; they were in the transportation business. ...
... Suddenly all C programmers (or at least, all Handel-C programmers) become honorary chip designers.
...
The circuits that Handel-C (or SystemC, et. al) produces would make a hardware engineer gag, but that's no different from most programmers' reactions to early C compilers. ... But you're not supposed to look at the output and you're not supposed to care. If ... the tool ... works, you should look to more important tasks or you defeat the purpose of the tool.
...
The general weakness that all these software-cum-hardware languages share is that conventional programming languages simply can't express parallelism. There's no way to show that two functions (hardware blocks) are supposed to operate simultaneously. The C language ... is innately serial and has no method for expressing parallelism or simultaneity. ...
In the end, the perverted software languages are bound to prevail. There are a lot more programmers in the world than hardware engineers, and the balance will probably tip still further in the future. Elegance of implementation has never triumphed over timesaving hacks.
A few unusual, interesting CPUs.
Ubicom's built-in multitasker with response time measured in nanoseconds. Ubicom says its IP3023 chip has a worst-case interrupt latency that's a thousand times shorter than that of VxWorks or Linux. Three orders of magnitude can make a big difference in how you plan your interrupt handlers. ... A PCI interface, for example, is a 200-instruction loop that takes up about one-tenth of the processor's horsepower.
... The processor switches threads on every cycle; every 4ns at 250MHz. Call it extreme time slicing; each task runs for exactly one instruction before being "switched out."
You organize tasks through a 64-entry task table. Each clock tick, the processor executes one instruction from the next task in the table. The task table acts as a kind of cache, giving the IP3023 visibility into the next 64 instructions. Unlike a cache, there's no risk of missing or mispredicting the next instruction, so performance is completely reliable and deterministic.
It's pretty easy to calculate how often a task needs to run because the Ubicom processor's computing is so predictable. ...
The IP3023 never disables interrupts because there are no interrupts to disable. You handle interrupts with another task that monitors one or more input pins (you get to decide how many). Thus, any pins you want can become interrupt inputs. Their latency is determined by how frequently you choose to monitor them; if you schedule your interrupt request task to run every eighth cycle, for instance, you're looking at a 32ns interrupt latency.
... runs at 250MHz ...
unsorted
Date: Wed, 25 Feb 1998 13:08:23 -0800 (PST) From: Mark Crosby <crosby_m@rocketmail.com> Subject: >H Autopoiesis Vs Autopotency To: transhuman@logrus.org MIME-Version: 1.0 Sender: owner-test-new@logrus.org Reply-To: transhuman@logrus.org Transhuman Mailing List Mitch Porter wrote: < 'Autopoiesis' and 'autopotency' are different concepts. An autopoietic system is basically a self-regenerating system [CUT] (I say 'basically', since Maturana & Varela's work, e.g. _Autopoiesis and cognition_, has philosophical subtleties which I do not fathom.) Whereas an autopotent system is one having "complete power and knowledge over itself" (see Anders' lexicon). > Yes, they are quite different concepts. However, autopoiesis is a highly-developed, formal approach to theoretical biology that is probably very relevant to any automorphing and autopotency technologies that we may wish to develop. (Developing some of these connections, and differences, would be a good paper for The Journal of Transhumanism ;) George Kampis is the chairman of the Department of History and Philosophy of Science at Eötvös Loránd University, Budapest, Hungary. While he is not strictly an autopoiesis theorist, he has adapted many of Maturana & Varela's principles in his Alife research, which has focussed on self-modification in neural networks. A very interesting (10-page) paper by Kampis provides an overview of these concepts: "Computability, Self-Reference, and Self-Amendment" is available at http://www.c3.lanl.gov/~rocha/kampis.html < Abstract: There exist theories of cognition that assume the importance of self-referentiality and/or self-modification. We argue for the necessity of such considerations. We discuss basic concepts of self-reference and self-amendment, as well as their relationship to each other. Self-modification will be suggested to involve non-algorithmic mechanisms, and it will be developed as a primary concept from which self-reference derives. A biologically motivated mechanism for achieving both phenomena is outlined. Problems of computability are briefly discussed in connection with the definability and describability of self-modifying systems. Finally, the relevance of these problems to applications in semantic problems of cognition is shown. > Mark Crosby
From: jpd@isoroku.mit.edu (John Doty) Subject: Re: Designing Hardware For Software Systems Date: 16 Sep 1998 00:00:00 GMT Organization: Massachvsetts Institvte of Technology Newsgroups: comp.realtime In article <6tmu41$odr@gcsin3.geccs.gecm.com> walter.mallory@gecm.com (Walter Mallory) writes: One of our digital engineers came up with a useful trick a few years ago. He was designing an interface that would multiplex several ADC outputs into a DSP input port, using an FPGA. The CAD system provides a way to generate a stimulus on an interface pin to test a design in simulation, but the engineer decided that it was easier to draw a circuit that imitated the ADC than it was to write a stimulus specification. His pseudo-ADC contained a counter, so it generated successive samples of 0, 1, 2, ... This was fine for testing the design in simulation. However, when he decided to make a real chip, he realized he had enough left over capacity to incorporate his test circuit in the chip. So now, we have an interface chip that can be easily be put into a test mode that can test the data paths downstream of the ADC. This was especially valuable because the scientific instrument on the other side of the ADC doesn't work properly unless it is cooled to cryogenic temperatures. Even when it's working properly, 99% of the data is random noise (and the question comes up, "is the random noise we're seeing the *right* random noise" :-). Having an easy way to run a nonrandom pattern through the system is this very handy. Since then, we've put pattern generators on most of our interfaces. -- John Doty "You can't confuse me, that's my job." jpd@space.mit.edu
Products requirements design ---> Products requirements documentation | v ASIC requirements design ---> ASIC requirements documentation | v ASIC specification design | v ASIC development strategy | v ASIC architecture | v ASIC top-level design | v ASIC hierarchical decomposition | v +->Module-level ASIC design | | ^ | v | | Module-level ASIC verification | | | v | Top-level ASIC composition | | ^ | v | +--Top-level ASIC verification | | Constraints | | | | Target Libraries | | | | | | v v v ASIC synthesis ---> Gate-level netlist | v Preroute gate-level simulation | v Place and route ---> Back-annotated timing | v Static Timing Analysis | v Gate-level simulation (with annotated timing) | v Release ASIC to board.
-- From: Dave Gillespie Date: Fri, 29 Oct 1993 22:02:44 GMT Keywords: C, optimize, assembler Newsgroups: comp.compilersCISC can refer to "complex instruction sets" or "sets of complex instructions." People tend to think in terms of the former as the downfall of CISC, but actually it's the latter that brings on the microcoding and the slow cycle times. The 860 is a step in the direction of RISC processors that still have large, rich instruction sets. I'd like to see more movement in this direction---why not have conditional move, *and* min and max, *and* absolute value, and so on. Many of the things you want are FP instructions, which even tend not to be starved for bits in the opcode. And compilers would have no trouble using these instructions, too.
From: f95mabr@dd.chalmers.se To: f-cpu@egroups.com Date: Fri, 19 Mar 1999 17:41:24 -0000 ... Alpha Architecture Handbook section 1.4 : they have only four different type of instruction structures: +-------------------------------- |31-26|25-21|20-16|15 -- 5|4 - 0| +-----+-----+-----+-------+-----+ | OP | RA | RB | Func. | RC | Operate Format +-----+-----+-----+-------+-----+ | OP | RA | RB | Disp | Memory Format +-----+-----+-----+-------------+ | OP | RA | Disp | Branch Format +-----+-----+-------------------+ | OP | Number | PALcode Format +-----+-------------------------+ ... Mathias
From: kenl@compassnet.com.au (Ken Lee) Subject: Re: How many different processors do you use? Date: 04 Jun 1999 00:00:00 GMT Newsgroups: comp.arch.embedded Logically the main motivation for changing a processor design is based on either: 1. Performance - the current design can no longer cut it. 2. Price - for a given quantity level there is some cost savings in switching to another processor even considering the amortisation of the design costs. .. there is a 3rd reason but is rarely exercised: 3. Boredom - the engineer did it for the hell of it, there is no visible added value. Often management consider that everything can be fixed in software and so from an original succinct processor design, it becomes something like a: 8051 with megabytes of banked memory. In the end the engineering time spent in making an inappropriate design jump hoops, could have been better used in a new design with a new, more suitable processor. I would consider that one of the main criteria for selecting a processor is its suitability in the job at hand. Otherwise using an inappropriate processor will move the costs from the hardware design to the software design. Ken.http://www.deja.com/[ST_rn=ps]/threadmsg_if.xp?AN=485567505
From: wardrg@my-deja.com Subject: FIREFLY Embeddable MicroController Cores Date: 13 Aug 1999 00:00:00 GMT Organization: Deja.com - Share what you know. Learn what you don't. X-Article-Creation-Date: Fri Aug 13 10:22:28 1999 GMT X-MyDeja-Info: XMYDJUIDwardrg Newsgroups: comp.arch.fpga X-Http-User-Agent: Mozilla/4.0 (compatible; MSIE 4.5; Mac_PowerPC) FIREFLY Embeddable MicroController Cores At Mitel Semiconductor, we have combined our advanced CMOS ASIC technology with our MicroController design expertise to produce a unique Embedded MicroController ASIC capability. We have given the name Firefly to a set of embeddable MicroController cores developed for use in these ASICs. Firefly cores use the ARM7TDMI ('Thumb') processor core - a RISC processor core, especially suitable for ASIC applications in the wired, wireless communications and networking. Thumb is licensed by Mitel from ARM Ltd, and is offered as part of the Mitel SystemBuilder? library of fully-supported embeddable macrofunctions. Mitel has enhanced the ARM core by adding a number of the sort of peripherals everyone always needs, like a memory interface, an interrupt controller, UART and timers, and turned it all into a single complex microcontroller macrocell, the Firefly core. This first in the series of Firefly cores is available in silicon and is accompanied by extensive support. Firefly ASICs bring all this practical design experience to customers needing special-purpose microcontrollers. Mitel's Embedded MicroController ASIC architecture has been specially developed to make it easy to integrate customer's own logic with a Firefly core including memory and other standard functions. A comprehensive design flow, assists customers achieve right first time silicon of complex SLI designs along with many other benefits. For more information regarding FIREFLY please visit - http://www.mitelse Sent via Deja.com http://www.deja.com/ Share what you know. Learn what you don't.http://www.mitelse
Subject: Re: A Language Targeted at Microcontrollers From: "Stephen Pelc" Date: 2000/01/07 Newsgroups: comp.arch.embedded Gary Drummond wrote in message <387595C8.E53E1007@worldinter.net>... >Forth might very well be an answer to your question, except for very >time-critical applications, which a compiler for Forth might cure. >The fact that the "CS" community doesn't approve of it doesn't >mean much-COBOL isn't taught in universities any more, but >there are waiting lists for classes in the community >colleges. It doesn't take too many over-budget, two years late, >conversions to C, to wake up the business world. Modern Forth cross compilers, such as our Forth 6 VFX compilers (see website for details), produce optimised native code that is as fast as C. The code density is still very good, and on our 68HC12 compiler, the VFX code is smaller than the code from threaded systems. All the interactivity of Forth is available in the usual way. All our systems come with multitaskers. Many of our customers who use multiple programming languages believe that Forth is a better language for embedded systems than C. We even have a C to Forth translator so that you can reuse your existing C code. -- Stephen Pelc, sfp@mpeltd.demon.co.uk MicroProcessor Engineering Ltd - More Real, Less Time 133 Hill Lane, Southampton SO15 5AF, England tel: +44 (0)2380 631441, fax: +44 (0)2380 339691 web: http://www.mpeltd.demon.co.uk
Cray is not amused. "I have really strong feelings about that," he said. "I feel the bigger the group that works on the project, the lower the chances for success. I'm appalled at our trying to make a country-wide coordinated effort. I just can't imagine it ever being successful.
"I believe you want a lot of independent people thinking their own thoughts and trying their own things. We're not going to participate in any national effort, and I don't want any money from the government. We've got competition within the company. I've got a group here five miles away who I know are trying to outdo me."
[DAV: much stuff snipped from this long post] ------------------------------ Date: Tue, 20 Jun 2000 23:15:04 -0700 From: Jeff Fox To: misc Subject: CPUs and Forth ... I know what you mean. I have confidence that people are still capable of being smarter than their PCs at this stage of technology and don't really need to adopt the stance that they need a compiler that is smarter than they are. ... I really really wish people could stay away from the religion, high priest, and cult bullshit. I am happy to disucss technology, science, hardware, programming, etc. but I get very very tired of people trying to paint this as religion. ... The thing I find most insulting about it is that it denies that we have any real religious belief. It seems to deny that we have any concept of what religion is. As an ordained minister and as someone who has done meditation for over fourty years I am amazed at how many people write things about me telling other people about my religion. ... Jeff Fox ------------------------------
STORE R1, dest LOAD R1, source // now dest contains original contents of R1could be re-arranged
LOAD R1, source STORE R1, dest // in ordinary architectures, dest contains // data from source -- but // with proposed "load delay slot" architecture, // dest contains original contents of R1.(I think some compilers already think about this to avoid pipeline stalls...).
some considerations
by 3-State Bit on Saturday March 24, @06:03PM EST
You know Slashdot wouldn't suck if
Turley's Law (my own humble contribution to journalists' amusement) says that the amount of processing power you carry on your body doubles every two years. Pat yourself down and see if you're not already carrying more computing horsepower than NASA left on the surface of the moon. ''
DAV: This seems to unfairly slam asynchronous designs. [vlsi][low power][asynchronous]asynchronous circuits ... Though the asynchronous paradigm appears on the surface to fit the requirement in that it removes the need for the globally distributed clock signal, the learning curve in migrating to this paradigm is simply too steep for the world to embrace at this time.
...
The solution of designing better tools is a conceptually simple one to propose, and difficult one to implement.
...
A short summary of the effects of the increasing wiring delays in the overall delay equation is that the cost of data transportation will continue to increase, relative to the cost of data computation. It may become cheaper to recompute frequently used variables, instead of precomputing those values, and then attempt to bring those values on chip when they are needed. Already, the looming prospect of the change in the delay equation has prompted some calls to return to some old ideas which had been deemed obsolete. Some have proposed that macro instructions may be set up, so that frequently used sections of code may be kept on chip, and do not have to be retrieved from memory every time. The idea of using macro instructions to minimize the requirement of instruction bandwidth invokes the memory of the venerable X86 processor's use of instructions executed via microcode for such things as string manipulations. Reduced Instruction Set Computing had gained favor with the mantra that by simplifying the hardware, the hardware may be easier to design and achieve a higher clock frequency at the slight expense of increased memory storage and bandwidth requirements compared to the classical CISC processors of its day. It is therefore ironic that with the cost of data computation dropping relative to the cost of data transportation, Complex Instruction Set Computing appears to be making a mild comeback via the idea of macro instructions. The future and the past of computer architecture may yet cross path, and complex instructions which saves instruction bandwidth may yet again be important.
...
This was a typical Richard Feynman explanation. On the one hand, it infuriated the experts who had worked on the problem because it neglected to even mention all of the clever problems that they had solved. On the other hand, it delighted the listeners since they could walk away from it with a real understanding of the phenomenon and how it was connected to physical reality.
We tried to take advantage of Richard's talent for clarity by getting him to critique the technical presentations that we made in our product introductions. Before the commercial announcement of the Connection Machine CM-1 and all of our future products, Richard would give a sentence-by-sentence critique of the planned presentation. "Don't say `reflected acoustic wave.' Say echo." ...
...
... building a big computer is a good excuse to talk to people who are working on some of the most exciting problems in science. We started working with physicists, astronomers, geologists, biologists, chemists -- everyone of them trying to solve some problem that it had never been possible to solve before. ...
For Richard, figuring out these problems was a kind of a game. He always started by asking very basic questions like, "What is the simplest example?" or "How can you tell if the answer is right?" He asked questions until he reduced the problem to some essential puzzle that he thought he would be able to solve. Then he would set to work, scribbling on a pad of paper and staring at the results. ... Eventually he would either decide the problem was too hard (in which case he lost interest), or he would find a solution (in which case he spent the next day or two explaining it to anyone who listened). In this way he worked on problems in database searches, geophysical modeling, protein folding, analyzing images, and reading insurance forms.
...
... we never really knew what we were doing. But the things that we studied were so new that no one else knew exactly what they were doing either. It was amateurs who made the progress.
... The act of discovery was not complete for him until he had taught it to someone else.
...
run the MAC OS on my PC http://www.sysopt.com/reviews/macemu/
... The Earth Simulator is capable of 35 teraflops, or 35 million million calculations per second.
It trounces ASCI White's 7 teraflops ...
Hans Werner Meuer, founder of the top 500 list of world supercomputers, says he expects the Earth Simulator to outperform all of its nearest 19 rivals put together. ...
symmetric multiprocessing (SMP) ... simultaneous multithreading (SMT) ...
includes
... The IBM 360 machine, created in 1964, was probably the first major microcoded processor system and was built out of the need for flexibility. ...
... hardwired approaches ... when used with non complex instructions like RISC ... take up far less decoding area [than] a microprocessor that uses microcoded control.
Koopman (1987) ... suggests that as technology improves and microcoded memory access times and speed of processing increases [compared to main memory RAM access times], the speed advantages of hardwired systems decrease.
...
on-chip caches
additional functional units for superscalar execution
additional "non-RISC" (but fast) instructions
...
branch prediction
...
The microcode update is volatile and needs to be uploaded on each system boot. I.e. it doesn't reflash your cpu permanently. Reboot and it reverts back to the old microcode.'' Is this really true ? [FIXME: if it's really true, email Koopman]
"On a chip, regular parallel structures can be very dense compared to random control logic." -- Pascal Dornier
BOAR provides a reprogrammable, high performance platform for prototyping DSP development applications. Its main use is to provide an environment to test and prototype VHDL ASIC designs on a real hardware. ... BOAR supports hardware/software codesign.points to Hedgehog - A younger brother for BOAR http://www.sci.fi/~cubase/board.html by Pekka Martikainen
Top 10 List: Ways to Simplify Programmingby Mike Elola http://www.amresearch.com/10_ways.html
Xilinx, Inc. (NASDAQ:XLNX) and Insight Electronics introduced today the industry's first User Datagram Protocol (UDP) stack core optimized for an FPGA. The UDP stack core provides a cost-effective solution for transmitting voice and data over the Internet. It can be used to create low cost consumer products such as Voice over Internet Protocol (VoIP) phones, Internet intercoms, remote security monitors, Internet-based voice recorders, and simple thin clients for home networking applications.
The UDP stack core has been used to create a low cost consumer VoIP reference design based on a single very low-cost Spartan-II FPGA. ...
... Xilinx invented the FPGA and fulfills more than half of the world demand for these devices today. ...
http://slashdot.org/articles/03/02/01/2035204.shtml?tid=127&tid=137&tid=164 [FIXME: read]... do not see the C-One as a Commodore 64 replica. It's a giant leap in computer technology, having the opportunity to change the behaviour of the hardware on the fly ...
The C-One aims at those who are into computer nostalgia, as well as those who want it for educational purpose. We'll supply all kinds of material for you to start VHDL programming, and instantly try it out on this board. Start modifying the board without soldering, extend the capabilities of your video output, or even switch to a completely different computer on the fly.
This computer is not for the usual point-and-click user. It's going back to the times where each and every bit of the machine was documented, and forward to a new kind of computer technology: Re-configurable hardware.
...
By embracing the inevitability of system failures, recovery-oriented computing returns service faster ...
... our team concentrates on designing systems that recover rapidly when mishaps do occur. ...
esim: A Structural Design Language for Computer Architecture Education ... The compiler and simulator for the language are freely distributablehttp://www.cse.ucsc.edu/~elm/Software/Esim/ (apparently a simplified subset of VHDL ?) [FIXME: crosslink to schematic.html#digital ]
the first machine that had all the components now classically regarded as characteristic of the basic computer. Most importantly it was the first computer that could store not only data but any (short!) user program in electronic memory and process it at electronic speed.http://www.computer50.org/ [FIXME: toread. Any clever ideas here waiting to be rediscovered ?]
Subject: CPU design http://www.emulators.com/pentium4.htm "it takes an Intel or an AMD or a Motorola a good 3 to 5 years to design a new processor architecture." also lists and has hard benchmarks of several chips, including the Transmeta Crusoe chip. "the fatal design flaws of the Pentium 4" http://www.emulators.com/pentium4.htm : MISTAKE #1 - Small L1 data cache " The Pentium 4 has a grossly under-sized 8K L1 data cache. " MISTAKE #2 - No L3 cache "How much is 256K or 2M? Well, that's about the typical size of an uncompressed bitmap. It's the reason a Power Mac G4 running Photoshop kills a typical Pentium III running Photoshop. " MISTAKE #4 - Trace cache throughput too low "Together, these execution units can in theory process 9 micro-ops per clock cycle - 4 simple integer operations, 1 integer shift/rotate, a read and write to memory, a floating point operating, and an MMX operation. Sounds pretty sweet, except for the problem that the trace cache feeds only 3 micro-ops at a time! " MISTAKE #6 - Shifts and rotates are slow MISTAKE #8 - Instructions take more clock cycles to complete "This is not so much a specific mistake as it is an overall side effect of the first 7 idiotic mistakes." "It reverts to 10 year old techniques which Intel abandoned and apparently forgot why." http://www.techextreme.com/display.asp?ID=290&Page=1 claims that the above article is missing the point. "Remember, out-of-order execution does not only mean that the individual instructions are executed out-of-order, but parts of the instructions (the micro-ops) are as well." -- http://www.emulators.com/pentium4.htm
New TASKING TriCore toolset from Altium uses Code-Efficient Viper compiler technology
Altium Limited has announced the release of a new TASKING embedded software development toolset for the TriCore architecture that has been enhanced with Tasking's Viper compiler technology. Altium benchmarks show faster execution speed and code size decreases of an average of 10% when compared to the previous TASKING TriCore toolset. The new toolset is compatible with all leading third-party TriCore products, such as emulators and RTOSs.
May 14, 2003
[DAV: I wish for more details ... so I could add these ideas to a list of general ideas for improving code size] -- http://microcontroller.com/news/bits_and_bytes.asp
CodeForce (peter_de_heer) has a very strong opinion on the changes he wants: http://www.aceshardware.com/forum?read=105039062
On the other hand, "rational" makes a good argument that "The ISA just doesn't matter" http://www.aceshardware.com/forum?read=105038773 but Paul DeMone disputes every point.
"cheesemower" claims "all successful ISAs get Byzantine after a couple of years."
The x86 isn't all that complex -- it just doesn't make a lot of sense.
--
Mike Johnson,
Leader of 80x86 Design at AMD,
Microprocessor Report (1994)
his kids ... are, even as I write, editing home movies (and adding daft sound tracks, I shouldn't wonder).
And the point of all this? Simple: everything my friend's kids are doing is computer-intensive in that it requires fast, powerful processors together with lots of RAM and big disks. But they don't see it as computing. To them it's just record production, image manipulation or video editing. We are looking at technology's version of the old principle that work expands to fill the space available. And that is what explains why Dixons sell - and we buy - those absurdly powerful machines.
... "the Reconfigurable Cell (RC) array. It has 64 reconfigurable cells arranged in an 8 by 8 array. Each cell has an ALU/MAC unit and a register file." http://www.eng.uci.edu/morphosys/rcarray.htmlMorphoSys is a Reconfigurable Computer Architecture that is composed of a software programmable processing unit called TinyRISC and a reconfigurable hardware unit called RC Array. MorphoSys is currently running at 450MHz, using 0.13-micron technology and with an area of 8x8mm or 16x16mm with onchip memory.
"GNUPro Tools for embedded systems" http://www.redhat.com/docs/manuals/gnupro/GNUPro-Toolkit-99r1/pdf/6_embed.pdf tells how to set up Cygwin, and discusses development for ARM7, Hitachi H8, LSI TinyRISC, Matsushita MN10200, MIPS VR4100, MIPS VR5xxx, Mitsubishi D10V, Motorola M68K, NEC V850, PowerPC, SPARC, SPARClite, Toshiba TX39,
I'm not quite sure why http://npu2.npu.edu/sloc/notes/computer_architecture.html has mirrored this computer architecture page ...
Part of the
Previous | Next | |
Random Choice |
This page started 1998-04-21 and has backlinks
David Cary
Return to index // end http://david.carybros.com/html/computer_architecture.html