updated 2000-01-05.
When I was working on this, I thought it was pretty creative and unique. Later I found out that I'd been stuck in a ``Turing Tar Pit''. Still, this looks like a more reasonable instruction set than some of the really ugly things that have been developed by other people who have also wasted a lot of time in a ``Turing Tar Pit''.
Describes a microprocessor instruction set developed by David Cary that packs 2 instructions into 8 bits of RAM. As far as I know, this is *the* Minimal Instruction Set (for a single-processor Von Neuman machine). In other words, I know of no other (Turing-complete) instruction set that has as many or fewer distinct instructions (DI).
2003-01-04:DAV: I just discovered
``A Minimal CISC''
by Douglas W. Jones
http://www.cs.uiowa.edu/~jones/arch/cisc/
that has fewer distinct instructions (DI): only 8,
so 5 of them can pack into a 16 bit word.
The instructions are:
- NOP: No operation.
- DUP: Duplicate the stack top. This is the only way to allocate stack space.
- ONE: Shift the stack top left one bit, shifting one into the least significant bit.
- ZERO: Shift the stack top left one bit, shifting zero into the least significant bit.
- LOAD: Use the value on the stack top as a memory address; replace it with the contents of the referenced location.
- POP: Store the value from the top of the stack in the memory location referenced by the second word on the stack; pop both.
- SUB: Subtract the top value on the stack from the value below it, pop both and push the result.
- JPOS: If the word below the stack top is positive, jump to the word pointed to by the stack top. In any case, pop both.
... any constant can be pushed on the stack by a DUP followed by 16 ONE or ZERO instructions. Zero may be pushed on the stack by the sequence DUP DUP SUB; negation may be done by subtracting from zero; addition may be done by subtracting a negated value, and pushing zero prior to pushing an unconditional branch address allows an unconditional branch.
See also its counterpart, "The Ultimate RISC" by Douglas W. Jones http://www.cs.uiowa.edu/~jones/arch/risc/
The "Whitespace" programming language is similar to this "Minimal CISC". http://compsoc.dur.ac.uk/whitespace/
The "Path" language http://pathlang.sourceforge.net/ has an interesting 2D layout, and a near-minimal instruction set (expressed in single characters):
2002-12-07:[FIXME: move ``Turing Tar Pit'' information here.] computer_architecture.html#tarpit
contents: [FIXME:]
See also
Here David Cary pushes the MISC idea to ugly extremes. [FIXME: need a catchy name for this architecture]
"whenever you excessibly constrain any parameter, something else has got to give." -- Don Lancaster.
I've put together yet another MISC instruction set. I've squeezed it down to 11 instructions. I don't think I've painted myself into a corner yet (I hope). If I could just squeeze out 3 more instructions, I could pack 5 of them into a 16 bit cell.
One advantage of having few instructions -- you can document them in a reasonably-sized email, rather than needing a entire book to document the details of scads of instructions.
There's 2 very different ways of counting the the "size" of an instruction set.
We could count the number of different instruction mnemonics (NM) mentioned in the assembler documentation. This method is subjective, since different assemblers may describe the same machine with different numbers of mnemonics (for example, one assembler may require programmers to type "LOAD dest TO PC" to get the effect of an immediate jump. Another assembler may require programmers to use a additional mnemonic "JUMP TO dest" to get exactly the same bit pattern in the final executable.)
Processers with smaller NM are likely to have regular, orthogonal instruction sets (the same addressing modes apply to all instructions) which makes them easier to program than processors with larger NM.
The minimum number of mnemonics NM is 1: the "move" instruction used by TTA http://www.rdrop.com/~cary/html/computer_architecture.html#tta . Since you know what the instruction will be in a TTA architecture, it takes zero bits to specify -- -- but TTA still requires a bunch of bits per instruction to select registers and/or addressing modes.
Another way of counting the "size" of an instruction set is to count all possible distinct instructions (DI), all valid variations of "source register", "destination register", and "addressing mode". Using this count, TTA has lots more than 1 instruction.
This is a more objective count: if the longest instruction has b bits, then the number DI is equal to 2^b minus the number of invalid instructions of that length.
Processors with smaller DI are likely to need fewer bits b to specify each instruction. Processors with fewer bits per instruction
Unfortunately, this advantage is partially (perhaps completely) cancelled out by the fact that processors with smaller DI often require *more* instructions (more RAM and/or more cycles) to implement a given piece of functionality than processors with larger DI.
Nevertheless, it is a interesting intellectual challenge to wonder What is the minimum number of bits bmin required per instruction ? What is the minimum number of distinct instructions DI for a Turing-complete von Neumann machine ?
The shBoom(tm) microprocessor from Patriot Scientific Corporation http://www.ptsc.com/ http://www.circuitcellar.com/articles/misc/tom-92.pdf packs instructions into 8 bits.
So we know DImin is 256 or less.
The "Itty Bitty Stack Machine" http://www.ittybittycomputers.com/IttyBitty/IBSM.htm is very similar to the F21, with most instructions packed into 5 bits. (A total of about 54 instructions) (This instruction set is allegedly "fast, that is, it should be capable of emulation at a raw speed not slower than 10% of the host native hardware.")
The "F21" Forth engine by Chuck Moore F21 STACK PROCESSOR CPU DESCRIPTION http://pisa.rockefeller.edu:8080/MISC/F21.specs has 27 distinct instructions, each one is packed into 5 bits (neglecting the bits that follow "#" and branches).
So we know DImin is 27 or less.
Clive Sinclair http://www.cdworld.co.uk/zx2000/clive.html claims that he has designed a CPU with only "16 principle instructions", but he doesn't list any details.
Dr Neil Burgess mentions "ultraRISC processor, that has only 15 instructions" http://www.acue.adelaide.edu.au/leap/discipline/eng/Burgess.html but he doesn't list any details.
Can a CPU really be designed to have 16 or fewer distinct instructions DI, such that one can pack 2 instructions into 8 bits of RAM ? 8 instructions into a 32 bit word ? Or are these people counting NM, not DI ?
Starting with the elegant 27 instruction set
for the "F21" Forth engine by Chuck Moore
F21 STACK PROCESSOR CPU DESCRIPTION
http://pisa.rockefeller.edu:8080/MISC/F21.specs
,
and eliminating instructions that could be emulated
by (sometimes lengthy) combinations of the other instructions,
David Cary managed to get a (very ugly) 16 instruction set.
"# push pop A! A@ T=0 @A !A xor and + com 2/ dup over drop".
(16 instructions as of 1998)
Can you think of a more "elegant" (yet still "complete") instruction set of 16 or fewer distinct instructions DI ? Turing complete cellular automata are proof that a Turing Machine can be built without *any* distinct "instructions".
Hint: You soon get to the stage "Every instruction in this set is essential. if I eliminate *this*, I won't be able to do *that*, no matter how many of the remaining instructions I string together". While you can't simply eliminate 1 instruction from this set, sometimes you can replace 3 instructions from a set with 2 completely different instructions that, given the remaining instructions, can still do *that*. It helps to assume you have some scratchpad RAM, since the fewer internal registers you have, the fewer instructions you need to shuffle things back and forth between those registers.
Myron Plichota and vic plichota have developed a much more elegant set of 16 instructions they call qUark ../mirror/quark.txt
With a bit of a challenge from "vic plichota" <atsvap@cgo.wave.ca>, and many ideas from Myron Plichota, I've developed a instruction set with only 13 instructions: "# ! + xor nand 2/ push pop dropR swap dup T=0 nop" (13 instructions as of 1999-02-22) Programmers model: there is a instruction pointer P and 2 stacks, the "return stack" and the "data stack". The top of the data stack is called T and the second on the data stack is called S; the top of the return stack is called R.
T R P S . . . . . . .
I've managed to whittle it down even more with a ``conditional skip if arithmetic result not zero'' idea from Alan Grimes: Instruction summary: "# ! + xor nand 2/ push pop toA Afrom call nop" (12 instructions as of 1999-03-22)
I've managed to whittle it down even more "# ! + xor nand 2/ push popA AT call nop" (11 instructions as of 1999-03-24)
I *think* this is still functionally complete, and can still do everything that any other Turing-complete CPU can do. It just takes a *lot* more instruction cycles to do most things than most CPUs.
2000-01-05:DAV: I've just stumbled across "BF: An Eight-Instruction Turing-Complete Programming Language which was invented by Urban Mueller solely for the purpose of being able to create a compiler that was less than 256 bytes in size, for the Amiga OS."
More BF details:
A very clean instruction set, although it takes far more instruction cycles than even my 11 instructions to do even the most trivial things. I wonder if we could redefine the "input" and "output" op-codes ...
"# ! + xor nand 2/ push popA AT call nop" (11 instructions as of 1999-03-24) [perhaps more Forth-like names would be
>R toR instead of push A> Afrom instead of AT RA ?? instead of popA]
2 push-down stacks, the "return stack" and the "data stack".
T A R P S . . . . . . .
All registers are the same length, the length of a memory address.
[FIXME: A implementation needs to choose:
(Moore fixed (sizeof_address) = (sizeof_data_word + 1 bit). What other choices would be "interesting" ???) ]
[A slow, minimal-gate implementation needs stack pointers that point to S and R in RAM. Faster processors can keep most or all of the stack on-chip ... What should the processor do when stacks overflow or underflow ? ]
** Acronyms and notation: cell: the implementation's native integer size, the number of bits read and wrtten at once. [ ... ] indicates that everything inside the brackets is contained in a single cell. (. . .): parameter-stack diagram, T to far right, then S. <. . .>: return-stack diagram, R on far right. |: pipe char used to indicate an "either-or" choice
Subroutines are documented with the initial (to the left of the "-") and final (to the right of the "-") state of the data stack and the return stack.
Data on the external memory bus always comes from T or from "external devices", and always goes to "external devices" or T or the instruction latch. All data in memory is accessed only on cell boundaries (i.e., only 1 whole cell is read or written at a time).
Addresses on the memory bus always come from P or R (or perhaps from external devices while the CPU is not using the bus).
Since instructions are always only 4 bits each, instructions are packed into "cells", as many as will fit (e.g., 3 instructions per 12 bit cells on some machines, or 4 instructions per 16 bit cell, 5 instructions per 20 bit cell, or 8 instructions per 32 bit cell on other machines. ). (does it make any sense for "cells" to be bigger or smaller than sizeof_data_word, the size of a single word of memory ?). A minimum of 3 instructions per cell is needed for the ugly hack that works around the lack of a proper "load" command).
Any opcode ``# !R+ + xor nand 2/ push popA AT call nop'' can occur in any slot.
[FIXME: How are interrupts handled ? computer_architecture.html#interrupt ]
It seems that the instructions execute so quickly that the bottleneck is the speed of the RAM.
Surprisingly, only 1 memory read instruction is needed for both in-line literals and for data everywhere else in memory:
// ( -- x ) @P+ // load data in RAM at [P], push it onto T, then increment P. load // ditto # // ditto
For example, when the CPU executes a cell that starts with 3 "#" instructions, then the following 3 cells are pushed onto the data stack (the second one ends up on S, the third one ends up on T), then the remaining instructions in the cell are executed, and then the 4th following cell is loaded into the instruction latch and executed.
// ( ... - ... data1 data2 data3 ) [ # # # ... ] [data1] [data2] [data3] [more instructions ...] ...
The assembler pseudo-op "#(value)" keeps track of implementation details (word size, register size, sign-extension, the current instruction word, where the next literal/instruction cell will be loaded from, etc.). "#(value)" expands to "#" (or perhaps "# com" or "# unsigned com") and packs the (possibly complemented) literal value into the next available (sizeof_data_word) cell, such that when that code is executed, the desired value gets loaded into T.
To load data that is *not* an in-line literal,
somehow calculate the address and get it onto R,
then pack these instructions into a single cell:
"swapPR load swapPR
"
.
There is also only one write instruction:
// ( x -- ) < address -- address+1 > !R+ // pop T, store popped value to RAM at [R], then increment R. store // ditto
(??? should we use A for the address instead of R, a la the F21 ?) (hardware *might* be a bit simpler if P is used for *all* addresses to the memory bus, replacing !R+ with !P+).
The 2 operand arithmetical instructions always pop 2 items off the data stack (S and T) and push the result back onto the data stack (into T).
+ ( n1 n2 -- n1_+_n2 ) add S to T xor ( n1 n2 -- n1_bitxor_n2 ) bitwise exclusive-or S to T nand ( n1 n2 -- ~(n1_bitand_n2) ) bitwise nand S to T
One way to write software to implement multiple-precision arithemetic is to use the most significant bit of T as the carry bit, and extending precision in sizeof_address-1 bit chunks.
2/ ( x -- x/2 ) arithmetic shift right of T (keep sign), and skip next instruction if T was not zero. (Next instruction only executed if T was (and still is) 0). nop does nothing
[can "nop" be eliminated ?]
This "2/
" is the only conditional instruction in the entire set.
The skipped instruction is typically
"call
" or "nop
" or another ``2/
''.
For example,
to shift 3 bits to the right, do "2/ 2/ 2/ nop".
(I leave it as an exercise to the reader to show this sequence
always has the net effect of
unconditionally shifting T three times to the right).
[It may make hardware simpler (allow interrupts at end of every cell)
if the programmer model
always includes a virtual "nop" at the end of every cell,
i.e., even if a 2/ is the last instruction of a cell,
the first instruction of the next cell
is executed unconditionally.
This means the compiler must insert or delete a ``nop''
to make the ``2/
''
do what the programmer expects:
2/ call --> [... nop] [2/ call ...] 2/ nop call --> [... 2/ ] [call ... ] 2/ 2/ --> [... 2/ ] [2/ ... ]]
[Other options:
perhaps make "2/", if the result is not zero,
skip *all* the remaining instructions of the current cell.
perhaps make "2/", if the result is not zero,
skip precisely 3 instructions: following it in this cell
and perhaps the start of the next cell.
This delay slot could be filled with
"AT push popA
"
reducing the need for the "nop" instruction.
]
push ( x -- ) < -- x > pop T, push onto R popA ( -- ) < x -- > pop R, push onto A dropR ( -- ) < x -- > // alias for popA AT ( -- x ) push a copy of A onto T (leaving A unchanged).
Data in any register in the list (T R A T) can be moved in a single cycle to the register to its right.
"popA" is a hint to the reader that the value of A will soon be used with AT; "dropR" is a hint to the reader that the value of A is now irrelevant.
There is only 1 branch instruction, only one way to modify the value of P (but lots of different aliases to make the intent of the programmer clear):
// various aliases for the same branch instruction bit pattern swapPR ( -- ) < future_P -- past_P > // swap P with R. call branch return exit ; // pronounced "exit" jmp
P typically (but not always) points to the cell following the cell from which the currently-executing instructions were taken.
Once a cell is loaded into the instruction latch for execution, P is incremented, and then *all* of the instructions in that cell are executed, (possibly modifying P) from the first to the last. The CPU never skips the remaining instructions in a cell (not even for the call instruction).
If a "call" is immediately followed by a "#" instruction, then that distant value at [P] is loaded, *not* the data in the cell immediately following the cell currently being executed.
After *all* instructions in a cell have executed, the next cell of instructions is read from this new value of [P] into the instruction latch, P is incremented, and then all the instructions in that cell are always executed (possibly modifying P). Whatever value of P exists immediately after they are all executed is the address where the next instruction cell will be loaded from.
Sequences that "temporarily" change P must be very careful to restore P before the end of that cell.
All branch destinations are to (the first instruction of) a particular cell, not to some (other) instruction inside that cell.
Prefetch may be implemented on some chips. It has no effect on the programmer model; all implementations act "as if" there was no prefetch.
Typically, while the instructions in one cell are executed, the next cell is being pre-fetched. That pre-fetched cell will in turn be executed unless (a) the current cell mentions "load" (#), which diverts the pre-fetched cell to T instead of the instruction latch, increments P, and starts pre-fetching the following cell. (b) the current cell modifies P with "swapPR". The CPU will then flush the ``next cell'' it speculatively pre-loaded, then starts pre-fetching the next instruction where P now points.
[A strange alternative: Instead, a chip *could* stick with the absolute simplest thing to do with "call" -- merely swap R and P, and not worry about the consequences.
This has a *major* impact on the programmer's model.
This means that pre-fetched cells are *never* ``wasted''. The consequences are that after a cell with one ``call'' and no ``load'' instructions is fetched, the cell at the following address is prefetched and will be executed unconditionally (it is in the "branch delay slot" of a "delayed branch" a la Sparc ???), while prefetch loads the first cell of that subroutine. (note that there is *both* a call delay slot *and* a return delay slot).
Confusing, difficult-to-explain things to happen if one tries to mix "call" followed several "#"s in the same cell. The first "#" loads data from the cell immediately following the cell being executed, while the following "#"s load data from distant cells.
Other strange things happen if you put multiple call instructions in consecutive cells.
If you do this, then the macro for @R+ expands to:
: @R+ // next 4 instruction must be in same cell. swapPR @P+ // puts a wasted prefetched value in T, starts loading desired value swapPR @P+ // puts the desired value in T, starts loading next instruction cnop [ cnop ] // this cell is skipped swap // get the wasted value on top drop // get rid of it. ;]
[ we might be able to have a minimal implementation that packs only 1 instruction per cell, feeding out of a 4 bit RAM, if we change the instruction buffer so it pre-fetches and holds not only the currently executing instruction, but also the 2 following instructions. The ugly hack to fake a "load [R]": would work like this: while executing the source code "swapPR swapPR # push call", the instruction buffer would look like this:
swapPR swapPR #
''
(P and R get swapped; then data at [P++] is read)
swapPR # (data)
'' (P and R get swapped; instruction at [P++] is read)
# (data) push
''
(data shunted to T,
instruction buffer overwritten with NOP;
instruction at [P++] is read)
nop push call
'' (etc.)
If we we are feeding out of a 4 bit RAM but we want the register size to be N times larger, then we could prefetch N+1 following instructions (confused yet ?). Then the sequence would be (for N=3, so T = 3*4 bits = 12 bits)
swapPR nop nop swapPR #
''
(P and R get swapped; then data at [P++] is read)
nop nop swapPR # (data)
''
(nop; then data at [P++] is read)
nop swapPR # (data) (data)
''
(nop; then data at [P++] is read)
swapPR # (data) (data) (data)
'' (P and R get swapped; instruction at [P++] is read)
# (data) (data) (data) push
''
(data shunted to T,
instruction buffer overwritten with NOPs;
instruction at [P++] is read)
nop nop nop push call
'' (etc.)
The assembler pseudo-op ``cnop
''
fills the rest of the current cell with (zero or more) NOPs,
sufficient to align the next instruction
with the start of the next cell.
This alignment is necessary when you want that next instruction
to be the destination of some branch in the code.
(This is implied by labels at the start of a subroutine,
and the "then" and "else" of a "if-then" statement.
).
The compiler automatically generates inline code whenever the word being compiled would take *less* space inline than the subroutine call. (-- idea from Myron Plichota)
Should the compiler choose to actually generate a call,
`` .... subroutine_label ...
"
expands to something like this:
[ ... # push call] [ subroutine_label ] [ ... ] ... subroutine_label: ... // instructions to actually do something [ ... return dropR ]
To reduce space slightly, the compiler may choose to compact lists of subroutine calls (after making sure that none of the subroutines will interfere too much with the return stack) from
[ ... # push call] [ name_1 ] [ ... # push call] [ name_2 ] [ ... # push call] [ name_3 ] [ ... # push call] [ name_4 ] [ ... return dropR]to
[ # # # # push push ] [ name_1 ] [ name_2 ] [ name_3 ] [ name_4 ] [ push push jump dropR ]
or to
[ # push # push ] [ name_4 ] [ name_3 ] [ # push # push jump dropR ] [ name_2 ] [ name_1 ]
(This is a generalization of "tailbiting").
To minimize space (but not time) in long sequences of only subroutine calls some compilers may use direct threading. data_compression.html#program_compression Here is an example of a fully re-entrant direct-threaded subroutine list interpreter:
[... # push call] [ subroutine_list_interpreter ] [ sub1 ] [ sub2 ] [ sub3 ] [ sub4 ] [ continue ] continue: // ( ) < oldP p_subroutine_list > [dropR dropR ...] ... subroutine_list_interpreter: ( -- ) < p_subroutine_list -- > [ swapPR # swapPR push call cnop ] // = [ @R! push call cnop ] [ # jump dropR cnop ] [ subroutine_list_interpreter ]
This subroutine list interpreter assumes these subroutines don't modify the top 2 items on the return stack; it's ok if they use deeper items for parameters.
Once you have this subroutine list interpreter, you can collapse other lists of subroutine calls to 2 full cells + 2 partial cells of overhead, plus 1 more cell per subroutine call.
The compiler converts
``.... if ... true_stuff ... then ...
''
statements to something like
// last 5 instructions must all be in same cell [ ... # push 2/ jmp dropR push dropR ] [ then ] [ ... true_stuff ... ] ... then: ...
The compiler converts
``.... if ... true_stuff ... else ... false_stuff ... then ...
''
statements to something like
// last 5 instructions must all be in same cell [ ... # push 2/ jmp dropR push dropR ] [ else ] [ ... true_stuff ... ] ... [ ... # push jmp dropR ] [ then ] else: [ ... false_stuff ... ] ... then: ...
A minimum implementation (only 3 instructions per cell) is forced to use this alternate form:
// last 2 instructions must be in same cell [ ... ... # push 2/ jmp ] [ else ] [ dropR push dropR ... true_stuff ... ] ... [ ... # push jmp dropR ] [ then ] else: [ dropR push dropR ... false_stuff ... ] ... then: [ ... ] ...
A smarter compiler might have a special ``else 0 then'' case
for code like
`` ... if ... true_stuff ... else 0 then ...
''
to
[ ... # push 2/ jmp dropR ] // ( refrain from dropping T ) [ (then) ] [ push dropR ... true_stuff ... ] // T was non-zero; drop it ... then: ... // if T was 0, now T is still 0.
(note that the "2/" never drops T, even in the case where T=0 and the CPU actually executes that next instruction. I suppose a really smart compiler might be able to refrain from dropping T in certain unusual situations when we later want to use T/2 in the true_stuff. and/or 0 in more complicated false_stuff. ).
The compiler converts
`` ... DO ... LOOP ...
''
into
[FIXME]
DO: loop initialization
...
LOOP: increment by 1 loop
+LOOP: variable increment loop ???
Some processors have the only conditional instruction be a
"conditional return if 0" instruction.
Another way of implementing
``.... if ... true_stuff ... then ...
''
is to use the "conditional-return" style:
[ ... # push call] [ subroutine_label ] [ dropR ... ] ... subroutine_label: [ ... condition ... ] [ ... 2/ return push dropR ] // conditional return [ ... true_stuff ... ] [ ... return ] // unconditional return
A "conditional return" implementation of
``.... if ... true_stuff ... else ... false_stuff ... then ...
''
looks something like
// could calculate some of condition here [ ... # push call] [ (else) ] ... else: // < return_address > // could calculate some of condition here [ ... # push call ] [ (then) ] // < return_address > [ ... false_stuff ... ] [ ... return ] // unconditional return ... then: // < ... return_address else_address > [ ... condition ... ] // either return to "else:", // or fix up return stack so // we can do true_stuff // and return all the way to // original call. [ ... 2/ return drop dropR ] // < return_address > [ ... true_stuff ... ] ... [ ... return dropR ] // unconditional return
//some common macros //using the instruction set // "# !R+ + xor nand 2/ push popA AT call nop" (DAV 1999-03-24) // pseudo-primitives: pop, dup, drop, swapTR, swap. // pop: take top value off R stack, push onto T stack. : pop ( -- x ) < x -- > popA AT ; // always inlined : pop_dup ( -- x x ) < x -- > _[[ popA AT AT _]] ; // 3 cycles, always inlined // dup: make a duplicate copy of T, push into T and S : dup ( x -- x x ) push pop_dup ; // 4 cycles, always inlined : drop ( x -- ) < -- > push dropR ; // always inlined // swapTR: swap T with R : swapTR ( a -- b ) < b -- a > popA push AT ; // 3 cycles, always inlined : swap (... a b --... b a ) swap T with S push swapTR pop // 6 cycles : 0 ( -- 0 ) dup dup xor ; : 0 ( -- 0 ) _[[ AT AT _]] xor ; // 3 cycles, always inlined // The A register and ``popA'' and ``AT'' instructions // may be reserved exclusively for these pseudo-primitives to use. // Then use "push, pop, dup, drop, swap, swapTR, dropR, 0" macros // as if they were the primitive instructions. // stack manipulation : over (... b a --... b a b ) push dup pop swap ; : over (... b a --... b a b ) 13 cycles if inlined push dup swapTR pop ; : rot (... a b c --... b c a ) push swap pop swap ; : rot (... a b c --... b c a ) push push swapTR pop swapTR pop ; // arithmetical manipulation // com: complement: invert all bits of T. : com ( x -- (-x-1) ) or equivalently ( x -- -(x+1) ) dup nand ; // 5 cycles : -1 ( -- -1) 0 com ; // 8 cycles : -1 ( -- -1) 0 0 nand ; // ( -- -1) // 7 cycles : negate ( x -- -x ) #(-1) + com ; // ( x -- -x ) // 13 cycles, or 7 cycles + 1 RAM cycle : 1 ( -- +1 ) #(1) ; // 1 cycle + 1 RAM cycle // too much work for no gain over straightforward literal. : 1 -1 dup + com ; // 17 cycles : 1 _[[ AT AT _]] xor // ( -- 0 ) _[[ AT AT _]] xor // ( 0 -- 0 0 ) nand // ( 0 0 -- -1 ) push _[[ popA AT AT // + // ( ... -1 -1 -- -2 ); A=-1 AT AT _]] + // ( ... -2 -1 -1 -- ... -2 -2 ) nand ; // (... -2 -2 -- ... +1 ) // 16 cycles : and nand dup nand ; : or over com and xor ; // clever sequence from Chuck Moore : or com swap com nand ; : or // 11 cycles push push pop_dup nand pop_dup nand nand // subtract T from S: : - (... a b --... (a-b) ) negate + ; // 0<> logical buffer: 0->0, nonzero->(-1). // 0= logical not: 0->(-1); nonzero->0 // (Would it be better to map false to +1 rather than -1 ?) // (Do we really use less power by using +1 rather than -1 ?) // simple and straightforward : 0= ( 0 -- -1 ) | ( nonzero -- 0 ) if 0 else 1 then ; : 0<> ( 0 -- 0 ) | ( nonzero -- -1 ) if -1 else 0 // take advantage of special speedup for ``else 0 then'' then ; : 0= ( 0 -- -1 ) | ( nonzero -- 0 ) 0<> com ; : 0<> ( 0 -- 0 ) | ( nonzero -- -1 ) 0= com ; // doesn't work now -- only works if "swap" is really a primitive //: 0<> ( 0 -- 0 ) | ( nonzero -- -1 ) // #(-1) swap // // x=0 | x!=0 // 2/ swap // (...1 0 --...0 1 ) | (...1 x --...1 x/2 ) // drop // (...0 1 --...0 ) | (...1 x/2 --... 1 ) // somewhat more bizzare programming styles : 0= ( 0 -- -1 ) | ( nonzero -- 0 ) -1 swap if dup else 0 // take advantage of special speedup for ``else 0 then'' then xor ; : 0= ( 0 -- -1 ) | ( nonzero -- 0 ) 1 push 0 push // ( x -- x ) <... --... 1 0 > _[[ popA // ( x -- x ) <... 1 0 --... 1 >, A=0 // x=0 | x!=0 2/ popA // < 1 -- >, A=1 | < 1 -- 1 >, A=0 AT AT _]] // (0 --...0 1 1) < -- > | (x/2 --...x/2 0 0) < 1 -- 1 > 2/ dropR drop // (0 1 1 --...0 1)< -- > | (...x/2 0 0 --...x/2 0) < 1 -- > push push dropR pop // (...0 1 -- 1 )< -- > | (...x/2 0 -- 0 )< -- > ; : 0<> ( 0 -- 0 ) | ( nonzero -- -1 ) -1 push if pop ret // force a return else 0 then // take advantage of special speedup for ``else 0 then'' dropR ; which the compiler expands to : 0<> ( 0 -- 0 ) | ( nonzero -- -1 ) [ # push # push 2/ jmp dropR ] [ -1 ] [ iszero ] nonzero: [ push dropR popA AT ret ] iszero: [ dropR ret ] // even more bizzare versions: : 0<> ( 0 -- 0 ) | ( nonzero -- -1 ) [ 2/ ret cnop ] nonzero: [ push dropR # ret cnop ] [ -1 ] : 0= ( 0 -- -1 ) | ( nonzero -- 0 ) // this version must *not* be in-lined #(nonzero) push 2/ swapPR push dropR # dropR ret // these 6 instructions *must* be in the same cell zero: [-1] nonzero: [ 0] which the compiler expands to : 0= ( 0 -- -1 ) | ( nonzero -- 0 ) [# push 2/ swapPR push dropR # dropR ret] [ nonzero ] [-1] nonzero: [ 0] // A smart compiler might take any phrase of the form ``... if #(a) ... true_stuff ... else #(b) ... false_stuff ... then'' and compile it to [ ... # push 2/ jmp dropR push dropR # ] [ else ] [ a ] [ ... true_stuff ... ] ... [ ... # push jmp dropR ] [ then ] else: [ b ] [ ... false_stuff ... ] ... then: ... : 1+ ( ... a -- ... (1+a) ) com -1 + com ; // 19 cycles : 1+ ( ... a -- ... (1+a) ) #(1) + ; // 2 cycles + 1 RAM cycle : 1- ( ... a -- ... (a-1) ) -1 + ; // 9 cycles : R-- ( -- ) < x -- (x-1) > subtract one from R -1 pop + push ; // other macros // "R@+", "fetch": load word from RAM to T at address in R. // Note that both swapPR *must* be in same cell to avoid wild jumps. : R@+ ( -- value ) < source_address -- (1+source_address) > _[[ swapPR load swapPR _]] ; // "@", "fetch": load word from RAM at address in T. : @ ( source_address -- value ) < -- > push R@+ dropR ; // "!", "poke": put value in T into RAM at address S // (... value dest_address -- ) : ! push !R+ dropR ; : = ( n1 n2 -- (n1==n1) ) xor 0= ; HEX // from plichota : MIN- ( -- 80000000) 80000000 ; : signbit ( -- 80000000) 80000000 ; // from plichota : MAX+ ( -- 7FFFFFFF) 7FFFFFFF ; // if T is strictly negative, return true. // from plichota : 0< ( x<0 -- -1 ) | ( x>=0 -- 0 ) signbit and 0<> ; : > ( n1 n2 -- flag=n1>n2 ) - 0< ; : < ( n1 n2 -- flag=n1<n2 ) swap > ; : <= ( n1 n2 -- flag=n1<=n2) > 0= ; : >= ( n1 n2 -- flag=n1>=n2) swap <= ; // from plichota : ABS ( n -- |n|) DUP 0< IF NEGATE THEN ; // pull out the third element down in the stack, and put on T : rot ( ... a b c -- ... b c a ) push swap pop swap ; // multi-word arithmetic // copy word from one literal location to another // ( -- ) < -- > : copyword(source, dest) #(dest) #(source) @ ! ; : copyword ( dest -- 1+dest ) < source -- 1+source > @R+ push swapTR store pop ; // Note that both swapPR *must* be in same cell to avoid wild jumps. // this requires cell size to be at least 6 instructions // 9 cycles + 4 RAM cycles : copy2words ( source -- 2+source ) < dest -- 2+dest > push _[[ swapPR load store load store swapPR _]] pop // ( dest -- 4+dest ) < source -- 4+source > : copy4words (copyword copyword copyword copyword) // ( -- ) < source dest -- 4+source 4+dest > // Note that both swapPR *must* be in same cell to avoid wild jumps. // this requires cell size to be at least 6 instructions // 13 cycles + 8 RAM cycles : copy4words ( _[[ swapPR load load load load swapPR _]] _[[ popA store store store store AT _]] push [ _Stack Computers_ by Philip Koopman 1989 mentions these words: // ?DUP: conditionally duplicate T if it is non-zero : ?DUP ( 0 -- 0 ) or ( x -- x x ) dup if dup then ; (Chuck Moore doesn't like ?DUP: ``1x Forth'' by Charles Moore April 13, 1999 http://www.ultratechnology.com/1xforth.htm ) U< U1 U2 - FLAG Return a true FLAG if U1 is less than U2 when compared as unsigned integers. U> U1 U2 - FLAG Return a true FLAG if U1 is greater than U2 when compared as unsigned integers. U* N1 N2 - D3 Perform unsigned integer multiplication on N1 and N2, yielding the unsigned double precision result D3. U/MOD D1 N2 - N3 N4 Perform unsigned integer division on D1 and N2, yielding the quotient N4 and the remainder N3. how to implement multi-precision operations ? since there are no condition codes, the carry flag must be pushed onto the data stack as a logical value. _Stack Computers_ by Philip Koopman 1989 mentions these words: RLC N1 CIN -> N2 COUT -> Rotate left through carry N1 by 1 bit. CIN is carry-in, COUT is carry-out. RRC N1 CIN -> N2 COUT -> Rotate right through carry N1 by 1 bit. CIN is carry-in, COUT is carry-out. UNORM (... EXP1 U2 ->... EXP3 U4 ) Floating point normalize of unsigned 32-bit mantissa ADC (... N1 N2 CIN ->... N3 COUT ) Add with carry. CIN and COUT are logical flags on the stack. //Store the double-precision value D1 at the two memory //words starting at ADDR. // [Is most-significant or least-significant word on T ?] : D! (...D1 ADDR - ) //Drop the double-precision integer D1. : DDROP ( D1 -- ) drop drop ; //Duplicate double-precision integer D1 on the stack. : DDUP ( D1 - D1 D1 ) over over ; D+ D1 D2 - D3 Return the double precision sum of D1 and D2 as D3. D@ ADDR - D1 Fetch the double precision value D1 from memory starting at address ADDR. DNEGATE D1 - D2 Return D2, which is the two's complement of D1. // Swap the top two double-precision numbers on the stack. : DSWAP ( D1 D2 - D2 D1 ) (... a b c d --... c d a b ) push swap push swap //(...c a ) <...d b > pop pop //(...c a b d) <... > swap push swap //(...c d a) <...b > pop I - N1 Return the index of the currently active loop. I' - N1 Return the limit of the currently active loop. J - N1 Return the index of the outer loop in a nested loop structure. LEAVE - Set the loop counter on the return stack equal to the loop limit to force an exit from the loop. S-D N1 - D2 Sign extend N1 to occupy two words, making it a double precision integer D2. SP@ (fetch contents of data stack pointer) SP! (initialize data stack pointer) RP@ (fetch contents of return stack pointer) RP! (initialize return stack pointer) MATCH (string compare primitive) ABORT" (error checking & reporting word) +LOOP (variable increment loop) /LOOP (variable unsigned increment loop) CMOVE (string move) <CMOVE (reverse order string move) DO (loop initialization) ENCLOSE (text parsing primitive) LOOP (increment by 1 loop) FILL (block memory initialization word) TOGGLE (bit mask/set primitive) ]
The compiler must ensure that A is in a don't care state at the start and end of every cell in order for interrupt routines to be free to use A without saving it.
The compiler can enforce this rule by only allowing the use of the ``popA'' and ``AT'' opcodes in the subroutines "0", "dup", "pop", "swap", "swapTR" (and perhaps a few others), and making sure that when these subroutines are inlined, cell breaks don't occur at the "wrong" place. (The dropR opcode, even though it is the same as popA, can be allowed anywhere A is already in a don't care state, since it leaves A in a (different) don't care state).
Perhaps we need a special "non-breaking space" notation
so the programmer can indicate
that certain instructions (those sensitive to A,
and certain other ones following ``2/'')
must be packed into a single cell.
(if cells contain N instructions,
``cnop
'' forces the next N instructions into the same cell;
but when I really wanted the next 3 instructions
to stay together, and there were 4 empty slots
remaining in the current cell,
``cnop
'' is a bit wasteful.)
peephole optimizer usually needs to eliminate do-nothing "popa at push" sequences in straight-line code, since the only effect is to change A and usually A is in a "don't care state".
In particular, the
"pop swap
"
macro sequence (in the "over" macro) expands to
"pop push swapTR pop
",
and the
obviously do-nothing subsequence
"pop push
"
is further expanded to the do-nothing sequence
"popa at push
".
Perhaps it would be simplest to immediately replace the
"pop swap
"
sequence with the faster sequence
"swapTR pop
"
and then expand *that* into
"popA push AT popA AT
"
Similarly, "swap push" expands to "push swapTR pop push", with the same do-nothing subsequence "pop push"; it seems simple to make the compiler smart enough to immediately replace "swap push" with "push swapTR"
(The "A" is simulated as an location in RAM, _A )
else unconditional jump: #(dest) push _[[ jmp dropR _]] cnop T=0(dest) : dup #(dest) push _[[ 2/ jmp dropR push dropR _]] cnop call(dest) : #(dest) push call cnop dropR C=0(dest) : #(MSbset) nand dup nand T=0(dest) cnop ; return : return cnop cnop subroutine start: // no entry sequence needed. : @R+ // fetch, address in R, then increment R cnop // next 3 instruction must be in same cell. swapPR @P+ // do the load swapPR // restore P ; : @A+ : #(_A) // get address of _A (&A) dup push push @R+ dropR // fetch value of A: (A) < &A > push @R+ // fetch what A points to: (data) < A+1 &A > pop ! dropR // update A with new value. (data) < -- > # : # @A: #(_A) // get address of A (&A) push @R+ dropR // fetch value of A: (A) < -- > push @R+ // fetch what A points to: (data) < A+1 > dropR // !R+ : ! //store, address in R, increment R !A+ : #(_A) // get address of A (&A data) dup push push @R+ dropR // fetch value of A: (A data) < &A > push ! // pop T, store it at [A]: ( -- ) < A+1 &A > pop ! dropR // update A with new value. (data) < -- > !A : #(_A) // get address of A (&A data) push @R+ dropR // fetch value of A: (A data) < -- > push ! // pop T, store it at [A]: ( -- ) < A+1 > dropR // com : dup nand 2* : dup + 2/ : 2/ +* : // add S to T if T0 one DUP 1 AND IF OVER + THEN dup #(1) and if over + then +* : // add S to T if T0 one DUP 1 AND IF OVER + THEN dup #(1) and // expands to ``nand dup nand'' if over // expands to ``push dup swapTR pop'' else 0 then + which expands to (with a good compiler on a 4-instruction-per-word machine) [push popA AT AT] // dup [ # push # nand] // set up for ``if'', #(1), ``nand'' [ (then) ] [ (1) ] [ push popA AT AT] // ''dup'' [ nand 2/ jmp dropR ] // ``nand'', ``if'' [ push dropR push push ] // finish ``if'', start ''over'': ``push push'' [ popA AT AT nop] // continue ``over'': ``dupR'' [ popA push AT nop] // ``swapTR'' [ popA AT nop nop ] // finish ``over'': ``pop''. then: [ + ] +*R : // add R to T if T0 one DUP 1 AND IF pop dup push + THEN [push popA AT AT] // dup [ # push # nand ] // set up for ``if'', #(1), ``nand'' [ (then) ] [ (1) ] [ push popA AT AT] // ''dup'' [ nand 2/ jmp dropR ] // ``nand'', ``if'' [ push dropR nop nop ] // finish ``if'' [ popA AT AT push] // get copy of R ``pop dup push'' then: [ + ] // add either 0 or a copy of R to original T // add step on return stack +*_return : // add Rnext to Rtop if Rtop0 one //pop DUP 1 AND IF pop dup push + THEN push [ popA AT AT #] // pop dup #(1), [ (1) ] [ nand push nop nop ] // start of ``and'' [ popA AT AT nand] // end of ``and'' [ # 2/ jmp dropR ] //``if'' [ (then) ] [ push dropR nop nop ] // finish ``if'' [ popA AT AT push] // get copy of R ``pop dup push'' then: [ + push ] // add either 0 or a copy of Rnext to original Rtop xor : xor and : nand dup nand + : + pop : popA AT A@ : #(_A) push @R+ dropR dup : dup over : push dup pop swap over : push dup swapTR pop which expands to push push _[[ popA AT AT _]] _[[ popA push AT _]] _[[ popA AT _]] push : push A! : #(A) push ! dropR nop : nop drop : push popA
[FIXME:] 2000-07-23: After further thought, this seems like a very good thing. Replace "!R++" and "@P++" with "@R++" and "!A++". Don't have time to go through and see if this is really superior and update all the examples. Advantages: Now we have a real load and a real store instruction.
You may be asking, "OK, now we can *read* from a address in R, and we can *write* to a address in A -- but how do we get literal values into R ?". err... ummm... we can't. Other than painstakingly mathematically assembling them a bit at a time. Unless... we have a special function at memory address 0.
So normal code, whenever it needs some literal value, does
dup dup xor // get a zero push // push that zero onto the return stack swapPR // call the special function at address 0. 60321 // some literal value, embedded in the code dropR // get rid of the 0. ... // program continues from this point.
This is a classic technique in 6502 assembler (and probably other assembly languages) -- embed literal values, parameters to a subroutine, in the next memory cells after the subroutine call. Then the subroutine can access them using the value on the return stack:
[0 org] @R++ // grab the constant in the caller's code stream swapPR // return from subroutine. dropR // optionally get rid of the 0 in the branch delay.
With this modification, code is *not* forced to be executed a entire 3 instruction bundle at a time.
We don't need to call this subroutine *every* time -- I could imagine that we could have special versions of most subroutines that have a extra "@R++" tacked on the end, to get that literal value set to go.
Strange variations and ideas (ways to make the instruction set *more* ugly):
: f- ( ... S T -- ... (S-T) ) swap r- ; : r- ( ... S T -- ... (T-S) ) swap f- ; : negate 0 r- ; : negate 0 swap f- ; : + negate r- negate ; // 5 cycles : + negate f- ; // 4 cycles : +_+ ( ... a b c -- ... (a+b+c) ) negate r- r- negate ; // 6 cycles : +_+ ( ... a b c -- ... (a+b+c) ) negate f- negate f- ; // 8 cycles
There is only 1 branch instruction, only one way to modify the value of P. it is also the only conditional instruction.
T=0
conditional branch: conditionally swap P with R.
(??? should this also drop T when T is zero ? When you *know* it is zero, why bother keeping it ?) otherwise skip the "jump" and continue).
+* : // add S to T if T0 one DUP 1 AND IF OVER + THEN dup #(1) nand dup nand ( x -- lsb x ) #(continuelabel) push T=0 cnop // (1 x) < continuelabel > drop over + continuelabel: // (x) < old_address > dropR
Once a cell (pointed to by [P]) is loaded for execution, all of the instructions in that cell are executed, from the first to the last. The CPU never skips the remaining instructions in a cell (not even for the branch instruction T=0). The CPU never starts executing somewhere in the middle of a cell -- all branch destinations are to (the first instruction of) a particular cell, not to some (other) instruction inside that cell. Once a cell has been loaded from a particular address into the for execution, P is incremented. Typically the instructions in that cell are executed while the contents of the next address is being loaded; that next cell will in turn be executed unless (a) the current cell mentions "load" (#), which diverts the next cell to T instead of the instruction latch, increments P, and starts loading the following cell; (b) the current cell modifies P with "T=0". (We *could* just flush the "next cell" we just loaded, then load where P now points, then start executing along this new branch).
push
"). Perhaps making it part of the R stack
will be easier.
If A is part of the R stack, then
"swapPR
" is OK,
"popA
" is trivial (decrement pointer to A R stack),
"AT
" is OK,
"push
" is a bit tricky --
needs to copy old value of A to next higher location,
then move T to old location of A.
: dropT push dropR ; : dropR pop dropT ;Picking "dropR" seems to reduce branch and call overhead.
: over push dup pop swap ; >R dup >A R> A> ; : swap over push push drop pop pop ; push popA push AT pop ; >A >R A> R> ;
... load(label) SWAP(P,T) cnop (next instruction executed after subroutine returns) dropT ... label: (start of leaf subroutine) push ... pop SWAP(P,T) cnop(6 instructions, not counting cnop). Or alternatively
... load(label) push SWAP(P,R) cnop (next instruction executed after subroutine returns) dropR ... label: (start of leaf subroutine) ... SWAP(P,R) cnop(5 instructions, not counting cnop) (6 instructions if we replace "dropR" with "pop dropT")
... load(label) P->R, T->P cnop (next instruction executed after subroutine returns) dropR ... label: (start of leaf subroutine) ... pop P->R, T->P cnop(5 instructions, not counting cnop) (6 instructions if we replace "dropR" with "pop dropT")
"Go To" style jumps can be coded
#(label) push P->R, T->P // jump ... label: dropR // drop the unused address from R ...
Conditional "Go To" style jumps can be coded
#(label) push conditional(T=0 ? P->R, T->P) ... label: dropR // drop the unused address from the R stack ...
... load label push P->T, R->P // call subroutine cnop (next instruction executed after subroutine returns) dropT ... label: (start of leaf subroutine) push // save the return address on the R stack ... P->T, R->P cnop(6 instructions, not counting cnop) The return-like "3-register" branch seems to always be inferior to the call-like "3-register" branch, because the call sequence requres that extra "push" instruction.
Then push and pop do dual-purpose as call and return.
Then "Go To" style jumps can be coded
#(label) push with matching "come from" locations coded label: pop pop drop push // ... or alternatively #(label) pop drop pushwith no special code needed at the "come from" location.
Conditional "If(stuff)Then (thenpart) endif (continuation)" style structures are coded
(stuff) // leaves result on T // T=0 conditionally drops either P (nonzero) or R (zero) ??? #(label2) push T=0 cnop label: (thenpart) label2: (continuation)"If(stuff)Then (thenpart) else (elsepart) endif (continuation)"
(stuff) // leaves result on T // T=0 conditionally drops either P (nonzero) or R (zero) ??? #(label2) push T=0 cnop label: (thenpart) pop drop // get rid of current value of P #(label3) push // get new value of P label2: (elsepart) label3: (continuation)
Subroutine calls can be coded
# (com) push ... while subroutines them selves can be coded // the return address is on the R stack ... pop drop
I played with this (1998) and got Instruction summary:
The 15 instruction codes are: Code Name Description Traditional Forth where A is a variable @P+ fetch, address in P, push data onto T, P++ P @ @ 1 P +! !P+ store, address in P, pop data from T, P++ P @ ! 1 P +! push pop T, push into P >P pop pop P, push into T P> T=0 if T0 to T19 all zero, nop else drop P. (note: ignores MSbit of T !) A! pop T into A A ! A@ push A into T A @ xor exclusive-or S to T XOR nand nand S to T AND COM + add S to T + 2/ shift T, T20 to T19 (T20 unchanged) 2/ dup push T into T DUP over push S into T OVER drop pop T DROP nop NOPThe next instruction fetch begins as soon as no memory instructions (@P+ !P+) or possible branches (push pop T=0) are pending.
// fetch, address in R, increment R @R+ pop @P+ over push push drop pop //store, address in R, increment R !R+ pop over !P+ push drop push @P+ pop drop fetch, replacing address in T with data. @ push !P+ pop drop Store, address in T, data in S, removing both. !
One *could* read everything through a single register @P+ and !P+ but then references to variables in data RAM require careful manipulation to put the variable address in P, do the load, and restore the next-instruction address to P before the end of the cell:
swapPR !P+ swapPR.
One *could* access everything through a single register @R+ and !R+ but then references to in-line constants require that the sequence
swapPR @R+ swapPRfinish before the end of the cell.
As soon as instruction is loaded into the instruction latch and starts to execute, the next word starts to be loaded into a prefetch buffer.
If the instruction finishes with no memory access, we wait for the word to finish loading then copy the prefetch buffer to the instruction latch. If the instruction has a # literal load, we wait for the word to finish loading then copy the prefetch buffer to T, and start loading the *next* word. If the instruction has a data-memory access ... fastest would be to wait for the word to finish loading somehow keep it in some other buffer, start another memory cycle to load the requested item to T. But a simpler circuit would just interrupt and throw away the current cycle, start a new cycle to load the item to the buffer like all memory accesses do, when that is done move it to T and re-start the cycle to load the next instruction. (This may be faster if memory is *very* slow and can be cancelled quickly).
The only case when this prefetch doesn't help is with a calculated jump, then one must interrupt and throw away the current cycle, start a new cycle to load the new P location, and when that is done move it from the buffer to the instruction latch.
We might get away with *not* having a "delayed branch"
for normal calls (#(dest) push call
)
if we could force them into the sequence
- execute current instruction while loading next cell - when you hit #, wait for cell to finish loading - don't pre-fetch any more cells until we're ready to put this new address on the data bus - load new cell from [dest] into instruction register and continue from there.(but this still gives a delayed branch for the return).
Using @P+ (rather than @R+) as the only memory access instruction seems to better reflect the state of the "simpler circuit" prefetch mechanism.
//< address -- address+1 > //( -- [address] ) dup dup xor // push 0 to T // all in one word cSWAP(P,R) // always swaps, since T is zero @P+ // do the load swap // get the 0 back on top cSWAP(P,R) // restore P // end of word dropT // get rid of the 0.OK, packing 4 instructions into a single word isn't too difficult. Although it *does* seem like a lot of work (8 instructions) to do a pretty fundamental operation. (it would only be 4 instructions if we add a unconditional SWAP(P,R) instruction; if we instead add a @R+ instruction, it would only take 1 instruction). Something simple like "copy one word from this address to that address" becomes
//< source dest -- source++ dest++> //( -- ) dup dup xor // push 0 to T // all in one word cSWAP(P,R) // always swaps, since T is zero @P+ // do the load swap // get the 0 back on top cSWAP(P,R) // restore P // end of word dropT // get rid of the 0. pop swap // now < dest > ( value source++ ) !R+ push
This may be a slippery slope leading to a VLIW CPU rather than a MISC CPU. But it avoids the confusion of "What if there are multiple flow-control instructions in a word ?" Basically, I'm thinking something like a group of bits to indicate which type of cell is this:
(a) this entire cell is a literal; read it into T and go on to the next word. (b) This cell is normal sequential list of instructions; read them into the instruction latch, execute them, and go on to the next word. (c) When you're done with the instructions in this cell, do a conditional branch or call of type XXX. (d) When you're done with these instructions, return.At the time I'm writing this, that is 4 different possibilities which requires 2 bits, but it eliminates 3 instructions from my opcode list. (Just a single bit could be used to discriminate between sequential v. branching cells ...)
This page started 1998-01-15 (but some information much older, going back to 1996 Nov 12) and has backlinks
David Cary feedback.html
Return to index // end http://david.carybros.com/html/minimal_instruction_set.html