minimal instruction set

updated 2000-01-05.

When I was working on this, I thought it was pretty creative and unique. Later I found out that I'd been stuck in a ``Turing Tar Pit''. Still, this looks like a more reasonable instruction set than some of the really ugly things that have been developed by other people who have also wasted a lot of time in a ``Turing Tar Pit''.

Describes a microprocessor instruction set developed by David Cary that packs 2 instructions into 8 bits of RAM. As far as I know, this is *the* Minimal Instruction Set (for a single-processor Von Neuman machine). In other words, I know of no other (Turing-complete) instruction set that has as many or fewer distinct instructions (DI).

2003-01-04:DAV: I just discovered
``A Minimal CISC'' by Douglas W. Jones http://www.cs.uiowa.edu/~jones/arch/cisc/ that has fewer distinct instructions (DI): only 8, so 5 of them can pack into a 16 bit word.

The instructions are:

NOP: No operation.
DUP: Duplicate the stack top. This is the only way to allocate stack space.
ONE: Shift the stack top left one bit, shifting one into the least significant bit.
ZERO: Shift the stack top left one bit, shifting zero into the least significant bit.
LOAD: Use the value on the stack top as a memory address; replace it with the contents of the referenced location.
POP: Store the value from the top of the stack in the memory location referenced by the second word on the stack; pop both.
SUB: Subtract the top value on the stack from the value below it, pop both and push the result.
JPOS: If the word below the stack top is positive, jump to the word pointed to by the stack top. In any case, pop both.

... any constant can be pushed on the stack by a DUP followed by 16 ONE or ZERO instructions. Zero may be pushed on the stack by the sequence DUP DUP SUB; negation may be done by subtracting from zero; addition may be done by subtracting a negated value, and pushing zero prior to pushing an unconditional branch address allows an unconditional branch.

See also its counterpart, "The Ultimate RISC" by Douglas W. Jones http://www.cs.uiowa.edu/~jones/arch/risc/

The "Whitespace" programming language is similar to this "Minimal CISC". http://compsoc.dur.ac.uk/whitespace/

The "Path" language http://pathlang.sourceforge.net/ has an interesting 2D layout, and a near-minimal instruction set (expressed in single characters):

+ - increment the current memory cell
- - decrement the current memory cell
} - go to the next memory cell
{ - go to the previous memory cell
, - input an ascii character from stdin into the current memory cell
. - output an ascii character from the current memory cell into stdout
/ \ - Turn (unconditional branch)
^ < > V - Turn if current memory cell is not equal to 0
! - jump next symbol
$ - start here heading right
# - end here
any other character including spaces - do nothing

2002-12-07:[FIXME: move ``Turing Tar Pit'' information here.] computer_architecture.html#tarpit

contents: [FIXME:]

What is the minimum number of instructions for a Turing-complete von Neumann machine ?

Here David Cary pushes the MISC idea to ugly extremes. [FIXME: need a catchy name for this architecture]

"whenever you excessibly constrain any parameter, something else has got to give." -- Don Lancaster.

I've put together yet another MISC instruction set. I've squeezed it down to 11 instructions. I don't think I've painted myself into a corner yet (I hope). If I could just squeeze out 3 more instructions, I could pack 5 of them into a 16 bit cell.

One advantage of having few instructions -- you can document them in a reasonably-sized email, rather than needing a entire book to document the details of scads of instructions.

There's 2 very different ways of counting the the "size" of an instruction set.

We could count the number of different instruction mnemonics (NM) mentioned in the assembler documentation. This method is subjective, since different assemblers may describe the same machine with different numbers of mnemonics (for example, one assembler may require programmers to type "LOAD dest TO PC" to get the effect of an immediate jump. Another assembler may require programmers to use a additional mnemonic "JUMP TO dest" to get exactly the same bit pattern in the final executable.)

Processers with smaller NM are likely to have regular, orthogonal instruction sets (the same addressing modes apply to all instructions) which makes them easier to program than processors with larger NM.

The minimum number of mnemonics NM is 1: the "move" instruction used by TTA http://www.rdrop.com/~cary/html/computer_architecture.html#tta . Since you know what the instruction will be in a TTA architecture, it takes zero bits to specify -- -- but TTA still requires a bunch of bits per instruction to select registers and/or addressing modes.

Another way of counting the "size" of an instruction set is to count all possible distinct instructions (DI), all valid variations of "source register", "destination register", and "addressing mode". Using this count, TTA has lots more than 1 instruction.

This is a more objective count: if the longest instruction has b bits, then the number DI is equal to 2^b minus the number of invalid instructions of that length.

Processors with smaller DI are likely to need fewer bits b to specify each instruction. Processors with fewer bits per instruction

pack more instructions into a given number of bits of RAM, and
execute more instructions per second with a given RAM bandwidth.

Unfortunately, this advantage is partially (perhaps completely) cancelled out by the fact that processors with smaller DI often require *more* instructions (more RAM and/or more cycles) to implement a given piece of functionality than processors with larger DI.

Nevertheless, it is a interesting intellectual challenge to wonder What is the minimum number of bits bmin required per instruction ? What is the minimum number of distinct instructions DI for a Turing-complete von Neumann machine ?

The shBoom(tm) microprocessor from Patriot Scientific Corporation http://www.ptsc.com/ http://www.circuitcellar.com/articles/misc/tom-92.pdf packs instructions into 8 bits.

So we know DImin is 256 or less.

The "Itty Bitty Stack Machine" http://www.ittybittycomputers.com/IttyBitty/IBSM.htm is very similar to the F21, with most instructions packed into 5 bits. (A total of about 54 instructions) (This instruction set is allegedly "fast, that is, it should be capable of emulation at a raw speed not slower than 10% of the host native hardware.")

The "F21" Forth engine by Chuck Moore F21 STACK PROCESSOR CPU DESCRIPTION http://pisa.rockefeller.edu:8080/MISC/F21.specs has 27 distinct instructions, each one is packed into 5 bits (neglecting the bits that follow "#" and branches).

So we know DImin is 27 or less.

Clive Sinclair http://www.cdworld.co.uk/zx2000/clive.html claims that he has designed a CPU with only "16 principle instructions", but he doesn't list any details.

Dr Neil Burgess mentions "ultraRISC processor, that has only 15 instructions" http://www.acue.adelaide.edu.au/leap/discipline/eng/Burgess.html but he doesn't list any details.

Can a CPU really be designed to have 16 or fewer distinct instructions DI, such that one can pack 2 instructions into 8 bits of RAM ? 8 instructions into a 32 bit word ? Or are these people counting NM, not DI ?

Starting with the elegant 27 instruction set for the "F21" Forth engine by Chuck Moore F21 STACK PROCESSOR CPU DESCRIPTION http://pisa.rockefeller.edu:8080/MISC/F21.specs , and eliminating instructions that could be emulated by (sometimes lengthy) combinations of the other instructions, David Cary managed to get a (very ugly) 16 instruction set.
"# push pop A! A@ T=0 @A !A xor and + com 2/ dup over drop". (16 instructions as of 1998)

Can you think of a more "elegant" (yet still "complete") instruction set of 16 or fewer distinct instructions DI ? Turing complete cellular automata are proof that a Turing Machine can be built without *any* distinct "instructions".

Hint: You soon get to the stage "Every instruction in this set is essential. if I eliminate *this*, I won't be able to do *that*, no matter how many of the remaining instructions I string together". While you can't simply eliminate 1 instruction from this set, sometimes you can replace 3 instructions from a set with 2 completely different instructions that, given the remaining instructions, can still do *that*. It helps to assume you have some scratchpad RAM, since the fewer internal registers you have, the fewer instructions you need to shuffle things back and forth between those registers.

Myron Plichota and vic plichota have developed a much more elegant set of 16 instructions they call qUark ../mirror/quark.txt

With a bit of a challenge from "vic plichota" <atsvap@cgo.wave.ca>, and many ideas from Myron Plichota, I've developed a instruction set with only 13 instructions: "# ! + xor nand 2/ push pop dropR swap dup T=0 nop" (13 instructions as of 1999-02-22) Programmers model: there is a instruction pointer P and 2 stacks, the "return stack" and the "data stack". The top of the data stack is called T and the second on the data stack is called S; the top of the return stack is called R.

  T  R  P
  S  .
  .  .
  .  .
  .  .

I've managed to whittle it down even more with a ``conditional skip if arithmetic result not zero'' idea from Alan Grimes: Instruction summary: "# ! + xor nand 2/ push pop toA Afrom call nop" (12 instructions as of 1999-03-22)

I've managed to whittle it down even more "# ! + xor nand 2/ push popA AT call nop" (11 instructions as of 1999-03-24)

I *think* this is still functionally complete, and can still do everything that any other Turing-complete CPU can do. It just takes a *lot* more instruction cycles to do most things than most CPUs.

2000-01-05:DAV: I've just stumbled across "BF: An Eight-Instruction Turing-Complete Programming Language which was invented by Urban Mueller solely for the purpose of being able to create a compiler that was less than 256 bytes in size, for the Amiga OS."

More BF details:

A very clean instruction set, although it takes far more instruction cycles than even my 11 instructions to do even the most trivial things. I wonder if we could redefine the "input" and "output" op-codes ...

** Programmers model

"# ! + xor nand 2/ push popA AT call nop" (11 instructions as of 1999-03-24) [perhaps more Forth-like names would be

 >R toR      instead of push
 A> Afrom    instead of AT
 RA ??       instead of popA

]

2 push-down stacks, the "return stack" and the "data stack".

    T A R P
    S   .
    .   .
    .   .
    .   .

P = the instruction pointer (program counter).
R = the top value of the return stack. The return stack stores subroutine return addresses, address pointers, and occasional data.
A = A register to hold temporary data (during "dup", "swap", "drop", and "pop").
T = the top value of the data stack (which hold data values and occasional address pointers).
S = the second value on the data stack

All registers are the same length, the length of a memory address.

[FIXME: A implementation needs to choose:

sizeof_address (in bits)
sizeof_data_word (in bits) (must be less than or equal to sizeof_address)
depth of data stack (in addresses)
depth of return stack (in addresses)

(Moore fixed (sizeof_address) = (sizeof_data_word + 1 bit). What other choices would be "interesting" ???) ]

[A slow, minimal-gate implementation needs stack pointers that point to S and R in RAM. Faster processors can keep most or all of the stack on-chip ... What should the processor do when stacks overflow or underflow ? ]

** Acronyms and notation: cell: the implementation's native integer size, the number of bits read and wrtten at once. [ ... ] indicates that everything inside the brackets is contained in a single cell. (. . .): parameter-stack diagram, T to far right, then S. <. . .>: return-stack diagram, R on far right. |: pipe char used to indicate an "either-or" choice

Subroutines are documented with the initial (to the left of the "-") and final (to the right of the "-") state of the data stack and the return stack.

** External interface summary

Data on the external memory bus always comes from T or from "external devices", and always goes to "external devices" or T or the instruction latch. All data in memory is accessed only on cell boundaries (i.e., only 1 whole cell is read or written at a time).

Addresses on the memory bus always come from P or R (or perhaps from external devices while the CPU is not using the bus).

Since instructions are always only 4 bits each, instructions are packed into "cells", as many as will fit (e.g., 3 instructions per 12 bit cells on some machines, or 4 instructions per 16 bit cell, 5 instructions per 20 bit cell, or 8 instructions per 32 bit cell on other machines. ). (does it make any sense for "cells" to be bigger or smaller than sizeof_data_word, the size of a single word of memory ?). A minimum of 3 instructions per cell is needed for the ugly hack that works around the lack of a proper "load" command).

Any opcode ``# !R+ + xor nand 2/ push popA AT call nop'' can occur in any slot.

[FIXME: How are interrupts handled ? computer_architecture.html#interrupt ]

It seems that the instructions execute so quickly that the bottleneck is the speed of the RAM.

** Memory access instructions:

Surprisingly, only 1 memory read instruction is needed for both in-line literals and for data everywhere else in memory:

	// ( -- x )
	@P+      // load data in RAM at [P], push it onto T, then increment P.
	load     // ditto
	#        // ditto

For example, when the CPU executes a cell that starts with 3 "#" instructions, then the following 3 cells are pushed onto the data stack (the second one ends up on S, the third one ends up on T), then the remaining instructions in the cell are executed, and then the 4th following cell is loaded into the instruction latch and executed.

  // ( ... - ... data1 data2 data3 )
  [ # # # ... ] [data1] [data2] [data3] [more instructions ...] ...

The assembler pseudo-op "#(value)" keeps track of implementation details (word size, register size, sign-extension, the current instruction word, where the next literal/instruction cell will be loaded from, etc.). "#(value)" expands to "#" (or perhaps "# com" or "# unsigned com") and packs the (possibly complemented) literal value into the next available (sizeof_data_word) cell, such that when that code is executed, the desired value gets loaded into T.

To load data that is *not* an in-line literal, somehow calculate the address and get it onto R, then pack these instructions into a single cell: "swapPR load swapPR" .

There is also only one write instruction:


	// ( x -- ) < address -- address+1 >
	!R+     // pop T, store popped value to RAM at [R], then increment R.
	store   // ditto

(??? should we use A for the address instead of R, a la the F21 ?) (hardware *might* be a bit simpler if P is used for *all* addresses to the memory bus, replacing !R+ with !P+).

2 operand arithmetical instructions

The 2 operand arithmetical instructions always pop 2 items off the data stack (S and T) and push the result back onto the data stack (into T).

  +     ( n1 n2 -- n1_+_n2 ) add S to T
  xor   ( n1 n2 -- n1_bitxor_n2 ) bitwise exclusive-or S to T
  nand  ( n1 n2 -- ~(n1_bitand_n2) ) bitwise nand S to T

One way to write software to implement multiple-precision arithemetic is to use the most significant bit of T as the carry bit, and extending precision in sizeof_address-1 bit chunks.

1 operand arithmetical instructions

  2/	( x -- x/2 ) arithmetic shift right of T (keep sign),
        and skip next instruction if T was not zero.
		(Next instruction only executed if T was (and still is) 0).
  nop	does nothing

[can "nop" be eliminated ?]

This "2/" is the only conditional instruction in the entire set.

The skipped instruction is typically "call" or "nop" or another ``2/''. For example, to shift 3 bits to the right, do "2/ 2/ 2/ nop". (I leave it as an exercise to the reader to show this sequence always has the net effect of unconditionally shifting T three times to the right).

[It may make hardware simpler (allow interrupts at end of every cell) if the programmer model always includes a virtual "nop" at the end of every cell, i.e., even if a 2/ is the last instruction of a cell, the first instruction of the next cell is executed unconditionally. This means the compiler must insert or delete a ``nop'' to make the ``2/'' do what the programmer expects:

  2/ call      --> [... nop] [2/ call ...]
  2/ nop call  --> [... 2/ ] [call ... ]
  2/ 2/        --> [... 2/ ] [2/ ... ]

]

[Other options: perhaps make "2/", if the result is not zero, skip *all* the remaining instructions of the current cell. perhaps make "2/", if the result is not zero, skip precisely 3 instructions: following it in this cell and perhaps the start of the next cell. This delay slot could be filled with "AT push popA" reducing the need for the "nop" instruction. ]

Stack manipulation instructions

  push	 ( x -- ) < -- x > pop T, push onto R
  popA	 ( -- ) < x -- > pop R, push onto A
  dropR	 ( -- ) < x -- > // alias for popA
  AT	 ( -- x ) push a copy of A onto T (leaving A unchanged).

Data in any register in the list (T R A T) can be moved in a single cycle to the register to its right.

"popA" is a hint to the reader that the value of A will soon be used with AT; "dropR" is a hint to the reader that the value of A is now irrelevant.

Branches (Change-of-program-flow instructions)

There is only 1 branch instruction, only one way to modify the value of P (but lots of different aliases to make the intent of the programmer clear):

	// various aliases for the same branch instruction bit pattern
	swapPR ( -- ) < future_P -- past_P > // swap P with R.
	call
	branch
	return
	exit
	;      // pronounced "exit"
	jmp

P typically (but not always) points to the cell following the cell from which the currently-executing instructions were taken.

Once a cell is loaded into the instruction latch for execution, P is incremented, and then *all* of the instructions in that cell are executed, (possibly modifying P) from the first to the last. The CPU never skips the remaining instructions in a cell (not even for the call instruction).

If a "call" is immediately followed by a "#" instruction, then that distant value at [P] is loaded, *not* the data in the cell immediately following the cell currently being executed.

After *all* instructions in a cell have executed, the next cell of instructions is read from this new value of [P] into the instruction latch, P is incremented, and then all the instructions in that cell are always executed (possibly modifying P). Whatever value of P exists immediately after they are all executed is the address where the next instruction cell will be loaded from.

Sequences that "temporarily" change P must be very careful to restore P before the end of that cell.

All branch destinations are to (the first instruction of) a particular cell, not to some (other) instruction inside that cell.

** prefetch implementation

Prefetch may be implemented on some chips. It has no effect on the programmer model; all implementations act "as if" there was no prefetch.

Typically, while the instructions in one cell are executed, the next cell is being pre-fetched. That pre-fetched cell will in turn be executed unless (a) the current cell mentions "load" (#), which diverts the pre-fetched cell to T instead of the instruction latch, increments P, and starts pre-fetching the following cell. (b) the current cell modifies P with "swapPR". The CPU will then flush the ``next cell'' it speculatively pre-loaded, then starts pre-fetching the next instruction where P now points.

[A strange alternative: Instead, a chip *could* stick with the absolute simplest thing to do with "call" -- merely swap R and P, and not worry about the consequences.

This has a *major* impact on the programmer's model.

This means that pre-fetched cells are *never* ``wasted''. The consequences are that after a cell with one ``call'' and no ``load'' instructions is fetched, the cell at the following address is prefetched and will be executed unconditionally (it is in the "branch delay slot" of a "delayed branch" a la Sparc ???), while prefetch loads the first cell of that subroutine. (note that there is *both* a call delay slot *and* a return delay slot).

Confusing, difficult-to-explain things to happen if one tries to mix "call" followed several "#"s in the same cell. The first "#" loads data from the cell immediately following the cell being executed, while the following "#"s load data from distant cells.

Other strange things happen if you put multiple call instructions in consecutive cells.

If you do this, then the macro for @R+ expands to:

: @R+
  // next 4 instruction must be in same cell.
  swapPR
  @P+ // puts a wasted prefetched value in T, starts loading desired value
  swapPR
  @P+ // puts the desired value in T, starts loading next instruction
  cnop
  [ cnop ] // this cell is skipped
  swap // get the wasted value on top
  drop // get rid of it.
  ;

]

[ we might be able to have a minimal implementation that packs only 1 instruction per cell, feeding out of a 4 bit RAM, if we change the instruction buffer so it pre-fetches and holds not only the currently executing instruction, but also the 2 following instructions. The ugly hack to fake a "load [R]": would work like this: while executing the source code "swapPR swapPR # push call", the instruction buffer would look like this:

``swapPR swapPR #'' (P and R get swapped; then data at [P++] is read)
``swapPR # (data)'' (P and R get swapped; instruction at [P++] is read)
``# (data) push'' (data shunted to T, instruction buffer overwritten with NOP; instruction at [P++] is read)
``nop push call'' (etc.)

every instruction triggers 1 read cycle to get the next instruction and shifts the instruction queue down one. ]

If we we are feeding out of a 4 bit RAM but we want the register size to be N times larger, then we could prefetch N+1 following instructions (confused yet ?). Then the sequence would be (for N=3, so T = 3*4 bits = 12 bits)

``swapPR nop nop swapPR #'' (P and R get swapped; then data at [P++] is read)
``nop nop swapPR # (data)'' (nop; then data at [P++] is read)
``nop swapPR # (data) (data)'' (nop; then data at [P++] is read)
``swapPR # (data) (data) (data)'' (P and R get swapped; instruction at [P++] is read)
``# (data) (data) (data) push'' (data shunted to T, instruction buffer overwritten with NOPs; instruction at [P++] is read)
``nop nop nop push call'' (etc.)

** The compiler/assembler

The assembler pseudo-op ``cnop'' fills the rest of the current cell with (zero or more) NOPs, sufficient to align the next instruction with the start of the next cell. This alignment is necessary when you want that next instruction to be the destination of some branch in the code. (This is implied by labels at the start of a subroutine, and the "then" and "else" of a "if-then" statement. ).

The compiler automatically generates inline code whenever the word being compiled would take *less* space inline than the subroutine call. (-- idea from Myron Plichota)

Should the compiler choose to actually generate a call, `` .... subroutine_label ... " expands to something like this:

	[       ... # push call]
	[ subroutine_label     ]
	[ ...            ]
	...
  subroutine_label:
	... // instructions to actually do something
	[ ...     return dropR ]

To reduce space slightly, the compiler may choose to compact lists of subroutine calls (after making sure that none of the subroutines will interfere too much with the return stack) from

	[       ... # push call]
	[ name_1               ]
	[       ... # push call]
	[ name_2               ]
	[       ... # push call]
	[ name_3               ]
	[       ... # push call]
	[ name_4               ]
	[      ... return dropR]

	[ # # # # push push    ]
	[ name_1               ]
	[ name_2               ]
	[ name_3               ]
	[ name_4               ]
	[ push push jump dropR ]

or to

	[ # push # push            ]
	[ name_4                   ]
	[ name_3                   ]
	[ # push # push jump dropR ]
	[ name_2                   ]
	[ name_1                   ]

(This is a generalization of "tailbiting").

To minimize space (but not time) in long sequences of only subroutine calls some compilers may use direct threading. data_compression.html#program_compression Here is an example of a fully re-entrant direct-threaded subroutine list interpreter:

	[... # push call]
	[ subroutine_list_interpreter  ]
	[ sub1 ]
	[ sub2 ]
	[ sub3 ]
	[ sub4 ]
	[ continue ]
  continue: // ( ) < oldP p_subroutine_list >
	[dropR dropR ...]
  ...
  subroutine_list_interpreter: ( -- ) < p_subroutine_list -- >
	[ swapPR # swapPR push call cnop ] // = [ @R! push call cnop ]
	[ # jump dropR              cnop ]
	[ subroutine_list_interpreter    ]

This subroutine list interpreter assumes these subroutines don't modify the top 2 items on the return stack; it's ok if they use deeper items for parameters.

Once you have this subroutine list interpreter, you can collapse other lists of subroutine calls to 2 full cells + 2 partial cells of overhead, plus 1 more cell per subroutine call.

The compiler converts ``.... if ... true_stuff ... then ... '' statements to something like

    // last 5 instructions must all be in same cell
    [ ... # push 2/ jmp dropR push dropR ]
    [ then ]
    [ ... true_stuff ...      ]
    ...
  then:
    ...

The compiler converts ``.... if ... true_stuff ... else ... false_stuff ... then ... '' statements to something like

    // last 5 instructions must all be in same cell
    [ ... # push 2/ jmp dropR push dropR ]
    [ else ]
    [ ... true_stuff ...      ]
    ...
    [ ...    # push jmp dropR ]
    [ then ]
  else:
    [ ...   false_stuff   ... ]
    ...
  then:
    ...

A minimum implementation (only 3 instructions per cell) is forced to use this alternate form:

	// last 2 instructions must be in same cell
    [ ...    ... # push 2/ jmp ]
    [ else ]
    [ dropR push dropR  ... true_stuff ... ]
    ...
    [ ...    # push jmp dropR ]
    [ then ]
  else:
    [ dropR push dropR ...   false_stuff   ... ]
    ...
  then:
    [ ...                ]
    ...

A smarter compiler might have a special ``else 0 then'' case for code like `` ... if ... true_stuff ... else 0 then ... '' to

    [ ... # push 2/ jmp dropR ] // ( refrain from dropping T )
    [ (then) ]
    [  push dropR ... true_stuff ...      ] // T was non-zero; drop it
    ...
  then:
    ... // if T was 0, now T is still 0.

(note that the "2/" never drops T, even in the case where T=0 and the CPU actually executes that next instruction. I suppose a really smart compiler might be able to refrain from dropping T in certain unusual situations when we later want to use T/2 in the true_stuff. and/or 0 in more complicated false_stuff. ).

The compiler converts `` ... DO ... LOOP ... '' into [FIXME] DO: loop initialization ... LOOP: increment by 1 loop +LOOP: variable increment loop ???

Some processors have the only conditional instruction be a "conditional return if 0" instruction. Another way of implementing ``.... if ... true_stuff ... then ... '' is to use the "conditional-return" style:

	[       ... # push call]
	[ subroutine_label     ]
	[ dropR ...            ]
	...
  subroutine_label:
    [ ... condition ... ]
    [ ... 2/ return push dropR ] // conditional return
    [ ... true_stuff ...      ]
	[ ...     return ] // unconditional return

A "conditional return" implementation of ``.... if ... true_stuff ... else ... false_stuff ... then ... '' looks something like

	// could calculate some of condition here
	[       ... # push call]
	[ (else)     ]
	...

  else:
	// < return_address >
	// could calculate some of condition here
	[       ... # push call ]
	[ (then)     ]
	// < return_address >
    [ ...   false_stuff   ... ]
	[ ...     return ] // unconditional return
    ...

  then:
	// < ... return_address else_address >
    [ ... condition ... ]
	// either return to "else:",
	// or fix up return stack so
	// we can do true_stuff
	// and return all the way to
	// original call.
    [ ... 2/ return drop dropR ]
	// < return_address >
    [ ... true_stuff ...      ]
    ...
	[ ...        return dropR ] // unconditional return

//some common macros
//using the instruction set
//  "# !R+ + xor nand 2/ push popA AT call nop" (DAV 1999-03-24)

// pseudo-primitives: pop, dup, drop, swapTR, swap.

// pop: take top value off R stack, push onto T stack.
: pop     ( -- x ) < x -- >
	popA AT ; // always inlined

: pop_dup	( -- x x ) < x -- >
	_[[ popA AT AT _]] ; // 3 cycles, always inlined

// dup: make a duplicate copy of T, push into T and S
: dup     ( x -- x x )
	push pop_dup ; // 4 cycles, always inlined

: drop    ( x -- ) < -- >
	push dropR ; // always inlined

// swapTR: swap T with R
: swapTR  ( a -- b ) < b -- a >
	popA push AT ; // 3 cycles, always inlined

: swap    (... a b --... b a ) swap T with S
	push swapTR pop // 6 cycles

: 0     ( -- 0 )
	dup dup xor ;
: 0     ( -- 0 )
	_[[ AT AT _]] xor ; // 3 cycles, always inlined

// The A register and ``popA'' and ``AT'' instructions
// may be reserved exclusively for these pseudo-primitives to use.
// Then use "push, pop, dup, drop, swap, swapTR, dropR, 0" macros
// as if they were the primitive instructions.

// stack manipulation

: over    (... b a --... b a b )
	push dup pop swap ;

: over	(... b a --... b a b ) 13 cycles if inlined
	push dup swapTR pop ;

: rot	(... a b c --... b c a )
	push swap pop swap ;

: rot	(... a b c --... b c a )
	push push swapTR pop swapTR pop ;

// arithmetical manipulation

// com: complement: invert all bits of T.
: com	( x -- (-x-1) ) or equivalently ( x -- -(x+1) )
	dup nand ; // 5 cycles

: -1    ( -- -1)
	0 com ; // 8 cycles
: -1    ( -- -1)
	0 0 nand ; // ( -- -1) // 7 cycles
: negate   ( x -- -x )
	#(-1) + com ; // ( x -- -x ) // 13 cycles, or 7 cycles +  1 RAM cycle

: 1   ( -- +1 )
	#(1) ; // 1 cycle + 1 RAM cycle
// too much work for no gain over straightforward literal.
: 1
	-1 dup + com ; // 17 cycles
: 1
	_[[ AT AT _]] xor // ( -- 0 )
	_[[ AT AT _]] xor // ( 0 -- 0 0 )
	nand // ( 0 0 -- -1 )
	push
	_[[ popA AT AT //
	+	// ( ... -1 -1 -- -2 ); A=-1
	AT AT _]]
	+	// ( ... -2 -1 -1 -- ... -2 -2 )
	nand ; // (... -2 -2 -- ... +1 )
	// 16 cycles

: and
	nand dup nand ;
: or
	over com and xor ; // clever sequence from Chuck Moore
: or
	com swap com nand ;
: or // 11 cycles
	push
	push
	pop_dup
	nand
	pop_dup
	nand
	nand



// subtract T from S:
: -      (... a b --... (a-b) )
	negate + ;

// 0<> logical buffer: 0->0, nonzero->(-1).
// 0= logical not: 0->(-1); nonzero->0
// (Would it be better to map false to +1 rather than -1 ?)
// (Do we really use less power by using +1 rather than -1 ?)

// simple and straightforward
: 0=     ( 0 -- -1 ) | ( nonzero -- 0 )
	if
	0
	else
	1
	then
	;

: 0<>     ( 0 -- 0 ) | ( nonzero -- -1 )
	if
	-1
	else 0 // take advantage of special speedup for ``else 0 then''
	then
	;

: 0=     ( 0 -- -1 ) | ( nonzero -- 0 )
	0<> com ;

: 0<>     ( 0 -- 0 ) | ( nonzero -- -1 )
	0= com ;

// doesn't work now -- only works if "swap" is really a primitive
//: 0<>             ( 0 -- 0 ) | ( nonzero -- -1 )
//	#(-1) swap
//	         //       x=0          |         x!=0
//	2/ swap // (...1 0 --...0 1 )  | (...1 x --...1 x/2 )
//	drop    // (...0 1 --...0 )    | (...1 x/2 --... 1 )

// somewhat more bizzare programming styles
: 0=     ( 0 -- -1 ) | ( nonzero -- 0 )
	-1 swap
	if
	dup
	else 0 // take advantage of special speedup for ``else 0 then''
	then
	xor
	;

: 0=                     ( 0 -- -1 ) | ( nonzero -- 0 )
	1 push 0 push // ( x -- x ) <... --... 1 0 >

	_[[ popA // ( x -- x ) <... 1 0 --... 1 >, A=0
	           //                x=0 | x!=0
	2/ popA    // < 1 -- >, A=1      | < 1 -- 1 >, A=0
	AT AT _]]     // (0 --...0 1 1) < -- >   | (x/2 --...x/2 0 0) < 1 -- 1 >
	2/ dropR drop  // (0 1 1 --...0 1)< -- > | (...x/2 0 0 --...x/2 0)  < 1 -- >
	push push dropR pop // (...0 1 -- 1 )< -- > | (...x/2 0 -- 0 )< -- >
	;

: 0<>     ( 0 -- 0 ) | ( nonzero -- -1 )
	-1 push
	if
	pop
	ret  // force a return
	else 0
	then // take advantage of special speedup for ``else 0 then''
	dropR
	;
which the compiler expands to
: 0<>     ( 0 -- 0 ) | ( nonzero -- -1 )
	[ # push # push 2/ jmp dropR ]
	[ -1 ]
	[ iszero ]
nonzero:
	[ push dropR popA AT ret ]
iszero:
	[ dropR ret ]


// even more bizzare versions:

: 0<>     ( 0 -- 0 ) | ( nonzero -- -1 )
	[ 2/ ret cnop ]
  nonzero:
	[ push dropR # ret cnop ]
	[ -1 ]

: 0=     ( 0 -- -1 ) | ( nonzero -- 0 )
	// this version must *not* be in-lined
	#(nonzero) push
	2/
	swapPR push dropR # dropR ret // these 6 instructions *must* be in the same cell
  zero:
	[-1]
  nonzero:
	[ 0]

which the compiler expands to
: 0=     ( 0 -- -1 ) | ( nonzero -- 0 )
	[# push 2/ swapPR push dropR # dropR ret]
	[ nonzero ]
	[-1]
  nonzero:
	[ 0]

// A smart compiler might take any phrase of the form
	``... if #(a) ... true_stuff ... else #(b) ... false_stuff ... then''
and compile it to
	[ ... # push 2/ jmp dropR push dropR # ]
	[ else ]
	[ a ]
	[ ... true_stuff ... ]
	...
    [ ...    # push jmp dropR ]
    [ then ]
  else:
	[ b ]
    [ ...   false_stuff   ... ]
    ...
  then:
    ...



: 1+	( ... a -- ... (1+a) )
	com -1 + com ; // 19 cycles

: 1+	( ... a -- ... (1+a) )
	#(1) + ; // 2 cycles + 1 RAM cycle

: 1-	( ... a -- ... (a-1) )
	-1 + ; // 9 cycles




: R--	( -- ) < x -- (x-1) > subtract one from R
	-1 pop + push ;

// other macros

// "R@+", "fetch": load word from RAM to T at address in R.
// Note that both swapPR *must* be in same cell to avoid wild jumps.
: R@+      ( -- value ) < source_address -- (1+source_address) >
	_[[ swapPR load swapPR _]]
	;

// "@", "fetch": load word from RAM at address in T.
: @      ( source_address -- value ) < -- >
	push
	R@+
	dropR ;

// "!", "poke": put value in T into RAM at address S
// (... value dest_address -- )
: !
	push !R+ dropR ;


: =    ( n1 n2 -- (n1==n1) )
	xor 0= ;




HEX

// from plichota
: MIN-          ( -- 80000000)
    80000000 ;

: signbit          ( -- 80000000)
    80000000 ;

// from plichota
: MAX+          ( -- 7FFFFFFF)
    7FFFFFFF ;



// if T is strictly negative, return true.
// from plichota
: 0<    ( x<0 -- -1 ) | ( x>=0 -- 0 )
	signbit and 0<> ;

: >     ( n1 n2 -- flag=n1>n2 )
	- 0< ;

: <     ( n1 n2 -- flag=n1<n2 )
	swap > ;


: <=            ( n1 n2 -- flag=n1<=n2)
    > 0= ;

: >=            ( n1 n2 -- flag=n1>=n2)
    swap <= ;

// from plichota
: ABS           ( n -- |n|)
    DUP 0< IF NEGATE THEN ;

// pull out the third element down in the stack, and put on T
: rot ( ... a b c -- ... b c a )
	push swap pop swap ;

// multi-word arithmetic

// copy word from one literal location to another
// ( -- ) < -- >
: copyword(source, dest)
	#(dest) #(source) @ ! ;

: copyword    ( dest -- 1+dest ) < source  -- 1+source >
	@R+
	push
		swapTR
		store
	pop ;

// Note that both swapPR *must* be in same cell to avoid wild jumps.
// this requires cell size to be at least 6 instructions
// 9 cycles + 4 RAM cycles
: copy2words	( source  -- 2+source ) < dest -- 2+dest >
	push
	_[[ swapPR
		load store
		load store
	swapPR _]]
	pop

// ( dest -- 4+dest ) < source  -- 4+source >
: copy4words (copyword copyword copyword copyword)

// ( -- ) < source dest -- 4+source 4+dest >
// Note that both swapPR *must* be in same cell to avoid wild jumps.
// this requires cell size to be at least 6 instructions
// 13 cycles + 8 RAM cycles
: copy4words (
	_[[ swapPR
		load load load load
	swapPR _]]
	_[[ popA
		store store store store
	AT _]]
	push



[


_Stack Computers_ by Philip Koopman 1989
mentions these words:

// ?DUP: conditionally duplicate T if it is non-zero
: ?DUP  ( 0 -- 0 ) or ( x -- x x )
	dup
	if
	dup
	then ;

(Chuck Moore doesn't like ?DUP:
``1x Forth'' by Charles Moore April 13, 1999

http://www.ultratechnology.com/1xforth.htm

)


U<      U1 U2 - FLAG
Return a true FLAG if U1 is less than U2 when compared
as unsigned integers.

U>       U1 U2 - FLAG
Return a true FLAG if U1 is greater than U2 when compared
as unsigned integers.

U*      N1 N2 - D3
Perform unsigned integer multiplication on N1 and N2,
yielding the unsigned double precision result D3.

U/MOD   D1 N2 - N3 N4
Perform unsigned integer division on D1 and N2, yielding
the quotient N4 and the remainder N3.



how to implement multi-precision operations ?
since there are no condition codes,
the carry flag must be pushed onto the data stack
as a logical value.

_Stack Computers_ by Philip Koopman 1989
mentions these words:


RLC          N1 CIN -> N2 COUT         ->
   Rotate left through carry N1 by 1 bit.  CIN is carry-in,
   COUT is carry-out.

RRC          N1 CIN -> N2 COUT         ->
   Rotate right through carry N1 by 1 bit.  CIN is carry-in,
   COUT is carry-out.


UNORM     (... EXP1 U2 ->... EXP3 U4 )
   Floating point normalize of unsigned 32-bit mantissa

ADC   (... N1 N2 CIN ->... N3 COUT )
   Add with carry.  CIN and COUT are logical flags on the
   stack.

//Store the double-precision value D1 at the two memory
//words starting at ADDR.
// [Is most-significant or least-significant word on T ?]
: D! (...D1 ADDR - )

//Drop the double-precision integer D1.
: DDROP ( D1 -- )
	drop drop ;

//Duplicate double-precision integer D1 on the stack.
: DDUP ( D1 - D1 D1 )
	over over ;

D+      D1 D2 - D3
Return the double precision sum of D1 and D2 as D3.

D@       ADDR - D1
Fetch the double precision value D1 from memory starting
at address ADDR.


DNEGATE    D1 - D2
Return D2, which is the two's complement of D1.

// Swap the top two double-precision numbers on the stack.
: DSWAP ( D1 D2 - D2 D1 ) (... a b c d --... c d a b )
  push
  swap
  push
  swap
  //(...c a ) <...d b >
  pop pop
  //(...c a b d) <... >
  swap
  push
  swap
  //(...c d a) <...b >
  pop

I             - N1
Return the index of the currently active loop.

I'            - N1
Return the limit of the currently active loop.

J             - N1
Return the index of the outer loop in a nested loop structure.

LEAVE         -
Set the loop counter on the return stack equal to the
loop limit to force an exit from the loop.


S-D        N1 - D2
Sign extend N1 to occupy two words, making it a double
precision integer D2.


SP@        (fetch contents of data stack pointer)
SP!        (initialize data stack pointer)
RP@        (fetch contents of return stack pointer)
RP!        (initialize return stack pointer)
MATCH      (string compare primitive)
ABORT"     (error checking & reporting word)
+LOOP      (variable increment loop)
/LOOP      (variable unsigned increment loop)
CMOVE      (string move)
<CMOVE     (reverse order string move)
DO         (loop initialization)
ENCLOSE    (text parsing primitive)
LOOP       (increment by 1 loop)
FILL       (block memory initialization word)
TOGGLE     (bit mask/set primitive)


]

note to assembly language programmers and compiler writers

The compiler must ensure that A is in a don't care state at the start and end of every cell in order for interrupt routines to be free to use A without saving it.

The compiler can enforce this rule by only allowing the use of the ``popA'' and ``AT'' opcodes in the subroutines "0", "dup", "pop", "swap", "swapTR" (and perhaps a few others), and making sure that when these subroutines are inlined, cell breaks don't occur at the "wrong" place. (The dropR opcode, even though it is the same as popA, can be allowed anywhere A is already in a don't care state, since it leaves A in a (different) don't care state).

Perhaps we need a special "non-breaking space" notation so the programmer can indicate that certain instructions (those sensitive to A, and certain other ones following ``2/'') must be packed into a single cell. (if cells contain N instructions, ``cnop'' forces the next N instructions into the same cell; but when I really wanted the next 3 instructions to stay together, and there were 4 empty slots remaining in the current cell, ``cnop'' is a bit wasteful.)

note to optimizing compiler writers:

peephole optimizer usually needs to eliminate do-nothing "popa at push" sequences in straight-line code, since the only effect is to change A and usually A is in a "don't care state".

In particular, the "pop swap" macro sequence (in the "over" macro) expands to "pop push swapTR pop", and the obviously do-nothing subsequence "pop push" is further expanded to the do-nothing sequence "popa at push". Perhaps it would be simplest to immediately replace the "pop swap" sequence with the faster sequence "swapTR pop" and then expand *that* into "popA push AT popA AT"

Similarly, "swap push" expands to "push swapTR pop push", with the same do-nothing subsequence "pop push"; it seems simple to make the compiler smart enough to immediately replace "swap push" with "push swapTR"

Macros to emulate each of the F21 instructions

(The "A" is simulated as an location in RAM, _A )

else unconditional jump: #(dest) push _[[ jmp dropR _]] cnop
T=0(dest) : dup #(dest) push _[[ 2/ jmp dropR push dropR _]] cnop
call(dest) : #(dest) push call cnop dropR
C=0(dest) : #(MSbset) nand dup nand T=0(dest) cnop
; return :	return cnop cnop

subroutine start: // no entry sequence needed.

: @R+ // fetch, address in R, then increment R
  cnop // next 3 instruction must be in same cell.
  swapPR
  @P+ // do the load
  swapPR // restore P
  ;

: @A+ :
  #(_A) // get address of _A (&A)
  dup push push @R+ dropR // fetch value of A: (A) < &A >
  push @R+ // fetch what A points to: (data) < A+1 &A >
  pop ! dropR // update A with new value. (data) < -- >
# : #
@A:
  #(_A) // get address of A (&A)
  push @R+ dropR // fetch value of A: (A) < -- >
  push @R+ // fetch what A points to: (data) < A+1 >
  dropR //
!R+ : ! //store, address in R, increment R
!A+ :
  #(_A) // get address of A (&A data)
  dup push push @R+ dropR // fetch value of A: (A data) < &A >
  push ! // pop T, store it at [A]: ( -- ) < A+1 &A >
  pop ! dropR // update A with new value. (data) < -- >
!A :
  #(_A) // get address of A (&A data)
  push @R+ dropR // fetch value of A: (A data) < -- >
  push ! // pop T, store it at [A]: ( -- ) < A+1 >
  dropR //
com : dup nand
2* : dup +
2/ : 2/
+* : // add S to T if T0 one            DUP 1 AND IF OVER + THEN
	dup
	#(1) and
	if
	over +
	then

+* : // add S to T if T0 one            DUP 1 AND IF OVER + THEN
	dup
	#(1)
	and // expands to ``nand dup nand''
	if
	over // expands to ``push dup swapTR pop''
	else 0
	then
	+

which expands to (with a good compiler on a 4-instruction-per-word machine)
	[push popA AT AT] // dup
	[ # push # nand] // set up for ``if'', #(1), ``nand''
	[ (then) ]
	[ (1) ]
	[ push popA AT AT] // ''dup''
	[ nand 2/ jmp dropR ] // ``nand'', ``if''
	[ push dropR push push ] // finish ``if'', start ''over'': ``push push''
	[ popA AT AT nop] // continue ``over'': ``dupR''
	[ popA push AT nop] // ``swapTR''
	[ popA AT nop nop ] // finish ``over'': ``pop''.
  then:
	[ + ]


+*R : // add R to T if T0 one            DUP 1 AND IF pop dup push + THEN
	[push popA AT AT] // dup
	[ # push # nand ] // set up for ``if'', #(1), ``nand''
	[ (then) ]
	[ (1) ]
	[ push popA AT AT] // ''dup''
	[ nand 2/ jmp dropR ] // ``nand'', ``if''
	[ push dropR nop nop ] // finish ``if''
	[ popA AT AT push] // get copy of R ``pop dup push''
 then:
	[ + ] // add either 0 or a copy of R to original T


// add step on return stack
+*_return : // add Rnext to Rtop if Rtop0 one
		//pop DUP 1 AND IF pop dup push + THEN push
	[ popA AT AT #] // pop dup #(1),
	[ (1) ]
	[ nand push nop nop ] // start of ``and''
	[ popA AT AT nand] // end of ``and''
	[ # 2/ jmp dropR ] //``if''
	[ (then) ]
	[ push dropR nop nop ] // finish ``if''
	[ popA AT AT push] // get copy of R ``pop dup push''
 then:
	[ + push ] // add either 0 or a copy of Rnext to original Rtop



xor : xor
and : nand dup nand
+ : +
pop : popA AT
A@ : #(_A) push @R+ dropR
dup : dup
over : push dup pop swap
over : push dup swapTR pop
which expands to
       push push _[[ popA AT AT _]] _[[ popA push AT _]] _[[ popA AT _]]
push : push
A! : #(A) push ! dropR
nop : nop
drop : push popA

** Ways to expand instruction set and make less ugly

-- delete the "A" register, and replace 2 instructions "popA" and "AT" with 4 instructions "dup", "swap", "drop", and "pop". (I think this actually simplifies the hardware)
-- Add a proper "load" instruction (perhaps @R+) so we don't have to have this ugly hack of playing around with the program counter. (which seems to do bad things with the instruction prefetch).
[FIXME:] 2000-07-23: After further thought, this seems like a very good thing. Replace "!R++" and "@P++" with "@R++" and "!A++". Don't have time to go through and see if this is really superior and update all the examples. Advantages: Now we have a real load and a real store instruction.
You may be asking, "OK, now we can *read* from a address in R, and we can *write* to a address in A -- but how do we get literal values into R ?". err... ummm... we can't. Other than painstakingly mathematically assembling them a bit at a time. Unless... we have a special function at memory address 0.
So normal code, whenever it needs some literal value, does
```
  dup dup xor // get a zero
  push // push that zero onto the return stack
  swapPR // call the special function at address 0.
  60321 // some literal value, embedded in the code
  dropR // get rid of the 0.
  ... // program continues from this point.
```
This is a classic technique in 6502 assembler (and probably other assembly languages) -- embed literal values, parameters to a subroutine, in the next memory cells after the subroutine call. Then the subroutine can access them using the value on the return stack:
```
[0 org]
  @R++ // grab the constant in the caller's code stream
  swapPR // return from subroutine.
  dropR // optionally get rid of the 0 in the branch delay.
```
With this modification, code is *not* forced to be executed a entire 3 instruction bundle at a time.
We don't need to call this subroutine *every* time -- I could imagine that we could have special versions of most subroutines that have a extra "@R++" tacked on the end, to get that literal value set to go.
-- replace the ``swapPR'' jack-of-all-trades branch (it's not as ugly as earlier versions with a ``T=0'' jack-of-all-trades branch) with normal "call" (``:'') and "return" (``;'') instructions (call: P->R, T->P) (return: R->P) The original F21 MISC had 5 branch instructions with several addressing modes each, but they run much faster.
- delete "nand" and replace with "and" and "com".

Strange variations and ideas (ways to make the instruction set *more* ugly):

maybe make a completely different instruction the conditional instruction ?) (Alan Grimes gave me the idea that we can make the jump instruction unconditional, if we design one or more of the arithmetical instructions, if they evaluate to zero, execute the following "jump" instruction (typically either this unconditional jump instruction or NOP),

Would it be better to have a "-" subtract replace the "+" add ? If one does have a subtract, is it better to have f- or to have r- ? They can emulate each other, of course:

: f- ( ... S T -- ... (S-T) )
	swap r- ;
: r- ( ... S T -- ... (T-S) )
	swap f- ;
: negate
	0 r- ;
: negate
	0 swap f- ;
: +
	negate r- negate ; // 5 cycles
: +
	negate f- ; // 4 cycles
: +_+ ( ... a b c -- ... (a+b+c) )
	negate r- r- negate ; // 6 cycles
: +_+ ( ... a b c -- ... (a+b+c) )
	negate f- negate f- ; // 8 cycles

(??? would it be better to make ``2/'' *unsigned* shift-in-zeros ?)
make the jump instruction conditional, like this:
There is only 1 branch instruction, only one way to modify the value of P. it is also the only conditional instruction.
T=0 conditional branch: conditionally swap P with R.
(??? should this also drop T when T is zero ? When you *know* it is zero, why bother keeping it ?) otherwise skip the "jump" and continue).
+* : // add S to T if T0 one DUP 1 AND IF OVER + THEN dup #(1) nand dup nand ( x -- lsb x ) #(continuelabel) push T=0 cnop // (1 x) < continuelabel > drop over + continuelabel: // (x) < old_address > dropR
Once a cell (pointed to by [P]) is loaded for execution, all of the instructions in that cell are executed, from the first to the last. The CPU never skips the remaining instructions in a cell (not even for the branch instruction T=0). The CPU never starts executing somewhere in the middle of a cell -- all branch destinations are to (the first instruction of) a particular cell, not to some (other) instruction inside that cell. Once a cell has been loaded from a particular address into the for execution, P is incremented. Typically the instructions in that cell are executed while the contents of the next address is being loaded; that next cell will in turn be executed unless (a) the current cell mentions "load" (#), which diverts the next cell to T instead of the instruction latch, increments P, and starts loading the following cell; (b) the current cell modifies P with "T=0". (We *could* just flush the "next cell" we just loaded, then load where P now points, then start executing along this new branch).
Can "A" be thought of as part of a stack, so that a minimal-gate implementation only needs to hold P, T, the instruction latch, and pointers to the 2 stacks ? (what other internal registers are there ?) If A is part of the T S stack, then it needs to be copied when the stack shrinks (i.e., when we have 2 operand instructions or we have "push"). Perhaps making it part of the R stack will be easier. If A is part of the R stack, then "swapPR" is OK, "popA" is trivial (decrement pointer to A R stack), "AT" is OK, "push" is a bit tricky -- needs to copy old value of A to next higher location, then move T to old location of A.
It may make the hardware simpler for the "conditional skip next instruction" to only work *inside* a cell; i.e., after the end of every cell is a implied "nop", and if the conditional skip instruction is the last instruction in a cell, only that implied "nop" is skipped, not the 1st instruction of the next cell. Does this *really* make the hardware any simpler ?
One needs either "dropT" or "dropR", the other can be synthesized:
```
: dropT
	push dropR ;
: dropR
	pop dropT ;
```
Picking "dropR" seems to reduce branch and call overhead.

One needs either "swap" or "over", the other can be synthesized:

: over
	push dup pop swap ;
	>R dup >A R> A> ;
: swap
	over push push drop pop pop ;
	push popA push AT pop ;
	>A >R A> R> ;

"2-register" branch instruction can replace both call and return:

	...
	load(label)
	SWAP(P,T)
	cnop
(next instruction executed after subroutine returns)
	dropT
	...
label: (start of leaf subroutine)
	push
	...
	pop
	SWAP(P,T)
	cnop

(6 instructions, not counting cnop). Or alternatively

	...
	load(label)
	push
	SWAP(P,R)
	cnop
(next instruction executed after subroutine returns)
	dropR
	...
label: (start of leaf subroutine)
	...
	SWAP(P,R)
	cnop

(5 instructions, not counting cnop) (6 instructions if we replace "dropR" with "pop dropT")

"3-register" branch instruction can replace both call and return

call-like "3-register" branch:

	...
	load(label)
	P->R, T->P
	cnop
(next instruction executed after subroutine returns)
	dropR
	...
label: (start of leaf subroutine)
	...
	pop
	P->R, T->P
	cnop

(5 instructions, not counting cnop) (6 instructions if we replace "dropR" with "pop dropT")

"Go To" style jumps can be coded

	#(label) push
	P->R, T->P  // jump
	...
label:
	dropR // drop the unused address from R
	...

Conditional "Go To" style jumps can be coded

	#(label) push
	conditional(T=0 ? P->R, T->P)
	...
label:
	dropR // drop the unused address from the R stack
	...

return-like "3-register" branch:
```
	...
	load label
	push
	P->T, R->P // call subroutine
	cnop
(next instruction executed after subroutine returns)
	dropT
	...
label: (start of leaf subroutine)
	push // save the return address on the R stack
	...
	P->T, R->P
	cnop
```
(6 instructions, not counting cnop) The return-like "3-register" branch seems to always be inferior to the call-like "3-register" branch, because the call sequence requres that extra "push" instruction.

One could just make P the top of the return stack, making R not directly accessible ... must be much more careful not to change P accidentally when temporarily using R.

Then push and pop do dual-purpose as call and return.

Then "Go To" style jumps can be coded

	#(label) push
with matching "come from" locations coded
label:
	pop pop drop push  //
	...
or alternatively
	#(label) pop drop push

with no special code needed at the "come from" location.

Conditional "If(stuff)Then (thenpart) endif (continuation)" style structures are coded

 (stuff) // leaves result on T
  // T=0 conditionally drops either P (nonzero) or R (zero) ???
  #(label2) push T=0 cnop
label:
  (thenpart)
label2:
  (continuation)

"If(stuff)Then (thenpart) else (elsepart) endif (continuation)"

 (stuff) // leaves result on T
  // T=0 conditionally drops either P (nonzero) or R (zero) ???
  #(label2) push T=0 cnop
label:
  (thenpart)
  pop drop // get rid of current value of P
  #(label3) push // get new value of P
label2:
  (elsepart)
label3:
  (continuation)

Subroutine calls can be coded

  # (com) push ...
while subroutines them selves can be coded
     // the return address is on the R stack
  ...
  pop drop

I played with this (1998) and got Instruction summary:

   The 15 instruction codes are:

 Code Name   Description        Traditional Forth where A is a variable

@P+    fetch, address in P, push data onto T, P++             P @ @ 1 P +!
!P+     store, address in P, pop data from T, P++               P @ ! 1 P +!

push   pop T, push into P                        >P
pop    pop P, push into T                        P>

T=0    if T0 to T19 all zero, nop else drop P. (note: ignores MSbit of T !)

A!     pop T into A                              A !
A@     push A into T                             A @

xor    exclusive-or S to T                       XOR
nand   nand S to T                               AND COM
+      add S to T                                +

2/     shift T, T20 to T19 (T20 unchanged)       2/
dup    push T into T                             DUP
over   push S into T                             OVER
drop   pop T                                     DROP

nop                                              NOP

The next instruction fetch begins as soon as no memory instructions (@P+ !P+) or possible branches (push pop T=0) are pending.

		// fetch, address in R, increment R		@R+
  pop @P+ over push
  push drop pop

		//store, address in R, increment R		!R+
  pop over !P+ push drop

  push @P+ pop drop  fetch, replacing address in T with data.	   @
  push !P+ pop drop   Store, address in T, data in S, removing both.  !

Memory load instructions:
One *could* read everything through a single register @P+ and !P+ but then references to variables in data RAM require careful manipulation to put the variable address in P, do the load, and restore the next-instruction address to P before the end of the cell:
```
  swapPR
  !P+
  swapPR.
```
One *could* access everything through a single register @R+ and !R+ but then references to in-line constants require that the sequence
```
  swapPR
  @R+
  swapPR
```
finish before the end of the cell.
It would be nice to *allow* a prefetch mechanism to work properly:
As soon as instruction is loaded into the instruction latch and starts to execute, the next word starts to be loaded into a prefetch buffer.
If the instruction finishes with no memory access, we wait for the word to finish loading then copy the prefetch buffer to the instruction latch. If the instruction has a # literal load, we wait for the word to finish loading then copy the prefetch buffer to T, and start loading the *next* word. If the instruction has a data-memory access ... fastest would be to wait for the word to finish loading somehow keep it in some other buffer, start another memory cycle to load the requested item to T. But a simpler circuit would just interrupt and throw away the current cycle, start a new cycle to load the item to the buffer like all memory accesses do, when that is done move it to T and re-start the cycle to load the next instruction. (This may be faster if memory is *very* slow and can be cancelled quickly).
The only case when this prefetch doesn't help is with a calculated jump, then one must interrupt and throw away the current cycle, start a new cycle to load the new P location, and when that is done move it from the buffer to the instruction latch.
We might get away with *not* having a "delayed branch" for normal calls (#(dest) push call) if we could force them into the sequence
```
 - execute current instruction while loading next cell
 - when you hit #, wait for cell to finish loading
 - don't pre-fetch any more cells until we're ready to
   put this new address on the data bus
 - load new cell from [dest] into instruction register and continue from there.
```
(but this still gives a delayed branch for the return).
Using @P+ (rather than @R+) as the only memory access instruction seems to better reflect the state of the "simpler circuit" prefetch mechanism.
If one tries to replace the unconditional SWAP(P,R) instruction with a conditional cSWAP(P,R) instruction, then it becomes more difficult to do pack a data memory load sequence into a single cell. One needs something like
```
//< address -- address+1 >
//( -- [address] )
	dup dup xor // push 0 to T
	// all in one word
	cSWAP(P,R) // always swaps, since T is zero
	@P+ // do the load
	swap // get the 0 back on top
	cSWAP(P,R) // restore P
	// end of word
	dropT // get rid of the 0.
```
OK, packing 4 instructions into a single word isn't too difficult. Although it *does* seem like a lot of work (8 instructions) to do a pretty fundamental operation. (it would only be 4 instructions if we add a unconditional SWAP(P,R) instruction; if we instead add a @R+ instruction, it would only take 1 instruction). Something simple like "copy one word from this address to that address" becomes
```
//< source dest -- source++ dest++>
//( -- )
	dup dup xor // push 0 to T
	// all in one word
	cSWAP(P,R) // always swaps, since T is zero
	@P+ // do the load
	swap // get the 0 back on top
	cSWAP(P,R) // restore P
	// end of word
	dropT // get rid of the 0.
	pop
	swap // now < dest > ( value source++ )
	!R+
	push
```
Since you never have multiple branches inside a single cell, maybe you could seperate the flow-control and perhaps the "#" listeral instruction, putting them in a special field. The remaining normal sequential instructions, since there are fewer of them, might fit into fewer bits in the remaining fields of the cell.
This may be a slippery slope leading to a VLIW CPU rather than a MISC CPU. But it avoids the confusion of "What if there are multiple flow-control instructions in a word ?" Basically, I'm thinking something like a group of bits to indicate which type of cell is this:
```
	(a) this entire cell is a literal;
	read it into T and go on to the next word.
	(b) This cell is normal sequential list of instructions;
	read them into the instruction latch, execute them,
	and go on to the next word.
	(c) When you're done with the instructions in this cell,
	do a conditional branch or call of type XXX.
	(d) When you're done with these instructions, return.
```
At the time I'm writing this, that is 4 different possibilities which requires 2 bits, but it eliminates 3 instructions from my opcode list. (Just a single bit could be used to discriminate between sequential v. branching cells ...)
I don't know if I like the semantics of the conditional branch instruction T=0. Do we want it to test *all* T ? everything except the most significant bit of T (the carry bit) ? Do we want it to take the value T and conditionally (based on S, so we'll call it S=0) either move it to push down P (P->R, T->P) or just throw the value away ? Do we want T=0 to conditionally pop the value off P (R->P) or do nothing ? Do we want T=0 to conditionally *replace* the value P with a *copy* of the next value below it on the stack, or do nothing ? Do we want the zero value to be popped or not ?
Is this signed 2/ better than unsigned ? (I think so, since it's much easier to synthesize a unsigned shift -- generate a mask and strip off the shifted-in one bits -- than it is to have the unsigned shift be primitive and try to synthesize a signed shift.)

This page started 1998-01-15 (but some information much older, going back to 1996 Nov 12) and has backlinks

David Cary feedback.html
d.cary@ieee.org.

Return to index // end http://david.carybros.com/html/minimal_instruction_set.html