minimal instruction set

updated 2000-01-05.

When I was working on this, I thought it was pretty creative and unique. Later I found out that I'd been stuck in a ``Turing Tar Pit''. Still, this looks like a more reasonable instruction set than some of the really ugly things that have been developed by other people who have also wasted a lot of time in a ``Turing Tar Pit''.

Describes a microprocessor instruction set developed by David Cary that packs 2 instructions into 8 bits of RAM. As far as I know, this is *the* Minimal Instruction Set (for a single-processor Von Neuman machine). In other words, I know of no other (Turing-complete) instruction set that has as many or fewer distinct instructions (DI).

2003-01-04:DAV: I just discovered
``A Minimal CISC'' by Douglas W. Jones http://www.cs.uiowa.edu/~jones/arch/cisc/ that has fewer distinct instructions (DI): only 8, so 5 of them can pack into a 16 bit word.

The instructions are:

  1. NOP: No operation.
  2. DUP: Duplicate the stack top. This is the only way to allocate stack space.
  3. ONE: Shift the stack top left one bit, shifting one into the least significant bit.
  4. ZERO: Shift the stack top left one bit, shifting zero into the least significant bit.
  5. LOAD: Use the value on the stack top as a memory address; replace it with the contents of the referenced location.
  6. POP: Store the value from the top of the stack in the memory location referenced by the second word on the stack; pop both.
  7. SUB: Subtract the top value on the stack from the value below it, pop both and push the result.
  8. JPOS: If the word below the stack top is positive, jump to the word pointed to by the stack top. In any case, pop both.

... any constant can be pushed on the stack by a DUP followed by 16 ONE or ZERO instructions. Zero may be pushed on the stack by the sequence DUP DUP SUB; negation may be done by subtracting from zero; addition may be done by subtracting a negated value, and pushing zero prior to pushing an unconditional branch address allows an unconditional branch.

See also its counterpart, "The Ultimate RISC" by Douglas W. Jones http://www.cs.uiowa.edu/~jones/arch/risc/

The "Whitespace" programming language is similar to this "Minimal CISC". http://compsoc.dur.ac.uk/whitespace/

The "Path" language http://pathlang.sourceforge.net/ has an interesting 2D layout, and a near-minimal instruction set (expressed in single characters):

2002-12-07:[FIXME: move ``Turing Tar Pit'' information here.] computer_architecture.html#tarpit

contents: [FIXME:]

See also

What is the minimum number of instructions for a Turing-complete von Neumann machine ?

Here David Cary pushes the MISC idea to ugly extremes. [FIXME: need a catchy name for this architecture]

"whenever you excessibly constrain any parameter, something else has got to give." -- Don Lancaster.

I've put together yet another MISC instruction set. I've squeezed it down to 11 instructions. I don't think I've painted myself into a corner yet (I hope). If I could just squeeze out 3 more instructions, I could pack 5 of them into a 16 bit cell.

One advantage of having few instructions -- you can document them in a reasonably-sized email, rather than needing a entire book to document the details of scads of instructions.

There's 2 very different ways of counting the the "size" of an instruction set.

We could count the number of different instruction mnemonics (NM) mentioned in the assembler documentation. This method is subjective, since different assemblers may describe the same machine with different numbers of mnemonics (for example, one assembler may require programmers to type "LOAD dest TO PC" to get the effect of an immediate jump. Another assembler may require programmers to use a additional mnemonic "JUMP TO dest" to get exactly the same bit pattern in the final executable.)

Processers with smaller NM are likely to have regular, orthogonal instruction sets (the same addressing modes apply to all instructions) which makes them easier to program than processors with larger NM.

The minimum number of mnemonics NM is 1: the "move" instruction used by TTA http://www.rdrop.com/~cary/html/computer_architecture.html#tta . Since you know what the instruction will be in a TTA architecture, it takes zero bits to specify -- -- but TTA still requires a bunch of bits per instruction to select registers and/or addressing modes.

Another way of counting the "size" of an instruction set is to count all possible distinct instructions (DI), all valid variations of "source register", "destination register", and "addressing mode". Using this count, TTA has lots more than 1 instruction.

This is a more objective count: if the longest instruction has b bits, then the number DI is equal to 2^b minus the number of invalid instructions of that length.

Processors with smaller DI are likely to need fewer bits b to specify each instruction. Processors with fewer bits per instruction

Unfortunately, this advantage is partially (perhaps completely) cancelled out by the fact that processors with smaller DI often require *more* instructions (more RAM and/or more cycles) to implement a given piece of functionality than processors with larger DI.

Nevertheless, it is a interesting intellectual challenge to wonder What is the minimum number of bits bmin required per instruction ? What is the minimum number of distinct instructions DI for a Turing-complete von Neumann machine ?

The shBoom(tm) microprocessor from Patriot Scientific Corporation http://www.ptsc.com/ http://www.circuitcellar.com/articles/misc/tom-92.pdf packs instructions into 8 bits.

So we know DImin is 256 or less.

The "Itty Bitty Stack Machine" http://www.ittybittycomputers.com/IttyBitty/IBSM.htm is very similar to the F21, with most instructions packed into 5 bits. (A total of about 54 instructions) (This instruction set is allegedly "fast, that is, it should be capable of emulation at a raw speed not slower than 10% of the host native hardware.")

The "F21" Forth engine by Chuck Moore F21 STACK PROCESSOR CPU DESCRIPTION http://pisa.rockefeller.edu:8080/MISC/F21.specs has 27 distinct instructions, each one is packed into 5 bits (neglecting the bits that follow "#" and branches).

So we know DImin is 27 or less.

Clive Sinclair http://www.cdworld.co.uk/zx2000/clive.html claims that he has designed a CPU with only "16 principle instructions", but he doesn't list any details.

Dr Neil Burgess mentions "ultraRISC processor, that has only 15 instructions" http://www.acue.adelaide.edu.au/leap/discipline/eng/Burgess.html but he doesn't list any details.

Can a CPU really be designed to have 16 or fewer distinct instructions DI, such that one can pack 2 instructions into 8 bits of RAM ? 8 instructions into a 32 bit word ? Or are these people counting NM, not DI ?

Starting with the elegant 27 instruction set for the "F21" Forth engine by Chuck Moore F21 STACK PROCESSOR CPU DESCRIPTION http://pisa.rockefeller.edu:8080/MISC/F21.specs , and eliminating instructions that could be emulated by (sometimes lengthy) combinations of the other instructions, David Cary managed to get a (very ugly) 16 instruction set.
"# push pop A! A@ T=0 @A !A xor and + com 2/ dup over drop". (16 instructions as of 1998)

Can you think of a more "elegant" (yet still "complete") instruction set of 16 or fewer distinct instructions DI ? Turing complete cellular automata are proof that a Turing Machine can be built without *any* distinct "instructions".

Hint: You soon get to the stage "Every instruction in this set is essential. if I eliminate *this*, I won't be able to do *that*, no matter how many of the remaining instructions I string together". While you can't simply eliminate 1 instruction from this set, sometimes you can replace 3 instructions from a set with 2 completely different instructions that, given the remaining instructions, can still do *that*. It helps to assume you have some scratchpad RAM, since the fewer internal registers you have, the fewer instructions you need to shuffle things back and forth between those registers.

Myron Plichota and vic plichota have developed a much more elegant set of 16 instructions they call qUark ../mirror/quark.txt

With a bit of a challenge from "vic plichota" <atsvap@cgo.wave.ca>, and many ideas from Myron Plichota, I've developed a instruction set with only 13 instructions: "# ! + xor nand 2/ push pop dropR swap dup T=0 nop" (13 instructions as of 1999-02-22) Programmers model: there is a instruction pointer P and 2 stacks, the "return stack" and the "data stack". The top of the data stack is called T and the second on the data stack is called S; the top of the return stack is called R.

  T  R  P
  S  .
  .  .
  .  .
  .  .

I've managed to whittle it down even more with a ``conditional skip if arithmetic result not zero'' idea from Alan Grimes: Instruction summary: "# ! + xor nand 2/ push pop toA Afrom call nop" (12 instructions as of 1999-03-22)

I've managed to whittle it down even more "# ! + xor nand 2/ push popA AT call nop" (11 instructions as of 1999-03-24)

I *think* this is still functionally complete, and can still do everything that any other Turing-complete CPU can do. It just takes a *lot* more instruction cycles to do most things than most CPUs.

2000-01-05:DAV: I've just stumbled across "BF: An Eight-Instruction Turing-Complete Programming Language which was invented by Urban Mueller solely for the purpose of being able to create a compiler that was less than 256 bytes in size, for the Amiga OS."

More BF details:

A very clean instruction set, although it takes far more instruction cycles than even my 11 instructions to do even the most trivial things. I wonder if we could redefine the "input" and "output" op-codes ...

** Programmers model

"# ! + xor nand 2/ push popA AT call nop" (11 instructions as of 1999-03-24) [perhaps more Forth-like names would be

 >R toR      instead of push
 A> Afrom    instead of AT
 RA ??       instead of popA
]

2 push-down stacks, the "return stack" and the "data stack".

    T A R P
    S   .
    .   .
    .   .
    .   .

All registers are the same length, the length of a memory address.

[FIXME: A implementation needs to choose:

(Moore fixed (sizeof_address) = (sizeof_data_word + 1 bit). What other choices would be "interesting" ???) ]

[A slow, minimal-gate implementation needs stack pointers that point to S and R in RAM. Faster processors can keep most or all of the stack on-chip ... What should the processor do when stacks overflow or underflow ? ]

** Acronyms and notation: cell: the implementation's native integer size, the number of bits read and wrtten at once. [ ... ] indicates that everything inside the brackets is contained in a single cell. (. . .): parameter-stack diagram, T to far right, then S. <. . .>: return-stack diagram, R on far right. |: pipe char used to indicate an "either-or" choice

Subroutines are documented with the initial (to the left of the "-") and final (to the right of the "-") state of the data stack and the return stack.

** External interface summary

Data on the external memory bus always comes from T or from "external devices", and always goes to "external devices" or T or the instruction latch. All data in memory is accessed only on cell boundaries (i.e., only 1 whole cell is read or written at a time).

Addresses on the memory bus always come from P or R (or perhaps from external devices while the CPU is not using the bus).

Since instructions are always only 4 bits each, instructions are packed into "cells", as many as will fit (e.g., 3 instructions per 12 bit cells on some machines, or 4 instructions per 16 bit cell, 5 instructions per 20 bit cell, or 8 instructions per 32 bit cell on other machines. ). (does it make any sense for "cells" to be bigger or smaller than sizeof_data_word, the size of a single word of memory ?). A minimum of 3 instructions per cell is needed for the ugly hack that works around the lack of a proper "load" command).

Any opcode ``# !R+ + xor nand 2/ push popA AT call nop'' can occur in any slot.

[FIXME: How are interrupts handled ? computer_architecture.html#interrupt ]

It seems that the instructions execute so quickly that the bottleneck is the speed of the RAM.

** Memory access instructions:

Surprisingly, only 1 memory read instruction is needed for both in-line literals and for data everywhere else in memory:

	// ( -- x )
	@P+      // load data in RAM at [P], push it onto T, then increment P.
	load     // ditto
	#        // ditto

For example, when the CPU executes a cell that starts with 3 "#" instructions, then the following 3 cells are pushed onto the data stack (the second one ends up on S, the third one ends up on T), then the remaining instructions in the cell are executed, and then the 4th following cell is loaded into the instruction latch and executed.

  // ( ... - ... data1 data2 data3 )
  [ # # # ... ] [data1] [data2] [data3] [more instructions ...] ...

The assembler pseudo-op "#(value)" keeps track of implementation details (word size, register size, sign-extension, the current instruction word, where the next literal/instruction cell will be loaded from, etc.). "#(value)" expands to "#" (or perhaps "# com" or "# unsigned com") and packs the (possibly complemented) literal value into the next available (sizeof_data_word) cell, such that when that code is executed, the desired value gets loaded into T.

To load data that is *not* an in-line literal, somehow calculate the address and get it onto R, then pack these instructions into a single cell: "swapPR load swapPR" .

There is also only one write instruction:


	// ( x -- ) < address -- address+1 >
	!R+     // pop T, store popped value to RAM at [R], then increment R.
	store   // ditto

(??? should we use A for the address instead of R, a la the F21 ?) (hardware *might* be a bit simpler if P is used for *all* addresses to the memory bus, replacing !R+ with !P+).

2 operand arithmetical instructions

The 2 operand arithmetical instructions always pop 2 items off the data stack (S and T) and push the result back onto the data stack (into T).

  +     ( n1 n2 -- n1_+_n2 ) add S to T
  xor   ( n1 n2 -- n1_bitxor_n2 ) bitwise exclusive-or S to T
  nand  ( n1 n2 -- ~(n1_bitand_n2) ) bitwise nand S to T

One way to write software to implement multiple-precision arithemetic is to use the most significant bit of T as the carry bit, and extending precision in sizeof_address-1 bit chunks.

1 operand arithmetical instructions

  2/	( x -- x/2 ) arithmetic shift right of T (keep sign),
        and skip next instruction if T was not zero.
		(Next instruction only executed if T was (and still is) 0).
  nop	does nothing

[can "nop" be eliminated ?]

This "2/" is the only conditional instruction in the entire set.

The skipped instruction is typically "call" or "nop" or another ``2/''. For example, to shift 3 bits to the right, do "2/ 2/ 2/ nop". (I leave it as an exercise to the reader to show this sequence always has the net effect of unconditionally shifting T three times to the right).

[It may make hardware simpler (allow interrupts at end of every cell) if the programmer model always includes a virtual "nop" at the end of every cell, i.e., even if a 2/ is the last instruction of a cell, the first instruction of the next cell is executed unconditionally. This means the compiler must insert or delete a ``nop'' to make the ``2/'' do what the programmer expects:

  2/ call      --> [... nop] [2/ call ...]
  2/ nop call  --> [... 2/ ] [call ... ]
  2/ 2/        --> [... 2/ ] [2/ ... ]
]

[Other options: perhaps make "2/", if the result is not zero, skip *all* the remaining instructions of the current cell. perhaps make "2/", if the result is not zero, skip precisely 3 instructions: following it in this cell and perhaps the start of the next cell. This delay slot could be filled with "AT push popA" reducing the need for the "nop" instruction. ]

Stack manipulation instructions

  push	 ( x -- ) < -- x > pop T, push onto R
  popA	 ( -- ) < x -- > pop R, push onto A
  dropR	 ( -- ) < x -- > // alias for popA
  AT	 ( -- x ) push a copy of A onto T (leaving A unchanged).

Data in any register in the list (T R A T) can be moved in a single cycle to the register to its right.

"popA" is a hint to the reader that the value of A will soon be used with AT; "dropR" is a hint to the reader that the value of A is now irrelevant.

Branches (Change-of-program-flow instructions)

There is only 1 branch instruction, only one way to modify the value of P (but lots of different aliases to make the intent of the programmer clear):

	// various aliases for the same branch instruction bit pattern
	swapPR ( -- ) < future_P -- past_P > // swap P with R.
	call
	branch
	return
	exit
	;      // pronounced "exit"
	jmp

P typically (but not always) points to the cell following the cell from which the currently-executing instructions were taken.

Once a cell is loaded into the instruction latch for execution, P is incremented, and then *all* of the instructions in that cell are executed, (possibly modifying P) from the first to the last. The CPU never skips the remaining instructions in a cell (not even for the call instruction).

If a "call" is immediately followed by a "#" instruction, then that distant value at [P] is loaded, *not* the data in the cell immediately following the cell currently being executed.

After *all* instructions in a cell have executed, the next cell of instructions is read from this new value of [P] into the instruction latch, P is incremented, and then all the instructions in that cell are always executed (possibly modifying P). Whatever value of P exists immediately after they are all executed is the address where the next instruction cell will be loaded from.

Sequences that "temporarily" change P must be very careful to restore P before the end of that cell.

All branch destinations are to (the first instruction of) a particular cell, not to some (other) instruction inside that cell.

** prefetch implementation

Prefetch may be implemented on some chips. It has no effect on the programmer model; all implementations act "as if" there was no prefetch.

Typically, while the instructions in one cell are executed, the next cell is being pre-fetched. That pre-fetched cell will in turn be executed unless (a) the current cell mentions "load" (#), which diverts the pre-fetched cell to T instead of the instruction latch, increments P, and starts pre-fetching the following cell. (b) the current cell modifies P with "swapPR". The CPU will then flush the ``next cell'' it speculatively pre-loaded, then starts pre-fetching the next instruction where P now points.

[A strange alternative: Instead, a chip *could* stick with the absolute simplest thing to do with "call" -- merely swap R and P, and not worry about the consequences.

This has a *major* impact on the programmer's model.

This means that pre-fetched cells are *never* ``wasted''. The consequences are that after a cell with one ``call'' and no ``load'' instructions is fetched, the cell at the following address is prefetched and will be executed unconditionally (it is in the "branch delay slot" of a "delayed branch" a la Sparc ???), while prefetch loads the first cell of that subroutine. (note that there is *both* a call delay slot *and* a return delay slot).

Confusing, difficult-to-explain things to happen if one tries to mix "call" followed several "#"s in the same cell. The first "#" loads data from the cell immediately following the cell being executed, while the following "#"s load data from distant cells.

Other strange things happen if you put multiple call instructions in consecutive cells.

If you do this, then the macro for @R+ expands to:

: @R+
  // next 4 instruction must be in same cell.
  swapPR
  @P+ // puts a wasted prefetched value in T, starts loading desired value
  swapPR
  @P+ // puts the desired value in T, starts loading next instruction
  cnop
  [ cnop ] // this cell is skipped
  swap // get the wasted value on top
  drop // get rid of it.
  ;
]

[ we might be able to have a minimal implementation that packs only 1 instruction per cell, feeding out of a 4 bit RAM, if we change the instruction buffer so it pre-fetches and holds not only the currently executing instruction, but also the 2 following instructions. The ugly hack to fake a "load [R]": would work like this: while executing the source code "swapPR swapPR # push call", the instruction buffer would look like this:

  1. ``swapPR swapPR #'' (P and R get swapped; then data at [P++] is read)
  2. ``swapPR # (data)'' (P and R get swapped; instruction at [P++] is read)
  3. ``# (data) push'' (data shunted to T, instruction buffer overwritten with NOP; instruction at [P++] is read)
  4. ``nop push call'' (etc.)
every instruction triggers 1 read cycle to get the next instruction and shifts the instruction queue down one. ]

If we we are feeding out of a 4 bit RAM but we want the register size to be N times larger, then we could prefetch N+1 following instructions (confused yet ?). Then the sequence would be (for N=3, so T = 3*4 bits = 12 bits)

  1. ``swapPR nop nop swapPR #'' (P and R get swapped; then data at [P++] is read)
  2. ``nop nop swapPR # (data)'' (nop; then data at [P++] is read)
  3. ``nop swapPR # (data) (data)'' (nop; then data at [P++] is read)
  4. ``swapPR # (data) (data) (data)'' (P and R get swapped; instruction at [P++] is read)
  5. ``# (data) (data) (data) push'' (data shunted to T, instruction buffer overwritten with NOPs; instruction at [P++] is read)
  6. ``nop nop nop push call'' (etc.)

** The compiler/assembler

The assembler pseudo-op ``cnop'' fills the rest of the current cell with (zero or more) NOPs, sufficient to align the next instruction with the start of the next cell. This alignment is necessary when you want that next instruction to be the destination of some branch in the code. (This is implied by labels at the start of a subroutine, and the "then" and "else" of a "if-then" statement. ).

The compiler automatically generates inline code whenever the word being compiled would take *less* space inline than the subroutine call. (-- idea from Myron Plichota)

Should the compiler choose to actually generate a call, `` .... subroutine_label ... " expands to something like this:

	[       ... # push call]
	[ subroutine_label     ]
	[ ...            ]
	...
  subroutine_label:
	... // instructions to actually do something
	[ ...     return dropR ]

To reduce space slightly, the compiler may choose to compact lists of subroutine calls (after making sure that none of the subroutines will interfere too much with the return stack) from

	[       ... # push call]
	[ name_1               ]
	[       ... # push call]
	[ name_2               ]
	[       ... # push call]
	[ name_3               ]
	[       ... # push call]
	[ name_4               ]
	[      ... return dropR]
to
	[ # # # # push push    ]
	[ name_1               ]
	[ name_2               ]
	[ name_3               ]
	[ name_4               ]
	[ push push jump dropR ]

or to

	[ # push # push            ]
	[ name_4                   ]
	[ name_3                   ]
	[ # push # push jump dropR ]
	[ name_2                   ]
	[ name_1                   ]

(This is a generalization of "tailbiting").

To minimize space (but not time) in long sequences of only subroutine calls some compilers may use direct threading. data_compression.html#program_compression Here is an example of a fully re-entrant direct-threaded subroutine list interpreter:

	[... # push call]
	[ subroutine_list_interpreter  ]
	[ sub1 ]
	[ sub2 ]
	[ sub3 ]
	[ sub4 ]
	[ continue ]
  continue: // ( ) < oldP p_subroutine_list >
	[dropR dropR ...]
  ...
  subroutine_list_interpreter: ( -- ) < p_subroutine_list -- >
	[ swapPR # swapPR push call cnop ] // = [ @R! push call cnop ]
	[ # jump dropR              cnop ]
	[ subroutine_list_interpreter    ]

This subroutine list interpreter assumes these subroutines don't modify the top 2 items on the return stack; it's ok if they use deeper items for parameters.

Once you have this subroutine list interpreter, you can collapse other lists of subroutine calls to 2 full cells + 2 partial cells of overhead, plus 1 more cell per subroutine call.

The compiler converts ``.... if ... true_stuff ... then ... '' statements to something like

    // last 5 instructions must all be in same cell
    [ ... # push 2/ jmp dropR push dropR ]
    [ then ]
    [ ... true_stuff ...      ]
    ...
  then:
    ...

The compiler converts ``.... if ... true_stuff ... else ... false_stuff ... then ... '' statements to something like

    // last 5 instructions must all be in same cell
    [ ... # push 2/ jmp dropR push dropR ]
    [ else ]
    [ ... true_stuff ...      ]
    ...
    [ ...    # push jmp dropR ]
    [ then ]
  else:
    [ ...   false_stuff   ... ]
    ...
  then:
    ...

A minimum implementation (only 3 instructions per cell) is forced to use this alternate form:

	// last 2 instructions must be in same cell
    [ ...    ... # push 2/ jmp ]
    [ else ]
    [ dropR push dropR  ... true_stuff ... ]
    ...
    [ ...    # push jmp dropR ]
    [ then ]
  else:
    [ dropR push dropR ...   false_stuff   ... ]
    ...
  then:
    [ ...                ]
    ...

A smarter compiler might have a special ``else 0 then'' case for code like `` ... if ... true_stuff ... else 0 then ... '' to

    [ ... # push 2/ jmp dropR ] // ( refrain from dropping T )
    [ (then) ]
    [  push dropR ... true_stuff ...      ] // T was non-zero; drop it
    ...
  then:
    ... // if T was 0, now T is still 0.

(note that the "2/" never drops T, even in the case where T=0 and the CPU actually executes that next instruction. I suppose a really smart compiler might be able to refrain from dropping T in certain unusual situations when we later want to use T/2 in the true_stuff. and/or 0 in more complicated false_stuff. ).

The compiler converts `` ... DO ... LOOP ... '' into [FIXME] DO: loop initialization ... LOOP: increment by 1 loop +LOOP: variable increment loop ???

Some processors have the only conditional instruction be a "conditional return if 0" instruction. Another way of implementing ``.... if ... true_stuff ... then ... '' is to use the "conditional-return" style:

	[       ... # push call]
	[ subroutine_label     ]
	[ dropR ...            ]
	...
  subroutine_label:
    [ ... condition ... ]
    [ ... 2/ return push dropR ] // conditional return
    [ ... true_stuff ...      ]
	[ ...     return ] // unconditional return

A "conditional return" implementation of ``.... if ... true_stuff ... else ... false_stuff ... then ... '' looks something like

	// could calculate some of condition here
	[       ... # push call]
	[ (else)     ]
	...

  else:
	// < return_address >
	// could calculate some of condition here
	[       ... # push call ]
	[ (then)     ]
	// < return_address >
    [ ...   false_stuff   ... ]
	[ ...     return ] // unconditional return
    ...

  then:
	// < ... return_address else_address >
    [ ... condition ... ]
	// either return to "else:",
	// or fix up return stack so
	// we can do true_stuff
	// and return all the way to
	// original call.
    [ ... 2/ return drop dropR ]
	// < return_address >
    [ ... true_stuff ...      ]
    ...
	[ ...        return dropR ] // unconditional return

//some common macros
//using the instruction set
//  "# !R+ + xor nand 2/ push popA AT call nop" (DAV 1999-03-24)

// pseudo-primitives: pop, dup, drop, swapTR, swap.

// pop: take top value off R stack, push onto T stack.
: pop     ( -- x ) < x -- >
	popA AT ; // always inlined

: pop_dup	( -- x x ) < x -- >
	_[[ popA AT AT _]] ; // 3 cycles, always inlined

// dup: make a duplicate copy of T, push into T and S
: dup     ( x -- x x )
	push pop_dup ; // 4 cycles, always inlined

: drop    ( x -- ) < -- >
	push dropR ; // always inlined

// swapTR: swap T with R
: swapTR  ( a -- b ) < b -- a >
	popA push AT ; // 3 cycles, always inlined

: swap    (... a b --... b a ) swap T with S
	push swapTR pop // 6 cycles

: 0     ( -- 0 )
	dup dup xor ;
: 0     ( -- 0 )
	_[[ AT AT _]] xor ; // 3 cycles, always inlined

// The A register and ``popA'' and ``AT'' instructions
// may be reserved exclusively for these pseudo-primitives to use.
// Then use "push, pop, dup, drop, swap, swapTR, dropR, 0" macros
// as if they were the primitive instructions.

// stack manipulation

: over    (... b a --... b a b )
	push dup pop swap ;

: over	(... b a --... b a b ) 13 cycles if inlined
	push dup swapTR pop ;

: rot	(... a b c --... b c a )
	push swap pop swap ;

: rot	(... a b c --... b c a )
	push push swapTR pop swapTR pop ;

// arithmetical manipulation

// com: complement: invert all bits of T.
: com	( x -- (-x-1) ) or equivalently ( x -- -(x+1) )
	dup nand ; // 5 cycles

: -1    ( -- -1)
	0 com ; // 8 cycles
: -1    ( -- -1)
	0 0 nand ; // ( -- -1) // 7 cycles
: negate   ( x -- -x )
	#(-1) + com ; // ( x -- -x ) // 13 cycles, or 7 cycles +  1 RAM cycle

: 1   ( -- +1 )
	#(1) ; // 1 cycle + 1 RAM cycle
// too much work for no gain over straightforward literal.
: 1
	-1 dup + com ; // 17 cycles
: 1
	_[[ AT AT _]] xor // ( -- 0 )
	_[[ AT AT _]] xor // ( 0 -- 0 0 )
	nand // ( 0 0 -- -1 )
	push
	_[[ popA AT AT //
	+	// ( ... -1 -1 -- -2 ); A=-1
	AT AT _]]
	+	// ( ... -2 -1 -1 -- ... -2 -2 )
	nand ; // (... -2 -2 -- ... +1 )
	// 16 cycles

: and
	nand dup nand ;
: or
	over com and xor ; // clever sequence from Chuck Moore
: or
	com swap com nand ;
: or // 11 cycles
	push
	push
	pop_dup
	nand
	pop_dup
	nand
	nand



// subtract T from S:
: -      (... a b --... (a-b) )
	negate + ;

// 0<> logical buffer: 0->0, nonzero->(-1).
// 0= logical not: 0->(-1); nonzero->0
// (Would it be better to map false to +1 rather than -1 ?)
// (Do we really use less power by using +1 rather than -1 ?)

// simple and straightforward
: 0=     ( 0 -- -1 ) | ( nonzero -- 0 )
	if
	0
	else
	1
	then
	;

: 0<>     ( 0 -- 0 ) | ( nonzero -- -1 )
	if
	-1
	else 0 // take advantage of special speedup for ``else 0 then''
	then
	;

: 0=     ( 0 -- -1 ) | ( nonzero -- 0 )
	0<> com ;

: 0<>     ( 0 -- 0 ) | ( nonzero -- -1 )
	0= com ;

// doesn't work now -- only works if "swap" is really a primitive
//: 0<>             ( 0 -- 0 ) | ( nonzero -- -1 )
//	#(-1) swap
//	         //       x=0          |         x!=0
//	2/ swap // (...1 0 --...0 1 )  | (...1 x --...1 x/2 )
//	drop    // (...0 1 --...0 )    | (...1 x/2 --... 1 )

// somewhat more bizzare programming styles
: 0=     ( 0 -- -1 ) | ( nonzero -- 0 )
	-1 swap
	if
	dup
	else 0 // take advantage of special speedup for ``else 0 then''
	then
	xor
	;

: 0=                     ( 0 -- -1 ) | ( nonzero -- 0 )
	1 push 0 push // ( x -- x ) <... --... 1 0 >

	_[[ popA // ( x -- x ) <... 1 0 --... 1 >, A=0
	           //                x=0 | x!=0
	2/ popA    // < 1 -- >, A=1      | < 1 -- 1 >, A=0
	AT AT _]]     // (0 --...0 1 1) < -- >   | (x/2 --...x/2 0 0) < 1 -- 1 >
	2/ dropR drop  // (0 1 1 --...0 1)< -- > | (...x/2 0 0 --...x/2 0)  < 1 -- >
	push push dropR pop // (...0 1 -- 1 )< -- > | (...x/2 0 -- 0 )< -- >
	;

: 0<>     ( 0 -- 0 ) | ( nonzero -- -1 )
	-1 push
	if
	pop
	ret  // force a return
	else 0
	then // take advantage of special speedup for ``else 0 then''
	dropR
	;
which the compiler expands to
: 0<>     ( 0 -- 0 ) | ( nonzero -- -1 )
	[ # push # push 2/ jmp dropR ]
	[ -1 ]
	[ iszero ]
nonzero:
	[ push dropR popA AT ret ]
iszero:
	[ dropR ret ]


// even more bizzare versions:

: 0<>     ( 0 -- 0 ) | ( nonzero -- -1 )
	[ 2/ ret cnop ]
  nonzero:
	[ push dropR # ret cnop ]
	[ -1 ]

: 0=     ( 0 -- -1 ) | ( nonzero -- 0 )
	// this version must *not* be in-lined
	#(nonzero) push
	2/
	swapPR push dropR # dropR ret // these 6 instructions *must* be in the same cell
  zero:
	[-1]
  nonzero:
	[ 0]

which the compiler expands to
: 0=     ( 0 -- -1 ) | ( nonzero -- 0 )
	[# push 2/ swapPR push dropR # dropR ret]
	[ nonzero ]
	[-1]
  nonzero:
	[ 0]

// A smart compiler might take any phrase of the form
	``... if #(a) ... true_stuff ... else #(b) ... false_stuff ... then''
and compile it to
	[ ... # push 2/ jmp dropR push dropR # ]
	[ else ]
	[ a ]
	[ ... true_stuff ... ]
	...
    [ ...    # push jmp dropR ]
    [ then ]
  else:
	[ b ]
    [ ...   false_stuff   ... ]
    ...
  then:
    ...



: 1+	( ... a -- ... (1+a) )
	com -1 + com ; // 19 cycles

: 1+	( ... a -- ... (1+a) )
	#(1) + ; // 2 cycles + 1 RAM cycle

: 1-	( ... a -- ... (a-1) )
	-1 + ; // 9 cycles




: R--	( -- ) < x -- (x-1) > subtract one from R
	-1 pop + push ;

// other macros

// "R@+", "fetch": load word from RAM to T at address in R.
// Note that both swapPR *must* be in same cell to avoid wild jumps.
: R@+      ( -- value ) < source_address -- (1+source_address) >
	_[[ swapPR load swapPR _]]
	;

// "@", "fetch": load word from RAM at address in T.
: @      ( source_address -- value ) < -- >
	push
	R@+
	dropR ;

// "!", "poke": put value in T into RAM at address S
// (... value dest_address -- )
: !
	push !R+ dropR ;


: =    ( n1 n2 -- (n1==n1) )
	xor 0= ;




HEX

// from plichota
: MIN-          ( -- 80000000)
    80000000 ;

: signbit          ( -- 80000000)
    80000000 ;

// from plichota
: MAX+          ( -- 7FFFFFFF)
    7FFFFFFF ;



// if T is strictly negative, return true.
// from plichota
: 0<    ( x<0 -- -1 ) | ( x>=0 -- 0 )
	signbit and 0<> ;

: >     ( n1 n2 -- flag=n1>n2 )
	- 0< ;

: <     ( n1 n2 -- flag=n1<n2 )
	swap > ;


: <=            ( n1 n2 -- flag=n1<=n2)
    > 0= ;

: >=            ( n1 n2 -- flag=n1>=n2)
    swap <= ;

// from plichota
: ABS           ( n -- |n|)
    DUP 0< IF NEGATE THEN ;

// pull out the third element down in the stack, and put on T
: rot ( ... a b c -- ... b c a )
	push swap pop swap ;

// multi-word arithmetic

// copy word from one literal location to another
// ( -- ) < -- >
: copyword(source, dest)
	#(dest) #(source) @ ! ;

: copyword    ( dest -- 1+dest ) < source  -- 1+source >
	@R+
	push
		swapTR
		store
	pop ;

// Note that both swapPR *must* be in same cell to avoid wild jumps.
// this requires cell size to be at least 6 instructions
// 9 cycles + 4 RAM cycles
: copy2words	( source  -- 2+source ) < dest -- 2+dest >
	push
	_[[ swapPR
		load store
		load store
	swapPR _]]
	pop

// ( dest -- 4+dest ) < source  -- 4+source >
: copy4words (copyword copyword copyword copyword)

// ( -- ) < source dest -- 4+source 4+dest >
// Note that both swapPR *must* be in same cell to avoid wild jumps.
// this requires cell size to be at least 6 instructions
// 13 cycles + 8 RAM cycles
: copy4words (
	_[[ swapPR
		load load load load
	swapPR _]]
	_[[ popA
		store store store store
	AT _]]
	push



[


_Stack Computers_ by Philip Koopman 1989
mentions these words:

// ?DUP: conditionally duplicate T if it is non-zero
: ?DUP  ( 0 -- 0 ) or ( x -- x x )
	dup
	if
	dup
	then ;

(Chuck Moore doesn't like ?DUP:
``1x Forth'' by Charles Moore April 13, 1999

http://www.ultratechnology.com/1xforth.htm

)


U<      U1 U2 - FLAG
Return a true FLAG if U1 is less than U2 when compared
as unsigned integers.

U>       U1 U2 - FLAG
Return a true FLAG if U1 is greater than U2 when compared
as unsigned integers.

U*      N1 N2 - D3
Perform unsigned integer multiplication on N1 and N2,
yielding the unsigned double precision result D3.

U/MOD   D1 N2 - N3 N4
Perform unsigned integer division on D1 and N2, yielding
the quotient N4 and the remainder N3.



how to implement multi-precision operations ?
since there are no condition codes,
the carry flag must be pushed onto the data stack
as a logical value.

_Stack Computers_ by Philip Koopman 1989
mentions these words:


RLC          N1 CIN -> N2 COUT         ->
   Rotate left through carry N1 by 1 bit.  CIN is carry-in,
   COUT is carry-out.

RRC          N1 CIN -> N2 COUT         ->
   Rotate right through carry N1 by 1 bit.  CIN is carry-in,
   COUT is carry-out.


UNORM     (... EXP1 U2 ->... EXP3 U4 )
   Floating point normalize of unsigned 32-bit mantissa

ADC   (... N1 N2 CIN ->... N3 COUT )
   Add with carry.  CIN and COUT are logical flags on the
   stack.

//Store the double-precision value D1 at the two memory
//words starting at ADDR.
// [Is most-significant or least-significant word on T ?]
: D! (...D1 ADDR - )

//Drop the double-precision integer D1.
: DDROP ( D1 -- )
	drop drop ;

//Duplicate double-precision integer D1 on the stack.
: DDUP ( D1 - D1 D1 )
	over over ;

D+      D1 D2 - D3
Return the double precision sum of D1 and D2 as D3.

D@       ADDR - D1
Fetch the double precision value D1 from memory starting
at address ADDR.


DNEGATE    D1 - D2
Return D2, which is the two's complement of D1.

// Swap the top two double-precision numbers on the stack.
: DSWAP ( D1 D2 - D2 D1 ) (... a b c d --... c d a b )
  push
  swap
  push
  swap
  //(...c a ) <...d b >
  pop pop
  //(...c a b d) <... >
  swap
  push
  swap
  //(...c d a) <...b >
  pop

I             - N1
Return the index of the currently active loop.

I'            - N1
Return the limit of the currently active loop.

J             - N1
Return the index of the outer loop in a nested loop structure.

LEAVE         -
Set the loop counter on the return stack equal to the
loop limit to force an exit from the loop.


S-D        N1 - D2
Sign extend N1 to occupy two words, making it a double
precision integer D2.


SP@        (fetch contents of data stack pointer)
SP!        (initialize data stack pointer)
RP@        (fetch contents of return stack pointer)
RP!        (initialize return stack pointer)
MATCH      (string compare primitive)
ABORT"     (error checking & reporting word)
+LOOP      (variable increment loop)
/LOOP      (variable unsigned increment loop)
CMOVE      (string move)
<CMOVE     (reverse order string move)
DO         (loop initialization)
ENCLOSE    (text parsing primitive)
LOOP       (increment by 1 loop)
FILL       (block memory initialization word)
TOGGLE     (bit mask/set primitive)


]


note to assembly language programmers and compiler writers

The compiler must ensure that A is in a don't care state at the start and end of every cell in order for interrupt routines to be free to use A without saving it.

The compiler can enforce this rule by only allowing the use of the ``popA'' and ``AT'' opcodes in the subroutines "0", "dup", "pop", "swap", "swapTR" (and perhaps a few others), and making sure that when these subroutines are inlined, cell breaks don't occur at the "wrong" place. (The dropR opcode, even though it is the same as popA, can be allowed anywhere A is already in a don't care state, since it leaves A in a (different) don't care state).

Perhaps we need a special "non-breaking space" notation so the programmer can indicate that certain instructions (those sensitive to A, and certain other ones following ``2/'') must be packed into a single cell. (if cells contain N instructions, ``cnop'' forces the next N instructions into the same cell; but when I really wanted the next 3 instructions to stay together, and there were 4 empty slots remaining in the current cell, ``cnop'' is a bit wasteful.)

note to optimizing compiler writers:

peephole optimizer usually needs to eliminate do-nothing "popa at push" sequences in straight-line code, since the only effect is to change A and usually A is in a "don't care state".

In particular, the "pop swap" macro sequence (in the "over" macro) expands to "pop push swapTR pop", and the obviously do-nothing subsequence "pop push" is further expanded to the do-nothing sequence "popa at push". Perhaps it would be simplest to immediately replace the "pop swap" sequence with the faster sequence "swapTR pop" and then expand *that* into "popA push AT popA AT"

Similarly, "swap push" expands to "push swapTR pop push", with the same do-nothing subsequence "pop push"; it seems simple to make the compiler smart enough to immediately replace "swap push" with "push swapTR"

Macros to emulate each of the F21 instructions

(The "A" is simulated as an location in RAM, _A )

else unconditional jump: #(dest) push _[[ jmp dropR _]] cnop
T=0(dest) : dup #(dest) push _[[ 2/ jmp dropR push dropR _]] cnop
call(dest) : #(dest) push call cnop dropR
C=0(dest) : #(MSbset) nand dup nand T=0(dest) cnop
; return :	return cnop cnop

subroutine start: // no entry sequence needed.

: @R+ // fetch, address in R, then increment R
  cnop // next 3 instruction must be in same cell.
  swapPR
  @P+ // do the load
  swapPR // restore P
  ;

: @A+ :
  #(_A) // get address of _A (&A)
  dup push push @R+ dropR // fetch value of A: (A) < &A >
  push @R+ // fetch what A points to: (data) < A+1 &A >
  pop ! dropR // update A with new value. (data) < -- >
# : #
@A:
  #(_A) // get address of A (&A)
  push @R+ dropR // fetch value of A: (A) < -- >
  push @R+ // fetch what A points to: (data) < A+1 >
  dropR //
!R+ : ! //store, address in R, increment R
!A+ :
  #(_A) // get address of A (&A data)
  dup push push @R+ dropR // fetch value of A: (A data) < &A >
  push ! // pop T, store it at [A]: ( -- ) < A+1 &A >
  pop ! dropR // update A with new value. (data) < -- >
!A :
  #(_A) // get address of A (&A data)
  push @R+ dropR // fetch value of A: (A data) < -- >
  push ! // pop T, store it at [A]: ( -- ) < A+1 >
  dropR //
com : dup nand
2* : dup +
2/ : 2/
+* : // add S to T if T0 one            DUP 1 AND IF OVER + THEN
	dup
	#(1) and
	if
	over +
	then

+* : // add S to T if T0 one            DUP 1 AND IF OVER + THEN
	dup
	#(1)
	and // expands to ``nand dup nand''
	if
	over // expands to ``push dup swapTR pop''
	else 0
	then
	+

which expands to (with a good compiler on a 4-instruction-per-word machine)
	[push popA AT AT] // dup
	[ # push # nand] // set up for ``if'', #(1), ``nand''
	[ (then) ]
	[ (1) ]
	[ push popA AT AT] // ''dup''
	[ nand 2/ jmp dropR ] // ``nand'', ``if''
	[ push dropR push push ] // finish ``if'', start ''over'': ``push push''
	[ popA AT AT nop] // continue ``over'': ``dupR''
	[ popA push AT nop] // ``swapTR''
	[ popA AT nop nop ] // finish ``over'': ``pop''.
  then:
	[ + ]


+*R : // add R to T if T0 one            DUP 1 AND IF pop dup push + THEN
	[push popA AT AT] // dup
	[ # push # nand ] // set up for ``if'', #(1), ``nand''
	[ (then) ]
	[ (1) ]
	[ push popA AT AT] // ''dup''
	[ nand 2/ jmp dropR ] // ``nand'', ``if''
	[ push dropR nop nop ] // finish ``if''
	[ popA AT AT push] // get copy of R ``pop dup push''
 then:
	[ + ] // add either 0 or a copy of R to original T


// add step on return stack
+*_return : // add Rnext to Rtop if Rtop0 one
		//pop DUP 1 AND IF pop dup push + THEN push
	[ popA AT AT #] // pop dup #(1),
	[ (1) ]
	[ nand push nop nop ] // start of ``and''
	[ popA AT AT nand] // end of ``and''
	[ # 2/ jmp dropR ] //``if''
	[ (then) ]
	[ push dropR nop nop ] // finish ``if''
	[ popA AT AT push] // get copy of R ``pop dup push''
 then:
	[ + push ] // add either 0 or a copy of Rnext to original Rtop



xor : xor
and : nand dup nand
+ : +
pop : popA AT
A@ : #(_A) push @R+ dropR
dup : dup
over : push dup pop swap
over : push dup swapTR pop
which expands to
       push push _[[ popA AT AT _]] _[[ popA push AT _]] _[[ popA AT _]]
push : push
A! : #(A) push ! dropR
nop : nop
drop : push popA

** Ways to expand instruction set and make less ugly

Strange variations and ideas (ways to make the instruction set *more* ugly):


This page started 1998-01-15 (but some information much older, going back to 1996 Nov 12) and has backlinks

known by AltaVista

known by Yahoo!

known by infoseek

comments, suggestions, errors, bug reports to

David Cary feedback.html
d.cary@ieee.org.

Return to index // end http://david.carybros.com/html/minimal_instruction_set.html