f_mmu

updated 2001-03-31

Extremely rough and crude design of a MMU unit to be integrated with the Freedom CPU. The organization is terrible, help me straighten it out. There are many errors and imperfections and gaps here. Help me fill them in.

See also The Freedom CPU Project computer_architecture.html#freedom

Introduction

The MMU and associated caches should be "transparent" -- a user program should not be able to tell the difference between a system with these things and a system without these things. Of course the operating system must handle these 2 cases slightly differently. A system without these things would be simpler and cheaper, so the only point to having them is if they increase performance.

A memory management unit (MMU) uses locality of reference to improve a system's overall performance. Rather than force the CPU to directly access relatively slow DRAM on every "load" and "store" instruction, we improve performance at constant cost by investing some of the dollars we would have spent on (a bunch of fast, expensive) DRAM and instead spend it on (the same amount of DRAM, but much slower and cheaper, plus) much faster SRAM cache and on a MMU.

"the essence of cache design is balancing fast hits and few misses." -- H&P, ch. 5.4 p. 422

The MMU also is a integral part of the memory protection features required by most multitasking operating systems.

Terms: [FIXME: definitions needed !]

write-through
write-back
page
page miss
page hit
cache miss
cache hit
memory protection
the "multilevel inclusion property"
branch-on-miss: When you need some data to complete a calculation, but it's been paged out to L2 or DRAM or even further away, rather than just waiting a long time (latency) for that data, do something useful (hopefully with data that is still in the cache). See SMT.
SMT: symmetric multi-threading
"perfectly virtualizable"

For memories designed in comparable technologies, a dollar will get you roughtly 8 to 16 times as many DRAM bits as SRAM bits, but those SRAM bits will have a cycle time roughly 8 to 16 times faster. So for systems that are HDD bandwidth-limited, the optimum would be *no* L2 cache, and spend all that money on DRAM instead.

A second-level cache has the "multilevel inclusion property" if all data in the 1st level cache is always in the 2nd level cache. The simplest 2 level cache structure has the inclusion property and same-size blocks for both caches. (Often second-level cache has larger blocks, which means more hardware is necessary to maintain multilevel inclusion -- -- or else that property is sacrificed.)

Most high-performance CPUs have a Harvard architecture internally: a data cache (D-cache) and a separate instruction cache (I-cache).

Superscalar CPUs generally read multiple instructions from the I-cache simultaneously and attempt to schedule them.

Most digital signal processors (DSPs) use 3 single-ported memory areas internally, 2 data RAM (X and Y) and a instruction RAM (I). This gives DSPs the ability to read a instruction and 2 data values simultaneously, or to read a instruction and a data value and write a data value simultaneously.

However, nearly all common CPUs have a unified (Princeton ?) (von Neumann ?) single-ported second level ("L2") cache.

Please note that just because the F-PCU has 64 bit registers does *not* imply that everything else must also be 64 bits. For example, if we use a motherboard designed for the Alpha CPU, that motherboard has second-level cache, memory bus, and memory all 256 bits wide.

Open questions:

Which is better: (a) closely integrate the MMU with both the on-chip first-level cache of the CPU and the off-chip second-level cache, or (b) have completely independent mechanisms dealing with L1 cache vs. L2 cache ?

general notes

General notes applicable to each level in the memory hierarchy.

Any cache has 2 data areas: One area holds a duplicate copy of something already stored in a lower memory level. The other area contains that "tag bits" that remember where, exactly, in the lower level all this stuff came from.

In a direct-mapped cache, each possible virtual address [ tag bits | block number | offset ] will be mapped to 1 and only 1 location in the cache: the block at "block number". When the "block number" is presented to the cache, it replies with tag bits and the entire block worth of values. If the tag bits match the ones in the virtual address, great; otherwise a miss occurs.

In a fully associative cache, [ tag bits | block number | offset] A "page table" is a table, one entry for each possible (high bits of) a virtual address, where the entry tells which physical page (block number in the cache) (or "none -- miss") is caching that address. A "inverted page table" is a table, one entry per page, that tells which (high bit of) virtual address that page is caching (block number of the lower level). "inverted page tables" can be much smaller than "page tables", since virtual memory space is very much larger than physical memory space, but "inverted page tables" require special hardware (a translation look-aside buffer TLB and/or a content-addressible memory CAM) (or a long time) to search through the entire inverted page table for a match.

CPU interface

the interface between the CPU and the MMU. As long as this interface stays constant, the same CPU can be used with a completely different MMU and memory sub-system.

"Is this address in your cache ?" "Yes, and here is the data." or "No, sorry. Wait a few cycles and I'll fetch it for you." or "Hey ! That's an unusual address. Let's do the page fault thing."

"Please write this data to that address." "OK." or "Hey, wait, we need to page fault here."

Memory interface

As long as this interface stays the same, a particular memory sub-system can be used with any MMU and any CPU.

memory hierarchy

details and quirks about the various levels of the hierarchy.

Registers are 64 bits on the f-cpu.

I-cache and D-cache (L1 cache) [ big block index | tag | block offset]

SRAM (L2 cache) virtual address: [ block index | lots of tag bits | block offset]

SRAM address lines are not multiplexed, so every location (address) in SRAM can be accessed equally quickly.

"Main Memory" Virtual address: [ Virtual page number | page offset ] DRAM is physically arranged as rows and columns of bits. It typically takes [FIXME] cycles to select any one row of bits. After one row has been selected, any bits in that row (all other addresses in that page) can be accessed relatively quickly, and in any order. ("nibble mode", "fast page mode", "burst memory access", etc.)

(All the addresses that access the same row of DRAM memory are called a "page", which is usually a *different* size from the "pages" used by the operating system).

It is simplest to consider DRAM as a cache for all virtual memory (invariably fully associative). Some areas of DRAM are "locked" (notably the subroutines necessary for reading and writing other pages to disk); these can be considered the equivalent of "tag memory". Some operating systems draw a distinction between the "working space" that overflows to the "swap file" section of the HDD, and the "disk cache" that caches other parts of the HDD. (David Cary fails to see why such a division would ever be preferable to a unified virtual memory area).

Occasionally DRAM is unavailable for a few cycles, because of the necessity of DRAM refresh / page switch cycles or because a I/O device (or parallel CPU) is reading or writing the DRAM.

Consider the interaction between the MMU and how the CPU handles peripheral interrupts computer_architecture.html#interrupt

Variations on DRAM, such as RAMBUS: see vlsi.html#ram .

HDD. If there were only 1 process, Virtual addresses could be considered specific spots on the HDD. Virtual address: [ HDD sector number | offset in that sector ].

The network Often, for reasons outside the user's control, the network is unavailable. Only a tiny portion of the Internet is writable by the user; all the rest is "read-only". Even if everything else about the system is uniprocessor, software needs to be aware of cache coherency and periodically destroy stale cached data and read more recent versions. Typically information is tagged with a "expires-by" tag to help determine when to use cached data and when it's time to fetch more data.

Internal details

The L1 cache. All of memory and cache is considered to be made up of "blocks" of data; entire "blocks" of data are always written (?) or read as a single unit. (Given N wires between the F-CPU and the DRAM memory, it takes at least blocksize/N transactions to read or write a single block).

Since a block is bigger than a single register, a "load" to some location not currently cached causes the the cache to not only load the desired data, but also to speculatively pre-load the rest of the block. If that information is later required by the CPU, then it will be accessed by the much faster cache rather than requiring another slow RAM access.

Block size of L1 cache:

H&P p. 394, Figure 5.11 and Figure 5.13, a block size of 64*8 bits (8 registers worth) seems optimum for 64KB to 256 KB caches. "Larger block sizes will reduce compulsory misses ... [but] larger blocks increase the miss penalty ... may increase conflict misses and even capacity misses if the cache is small"

H&P, p. 396: "a direct-mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/2. ... Direct-mapped caches are the fastest (faster than 2-way set-associative: 2% for CMOS, 10% in TTL or ECL)"

Many common processors have 2 way set associative caches. The block just loaded can go in 1 of 2 possible locations. (Fully associative: the block just loaded can go anywhere, is, according to H&P, not significantly different than 8 way set associative).

"Victim cache" (H&P p. 398) is a small (4 entry ?) fully-associative cache between a direct-mapped cache and its refill path. When a block is discarded from the direct-mapped cache because of a miss, a copy is sent to the victim cache. When there is a miss in the direct-mapped cache, the victim cache (tags) is checked; if the desired block of data is found, then the victim block and the cache block are swapped.

Pseudo-associative caches (H&P p. 399) If a miss occurs on a direct-mapped cache, check the other possible location (invert the MSb of the index field). If it is there, swap the blocks, otherwise completely miss and go to RAM.

p. 423 "The guideline of making the common use fast suggests that we use virtual addresses for the cache, since hits are much more common than misses. Such caches are termed /virtual caches/, [as opposed to] /physical cache/ ... the traditional cache that uses physical address. Virtual addressing eliminates address translation time from a cache hit. Then why doesn't everyone build virtually addresses caches ?"

Reason 1: p. 423 H&P claim "every time a process is switched, the virtual addresses refer to different physical addresses, requiring the cache to be flushed." Why ? Is there some reason that clever OS design couldn't put every process in a (usually) non-overlapping virtual address range ? (I think this has the same effect as the "process-identifier tag" H&P mention). This makes te common use fast ... ... but requires extra care for protection violations.

Reason 2: p. 424 "operating systems and user programs may use 2 different virtual addresses for the same physical address. These duplicate addresses, called /synonyms/ or /aliases/, could result in 2 copies of the same data in a virtual cache; if one is modified, the other will have the wrong value. With a physical cache this never happens. /anti-aliasing/ hardware is used to guarantee every cache block a unique physical address. Or software can force all aliases to be identical in the last N bits of their address, so a direct-mapped cache of size 2^N words can never have duplicate physical addresses for blocks.

Reason 3: I/O typically uses physical addresses.

H&P ch. 6.2 p. 492 "The price of a megabyte of [HDD] disk storage in 1995 is about 100 times cheaper than the price of a megabyte of DRAM in a system, but DRAM is about 100 000 times faster. Many a scientist has tried to invent a technology to fill that gap, but thus far all have failed."

One of the first (and simplest) MMU chips was the 74LS610 MMU: see

http://www.6502.org/users/andre/csa/ls610/
the System Block Diagram http://www.6502.org/users/andre/csa/ls610/ls610a.gif
The internal logic diagram http://www.6502.org/users/andre/csa/ls610/ls610b.gif
Pinout http://www.6502.org/users/andre/csa/ls610/ls610c.gif

misc


for_mmu.txt


	
	

	



X-URL: http://www.egroups.com/list/f-cpu/
Reply-To: f-cpu@egroups.com
Date: Wed, 01 Sep 1999 23:57:59 +0200
From: Whygee 
Organization: The Whygee Corporation 
X-Accept-Language: en,fr
To: f-cpu@egroups.com
Subject: [f-cpu] Re: MULTIPLE REPLIES (predication, move, and so o

Mathias Brossard wrote:
> Whygee wrote:
> > i don't know why you don't like the instruction i proposed ?
> > instead of the usual size flags, you put a shift count, and a 16-bit immediate
> > value is shifted by 0,16,32 or 48 bits in the shifting unit.
> > the destination register is read, ORed with the shifted constant,
> > and written back to the register set. don't tell me "it's complex".
>         There's no problem with your instruction. I just wanted
> to have the possibility to load a whole register in one instruction
> and have it done very simply ("le beurre et l'argent du beurre").
it can't be done in one instruction if the constant is > 16 bits.
P&H live with this for some time and don't feel too bad about it.
otherwise, it's CISC.

> > let's define that IN GENERAL (not a rule) the first byte contains
> > the main opcode fields, the second byte would contain the destination
> > register, the third byte one source register, the fourth byte the last source
> > register or a 8 or 6-bit immediate, the 16-bit immediates are in the
> > 2 last bytes, etc. So we can easily hexdump the instructions
> > and forget about the endianness. If instructions are fixed size,
> > endianness is not an issue if the hexdumpability is not a problem.
>
>         Hmm. I don't agree with you... I agree to that
> the opcode should come first. But aligning registers on byte
> boundary is stupid: remember we have 63 registers, that means
> 6 bits... So what I propose is using the formats in the draft
> except that we invert LSB and MSB...
well, it was a little example of how instrucitons would be designed,
so "i can't read the hexdumps" won't come in the problem.
i meant that if the opcodes were designed correctly, there
would not be endian issues.

> > now, let's speak about paging.
>         Ok what do you want to talk about on paging ?
> I'm looking at P&H which uses the Alpha as an example.
> - First question: What sizes shoud we allow ? 4KB and 8KB
>   are often used, but it's the minimum size. On Alpha the
>   sizes are 8KB, 64KB, 512KB and 4MB. In the same question,
>   how many levels shoud we have ?
there are two problems here.
first, the sizes.
variable sizes are required and the OS should allocate the
best pages as to reduce the number of page faults, it's sure.
i had thought about something similar to the ALPHA, having pages
of 4K, 32K, 256K, 2M, so this is equivalent and i think it is a
rather good idea. a difference by a factor of two is not revelant.
i think that page managing algorithms exist for this scheme and
this won't be a problem, even for HW.

second the levels.
or at least that's what we think is a problem.
The real problem with FC0 is to have the physical address
in one very short cycle time. no time for three levels there...
page is present or not. if something complex or sophisticated must
be done, it must be in software. it's a bit crayist, but it's realistic.

think about it :
in one cycle (it's the most i can allow myself to accept)
we must know the physical address of the page, the presence
and protection flags, the LRU flags etc. I see no reason to use
indirections, it's a simple lookup table. the software can use
any trick it wants to manage it though.


> - Second question: How big should the TLB be ? The Alphas
>   TLB has 8 PTEs for 8KB pages, 4 PTEs for 4MB and 32 PTEs
>   for PTEs for all sizes.
let's just make something simple.
it's a part that should have one cycle of latency, not more,
and this means around 10 transistors which is just the minimal
possible for this kind of structures.
but it's implementation dependent after all so don't care too
much about it.
the rule of thumb is that the more entries are on chip, the less
exceptions there are.
the less entries, the faster.
maybe an onchip two-level search, before triggering a fault ?
have 4 entries for all sizes and search in a wide, slower
table if not present, before triggering a fault ...
just like a cache... it's implementation dependent and
must be transparent to the user. but yet remain controllable.

any comment ?
please keep in mind though that the rest works with physical
address, to prevent a lot of problems...

> Mathias
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SHARCPAGE: http://www.mime.univ-paris8.fr/~whygee/sharcpage.html





mmu.txt

proposed MMU
for a 64 bit virtual address space


[FIXME: needs a lot of work]
[very preliminary]

Requirements:

On one side:
accepts 64 bit virtual addresses
from a CPU.

On the other side:
emits physical addresses
to DRAM
(what about 2nd level cache (L2-cache) ?),
with *fewer*
than 64 address wires
(If you really have 2^64 words of DRAM,
you don't really need virtual memory).

memory protection
I/O protection

All tasks in same virtual address space ?
Or different tasks in
different virtual address spaces ?

paging

Different exceptions
for page fault vs.
address privilege violation ?

Kinds of virtual addresses:
- data (read and write as data)
- constant (read only as data)
- executable (read only as code)
- currently swapped out
- unavailable/nonexistent/invalid
Each page has one of these attributes,
and the MMU blocks
any invalid access by user-mode code.



[ page number |   word offset of that page ]

The MMU passes the word offset
out to the DRAM -- in fact,
these lo bits don't even need to be
connected to the MMU,
but bypass it.
The MMU does a table lookup
on the page number
to get the physical page number
and the permissions.
It compares the permissions
with the access type
(read or write).
Most of the time,
the access type is permissible,
and the physical page number
is sent out on the hi bits.
If the access type is not permissible,
then the access type is blocked
(or perhaps merely
converted to a harmless "read"
from a random location in DRAM --
*not* I/O device address
).


Specifications:
___ page size
___ maximum DRAM
___ maximum I/O address space









There should be two modes of operation for this CPU: protected mode, and
real-mode.  The difference is simple: in protected mode, the MMU is enabled
to translate linear to physical addresses.  In real mode, it isn't.  THERE
SHOULD BE NO OTHER DIFFERENCES OR MODES.  Keep it simple, lest you find
yourself supporting 15,000 different modes of operations.

-- Samuel A. Falvo II (?) (1999 March)

modulo arithmetic [FIXME: should go elsewhere ?] A fast scheme for calculating a div n and a modn where n is a prime number of the form 2^N-1. (from H&P ch. 5 p. 480). 1. Modulo arithmetic obeys the laws of distribution: (( a modn ) + ( b modn )) modn = (a + b) modn. (( a modn ) * ( b modn )) modn = (a * b) modn. 2. When n is a prime number of the form 2^N-1, the sequence 2^0 modn, 2^1 modn, 2^2 modn, ... is a repeating pattern 2^0, 2^1, 2^2 ... 2^(N-1), 2^0, 2^1 and so on. For example, if n = 2^N-1 =7, then 2^0 modn = 1 2^1 modn = 2 2^2 modn = 4 2^3 modn = 1 2^4 modn = 2 2^5 modn = 4 2^6 modn = 1 2^7 modn = 2 2^8 modn = 4 2^9 modn = 1 2^10 modn = 2. 3. Given a binary number a, the value of (a mod7) can be expressed as ( ai*2^i + ... + a2*2^2 + a1*2^1 + a0*2^0 ) mod7 = (( a0 + a3 + ...)*1 + (a1 + a4 + ...)*2 + (a2 + a5 + ...)*4) mod7 where i = log2(a). This is possible because 7 is a prime number of the form 2^N-1. The multiplications in this expression are powers of 2, so they can be replaced by fast shifts. 4. The address is now small enough to find the modulo by looking it up in a read-only memory (ROM) to get the bank number. Exercise 5.10: Draw the block structure of the hardware that would pick the correct bank out of 7 banks given a 32 bit address. Assume each bank is 8 bytes wide. What is the size of the adders and ROM used in this organization ?

end f_mmu.html