f_mmu

updated 2001-03-31

Extremely rough and crude design of a MMU unit to be integrated with the Freedom CPU. The organization is terrible, help me straighten it out. There are many errors and imperfections and gaps here. Help me fill them in.

See also The Freedom CPU Project computer_architecture.html#freedom

Introduction

The MMU and associated caches should be "transparent" -- a user program should not be able to tell the difference between a system with these things and a system without these things. Of course the operating system must handle these 2 cases slightly differently. A system without these things would be simpler and cheaper, so the only point to having them is if they increase performance.

A memory management unit (MMU) uses locality of reference to improve a system's overall performance. Rather than force the CPU to directly access relatively slow DRAM on every "load" and "store" instruction, we improve performance at constant cost by investing some of the dollars we would have spent on (a bunch of fast, expensive) DRAM and instead spend it on (the same amount of DRAM, but much slower and cheaper, plus) much faster SRAM cache and on a MMU.

"the essence of cache design is balancing fast hits and few misses." -- H&P, ch. 5.4 p. 422

The MMU also is a integral part of the memory protection features required by most multitasking operating systems.

Terms: [FIXME: definitions needed !]

write-through
write-back
page
page miss
page hit
cache miss
cache hit
memory protection
the "multilevel inclusion property"
branch-on-miss
When you need some data to complete a calculation, but it's been paged out to L2 or DRAM or even further away, rather than just waiting a long time (latency) for that data, do something useful (hopefully with data that is still in the cache). See SMT.
SMT
symmetric multi-threading
"perfectly virtualizable"

For memories designed in comparable technologies, a dollar will get you roughtly 8 to 16 times as many DRAM bits as SRAM bits, but those SRAM bits will have a cycle time roughly 8 to 16 times faster. So for systems that are HDD bandwidth-limited, the optimum would be *no* L2 cache, and spend all that money on DRAM instead.

A second-level cache has the "multilevel inclusion property" if all data in the 1st level cache is always in the 2nd level cache. The simplest 2 level cache structure has the inclusion property and same-size blocks for both caches. (Often second-level cache has larger blocks, which means more hardware is necessary to maintain multilevel inclusion -- -- or else that property is sacrificed.)

Most high-performance CPUs have a Harvard architecture internally: a data cache (D-cache) and a separate instruction cache (I-cache).

Superscalar CPUs generally read multiple instructions from the I-cache simultaneously and attempt to schedule them.

Most digital signal processors (DSPs) use 3 single-ported memory areas internally, 2 data RAM (X and Y) and a instruction RAM (I). This gives DSPs the ability to read a instruction and 2 data values simultaneously, or to read a instruction and a data value and write a data value simultaneously.

However, nearly all common CPUs have a unified (Princeton ?) (von Neumann ?) single-ported second level ("L2") cache.

Please note that just because the F-PCU has 64 bit registers does *not* imply that everything else must also be 64 bits. For example, if we use a motherboard designed for the Alpha CPU, that motherboard has second-level cache, memory bus, and memory all 256 bits wide.

Open questions:

Which is better: (a) closely integrate the MMU with both the on-chip first-level cache of the CPU and the off-chip second-level cache, or (b) have completely independent mechanisms dealing with L1 cache vs. L2 cache ?

general notes

General notes applicable to each level in the memory hierarchy.

Any cache has 2 data areas: One area holds a duplicate copy of something already stored in a lower memory level. The other area contains that "tag bits" that remember where, exactly, in the lower level all this stuff came from.

In a direct-mapped cache, each possible virtual address [ tag bits | block number | offset ] will be mapped to 1 and only 1 location in the cache: the block at "block number". When the "block number" is presented to the cache, it replies with tag bits and the entire block worth of values. If the tag bits match the ones in the virtual address, great; otherwise a miss occurs.

In a fully associative cache, [ tag bits | block number | offset] A "page table" is a table, one entry for each possible (high bits of) a virtual address, where the entry tells which physical page (block number in the cache) (or "none -- miss") is caching that address. A "inverted page table" is a table, one entry per page, that tells which (high bit of) virtual address that page is caching (block number of the lower level). "inverted page tables" can be much smaller than "page tables", since virtual memory space is very much larger than physical memory space, but "inverted page tables" require special hardware (a translation look-aside buffer TLB and/or a content-addressible memory CAM) (or a long time) to search through the entire inverted page table for a match.

CPU interface

the interface between the CPU and the MMU. As long as this interface stays constant, the same CPU can be used with a completely different MMU and memory sub-system.

"Is this address in your cache ?" "Yes, and here is the data." or "No, sorry. Wait a few cycles and I'll fetch it for you." or "Hey ! That's an unusual address. Let's do the page fault thing."

"Please write this data to that address." "OK." or "Hey, wait, we need to page fault here."

Memory interface

As long as this interface stays the same, a particular memory sub-system can be used with any MMU and any CPU.

memory hierarchy

details and quirks about the various levels of the hierarchy.

Registers are 64 bits on the f-cpu.

I-cache and D-cache (L1 cache) [ big block index | tag | block offset]

SRAM (L2 cache) virtual address: [ block index | lots of tag bits | block offset]

SRAM address lines are not multiplexed, so every location (address) in SRAM can be accessed equally quickly.

"Main Memory" Virtual address: [ Virtual page number | page offset ] DRAM is physically arranged as rows and columns of bits. It typically takes [FIXME] cycles to select any one row of bits. After one row has been selected, any bits in that row (all other addresses in that page) can be accessed relatively quickly, and in any order. ("nibble mode", "fast page mode", "burst memory access", etc.)

(All the addresses that access the same row of DRAM memory are called a "page", which is usually a *different* size from the "pages" used by the operating system).

It is simplest to consider DRAM as a cache for all virtual memory (invariably fully associative). Some areas of DRAM are "locked" (notably the subroutines necessary for reading and writing other pages to disk); these can be considered the equivalent of "tag memory". Some operating systems draw a distinction between the "working space" that overflows to the "swap file" section of the HDD, and the "disk cache" that caches other parts of the HDD. (David Cary fails to see why such a division would ever be preferable to a unified virtual memory area).

Occasionally DRAM is unavailable for a few cycles, because of the necessity of DRAM refresh / page switch cycles or because a I/O device (or parallel CPU) is reading or writing the DRAM.

Consider the interaction between the MMU and how the CPU handles peripheral interrupts computer_architecture.html#interrupt

Variations on DRAM, such as RAMBUS: see vlsi.html#ram .

HDD. If there were only 1 process, Virtual addresses could be considered specific spots on the HDD. Virtual address: [ HDD sector number | offset in that sector ].

The network Often, for reasons outside the user's control, the network is unavailable. Only a tiny portion of the Internet is writable by the user; all the rest is "read-only". Even if everything else about the system is uniprocessor, software needs to be aware of cache coherency and periodically destroy stale cached data and read more recent versions. Typically information is tagged with a "expires-by" tag to help determine when to use cached data and when it's time to fetch more data.

Internal details

The L1 cache. All of memory and cache is considered to be made up of "blocks" of data; entire "blocks" of data are always written (?) or read as a single unit. (Given N wires between the F-CPU and the DRAM memory, it takes at least blocksize/N transactions to read or write a single block).

Since a block is bigger than a single register, a "load" to some location not currently cached causes the the cache to not only load the desired data, but also to speculatively pre-load the rest of the block. If that information is later required by the CPU, then it will be accessed by the much faster cache rather than requiring another slow RAM access.

Block size of L1 cache:

H&P p. 394, Figure 5.11 and Figure 5.13, a block size of 64*8 bits (8 registers worth) seems optimum for 64KB to 256 KB caches. "Larger block sizes will reduce compulsory misses ... [but] larger blocks increase the miss penalty ... may increase conflict misses and even capacity misses if the cache is small"

H&P, p. 396: "a direct-mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/2. ... Direct-mapped caches are the fastest (faster than 2-way set-associative: 2% for CMOS, 10% in TTL or ECL)"

Many common processors have 2 way set associative caches. The block just loaded can go in 1 of 2 possible locations. (Fully associative: the block just loaded can go anywhere, is, according to H&P, not significantly different than 8 way set associative).

"Victim cache" (H&P p. 398) is a small (4 entry ?) fully-associative cache between a direct-mapped cache and its refill path. When a block is discarded from the direct-mapped cache because of a miss, a copy is sent to the victim cache. When there is a miss in the direct-mapped cache, the victim cache (tags) is checked; if the desired block of data is found, then the victim block and the cache block are swapped.

Pseudo-associative caches (H&P p. 399) If a miss occurs on a direct-mapped cache, check the other possible location (invert the MSb of the index field). If it is there, swap the blocks, otherwise completely miss and go to RAM.

p. 423 "The guideline of making the common use fast suggests that we use virtual addresses for the cache, since hits are much more common than misses. Such caches are termed /virtual caches/, [as opposed to] /physical cache/ ... the traditional cache that uses physical address. Virtual addressing eliminates address translation time from a cache hit. Then why doesn't everyone build virtually addresses caches ?"

Reason 1: p. 423 H&P claim "every time a process is switched, the virtual addresses refer to different physical addresses, requiring the cache to be flushed." Why ? Is there some reason that clever OS design couldn't put every process in a (usually) non-overlapping virtual address range ? (I think this has the same effect as the "process-identifier tag" H&P mention). This makes te common use fast ... ... but requires extra care for protection violations.

Reason 2: p. 424 "operating systems and user programs may use 2 different virtual addresses for the same physical address. These duplicate addresses, called /synonyms/ or /aliases/, could result in 2 copies of the same data in a virtual cache; if one is modified, the other will have the wrong value. With a physical cache this never happens. /anti-aliasing/ hardware is used to guarantee every cache block a unique physical address. Or software can force all aliases to be identical in the last N bits of their address, so a direct-mapped cache of size 2^N words can never have duplicate physical addresses for blocks.

Reason 3: I/O typically uses physical addresses.

H&P ch. 6.2 p. 492 "The price of a megabyte of [HDD] disk storage in 1995 is about 100 times cheaper than the price of a megabyte of DRAM in a system, but DRAM is about 100 000 times faster. Many a scientist has tried to invent a technology to fill that gap, but thus far all have failed."

One of the first (and simplest) MMU chips was the 74LS610 MMU: see

misc

end f_mmu.html