2013年2月28日星期四

cache, vitural memory, TLB, paging

CPI: clock cycles per instruction
CPU time = CPU cycle number * cycle time = instruction number * CPI * cycle time
                 = instruction number * CPI / frequency

Why we need cache?

Register is too expensive and limit; DRAM  is cheap but slow. So we need a media between them to make PC faster. It store the data that is most active.

How can we get data when running a program?
 Most of data are in registers;
if we need data outside register, first cache, if not in cache, then RAM, if not RAM, at last disk.

 What is cache looks like?
It is made of SRAM. It is in CPU. Is is a part image of memory.  It consists of many blocks, usually with the same size.

How to map cache from memory?
(block addr)mod(block number in cache)
We use part of the address as the block index.
eg: If block number is 8, we use lower three bit as block index.
So for 00001,01001,10001, 11001, these four memory address are all mapped to cache block 001.
So each cache position might corresponds to multiple memory address.

How can we know if the data in certain cache block is what we need? How can we know the requested data is in cache or not?
We use the higher bits of address to verify.
The 32 bit address consists of three parts: tag, index, and offset.
index is used to choose block.
tag is used to compare with tag in cache
cache consists of three parts: valid, tag, data
So the total bits of cache is bigger than its capacity.

Do we still need to search for tag in cache when we just init CPU and there is no data in cache?
No, we use an extra valid bit to see whether this block has a valid address or not. If it is N, we do not read the content in this block.
This type of miss we call it cold miss.

What will be done the index is the same but tag is different?
We use the most recent address to replace previous address.(in direct mapped)

What we can do if cache is full?
We replace the old one with new one

How can we guarantee the consistency of cache and memory when we write data?
Intuitively, we can update memory whenever we update cache, which is called write-through.
cons of write-through: write operation cost too much time
Solution:write buffer
cons of write buffer: still can be stalled, when write buffer is full
Solution: write-back, when write happens, new value is only written into cache. Only when the written block replace by new data, we write it to memory. dirty bit to identify
cons: we need two cycle to do it: first we check cache miss or not, next is the real execuation
solution: we can also use write buffer in write-back mechanism.

Can we remove index in mapping?
Yes, if we do not use index, we need to traverse all blocks in cache, which we called fully associative cache. We can put memory data in any cache block.
cons: to make fully associative cache efficient, we need a Comparator. So it is only practical for that case that block number is small.

Between fully associative and direct mapping is set-associative cache. Each block has n positions.
First, block is mapped to a set, then retrieve all elements in the set.
It could decrease cache miss but also increase hit time.
problem:when miss, we need to choose an item to substitute.  The rule is we choose the least recently used item.(LRU)
Replacement algorithm:
FIFO
LRU
LFU
Random

What type of cache we use for L1, L2?why?
We use direct mapping for L1, since hit time is more important than hit rate. since less hit time, less clock cycle.
We use set associative for L2, since hit rate is more important. since we need to reduce the cost to read memory, so we need high hit rate.

What is the motivation for virtual memory?
allow multiple program to share memory efficiently and Security.
allow user to use memory capacity that is bigger than physical memory.

Does every program really has such memroy?
No! That's the illusion OS offer us. Each program can only do read and write to part of memory which allocate to it. Memory only need to store the most active part, just like what cache did.

Is virtual memory similar to cache?
Yes! Block here we called page. cache miss we called page fault. we call the page transfer operation between memory and disk as swapping or paging. There is a mapping between virtual address and physical address.

Is page fault cost high? What should we do?
Yes! Page fault might cost millions clock cycle. So we need to try our best to decrease miss rate.
Solution:
allow page in fully associative pattern.
use write back machanism
page is big enough: 4KB-16KB

Fully associative is hard to position address. What should we do to improve this in VM?
We use page table. It is in memory. Each process has a separate PTE.

What is the size of page table? if page size is 4KB, and each item in page table is 4 Bytes, 32 bit address?
Number of page = 2^32/2^12 = 2^20
Page table size = 2^20 * 4 = 4MB
So each process has a page table with 4MB size. Multiple processes will occupy too much memory for just storing page tables.
Solution:
Inverted page table
multilevel page table
etc

Can we speed up page table operation?
YES! TLB. It store part of page table. It is like a cache of PTE. It use fully associative pattern. Random replacement algorithm.
So assume CPU has a VA, first we search TLB to find physical address, if it hit, then go to cache to find the data, if cache miss, then go to memory. If TLB miss, then turn to PTE to find physical address, if hit, then cache. If miss, exception: page fault.
Actually, an intuitive method is have a single page table consisting of an array of fast hardware register. VA could be the index. When a process is started up, OS loads the registers with the process's page table(in memory). During process execution, no memory reference during mapping.
cons: load the full page table at every context switch hurts performance. Moreover, if page table is large, it is expensive.

Why we need TLB? Why we need to speed up paging?
1. The mapping from VA to PA must be fast.
2. If VA space is large, the page table will be large.
All instruction must ultimately come from memory and many of them reference operands in memory as well. So, it is necessary to make 1, 2 or more page table reference per instruction.  If an instruction execution takes 1 nsec, the page table lookup must be done in under 0.2 nsec to avoid having the mapping become major bottleneck.
Each time CPU generate a VA, MMU need to look up PTE to do address translation. In worst case, this might an operation of memory which is hundreds of clock cycle. So we put a small buffer(TLB) in MMU to save the cost.


Does TLB has the same entry structure as page table?
NO!. Page table does not contain virtual page number, which is used as index to search page table. But TLB has virtual page number. TLB is inside the MMU and consists of a small number of entries, eight for example, but rarely more than 64.
Each TLB entry contains virtual page number, modified bit, protection bit, physical page frame and valid bit.
TLB lookup: Hardware check TLB by comparing it to all the entries simultaneously(ie., in parallel)

What TLB entry form is different from CPU to CPU.
From RB, VA consists of TLB index(virtual page number lower t bits) and TLB tag(rest bits of virtual page number) and virtual page offset. So TLB index is not in TLB entry, but TLB tag is in TLB, since TLB could be set-associative. TLB index is used to find set, and then TLB tag is used to find entry.




What  happens if virtual page number is not in TLB?
That's the interesting part! MMU detects the missing and does an ordinary page table lookup. It then evicts one entry of TLB and replaces it with the page table entry just looked up. If the page is used again soon, this time it will result in a TLB hit.
When an entry is purged from the TLB, modified bit is copied back into the page table entry in memory. Another bit need to changed is reference bit which help OS to choose a page to evict when a page fault occurs. Reference bit is important for page replacement algorithms.


 Is page table is still practical for 64-bit system?
 NOOOOO! We use two alternatives: multilevel page tables and inverted page tables.
first 10 bits for PT1, next 10 bits for PT2 and then offset. So each of 1024 PT1 entries represents 4M.
eg: for the address space  contains a million pages, only 4 page tables are actually needed: 4M table level table, 0-4M second level table for program text, 4M to 8M for data and top 4M for stack.
BUT...For 64bit system, we might need 5 level of page table....we can save space and time by inverting the page table -- mapping physical to virtual, instead of the other way 'round.
For tradition page table,  each process has a page table,and page table entries map VA to PA.
But now, we can have a single inverted page table for the system, which the entries mapping physical frames to (process ID, virtual page number) pairs.
Intuitively, we can use the pattern below to design a linear IPT
PA is not stored, since the index of IPT corresponds to it. Virtual page number and current process ID are compared against each entry, traversing the array sequentially.  When a match is found, the index of match replaces the virtual page number in the address to obtain a physical address.


problem: we need an efficient search strategies,
Solution: HASH!!!
Hash Anchor Table
each inverted PTE has to contain:
  • The process ID of the owning process.
  • The virtual page number.
  • A pointer to the next IPTE in the hash chain.
  • The normal protection, valid, modified bits referenced. 
The steps in a translation are:
  1. Hash the process ID and virtual page number to get an index into the HAT.
  2. Look up a Physical Frame Number in the HAT.
  3. Look at the inverted page table entry, to see if it is the right process ID and virtual page number. If it is, you're done.
  4. If the PID or VPN does not match, follow the pointer to the next link in the hash chain. Again, if you get a match then you're done; if you don't, then you continue. Eventually, you will either get a match or you will find a pointer that is marked invalid. If you get a match, then you've got the translation; if you get the invalid pointer, then you have a miss.