gcc -g testcpp.cpp -o testcpp
many link error occurs, the reason is that gcc would not automatically link the library for c++, we have to do it manually...
gcc -l stdc++ -g testcpp.cpp -o testcpp
here we go
LIANG
I am Liang Hong. A student in CSE department of UB. This blog is mainly used as a record of my study.
2013年8月24日星期六
2013年7月9日星期二
Interrupt
How
interrupt work
Simply
saying, you change the voltage on a pin and processor start to execute on a new
location.
The
way interrupt work:
Basically
the interrupt is an external hardware that alerts the processor that the input
is ready.
The
processor suspend what it is doing.
Processor
actually is really stupid. They have a clock signal and in every clock it take
whatever the value of program counter register is and read from a memory location,
get an instruction and execute that instruction. In the next clock, these
execution are actually pipelines that spread over several clock cycles but
logically it fetching a sequence of instructions and executing them.
What
you do is you change the voltage on the interrupt request line. Interrupt says
is hardware in the device that says: oh, wait, don't use the program counter on
this cycle, on the next cycle instead I want you to fetch from a fixed location
in memory, get the instruction there instead of using those in program counter.
And what's there is interrupt service routine.
Logically
the way it works is you set up your interrupt routine that you tell the
hardware you'd like the interrupt to occur if something happens on some
external pins, gpio pins goes high, serial port receive a byte. Write into our
memory map register says give me an interrupt when this external condition
happens.
The
processor itself goes into a loop where it just executing code until the
interrupt condition is satisfied at which point it goes to execute code in a
different interrupt service routine.
Microkernel
is small OS just handling scheduling of tasks. You need to decide when to
schedule it. One common strategy is to use timer interrupt. Microkernel will
write into a memory map register which will setup a periodic interrupt at some
interval time, let’s call it jiffy interval. When that interrupt comes in, you
got bunch things to do.
1. Disable interrupt
2. Store the register of machine
into persistent data structure, i.e thread structure which keep track of
current thread. Store the return address. What will typically happen in
interrupt hardware is program counter gets pushed onto stack. We will store it
into a persistent memory independent of stack which is in thread data
structure.
3. Decide whether to continue to
execute current thread.
Heap
persist the across procedure calls. Stack does not. Malloc space for thread
data structure.
4. Put PC on stack and overwrite
it and return from interrupt.
5. Enable interrupt again
What
does create a new thread mean?
Call
a procedure, allocate memory for this thread data structure, decide where the
stack going to go in memory. Make sure thread stack don’t interfere with one
another. Sufficient space between them.
So
basically initialize the data structure. Decide which to execute. If you decide
continue the thread just created the new thread then you would return from
interrupt.
Reset,
create the first idle process, timer will wake it up or by external device.
No-op,
infinite loop
Branch
to a bootloader, look at an outside memory, flash memory, for example and load
in program into RAM and start that program and so on.
How
to decide which thread to schedule?
1. Fairness, give every thread a
chance to run
2. Importance, idle thread:
lowest
3. Responsiveness, shortest job
first
4. Precedence, thread might wait
for other thread to complete. How to know one is waiting for another one to
complete? In thread data structure, you could have some status information that
indicate the thread is blocked and why it is blocked. When acquire a lock, stick on data structure. Later release the
lock, go to find all thread blocked. Try to execute the waiting thread.
Non-preemptive:
once start, go until done. Only useful problem broke down to small tasks. Cost
is tiny.
Time-triggered
interruption
Cooperative
multitasking
If
priority change dynamically, does that mean the scheduler is running
concurrently?
Assume
we have a single core, execute a single instruction stream. The only way of
concurrency is interrupt kick in and give a new task the opportunity to run.
The thread scheduler runs only when it is given an opportunity to run by an
interrupt or by a procedure call. If you have a thread that make a call to
pthread library to acquire a mutex, normally that will call the thread
scheduler to run. Why? Because if you attempt to acquire a mutex, you may
block. So obviously you want to the thread scheduler to run at that point you
don’t want to wait until the next timer interrupt to for the scheduler to run because
the currently executing scheduler is stalled. Remember processor has to do
something. What is it mean stall? The clock is still going, the program counter
still get incremented. The processor got to do something. Got to put some value
in the program counter if you can’t acquire the mutex. So a reasonable choice
is to make the program counter point to your thread scheduling algorithm and
let the code run to decide what to do next.
So
yes. If you have dynamic priorities, the scheduler need to get more opportunity
to run. Anything that could change the priority of thread need to involve
thread scheduler, if you have dynamic priorities.
Penalize
the thread that run a long time, give them a lower priority simply because they’re
taking a long time to execute.
Does
the halting problem affect your ability to estimate the execution time?
When
context switch happens, you disable interrupt first. Then save the PC(push
program counter on the stack), processor register that ISR might overwrite.
Then ISR jumps to certain designated address. Then execute the instruction at
that address. Or it could be a table of addresses of all ISRs.
Instruction
vector is a table which has one jump instruction for every kind of
interrupt. Jump to the code of corresponding interrupt. All of these
happens before ISR running. Then ISR code execute.
At
the end of ISR, it has restore the value of registers it had saved before
running. And then re-enable interrupt and return back to main function.
Oftentimes
it is useful to raise an interrupt after certain amount of time has elapsed.
For example, in RRH project, you need to read the digital power sensor value
every 10 ms. So every 10ms elapsed you want a new interrupt to be raised so
that the main thread then goes and gets a new sample.
What
happens when multiple interrupt occurs?
This
depends on hardware mechanism that implemented. The way interrupt indicated is
simply by raising the voltage level. There is only one interrupt at any time.
Buffer might help with multiple interrupt.
What
happen if you are serving interrupt A and some other interrupt B occurs?
If
you respond to A and disable all interrupts. You will not be able to respond
interrupt B.
When
you re-enable interrupt and B is still waiting and you will be able to respond
to it.
It
also possible not to disable all interrupts. In some cases there could be
different signals you can rate in increasing order. May be the reset interrupt
is the most important. You want reset interrupt has higher priority than ADC
interrupt. So the priority has to be set and you disable only the ADC interrupt
when you go to service it.
Set
up a timer, set the period of timer
Enable
the timer to start counting
Trigger
an interrupt, enable the interrupt itself
When
done, disable both interrupt and timer
Using
function pointer allow us to connecting the timer interrupt to interrupt
service routine. It register the ISR to be invoked on every SysTick interrupt.
Between
any two instructions in main function, you could have ISR executed. And
possibly between two instruction in ISR, we could have another instance of ISR
executed and so on. It is a form of interleaving. The code for ISR is
interleaved between the code for main function.
SysTickIntRegister(&ISR)
How
to avoid the case that you run ISR and you get interrupt, repeatly? Under what
condition would it be highly unlikely to happen? Assumption.
1. the time ISR execute is much
much shorter than the time between interrupts.
2. The time main function takes
3. The main function does not
mess with timer counter when running the loop.
Stack
is used for procedure calls and for interrupts. Heap is used for storing
persistent data that cross procedure calls.
Execution
time:
Bounded
iteration, bounded recursion
Minute
detail of embedded system
Do
you have cache? What is your cache replacement policy?
what
the contents of cache when the task get start executing, what the cache after
the task get preempted and then resumed.
timer interrupt:
timer: input is the impulse such as 1.93180MHZ, output0 impulse will connect to pin0 of interrupt controller and trigger a periodical interrupt. The period of this interrupt is what we call "tick". So basically, timer interrupt is actually a periodical signal which is totally hardware operation. This signal will trigger a interrupt service routine.
timer_interrupt() -> do_timer_interrupt() -> do_timer()/update_RTC()
do_timer() -> update_process_times()
every 10ms call timer_interrupt
every 11min, ie 660s update RTC information to synchronize OS timer and RTC timer
update_process_times() will do many works such as update the timer counter of current thread
interrupt installation
time_init() -> setup_irq()
setup_irq(0, &irq0) connect interrupt routine to interrupt queue and wait the execution when interrupt arrive
Generally, every timer tick allow the timer interrupt to be executed. timer interrupt frequency is 100 times/sec. Main job of timer interrupt is handle with all the information related with time and decide whether to execute schedule program and deal with the lower half. The information related to time is system time, time slice of process, delay, CPU using time, various timer, updated time slice of process provide proof for process scheduling. Lower half either have a dedicated kernel thread for each handler or are executed by a pool of kernel worker threads(sit on a run queue).
Funny
Example: An 8-bit microcontroller with 16-bit addresses
In
particular, general purpose 32 bits registers also have memory address in the
address space, this is very common design not only design. What it means if you
want to access a register in a machine, you can do so by just doing an ordinary
memory access but use these first 32 memory location. And what you get is
register rather than your main memory. In instruction set, typically, there is
instruction that directly reference these registers so you don't have to just
use the ordinary memory access to access these registers.
Given
the register in data memory, why there are only 32 of them? Accessing register
are generally really quick. Often a single instruction cycle can access
multiple registers and operate on them. Why wouldn't we then provide more
registers so that we can get faster programs that operate on more data?
You
need 5 bits to choose register in order to distinguish between the registers.
Where those 5 bits gonna be? They are gonna be in embedded in an instruction.
How big the instruction? It is 8-bit microcontroller. So each memory access
gives you 8 bits. So you'd like to be able to get an instruction that take up
only 8 bits and then executed it in a single cycle. If you need 16 bits in the
instruction. It's gonna take you two cycles to fetch an instruction. So 5 of
those 8 bits will be accessing the register, specifying the register that it's
operating on and gives you only three bits. So you can only have 8 such
instructions. But actually you can only have 7 such instructions. Because you
need more than a total of 8 instructions in a machine, so one of the three bit
patterns has to be the one that tells you that this is not a 8 bit instruction,
this is a 16 bit instruction. So you only have the possibility of having seven
8-bit instruction that operates on a register. If you had 64 registers, you
would only have 3 possible instructions that would access to register. So there
is design trade-off that happens where you want, you'd like to have a big
register file. But register file are expensive because they take up realistic
in instruction world.
There
are 64 I/O registers.
2013年5月24日星期五
页表的秘密
32位 4G的内存空间 每页大小4K 页的数目就是1M 页表也是需要地方存放的 32位每个entry4byte 那么页表的大小就是4M 这已经很大了
好吧 如果你说其实4M不大,那么如果是64位的话呢 2^64/2^12 = 2^52个页表entry 每个entry 4byte,就是2^54byte的页表 好吧 应该还没有计算机可以奢侈到能承受这么大的页表
如果用二级页表 前10个bit作为一个页表 可以表示1024个entry 每个entry存的是二级页表的index 可以认为每个entry代表了4M的地址空间 其实就是用的hash的思想 比如一个program 第0个entry 4M存text段的页表,第1个存data 段的页表 最后一个存stack的页表 PTO是page table offset
还是举个例子吧,比如一个virtual address 0x00403004
bit31-bit22: 0000 0000 01
bit21-bit12: 0000 0000 11
bit11-bit0: 0000 0000 0100
PTO1 就是1 PTO2是3 OFFSET 是4 不错挺简单
好了 下面是计算机干的活: MMU先用PT1=1去找第二级页表,怎么找?PT1的entry里存了L2页表的base address, 得到entry 1(从0开始) 对应的是4M-1 到8M-1的地址。相当于是找个二级页表某一个entry的起始地址,然后用PT2=3找这个entry的第4行 对应的地址是多少呢 在这个4M空间内 10个bit 所以依然是1024个entry 每个entry4K大小 0-4K,4K-8K, 8K-12K, 12K-16K对了 就是12K-16K 真正的页表地址还应该加上PT1里面存的base address,得到了页表地址就可以取值了。好了 这个entry里面就是我们想找的 page frame number了,加上offset就是physical address
扩展一下 Core i7采用了4级页表 i7支持48位和52位的物理地址空间(真是够大的。。。)
以48位为例,i7取前36位作为virtual page number,每级页表分到9位作为offset 当然了 每一级的标志位略有不同 不过idea还是类似上面所说的两级页表,前三级页表中bit51-bit12存放的都是下一级页表的基地址,而第四级页表存放的是40位PPN,加上那12位VPO直接得来的offset,就可以为我们提供4PB(52位)的物理地址空间了。牛吧。记住,每一级页表这里是64位,其中40位(bit51-bit12)是存下一级的基地址或者PPN,bit11-bit0是各种标志位,62-52未使用,63位是XD,这是为64位系统引入的,could be used to disable instruction fetches from individual memory pages.通过restricting execution to the read-only text segment来减少buffer overflow attacks。有个小trick,为什么每个页表分9位?page size是多少,一级页表占的空间是多大?答案是page size是4KB或者4MB,linux是4KB,一级页表就是一页。9位的原因:4KB = 2^9 * 8byte,每个entry 64位也就是8byte.好了.算是说圆了至少.
对于页表在kernel 中的implementation
比如在trap里处理TLB exception 也就是TLB miss了 怎么办 如果做demand paging的话 这是必须会遇到的 因为程序开始运行的时候并不会真正把数据load到disk上,shell通过execve装载loader的时候 会删除子进程的虚拟存储段 然后创建一组新的text,data, heap, stack,这时的创建不是真正的创建而只是定义,比如vnode,offset等等先初始化,等到真正要用的时候通过exception来paging。所以如果来了个faultaddress,exception传进来说kernel你帮帮忙处理一下吧。kernel说,好的,拿地址来,假设我们用二级页表来做,kernel先把前10bit给分离出来,
(faultaddress & 0xffc00000) >>22
这个得到的就是上面例子的PTO1=1
(faultaddress & 0x003ff000) >>12
这个得到的就是上面例子的PTO2=3
从current thread可以得到正在运行进程的地址空间struct,which include the first level pagetable base address and stackbase.
关键的来了,每个进程都有个pagetable的base address,也就是之前说的CR3(这里拿CR3来说不太合适。。因为i7是四级页表,),初始化的时候分了一页,4096来存,这个base address就是初始化返回的分的那一页的首地址。 比如i7里面的CR3 register, points to the beginning of level 1 page table. The value of CR3 is part of each process context, and is restored during each context switch.
as->as_pagetable = alloc_kpages(1);
比如PT 1base address register指向的地址是0x00080000,
那么virtual address 1,也就是第一级页表我们要找的entry的地址
0x00080000 + PTO1*4= 0x00080004
这个entry里面是什么呢,是第二级页表的base address,为啥不再拿个寄存器?kernel说道:大哥,寄存器这种奢侈的东西给你个CR3用应该知足了,后面自己想办法吧。
virtual address 2,vaddr2也就是第二级页表entry的地址:
*vaddr1 + PTO2*4
好了,*vaddr2也就是页表的内容了,其中就有我们要的20bitPPN,加上12位offset 就是物理地址
为啥我们要去知道这两个地址?*vaddr1是二级页表entry的地址,如果不为0,就代表有二级页表存在。*vaddr2就是二级页表的内容,其中有present/absent bit,R/W bit, U/S bit(determine whether the page can be accessed in user mode, protect code and data in the operating system kernel from user program),还有最重要的physical page number,
言归正传,我们怎么check present/absent bit?简单的很呐,把*vaddr2拿来查看相应bit就行了,做起来就是个一个 & mask
上面我们知道page table不是单纯只存有physical page number,还有一些其他标志位,比如present/absent bit,这一位track哪些页在内存中存在。也就是这个virtual address是不是被缓存在DRAM里面。永远都要有个信念,kernel想方设法给我们造成假象以至于我们以为每个进程都有4G的空间,其实进程p的很多virtual address压根儿是虚的,根本没在真正物理内存内有映射。只有那些有映射的才是真正干活的。所以如果这个bit是0,那么就是这个虚拟也没有被分配物理地址。
TLB entry中也有valid bit,这一位表示是否有一个valid translation,如果这一位是1,那么很好,这个entry有我们要找的physical frame number,TLB是比较牛逼的,每个entry有virtual page number以及对应的physical frame number,只要找到了连offset什么的都不用算,直接可以得到物理地址。这和page table还是有区别的,page table entry里没有存什么virtual page number,只是用VPN做index去找entry而已。由此可见,TLB是比较奢侈的。奢侈也是有代价的,就是一般size都很小嘛。。 比如8个,16个entry,最多64个。。。确实够小。。TLB还有一位是protection bit,当对应page改变的时候,会去设置这一位,比如code段的page就是read and execute(R X),heap段的page就是read and write(R W)。有人要问了,这TLB到底是怎么查找的。这是MMU干的活,当来了个virtual address的时候,硬件先比较virtual page number,怎么比较,很牛逼的同时也就是并行比较所有的entries。有趣的地方在于如果找不到,那么就是个miss,然后MMU就去做普通的page table lookup.然后更新TLB.相应的bit也要该表,比如modified bit.
好了,至此,多级页表已经没有秘密可言了。最后再说个address translation的optimizaiton
我们说了半天地址转换,目标就是为了得到个物理地址,得到物理地址干什么?去cache或者内存找数据啊!
所以MMU的任务是
1. 想方设法把CPU给的virtual address变成physical address
2. 把这个physical address 传给L1 cache
我们一直在分析virtual address的前20bit(32位)或者前36bit(48位),不要忘记后面的12bit offset 我们是不会变的,MMU忙活半天拿来了PPN还是得老老实实加上这12位offset。
对于i7而言,每个cache address有64位,形成了2^64个不同的地址,我们需要用这个地址去找数据。假设L1cache是8-associative,那么就有64 sets,每个sets有8个cache line,每个line 有64byte的block,这个block里面有一个valid bit, some tag bits, and a block of data.每个physical address 有6位cache offset bits and 6 index bits.这里的6不是巧合,log2(64)=6。index bits(CI)用来找组,offset bits(CO)用来找某一行。
好了,说了半天,我们是想说一个neat trick,那就是从virtual address拿来的那12bit offset就是CI+CO,刚好也是12位,This is not accident!!!因为那12bit offset在MMU忙活找PPN的时候闲着也是闲着,CPU发现MMU用不着这12bit啊,那我就给你发个VPN你自己忙去吧。我把VPO 12bit发给L1 cache,你先找着,可以找到appropriate set,然后读出8个tags以及对应的data,等到MMU忙完了拿来physical address的时候,取出前20个tag去match这8个tag就好了,这里假设物理地址空间32位。如果是48位或者52位,减去12就是tag的位数
32bit address
64 bytes/block
8 blocks/set
Number of set: 64
Number of blocks: 64*8=512
bits of index = log2(64)=6 #index sets
bit in offset = log2(64) = 6
bits of tag = 32-6-6=20
cache capacity =512*64=32K
bits31-bit12 | bit11-bit6 | bit5-bit0
tag set index offset
offset=100 indicate that the copy of w starts at byte 4 in the block.(we are assuming the words are 4 bytes long)
---------------------------
Below is some thoughts of C program memory allocation
Why we offer static variable a separate space?
Static is the most original, simplest and most secure pattern allocation for variables. When the program is compiled, this sort of variables are allocated a unique and constant space.(Which is cool..)
So the problems like: how much space variables need? Whether current space is sufficient or not? Where should I put the variables?
The answer of all these problems are explicit at first.
So BSS and DATA segment is like the free apartment/house offered by government for government officials in China.
So, where are the local variables stored?
They are on stack. Stack is like rent house. It allow the best efficiency to use the house. When function is called, some space is allocated to it. When it returns, those space are released. Because of stack, recursion is possible for us.
Where does segmentation fault come from?
It comes from the stack!!! Stack overflow lead to segmentation fault. Continuous pushing result in page fault, then smart linux is going to find some extra space to expand the stack. But when it reach a limit(RLIMIT_STACK, 8MB), it means linux really can not tackle this. Then...stack overflow->segmentation fault.
There is another small problem, whether the expand stack gonna to shrink back or not?
No.. It is like the government budget, which is always increasing...
好吧 如果你说其实4M不大,那么如果是64位的话呢 2^64/2^12 = 2^52个页表entry 每个entry 4byte,就是2^54byte的页表 好吧 应该还没有计算机可以奢侈到能承受这么大的页表
如果用二级页表 前10个bit作为一个页表 可以表示1024个entry 每个entry存的是二级页表的index 可以认为每个entry代表了4M的地址空间 其实就是用的hash的思想 比如一个program 第0个entry 4M存text段的页表,第1个存data 段的页表 最后一个存stack的页表 PTO是page table offset
还是举个例子吧,比如一个virtual address 0x00403004
bit31-bit22: 0000 0000 01
bit21-bit12: 0000 0000 11
bit11-bit0: 0000 0000 0100
PTO1 就是1 PTO2是3 OFFSET 是4 不错挺简单
好了 下面是计算机干的活: MMU先用PT1=1去找第二级页表,怎么找?PT1的entry里存了L2页表的base address, 得到entry 1(从0开始) 对应的是4M-1 到8M-1的地址。相当于是找个二级页表某一个entry的起始地址,然后用PT2=3找这个entry的第4行 对应的地址是多少呢 在这个4M空间内 10个bit 所以依然是1024个entry 每个entry4K大小 0-4K,4K-8K, 8K-12K, 12K-16K对了 就是12K-16K 真正的页表地址还应该加上PT1里面存的base address,得到了页表地址就可以取值了。好了 这个entry里面就是我们想找的 page frame number了,加上offset就是physical address
扩展一下 Core i7采用了4级页表 i7支持48位和52位的物理地址空间(真是够大的。。。)
以48位为例,i7取前36位作为virtual page number,每级页表分到9位作为offset 当然了 每一级的标志位略有不同 不过idea还是类似上面所说的两级页表,前三级页表中bit51-bit12存放的都是下一级页表的基地址,而第四级页表存放的是40位PPN,加上那12位VPO直接得来的offset,就可以为我们提供4PB(52位)的物理地址空间了。牛吧。记住,每一级页表这里是64位,其中40位(bit51-bit12)是存下一级的基地址或者PPN,bit11-bit0是各种标志位,62-52未使用,63位是XD,这是为64位系统引入的,could be used to disable instruction fetches from individual memory pages.通过restricting execution to the read-only text segment来减少buffer overflow attacks。有个小trick,为什么每个页表分9位?page size是多少,一级页表占的空间是多大?答案是page size是4KB或者4MB,linux是4KB,一级页表就是一页。9位的原因:4KB = 2^9 * 8byte,每个entry 64位也就是8byte.好了.算是说圆了至少.
对于页表在kernel 中的implementation
比如在trap里处理TLB exception 也就是TLB miss了 怎么办 如果做demand paging的话 这是必须会遇到的 因为程序开始运行的时候并不会真正把数据load到disk上,shell通过execve装载loader的时候 会删除子进程的虚拟存储段 然后创建一组新的text,data, heap, stack,这时的创建不是真正的创建而只是定义,比如vnode,offset等等先初始化,等到真正要用的时候通过exception来paging。所以如果来了个faultaddress,exception传进来说kernel你帮帮忙处理一下吧。kernel说,好的,拿地址来,假设我们用二级页表来做,kernel先把前10bit给分离出来,
(faultaddress & 0xffc00000) >>22
这个得到的就是上面例子的PTO1=1
(faultaddress & 0x003ff000) >>12
这个得到的就是上面例子的PTO2=3
从current thread可以得到正在运行进程的地址空间struct,which include the first level pagetable base address and stackbase.
关键的来了,每个进程都有个pagetable的base address,也就是之前说的CR3(这里拿CR3来说不太合适。。因为i7是四级页表,),初始化的时候分了一页,4096来存,这个base address就是初始化返回的分的那一页的首地址。 比如i7里面的CR3 register, points to the beginning of level 1 page table. The value of CR3 is part of each process context, and is restored during each context switch.
as->as_pagetable = alloc_kpages(1);
比如PT 1base address register指向的地址是0x00080000,
那么virtual address 1,也就是第一级页表我们要找的entry的地址
0x00080000 + PTO1*4= 0x00080004
这个entry里面是什么呢,是第二级页表的base address,为啥不再拿个寄存器?kernel说道:大哥,寄存器这种奢侈的东西给你个CR3用应该知足了,后面自己想办法吧。
virtual address 2,vaddr2也就是第二级页表entry的地址:
*vaddr1 + PTO2*4
好了,*vaddr2也就是页表的内容了,其中就有我们要的20bitPPN,加上12位offset 就是物理地址
为啥我们要去知道这两个地址?*vaddr1是二级页表entry的地址,如果不为0,就代表有二级页表存在。*vaddr2就是二级页表的内容,其中有present/absent bit,R/W bit, U/S bit(determine whether the page can be accessed in user mode, protect code and data in the operating system kernel from user program),还有最重要的physical page number,
言归正传,我们怎么check present/absent bit?简单的很呐,把*vaddr2拿来查看相应bit就行了,做起来就是个一个 & mask
上面我们知道page table不是单纯只存有physical page number,还有一些其他标志位,比如present/absent bit,这一位track哪些页在内存中存在。也就是这个virtual address是不是被缓存在DRAM里面。永远都要有个信念,kernel想方设法给我们造成假象以至于我们以为每个进程都有4G的空间,其实进程p的很多virtual address压根儿是虚的,根本没在真正物理内存内有映射。只有那些有映射的才是真正干活的。所以如果这个bit是0,那么就是这个虚拟也没有被分配物理地址。
TLB entry中也有valid bit,这一位表示是否有一个valid translation,如果这一位是1,那么很好,这个entry有我们要找的physical frame number,TLB是比较牛逼的,每个entry有virtual page number以及对应的physical frame number,只要找到了连offset什么的都不用算,直接可以得到物理地址。这和page table还是有区别的,page table entry里没有存什么virtual page number,只是用VPN做index去找entry而已。由此可见,TLB是比较奢侈的。奢侈也是有代价的,就是一般size都很小嘛。。 比如8个,16个entry,最多64个。。。确实够小。。TLB还有一位是protection bit,当对应page改变的时候,会去设置这一位,比如code段的page就是read and execute(R X),heap段的page就是read and write(R W)。有人要问了,这TLB到底是怎么查找的。这是MMU干的活,当来了个virtual address的时候,硬件先比较virtual page number,怎么比较,很牛逼的同时也就是并行比较所有的entries。有趣的地方在于如果找不到,那么就是个miss,然后MMU就去做普通的page table lookup.然后更新TLB.相应的bit也要该表,比如modified bit.
好了,至此,多级页表已经没有秘密可言了。最后再说个address translation的optimizaiton
我们说了半天地址转换,目标就是为了得到个物理地址,得到物理地址干什么?去cache或者内存找数据啊!
所以MMU的任务是
1. 想方设法把CPU给的virtual address变成physical address
2. 把这个physical address 传给L1 cache
我们一直在分析virtual address的前20bit(32位)或者前36bit(48位),不要忘记后面的12bit offset 我们是不会变的,MMU忙活半天拿来了PPN还是得老老实实加上这12位offset。
对于i7而言,每个cache address有64位,形成了2^64个不同的地址,我们需要用这个地址去找数据。假设L1cache是8-associative,那么就有64 sets,每个sets有8个cache line,每个line 有64byte的block,这个block里面有一个valid bit, some tag bits, and a block of data.每个physical address 有6位cache offset bits and 6 index bits.这里的6不是巧合,log2(64)=6。index bits(CI)用来找组,offset bits(CO)用来找某一行。
好了,说了半天,我们是想说一个neat trick,那就是从virtual address拿来的那12bit offset就是CI+CO,刚好也是12位,This is not accident!!!因为那12bit offset在MMU忙活找PPN的时候闲着也是闲着,CPU发现MMU用不着这12bit啊,那我就给你发个VPN你自己忙去吧。我把VPO 12bit发给L1 cache,你先找着,可以找到appropriate set,然后读出8个tags以及对应的data,等到MMU忙完了拿来physical address的时候,取出前20个tag去match这8个tag就好了,这里假设物理地址空间32位。如果是48位或者52位,减去12就是tag的位数
32bit address
64 bytes/block
8 blocks/set
Number of set: 64
Number of blocks: 64*8=512
bits of index = log2(64)=6 #index sets
bit in offset = log2(64) = 6
bits of tag = 32-6-6=20
cache capacity =512*64=32K
bits31-bit12 | bit11-bit6 | bit5-bit0
tag set index offset
offset=100 indicate that the copy of w starts at byte 4 in the block.(we are assuming the words are 4 bytes long)
---------------------------
Below is some thoughts of C program memory allocation
Why we offer static variable a separate space?
Static is the most original, simplest and most secure pattern allocation for variables. When the program is compiled, this sort of variables are allocated a unique and constant space.(Which is cool..)
So the problems like: how much space variables need? Whether current space is sufficient or not? Where should I put the variables?
The answer of all these problems are explicit at first.
So BSS and DATA segment is like the free apartment/house offered by government for government officials in China.
So, where are the local variables stored?
They are on stack. Stack is like rent house. It allow the best efficiency to use the house. When function is called, some space is allocated to it. When it returns, those space are released. Because of stack, recursion is possible for us.
Where does segmentation fault come from?
It comes from the stack!!! Stack overflow lead to segmentation fault. Continuous pushing result in page fault, then smart linux is going to find some extra space to expand the stack. But when it reach a limit(RLIMIT_STACK, 8MB), it means linux really can not tackle this. Then...stack overflow->segmentation fault.
There is another small problem, whether the expand stack gonna to shrink back or not?
No.. It is like the government budget, which is always increasing...
2013年2月28日星期四
cache, vitural memory, TLB, paging
CPI: clock cycles per instruction
CPU time = CPU cycle number * cycle time = instruction number * CPI * cycle time
= instruction number * CPI / frequency
Why we need cache?
Register is too expensive and limit; DRAM is cheap but slow. So we need a media between them to make PC faster. It store the data that is most active.
How can we get data when running a program?
Most of data are in registers;
if we need data outside register, first cache, if not in cache, then RAM, if not RAM, at last disk.
What is cache looks like?
It is made of SRAM. It is in CPU. Is is a part image of memory. It consists of many blocks, usually with the same size.
How to map cache from memory?
(block addr)mod(block number in cache)
We use part of the address as the block index.
eg: If block number is 8, we use lower three bit as block index.
So for 00001,01001,10001, 11001, these four memory address are all mapped to cache block 001.
So each cache position might corresponds to multiple memory address.
How can we know if the data in certain cache block is what we need? How can we know the requested data is in cache or not?
We use the higher bits of address to verify.
The 32 bit address consists of three parts: tag, index, and offset.
index is used to choose block.
tag is used to compare with tag in cache
cache consists of three parts: valid, tag, data
So the total bits of cache is bigger than its capacity.
Do we still need to search for tag in cache when we just init CPU and there is no data in cache?
No, we use an extra valid bit to see whether this block has a valid address or not. If it is N, we do not read the content in this block.
This type of miss we call it cold miss.
What will be done the index is the same but tag is different?
We use the most recent address to replace previous address.(in direct mapped)
What we can do if cache is full?
We replace the old one with new one
How can we guarantee the consistency of cache and memory when we write data?
Intuitively, we can update memory whenever we update cache, which is called write-through.
cons of write-through: write operation cost too much time
Solution:write buffer
cons of write buffer: still can be stalled, when write buffer is full
Solution: write-back, when write happens, new value is only written into cache. Only when the written block replace by new data, we write it to memory. dirty bit to identify
cons: we need two cycle to do it: first we check cache miss or not, next is the real execuation
solution: we can also use write buffer in write-back mechanism.
Can we remove index in mapping?
Yes, if we do not use index, we need to traverse all blocks in cache, which we called fully associative cache. We can put memory data in any cache block.
cons: to make fully associative cache efficient, we need a Comparator. So it is only practical for that case that block number is small.
Between fully associative and direct mapping is set-associative cache. Each block has n positions.
First, block is mapped to a set, then retrieve all elements in the set.
It could decrease cache miss but also increase hit time.
problem:when miss, we need to choose an item to substitute. The rule is we choose the least recently used item.(LRU)
Replacement algorithm:
FIFO
LRU
LFU
Random
What type of cache we use for L1, L2?why?
We use direct mapping for L1, since hit time is more important than hit rate. since less hit time, less clock cycle.
We use set associative for L2, since hit rate is more important. since we need to reduce the cost to read memory, so we need high hit rate.
What is the motivation for virtual memory?
allow multiple program to share memory efficiently and Security.
allow user to use memory capacity that is bigger than physical memory.
Does every program really has such memroy?
No! That's the illusion OS offer us. Each program can only do read and write to part of memory which allocate to it. Memory only need to store the most active part, just like what cache did.
Is virtual memory similar to cache?
Yes! Block here we called page. cache miss we called page fault. we call the page transfer operation between memory and disk as swapping or paging. There is a mapping between virtual address and physical address.
Is page fault cost high? What should we do?
Yes! Page fault might cost millions clock cycle. So we need to try our best to decrease miss rate.
Solution:
allow page in fully associative pattern.
use write back machanism
page is big enough: 4KB-16KB
Fully associative is hard to position address. What should we do to improve this in VM?
We use page table. It is in memory. Each process has a separate PTE.
What is the size of page table? if page size is 4KB, and each item in page table is 4 Bytes, 32 bit address?
Number of page = 2^32/2^12 = 2^20
Page table size = 2^20 * 4 = 4MB
So each process has a page table with 4MB size. Multiple processes will occupy too much memory for just storing page tables.
Solution:
Inverted page table
multilevel page table
etc
Can we speed up page table operation?
YES! TLB. It store part of page table. It is like a cache of PTE. It use fully associative pattern. Random replacement algorithm.
So assume CPU has a VA, first we search TLB to find physical address, if it hit, then go to cache to find the data, if cache miss, then go to memory. If TLB miss, then turn to PTE to find physical address, if hit, then cache. If miss, exception: page fault.
Actually, an intuitive method is have a single page table consisting of an array of fast hardware register. VA could be the index. When a process is started up, OS loads the registers with the process's page table(in memory). During process execution, no memory reference during mapping.
cons: load the full page table at every context switch hurts performance. Moreover, if page table is large, it is expensive.
Why we need TLB? Why we need to speed up paging?
1. The mapping from VA to PA must be fast.
2. If VA space is large, the page table will be large.
All instruction must ultimately come from memory and many of them reference operands in memory as well. So, it is necessary to make 1, 2 or more page table reference per instruction. If an instruction execution takes 1 nsec, the page table lookup must be done in under 0.2 nsec to avoid having the mapping become major bottleneck.
Each time CPU generate a VA, MMU need to look up PTE to do address translation. In worst case, this might an operation of memory which is hundreds of clock cycle. So we put a small buffer(TLB) in MMU to save the cost.
Does TLB has the same entry structure as page table?
NO!. Page table does not contain virtual page number, which is used as index to search page table. But TLB has virtual page number. TLB is inside the MMU and consists of a small number of entries, eight for example, but rarely more than 64.
Each TLB entry contains virtual page number, modified bit, protection bit, physical page frame and valid bit.
TLB lookup: Hardware check TLB by comparing it to all the entries simultaneously(ie., in parallel)
What TLB entry form is different from CPU to CPU.
From RB, VA consists of TLB index(virtual page number lower t bits) and TLB tag(rest bits of virtual page number) and virtual page offset. So TLB index is not in TLB entry, but TLB tag is in TLB, since TLB could be set-associative. TLB index is used to find set, and then TLB tag is used to find entry.
What happens if virtual page number is not in TLB?
That's the interesting part! MMU detects the missing and does an ordinary page table lookup. It then evicts one entry of TLB and replaces it with the page table entry just looked up. If the page is used again soon, this time it will result in a TLB hit.
When an entry is purged from the TLB, modified bit is copied back into the page table entry in memory. Another bit need to changed is reference bit which help OS to choose a page to evict when a page fault occurs. Reference bit is important for page replacement algorithms.
Is page table is still practical for 64-bit system?
NOOOOO! We use two alternatives: multilevel page tables and inverted page tables.
first 10 bits for PT1, next 10 bits for PT2 and then offset. So each of 1024 PT1 entries represents 4M.
eg: for the address space contains a million pages, only 4 page tables are actually needed: 4M table level table, 0-4M second level table for program text, 4M to 8M for data and top 4M for stack.
BUT...For 64bit system, we might need 5 level of page table....we can save space and time by inverting the page table -- mapping physical to virtual, instead of the other way 'round.
For tradition page table, each process has a page table,and page table entries map VA to PA.
But now, we can have a single inverted page table for the system, which the entries mapping physical frames to (process ID, virtual page number) pairs.
Intuitively, we can use the pattern below to design a linear IPT
PA is not stored, since the index of IPT corresponds to it. Virtual page number and current process ID are compared against each entry, traversing the array sequentially. When a match is found, the index of match replaces the virtual page number in the address to obtain a physical address.
problem: we need an efficient search strategies,
Solution: HASH!!!
Hash Anchor Table
each inverted PTE has to contain:
CPU time = CPU cycle number * cycle time = instruction number * CPI * cycle time
= instruction number * CPI / frequency
Why we need cache?
Register is too expensive and limit; DRAM is cheap but slow. So we need a media between them to make PC faster. It store the data that is most active.
How can we get data when running a program?
Most of data are in registers;
if we need data outside register, first cache, if not in cache, then RAM, if not RAM, at last disk.
What is cache looks like?
It is made of SRAM. It is in CPU. Is is a part image of memory. It consists of many blocks, usually with the same size.
How to map cache from memory?
(block addr)mod(block number in cache)
We use part of the address as the block index.
eg: If block number is 8, we use lower three bit as block index.
So for 00001,01001,10001, 11001, these four memory address are all mapped to cache block 001.
So each cache position might corresponds to multiple memory address.
How can we know if the data in certain cache block is what we need? How can we know the requested data is in cache or not?
We use the higher bits of address to verify.
The 32 bit address consists of three parts: tag, index, and offset.
index is used to choose block.
tag is used to compare with tag in cache
cache consists of three parts: valid, tag, data
So the total bits of cache is bigger than its capacity.
Do we still need to search for tag in cache when we just init CPU and there is no data in cache?
No, we use an extra valid bit to see whether this block has a valid address or not. If it is N, we do not read the content in this block.
This type of miss we call it cold miss.
What will be done the index is the same but tag is different?
We use the most recent address to replace previous address.(in direct mapped)
What we can do if cache is full?
We replace the old one with new one
How can we guarantee the consistency of cache and memory when we write data?
Intuitively, we can update memory whenever we update cache, which is called write-through.
cons of write-through: write operation cost too much time
Solution:write buffer
cons of write buffer: still can be stalled, when write buffer is full
Solution: write-back, when write happens, new value is only written into cache. Only when the written block replace by new data, we write it to memory. dirty bit to identify
cons: we need two cycle to do it: first we check cache miss or not, next is the real execuation
solution: we can also use write buffer in write-back mechanism.
Can we remove index in mapping?
Yes, if we do not use index, we need to traverse all blocks in cache, which we called fully associative cache. We can put memory data in any cache block.
cons: to make fully associative cache efficient, we need a Comparator. So it is only practical for that case that block number is small.
Between fully associative and direct mapping is set-associative cache. Each block has n positions.
First, block is mapped to a set, then retrieve all elements in the set.
It could decrease cache miss but also increase hit time.
problem:when miss, we need to choose an item to substitute. The rule is we choose the least recently used item.(LRU)
Replacement algorithm:
FIFO
LRU
LFU
Random
What type of cache we use for L1, L2?why?
We use direct mapping for L1, since hit time is more important than hit rate. since less hit time, less clock cycle.
We use set associative for L2, since hit rate is more important. since we need to reduce the cost to read memory, so we need high hit rate.
What is the motivation for virtual memory?
allow multiple program to share memory efficiently and Security.
allow user to use memory capacity that is bigger than physical memory.
Does every program really has such memroy?
No! That's the illusion OS offer us. Each program can only do read and write to part of memory which allocate to it. Memory only need to store the most active part, just like what cache did.
Is virtual memory similar to cache?
Yes! Block here we called page. cache miss we called page fault. we call the page transfer operation between memory and disk as swapping or paging. There is a mapping between virtual address and physical address.
Is page fault cost high? What should we do?
Yes! Page fault might cost millions clock cycle. So we need to try our best to decrease miss rate.
Solution:
allow page in fully associative pattern.
use write back machanism
page is big enough: 4KB-16KB
Fully associative is hard to position address. What should we do to improve this in VM?
We use page table. It is in memory. Each process has a separate PTE.
What is the size of page table? if page size is 4KB, and each item in page table is 4 Bytes, 32 bit address?
Number of page = 2^32/2^12 = 2^20
Page table size = 2^20 * 4 = 4MB
So each process has a page table with 4MB size. Multiple processes will occupy too much memory for just storing page tables.
Solution:
Inverted page table
multilevel page table
etc
Can we speed up page table operation?
YES! TLB. It store part of page table. It is like a cache of PTE. It use fully associative pattern. Random replacement algorithm.
So assume CPU has a VA, first we search TLB to find physical address, if it hit, then go to cache to find the data, if cache miss, then go to memory. If TLB miss, then turn to PTE to find physical address, if hit, then cache. If miss, exception: page fault.
Actually, an intuitive method is have a single page table consisting of an array of fast hardware register. VA could be the index. When a process is started up, OS loads the registers with the process's page table(in memory). During process execution, no memory reference during mapping.
cons: load the full page table at every context switch hurts performance. Moreover, if page table is large, it is expensive.
Why we need TLB? Why we need to speed up paging?
1. The mapping from VA to PA must be fast.
2. If VA space is large, the page table will be large.
All instruction must ultimately come from memory and many of them reference operands in memory as well. So, it is necessary to make 1, 2 or more page table reference per instruction. If an instruction execution takes 1 nsec, the page table lookup must be done in under 0.2 nsec to avoid having the mapping become major bottleneck.
Each time CPU generate a VA, MMU need to look up PTE to do address translation. In worst case, this might an operation of memory which is hundreds of clock cycle. So we put a small buffer(TLB) in MMU to save the cost.
Does TLB has the same entry structure as page table?
NO!. Page table does not contain virtual page number, which is used as index to search page table. But TLB has virtual page number. TLB is inside the MMU and consists of a small number of entries, eight for example, but rarely more than 64.
Each TLB entry contains virtual page number, modified bit, protection bit, physical page frame and valid bit.
TLB lookup: Hardware check TLB by comparing it to all the entries simultaneously(ie., in parallel)
What TLB entry form is different from CPU to CPU.
From RB, VA consists of TLB index(virtual page number lower t bits) and TLB tag(rest bits of virtual page number) and virtual page offset. So TLB index is not in TLB entry, but TLB tag is in TLB, since TLB could be set-associative. TLB index is used to find set, and then TLB tag is used to find entry.
What happens if virtual page number is not in TLB?
That's the interesting part! MMU detects the missing and does an ordinary page table lookup. It then evicts one entry of TLB and replaces it with the page table entry just looked up. If the page is used again soon, this time it will result in a TLB hit.
When an entry is purged from the TLB, modified bit is copied back into the page table entry in memory. Another bit need to changed is reference bit which help OS to choose a page to evict when a page fault occurs. Reference bit is important for page replacement algorithms.
Is page table is still practical for 64-bit system?
NOOOOO! We use two alternatives: multilevel page tables and inverted page tables.
first 10 bits for PT1, next 10 bits for PT2 and then offset. So each of 1024 PT1 entries represents 4M.
eg: for the address space contains a million pages, only 4 page tables are actually needed: 4M table level table, 0-4M second level table for program text, 4M to 8M for data and top 4M for stack.
BUT...For 64bit system, we might need 5 level of page table....we can save space and time by inverting the page table -- mapping physical to virtual, instead of the other way 'round.
For tradition page table, each process has a page table,and page table entries map VA to PA.
But now, we can have a single inverted page table for the system, which the entries mapping physical frames to (process ID, virtual page number) pairs.
Intuitively, we can use the pattern below to design a linear IPT
PA is not stored, since the index of IPT corresponds to it. Virtual page number and current process ID are compared against each entry, traversing the array sequentially. When a match is found, the index of match replaces the virtual page number in the address to obtain a physical address.
problem: we need an efficient search strategies,
Solution: HASH!!!
Hash Anchor Table
each inverted PTE has to contain:
- The process ID of the owning process.
- The virtual page number.
- A pointer to the next IPTE in the hash chain.
- The normal protection, valid, modified bits referenced.
- Hash the process ID and virtual page number to get an index into the HAT.
- Look up a Physical Frame Number in the HAT.
- Look at the inverted page table entry, to see if it is the right process ID and virtual page number. If it is, you're done.
- If the PID or VPN does not match, follow the pointer to the next link in the hash chain. Again, if you get a match then you're done; if you don't, then you continue. Eventually, you will either get a match or you will find a pointer that is marked invalid. If you get a match, then you've got the translation; if you get the invalid pointer, then you have a miss.
2012年10月4日星期四
gnuplot for fedora16
download newest version here
tar xvzf gnuplot-4.6.1.tar.gz
./configure
make
su
make install
gnuplot
------------------------------------------------
it should say
gnuplot>
congrats!
tar xvzf gnuplot-4.6.1.tar.gz
./configure
make
su
make install
gnuplot
------------------------------------------------
it should say
gnuplot>
congrats!
2012年10月2日星期二
L1 BLAS MLK
I have encountered a weird problem about ddot_() which need to be declared before use. If not, cc compiler still does not report error info.
Besides that problem, finding right link and compile option for MKL is much more time-wasting. We can obtain link and compile option from official intel website here. But make sure to check your own env to make MKL works. Highly recommend to read MKL manual if you have time.Basically, check if -L path has .a file for mkl and -lxxxx(with use libxxx.a under -L path).
this is what I got from official tool:
-L$(MKLROOT)/lib/intel64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread
For me, I changed
$MKLROOT to $MKL
lib/intel64 to lib/em64
-lmkl_intel_ilp64 to -lmkl_intel_lp64
add -lm
The reason I add -lm is that MKL also need math lib, otherwise you would get a error typical of not finding the math module such as:
/util/intel/mkl/10.2.2.025/lib/em64t/libmkl_core.so: undefined reference to `cos'
/util/intel/mkl/10.2.2.025/lib/em64t/libmkl_core.so: undefined reference to `sqrt'
/util/intel/mkl/10.2.2.025/lib/em64t/libmkl_sequential.so: undefined reference to `log'
------------------------------------------------------
sometimes, we need to mv files into a folder which is in the same directory with those file, with mouse, it is easy to handle this task, but in linux, it is tricky...
My method is to name the folder start with upper case which is unique in this directory, then use this handy shell:
mv [![:upper:]]* <folder name>
similar pattern:
[![:digit:]]*
*[[:lower:]123]: start with lower case and end with digit
[abc]* : start with 'abc'
Another handy shell command is script:
script -a blas.log
Besides that problem, finding right link and compile option for MKL is much more time-wasting. We can obtain link and compile option from official intel website here. But make sure to check your own env to make MKL works. Highly recommend to read MKL manual if you have time.Basically, check if -L path has .a file for mkl and -lxxxx(with use libxxx.a under -L path).
this is what I got from official tool:
-L$(MKLROOT)/lib/intel64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread
For me, I changed
$MKLROOT to $MKL
lib/intel64 to lib/em64
-lmkl_intel_ilp64 to -lmkl_intel_lp64
add -lm
The reason I add -lm is that MKL also need math lib, otherwise you would get a error typical of not finding the math module such as:
/util/intel/mkl/10.2.2.025/lib/em64t/libmkl_core.so: undefined reference to `cos'
/util/intel/mkl/10.2.2.025/lib/em64t/libmkl_core.so: undefined reference to `sqrt'
/util/intel/mkl/10.2.2.025/lib/em64t/libmkl_sequential.so: undefined reference to `log'
------------------------------------------------------
sometimes, we need to mv files into a folder which is in the same directory with those file, with mouse, it is easy to handle this task, but in linux, it is tricky...
My method is to name the folder start with upper case which is unique in this directory, then use this handy shell:
mv [![:upper:]]* <folder name>
similar pattern:
[![:digit:]]*
*[[:lower:]123]: start with lower case and end with digit
[abc]* : start with 'abc'
Another handy shell command is script:
script -a blas.log
2012年9月18日星期二
High Performance Computing
MPI & OpenMP are used for parallel computing. Simply saying, MPI is mainly for multi-PC but OpenMP is for multi-core/CPU, single PC.
I found a good explanation here.
I am so excited to do something practical of parallel computing rather than theoretical model... really looking forward to it.
I found a good explanation here.
I am so excited to do something practical of parallel computing rather than theoretical model... really looking forward to it.
订阅:
博文 (Atom)