Lab 4 Reflection
BUAA 2023 Spring OS
Prologue
In this lab, we’ll have a glance at system call for the first time.
A “Call” to Answer
For security reasons, certain instructions could not be accessed by user process. However, user process still need to complete certain tasks. Therefore, we have to “call” kernel to do that for us. And we, again, as “kernel”, have to answer such a “call”.
Well, this process is easy to understand since… PassBash also uses this pattern.
User Call
As user, we got a “phone book” to look for functions to call. And this phone book is user/lib/syscall_lib.c
. We can see a lot of functions with prefix syscall_
, indicating a system call. When we want a service, we can just call specific system call.
But, how can we really call kernel? Well, we do this by executing syscall
instruction. By calling this, we’ll sink into kernel immediately, thus make kernel answer. To make this simpler, we wrap system call into one unified interface, which is the so called msyscall
.
1 | // An example system call for user. |
And for msyscall
, we just do syscall
to sink into kernel.
1 | LEAF(msyscall) |
Kernel Answer
When user calls, we kernel will answer. syscall
actually invoke exception No.8 - handle_sys
, so our answer starts from this specific function. However, before we actually get into exception handler, we first jump to exception entry, which is a fixed address for MIPS. We set exception entry point in kernel.lds
, and exception handler entry function will be loaded here.
1 | # kernel.lds |
This will be executed every time there is an exception.
1 | # kern/entry.S |
So, what is exception_handler
? We defined this previously kern/traps.c
.
1 | void (*exception_handlers[32])(void) = { |
So we just jump to these specific handlers to handle our exceptions. You may notice that CP0.Cause
register is used to get the handlers, so what is it?
We can see that, by andi t0, 0x7c
, we get the ExcCode
bits in it, which indicates the type of exception to handle.
Notice that, for
exception_handlers(t0)
,t0
is not a index in C. Instead, it is a raw offset! So the lowest 2 bits are zero to make it byte aligned.
Exception Handler
Now that we can jump to exception handlers, let’s have a closer look at them, for example handle_sys
here. It is declared using a macro. The actual function is do_syscall
.
1 | # kern/genex.S |
And then the declaration for do_syscall
in kern/syscall_all.c
.
1 | void do_syscall(struct Trapframe* tf); |
Since all registers (including pc
, sp
, etc.) will be saved when exception is triggered, we can get the parameters user passed to us from corresponding Trapframe
. More specifically, one register for handler type, and five for actual parameters. For type, we also have a table in kernel for handlers in kern/syscall_all.c
.
1 | void* syscall_table[MAX_SYSNO] = { |
Register $a0
indicates the type of the call, and $a1
to $a3
and the other 2 in stack are the real parameters. Then, we can get the correct handler for the call and pass parameters to it.
In fact, user could passed at most 6 parameters to system call. However, the first one is the type of call, rather than a valid parameter.
After handling, we set $v0
($2
) to the return value of system call, to send it back to user.
1 | tf->regs[2] = func(arg1, arg2, arg3, arg4, arg5); |
Here is the complete code for do_syscall
.
do_syscall
1 | /* Overview: |
Notice that, for user, system call starts with
syscall_
, it at last calls the real system call in kernel mode, which starts withsys_
.
Basic System Calls
Before we get into some fancy system calls, lets have a look at some fundamental ones. They all located in kern/syscall_all.c
.
Identify Process
To know which process to manipulate, we use process id, which, in MOS, is Env::env_id
. We simply get a Env
from a envid
via envid2env
function. Its located in kern/env.c
. If checkperm
is set, we can only get Env
that with certain permissions (relation with current env).
1 | int envid2env(u_int envid, struct Env** penv, int checkperm); |
Notice that, if
envid
is zero, then this function will degenerate into getting current Env only!By the way, this is why
Env::env_id
won’t be zero, since zero indicate current Env.
Here is the complete definition of it.
envid2env
1 | /* Overview: |
For this line, we check
e->env_id
andenvid
, but why, we already get one?
1 if (e->env_status == ENV_FREE || e->env_id != envid)We have 1024 Envs, and the low 10 bits are to present the physical offset of a env to the base address
envs
. If ew have more than 1024 Envs, the high 22 bits are to ensure uniqueness of Env.Here, we get
e
by the lowest 10 bits fromenvid
and ignores the high bits, thus make two different Envs with the same low 10 bits possible. This is perhaps because of getting a out of data Env, or simply a bad envid.
Memory Management
Memory management is important to kernel, there’s no doubt.
sys_mem_alloc
Well, just like malloc
, this function requires kernel to allocate a physical page, to make access at given virtual address legal.
1 | int sys_mem_alloc(u_int envid, u_int va, u_int perm); |
There’s not much thing to say about this one. It just allocate a new page, and use page_insert
to map it to the given va
. Though trivial, previous page that mapped to va
will be unmapped. (see page_insert
)
sys_mem_alloc
1 | /* Overview: |
You may be curious about is_illegal_va
, it just check if va
is within user’s space or not.
1 | static inline int is_illegal_va(u_long va) |
Breaking News! Understanding of how
page_insert
works. Basically, it just fill the address field of the page table entry forva
with the physical page number of the page. Thus we can use the physical memory though thisva
. It can be divided into two steps.
- First, it ensures page table entry for
va
exists. If such entry exists, but not mapped to the given page, it will remove the page. If mapped to, it will flush TLB and reset permission. If doesn’t exist, it will create one.- Then, it fills the page table entry with the physical address of the page.
Important! When we talk about mapping or un-mapping a page to virtual memory, we simply fill or clear the page table entry of the given virtual memory.
Again!
page_insert
does not create new page! It only add the physical address of the given page to a specific page table entry. The same aspage_remove
.For a more detailed understanding of virtual memory and page table entry, you can refer to my previous post here: Page Directory Self Mapping.
sys_mem_map
This is a tricky function. What it does is mapping the physical memory, that mapped to source Env with id srcid
at virtual address srcva
to the destination Env with id dstid
at virtual address dstva
. That is to say, make these two Env share a same page.
1 | int sys_mem_map(u_int srcid, u_int srcva, u_int dstid, u_int dstva, u_int perm); |
sys_mem_map
Here is the realization of it.
1 | /* Overview: |
sys_mem_unmap
Well, sys_mem_unmap
is similar to its map brother. Just un-map the physical address of given virtual memory, making it no longer valid to access.
1 | int sys_mem_unmap(u_int envid, u_int va); |
sys_mem_unmap
Just, quite simple.
1 | /* Overview: |
Schedule
Schedule is also an important part of OS. And yield is to give up possession of CPU, due to inability to continue, or some other reasons. It is simple, just make kernel schedule again is OK.
1 | /* Overview: |
Notice that it will not return, since
schedule
is a no-return.
Inter-Process Communication
Well, IPC, huh? The fundamental support of IPC lies within Env, so let’s have a look.
1 | struct Env |
As receiver, we can decide whether to receive message or not. e.g. We can set Env::env_ipc_receiving
to 1, which indicate that we are receiving message, or 0 to block any message, and make sender’s sending request fail. Plus, we need Env::env_ipc_from
to know the source of the message.
If we only send a value, we only need Env::env_ipc_value
, and set Env::env_ipc_dstva
to 0. Otherwise, a valid virtual address is required.
As sender, we can set the receiver’s Env::env_ipc_value
and Env::env_ipc_from
, and, if there is, map message page to Env::env_ipc_dstva
.
Now, let’s see some core functions of IPC.
sys_ipc_recv
This system call make current process starts to receive a message. Such action will block current process from continue, until it receives the desired message. Here is its declaration.
1 | int sys_ipc_recv(u_int dstva); |
The caller of this system call is current process. If we only want to receive a value, we simply set dstva
to 0. Otherwise, we should set it to our desired virtual address. Then, since the process will be blocked, we’ll set its status to ENV_NOT_RUNNABLE
and remove it from env_sched_list
. At last, we should set the default return value 0 to $v0
.
sys_ipc_recv
1 | /* Overview: |
sys_ipc_try_send
He’s waiting for us! Let’s send him a message! It is quite straightforward, with all parameters receiver need. Notice that here we use ‘try’ send, because we may fail to send due to target process is not receiving or other errors.
1 | int sys_ipc_try_send(u_int envid, u_int value, u_int srcva, u_int perm); |
Since the receiver was suspended because of waiting for message, we need to wake him up after we send the message. If we also send a page, we should map our page to Env::env_ipc_dstva
of our receiver using page_insert
. And set its Env::env_ipc_recving
0 to mark completion.
The same with
page_insert
, we only link sender’s page to receiver’s page table entry. Only.
One more thing, in this system call, we do not restrict permission in envid2env
, since message can be exchanged between processes with no relations.
sys_ipc_try_send
1 | /* Overview: |
You may wonder, what if receiver’s Env::env_ipc_dstva
is 0, and it is a valid virtual address? Haha, don’t worry, you can check include/mmu.h
, address from 0x0
to UTEMP
, which is larger than one page, is invalid memory. :P
Fork
As we know before, fork is a useful function to duplicate process. It is fascinating, as we’ve been curious about its realization long before when we first encountered it here Process. You could use this as application of fork first.
In this part, I’d like to demonstrate stuffs from top to bottom. :)
Fork in Kernel Space
Well, fork is actually a lib function in user space, and the core of it is a system call named syscall_exofork
, which eventually calls sys_exofork
, and this system call explains how we get a new process out of nowhere. Now, let’s have a look of it. It’s surprisingly not that long, but each line has its meaning.
First, it’s declaration. Err… really simple.
1 | int sys_exofork(void); |
First, we really create a PCB (Env) from nowhere.
1 | /* Step 1: Allocate a new env using 'env_alloc'. */ |
Then, the most important step, we copy current PCB’s Trapframe to it.
1 | /* Step 2: Copy the current Trapframe below 'KSTACKTOP' to the new env's 'env_tf'. */ |
Important! By copying current Trapframe, we copy the break point of current process to its child, which makes it resume at the same point with its parent! Notice that, Trapframe is set at once when process makes the system call, and the system call dispatcher will increase CP0.epc by 4 bytes before entering the handler, which make it jump over
syscall
. Then, we enter the real hander and do the fork. So, anything we do in system call does not affect the break point (mainly focus on PC), and child process will not execute system call again!
Here, why we don’t just copy curenv->env_tf
to it? This is a little complicated, you can expand it optionally.
which one to copy
Go carefully about the source code, and you’ll find that, for system calls that don’t make schedule, we only use KSTACKTOP - TF_SIZE
to KSTACK_TOP
to save and restore process break point. Env::env_tf
is only used when schedule happens, to store process status and use it to restore that memory block next time it runs. So Env::env_tf
is usually out of date.
As a further explanation, system call use ret_from_exception
to restore registers with KSTACK_TOP - TF_SIZE
to KSTACK_TOP
, which also resets PC so it will go directly to the user process. This is what most system calls do at last. However, for system calls that involves schedule, be fore ret_from_exception
, it will use env_pop_tf
to restore KSTACK_TOP - TF_SIZE
with Env::env_tf
first, and then call ret_from_exception
to jump to the new process.
To distinguish between parent and child, we set child process’s return value to 0.
1 | /* Step 3: Set the new env's 'env_tf.regs[2]' to 0 to indicate the return value in child. */ |
Notice that, parent process hasn’t got its return value yet.
At last, since child process is not ready to run (no memory allocated), we temporarily set its status to ENV_NOT_RUNNABLE
, and priority the same as its parent.
1 | /* Step 4: Set up the new env's 'env_status' and 'env_pri'. */ |
As all these things are executed in parent process, though in kernel, the return value will also be returned to parent process directly. So here we just return child process’ Env::env_id
.
1 | return e->env_id; |
So what’s the difference between the return value of parent and child? Well, parent process is a complete process then, yet child process only has a PCB. So child could not receive return value from function, so we have to ‘make’ its return value from nowhere.
A complete view of this function.
sys_exofork
1 | int sys_exofork(void) |
It doesn’t do much. Yeah, I though it does everything. But it does not, other stuffs are done in user space.
Fork in User Space
You can find user space fork in user/include/lib.h
and user/lib/fork.c
. But hold your horse, please. There are some basic concepts to understand before we meet fork.
Copy on Write
To reduce physical memory use, when we fork a child, we first make child share parent’s physical pages. If both parent and child do not write them, then we can save much memory. However, if one of them write one of the pages, it will invoke a TLB mod exception, which will then duplicate the page.
To mark such pages, we use PTE_COW
. PTE_COW
and PTE_D
(write) is mutual exclusive. That is to say, such page can’t be written before copy.
However, though it is a kernel exception, the actual work is done by a user space function, which is stored in Env::env_user_tlb_mod_entry
, and this entry is set to a user space handler cow_entry
. This is a little tricky, and I’ll elaborate on it later, real soon.
What this entry does is just make a copy of the page, and jump back to the instruction that cause this exception.
cow_entry
1 | /* Overview: |
If you are confused about the jump between these functions, you can expand the following block.
tlb_mod_exception
First, we have to review do_tlb_mod
. When a TLB mod exception happens, we’ll first jump here.
do_tlb_mod
1 | void do_tlb_mod(struct Trapframe* tf) |
After this, user process will continue to execute cow_entry
after getting out of kernel. Notice that, user process still uses exception stack in cow_entry
in order not to crash previous stack.
You may wonder, how do we get back to the right track? The answer is the parameter we passed to env_user_tlb_mod_entry
. We stored the previous trapframe in exception stack to preserve it, and then from cow_entry
, we can see that it uses this trapframe to jump back to where the exception happens! Brilliant!
This is just like
setjmp()
andlongjmp
in C. :)
Duplicate Pages
To achieve COW, we first need to make child process share those pages, and this is done by duplicating the link to them. And this is done by duppage
. This function is relatively simple.
duppage
1 | static void duppage(u_int envid, u_int vpn) |
Don’t be afraid of accessing virtual memory, they will be intercepted by MMU at last and translated to physical memory.
One problem, why map child before remap to parent?
Because, once we remap to parent, the pages will become read only, but we still need to write pages because of function calling stack!
Fork
Finally, we could have a glance at fork in user space. Now, it should be easy to understand. Just about to exit fork, we should set child process’ TLB mod entry. And as we have prepared everything for it, we can set it status to RUNNABLE.
Notice that, for child process, we have to set its env
manually, since env
is set at the start of a process to the Env it corresponds to, but child process doesn’t start normally, which makes that the same as its parent.
fork
1 | int fork(void) |
Epilogue
Well, I guess, this is it… Too many words. :(