ENABLING MULTI-PROCESSORS IN MY HOBBY OS

2015-10-19

I recently added multi-processor support in my homebrew OS. Here are the technical details. BTW: Chapter 8 and 10 of the Intel Manual 3 are probably your best resource.

When the system starts, all but one CPU is halted. We must signal the other CPUs to start. I won't go into the details of how to bootstrap the processor, that step is easy: just go in protected mode then setup paging and jump to long mode. This is very well covered in the Intel manuals.

Basically, this is how we switch to protected mode


    // Before going any further, you must enable the A-20 line. Not covered in this example

    push    %cs     /* remember, cs is 07C0*/
    pop     %ds
    mov     $GDTINFO,%eax
    lgdtl   (%eax)
    mov     %cr0,%eax
    or      $1,%al
    mov     %eax,%cr0   /* protected mode */
    mov     $0x08,%bx

    // far jump to clear cache
    ljmpl   $0x10,$PROTECTEDMODE_ENTRY_POINT


GDTINFO:
     // GDT INFO
    .WORD 0x20
    .LONG . + 0x7C04    /*that will be the address of the begining of GDT table*/

    // GDT
    .LONG 00
    .LONG 00

    // GDT entry 1. Data segment descriptor used during unreal mode
    .BYTE 0xFF
    .BYTE 0xFF
    .BYTE 0x00
    .BYTE 0x00
    .BYTE 0x00
    .BYTE 0b10010010
    .BYTE 0b11001111
    .BYTE 0x00

    // GDT entry 2. Code segment used during protected mode code execution
    .BYTE 0xFF
    .BYTE 0xFF
    .BYTE 0x00
    .BYTE 0x00
    .BYTE 0x00
    .BYTE 0b10011010
    .BYTE 0b11001111
    .BYTE 0x00

    // GDT entry 3. 64bit Code segment used for jumping to 64bit mode.
    // This is just used to turn on 64bit mode. Segmentation will not be used anymore after 64bit code runs.
    // We will jump into that segment and it will enable 64bit. But limit and permissions are ignored,
    // the CPU will only check for bit D and L in this case because when we will jump in this, we will
    // already be in long mode, but in compatibility sub-mode. This means that while in long mode, segments are ignored.
    // but not entiorely. Long mode will check for D and L bits when jumping in another segment and will change
    // submodes accordingly. So in long mode, segments have a different purpose: to change sub-modes
    .BYTE 0xFF
    .BYTE 0xFF
    .BYTE 0x00
    .BYTE 0x00
    .BYTE 0x00
    .BYTE 0b10011010
    .BYTE 0b10101111  // bit 6 (D) must be 0, and bit 5 (L, was reserved before) must be 1
    .BYTE 0x00

This is how we switch to long mode


PROTECTEDMODE_ENTRY_POINT:
    // Before going any further, you must  setup paging structures.
    // Not covered in this example since it is very easy and well document
    // in the Intel manuals
    mov     $8,%ax
    mov     %ax,%ds
    mov     %ax,%es
    mov     %ax,%fs
    mov     %ax,%gs
    mov     %ax,%ss

    // set PML4 address
    mov     $PML4TABLE,%eax
    mov     %eax,%cr3

    // Enable PAE
    mov     %cr4,%eax
    or      $0b10100000,%eax
    mov     %eax,%cr4

    // enable long mode
    mov     $0xC0000080,%ecx
    rdmsr
    or      $0b100000000,%eax
    wrmsr

    //enable paging
    mov     %cr0,%eax
    or      $0x80000001,%eax
    mov     %eax,%cr0
    ljmpl   $0x18,$LONG_MODE_ENTRY_POINT

So at this point, the kernel is running in 64bit long mode.

Detecting the number of CPUs

The first thing to do is to detect the number of CPUs present. This can be done by looking for the "MP floating pointer" structure. It is located somewhere in in the BIOS address space and we must find it. I won't go into the details of the structure since it is very well documented everywhere. The MP structure contains information about the CPUs and IO APIC on the system. This structure is filled in by the BIOS at boot time. The structure can be at many places hence why we must search for it in memory. It starts with "_MP_" and contains a checksum, so by scanning the memory, you will find it. The important thing to know is that you do the following:

Find the structure in memory. According to the specs, it can be in a couple of different places.
Detect number of CPUs and Local APIC address of CPUs
Detect IO APIC address.

For more details on how to find the structure and its format, make a search for "Intel Multi-Processor Specification".

When wandering in the SMP world, you must forget about using the PIC (Programmable Interrupt Controller) The PIC is an old obsolete device anyway. The new way now is the use the APIC. So we won't be using the PIC anymore. There is a notion of a local APIC and the IO APIC. The local APIC is an APIC that is present on each CPU. The local APICs can be use to trigger interrupts from one CPU to another, as a way of communication. When the system starts, all but one CPU is halted. We must signal the other CPUs to start. The PIC could not allow us to do that, hence why we must use the APIC. The local APIC will allow us to trigger an interrupt on the other CPUs to get them out of their halted state.

We must then setup the local APIC for the current CPU. Each CPU have their own APIC and their APIC is mapped at the same address for each CPU. The local APIC address is 0xFEE00000. So when CPU0 read/writes at 0xFEE00000 it is not the same as if CPU1 read/write at 0xFEE00000 since the address maps to each CPU's own APIC. This is nice because it means you dont need to do something like "What CPU am I? number x? ok, then use address xyz then." Each CPU only need to write at the same address and they will be guaranteed to write to their own APIC. It's all transparent so you don't need to worry about it. The address of the IO APIC maps to the same IO APIC for all CPUs though. But that's also good because all CPUs want to use the same IO APIC anyway.


    mov     $APIC_BASE,%rdi
    mov     $(SPURIOUS_INTERRUPT_VECTOR | 0x100), %rax   // OR with enable flag
    mov     %eax,APIC_REG_SPURIOUSINTERRUPTVECTOR(%rdi)

Then, we start the APs


    #define WAIT(x) push %rcx; mov $x,%rcx; rep nop; pop %rcx;
    #define STALL() 1337: hlt; jmp 1337b;
    #define COUNT_ONES(regx,regy) push %rcx; \
        xor regy,regy; \
        1337:; \
        cmp $0,regx; \
        jz  1338f; \
        inc regy; \
        mov regx,%rcx; \
        dec %rcx; \
        and %rcx,regx; \
        jmp 1337b; \
        1338:; \
        pop  %rcx


    mov     $APIC_BASE,%rdi
    mov     $0xC4500, %rax              // broadcast INIT to all APs
    mov     %eax, APIC_REG_INTERRUPTCOMMANDLOW(%rdi)
    WAIT(100000000)                     //1 billion loop should take more than 10ms on a 4ghz CPU
    mov     $0xC4600, %rax              // broadcast SIPI to all APs
    mov     $SMP_TRAMPOLINE,%rcx
    shr     $12,%rcx
    and     $0xFF,%rcx
    or      %rcx,%rax
    mov     %eax, APIC_REG_INTERRUPTCOMMANDLOW(%rdi)

    mov     STARTEDCPUS,%rbx
    COUNT_ONES(%rbx,%rdx)
    cmp     CPUCOUNT,%rdx
    jz      1f
    mov     %eax, APIC_REG_INTERRUPTCOMMANDLOW(%rdi)
    WAIT(100000000)
    mov     STARTEDCPUS,%rbx
    COUNT_ONES(%rbx,%rdx)
    cmp     CPUCOUNT,%rdx
    jz      1f
    //CPUs are not all started. should do something about that
    STALL()
1:

The SMP_TRAMPOLINE constant is the address of where I want the APs to jump to when starting. This address must be aligned on a 4k boundary because we the SIPI message takes the page number as a parameter. Hence why I SHR the address by 12 (div by 4096). And since the APs will start in 16bit mode, the address must reside under the 1meg barrier. STARTEDCPUS is a 64bit bitfield that represents the CPUs. Each bit get set by the APs (cpuX sets bit X).

Application processors trampoline code

I decided to put the Application Processor's trampoline code in the bootloader (I've got 512bytes of room, that should be enough). The bootloader is a good decision beacause it is below the 1meg mark, the source file is compiled as 16bit code and all the initialisation is done there anyway. But when an AP starts, it will be given a start address aligned on a 4k page boundary and the bootloader is at 0x7C00. So the bootloader will copy a "jmp" at 0x1000 to jump to the bootloader AP init function. So the order of execution is:

AP receives SIPI with vector 0x01
AP jumps to 0x1000
Code at 0x1000 will make AP jump to 0x7C0:
AP will switch protected mode and jump to KernelMain
KernelMain will check in MSR[0x1B] if this is an AP or the BST. if BST, then jump to normal initialisation
setup the temporary stack for the AP's thread of exeuction: 0x70000+256*APIC_ID (256 bytes stacks)
enable long mode (64 bit)
set CPU started flag in global variable: STARTEDCPU = STARTEDCPU | (1<
So now I have multiple processor ready for work. The next step is to make a SMP compatible scheduler and start using the IO-APIC. I'll cover that another time.