C++ websocket serverLast edited on Apr 1, 2015

I recently wanted to learn a bit more about websockets. And by that, I don't mean how to use websockets from javascript but rather how the server part works and what the protocol looks like. So I decided to write my own server library. I followed RFC6455 but there is still some things I need to change in order to be fully compliant. There's not much to say about the library other than it is very easy to use. I did try libwebsocket before. It is pretty complete but I felt like it was a little more complicated than it should. So although my library is not as complete as libwebsocket, it is easier to use and will be good enough for most of my projects.

My code is hosted on github: https://github.com/pdumais/websocket

Sip attack banningLast edited on Mar 30, 2015

New and improved version

I wrote this article a few years ago and posted a c++ application I wrote for automatically invoking iptables to hosts that are abusing my Asterisk server.

I rewrote the application, but this time using Perl. I use Net::Pcap to sniff on the network. The script runs as a daemon and looks for traffic going out of the LAN. It filters SIP responses and will automatically invoke iptables to block hosts to which it sees asterisk sending more than 10 (configurable) responses higher or equal than 400 to a remote host. Only responses sent for REGISTER and INVITE are filtered.


You will find the script on github https://github.com/pdumais/astban

Block caching and writebackLast edited on Dec 19, 2014

I recently wrote a disk driver for my x86-64 OS. I also wrote a block caching mechanism with delayed writeback to disk

Block Caching

reading/writing blocks is at a layer under the filesystem. so there is no notion of available/used blocks. This layer only reads/writes and caches blocks.

Reading a block

when a read request is executed, the requested block is first searched for in the cache. If a block is already cached, that data is returned. If the block does not exist in the cache, a new cache entry is created and is marked as "pending read". The new cache entry is associated with the device and block number that is requested. The request will then block the current thread until the block is gets its "pending read" status cleared. This will be done by the IRQ. When a new block needs to be created, it is done atomically so that only one block at a time can be created. That mechanism will prevent two similar read access that occur at the same time to issue 2 read requests.

When a block is read from disk, it is kept in the cache. Everytime it is accessed, a timestamp is update to keep track of the latest access.


A function called schedule_io() is called at the following times:

  • At the end of a disk IRQ.
  • After a cache entry is marked "pending read" and the disk driver is not busy (so no pending operation would trigger an IRQ)

The schedule_io() function iterates through the list of cache entries and finds an entry that is "pending read" and then requests the disk driver to read the sector from disk. Several different algorithms can be used in this function to make schedule_io() choose which "pending read" entry to use. A common algorithm is the "elevator" algorithm where the scheduler will choose to execute a read operation for a sector that is the closest to the last one read. This is limit seeking on disk. An elevator that needs to go to floors 5,2,8,4 will not jump around all those floors. If the elevator is currently at floor 3, it will do: 2,4,5,8.

That is not the algorithm I chose to implement though. To keep it simple (and tgus very inneficient), my scheduler just picks the first "pending read" entry it sees in the block cache list. When there is no more read requests, the scheduler proceeds with write requests. So read requests will always have higher priority. This is good for speed, but bad for reliability of data persistance.

Updating a block

when data needs to be written to an existing block, the block could be loaded in memory previously. This means that it was either read earlier for some other reasons or it was read and a small portion of it was updated. Either way, it is already in the cache and it needs to be written back to disk. In that case, the "pending write" flag will be set on it and when the scheduler picks it up, it will send a write request to the disk driver.

The following scenarios could occur:

  • Trying to read while write pending
    It doesn't matter. The block will be be read directly (from memory). This could happen after writing into a block and reading it right away. You would want the updated version.
  • Trying to write a block that does not exist yet in the cache
    This means that the block was never read and we just wanna overwrite whatever is in it. A cache entry will be created for the block and data will be copied in it. The Write pending" flag will be set
  • Trying to update while write pending
    This call would need to block until the block is finished writing back on disk. because we want to avoid updating in the middle of write

Block cache list

To keep things simple (and again very inneficient), I chose to implement the block cache list as a fixed-size array. A better approach would be to store the entries in a tree and let it grow as long as there is available memory.

Each cache entry is as follows:

#define CACHE_IN_USE 8

struct block_cache_entry
    unsigned long block;
    char *data;
    unsigned char device;
    volatile unsigned char flags;
    unsigned long lastAccess;
} __attribute((packed))__;

Each entry has a field to determine the sector number on disk and the device number on which the sector belongs. lastAccess is used for the cache purging alorithm. The flags field is a combination of the following bits:

  • CACHE_WRITE_PENDING: The block does contain valid data but is not flushed to disk yet, but it should be.
  • CACHE_READ_PENDING: The block does not contain data yet and is waiting for a disk read operation to fill it
  • CACHE_BLOCK_VALID: The entry is valid. If 0, the entry is invalid and is free to use for caching. if 1, it contains valid data that belongs to a sector on disk.
  • CACHE_IN_USE: The entry is in use by the cache subsystem and should be be purged.
  • CACHE_FILL_PENDING: The entry was created to a write operation but does not contain data yet. So it cannot be read nor flushed to disk, but it should not be purged either.

Clearing cached block

when there is no space left in the cache block list (in my case, because the fixed-size array is full, but when the tree cannot grow anymore for the tree version), cached blocks must cleared. The block cache will find the blocks with oldest access time and that are not pending write or read and will free them. obviously, this is a very simple algorithm that does not take into account the frequency of access, or probability of access given the location on disk. But it works.

ATA driver

Just for reference, here is a sample of the disk driver. The full source code can be download at the end of this article, but I will show a portion of it here anyway


ata.c (the disk driver)

Beaglebone Black bare metal developementLast edited on Dec 8, 2014

Not so long ago, I wrote a small OS prototype for the Cortex-A8 CPU. I was using qemu but now I wanted to play with a real device. So I decided to give it a shot with my BeagleBone Black.


The beaglebone black's AM3359 chip has an internal ROM (located at 0x40000000) that contains boot code. That boot code will attempt to boot from many sources but I am only interested in the eMMC booting. The boot code will expect the eMMC to be partitioned and that the first partition is FAT32. I don't know if there is anyway to just use the eMMC as raw memory and have the AM3359 boot code to just load whatever is at the bottom of the flash without any partitions, so I will live with the FAT32 concept. I want to use u-boot because I want to be able to update my kernel with tftp. The stock BBB will have the eMMC formatted with a FAT32 partition with uboot on it. I will make a u-boot script that downloads my kernel from the tftp server, copy it in flash memory and then have u-boot load that kernel from flash memory into RAM. That last step is not necessary but I want to do it because at a later point in time, I will remove the tftp part from the u-boot script and only have the kernel in flash be loaded in RAM.

The proper way to do this, would be to store the kernel file and all of my application files in the ext2 partition that is already present in the eMMC. But then, I would need a EXT2 driver in my kernel so that it could load the application files from the flash. I don't wanna bother writing a ext2 driver for now so I will hack my way though this instead. So instead of getting uboot to download the kernel and applications in a eMMC partition, I will get it to write the kernel at a fixed location (0x04000000) in the eMMC. This will most probably overwrite a part of the 1st or second partition but I really don't care at that point. As long as I don't overwrite the partition table and the begining of the FAT32 partition where u-boot sits. Then all applications will be written one after the other just after the kernel in a linked-list style.

According to section 2.1 of the TI reference manual for the AM335x, the ROM starts at 0x40000000. But then, in section, they say that the ROM starts at 0x20000. This is very confusing. It turns out that when booting, memory location 0x40000000 is aliased to 0x00000000. The CPU starts executing there, and some ROM code jumps to the "public ROM code". The public ROM code starts at ROM_BASE+0x20000. Since memory is aliased, 0x200000 is the same as 0x40020000. Section says that the ROM code relocates the interrupt vector to 0x200000, probably using CP15 register c12. When the ROM code finds the x-loader (MLO) in flash memory, it loads it in SRAM at 0x402F0400. At this point, system behavior is defined by u-boot (MLO was built with u-boot). What was confusing me at first was that I thought that the eMMC mapped to 0x00000000. Turns out that this memory is not directly addressable. So if I need to retrieve my applications from eMMC, I will need to write a eMMC driver because the eMMC is only accessible through EMMC1. Now that I understand how eMMC works, I realize that it was foolish of me to think that it could be directly addressable. The MMC1 peripheral will allow you to communicate with the on-board eMMC but you still need to write your own code to interface it using the SD/MMC protocol. I had a really hard time finding information on how to read the eMMC. The TI documentation is good at explaining how to use the MMC controller but they don't explain how to actually communicate with the eMMC. And that's normal since the eMMC is board dependant. The eMMC is accessible through MMC1. The TI documentation explains how to initialize the device but since we know that the board contains eMMC, we don't have to go through all the trouble of detecting card types etc. I was really surprised of how it was hard to find good documentation on how to use the MMC/SD protocol. I can't really explain what I did, all I know is that it works, and the code will definitely not be portable to another board. I read the TRM and also looked at another source code and trought trial and error, I was able to read the eMMC. The file emmc.S in my source code is pretty easy to understand. I was not able to send the proper command to set the device in "block addressing mode" and to change the bus width. Like I said, this information is kinda hard to find. I'll have to do a lot more researching to make this work.

I want uboot to download my kernel from tftp and load it in memory. There doesn't seem to be any easy way to do this. I couldn't find a way to install uboot on my BBB without installing a full eMMC image containing linux. So I decided to just use the stock eMMC image but modify uboot to boot my kernel instead of the installed linux. But it seems that changing the environment variable "bootcmd" is impossible from uboot on the BBB. But there is the uenv.txt file residing on the FAT partition that I can change to contain my own script to download my kernel. Well, that to is impossible to modify directly from uboot.

So I ended creating an SD card with an angstrom image, boot from the SD card, mount the eMMC FAT32 partition and edit the uenv.txt file. I modifed it to look like this:

uenvcmd=set ipaddr;set serverip;tftp 0x80000000 os.bin;tftp 0x90000000 apps.bin;mmc dev 1;mmc write 0x90000000 0x28000 0x100;go 0x80000000

Now everytime I want to update the uenv.txt file, I need to boot from the SD card because I am destroying the 2nd partition on the eMMC with my kernel since I use raw writing on the eMMC. This is not a nice solution but it works for now

Software IRQ

The software IRQs on the BBB work in a completely different way than the realview-pb-8 board. On the BBB, software IRQs are not dedicated IRQs. You get a register that allows you to trigger an IRQ that is tied to a hardware IRQ already. So You can only use software IRQ to fake a hardware IRQ. This means that you could send a software IRQ 95 but that would be the same as if you would get a timer7 IRQ. You actually need to unmask IRQ 95 for this to work, but unmasking IRQ 95 will also allow you to get TIMER7 IRQs. In my case, this is excellent. Because my timer7 IRQ calls my scheduler code. So a Yield() function would just trigger that IRQ artificially using the software IRQ register.

User-mode handling of IRQs

User-mode threads can register interrupt handlers in order to be notified when GPIO is triggered. The way this works is that whenever an interrupt is received, if a user-handler is defined, then the page table is changed to the page table base address of the thread that is interested in receiving the event. Then, a jump to the handler is done. So the CPU stays in IRQ mode, but the page table is changed and the user-mode handler is executed in IRQ mode.

The code

There is a lot more I could describe in here but the source code might a better source of documentation. Basically, other things I have accomplished is:

  • AM3358 interrupt controller
  • AM3358 timer
  • SPI driver for a port expander (MCP23S18) and for an EEPROM chip (25aa256)
  • Pin muxing
  • GPIO (output and input with interrupts)
  • sending data on more than one UART.


ARM bare metal developmentLast edited on Nov 1, 2014

The project

I've always wondered how programming for an ARM cpu is. So I decided to try to make an OS, written 100% in assembly for an ARM development board. I shouldn't say OS though, every time I write an OS, I really only make: memory management, scheduler, mutex, netcard driver, serial port driver and some small application to run on the "os". It's basically just to learn about the architecture of the device.

The ARMv7 architecture offers a lot of cool features that I am not using. I just want to keep things simple for now. Once I get something working good, I will go deeper in the documentation and try some more advanced stuff.

At first, I wanted to use my beaglebone black to run my OS. But then, I found out that qemu can emulate quite a few boards and it would be easier to do. By using qemu, I get the following advantages over using a real board:

  • no need to upload code to the board, I use the image directly
  • can reboot the machine easily while working remotely (no need to physically access the board)
  • very easy to peek in memory with qemu's monitor command "pmsave"
  • can use gdb to debug with qemu
  • no need for a separate bootloader. Can boot kernel directly.

I chose to use the "realview-pb-a8" emulated board in qemu. I have never seen that board, I have no idea what it is. It uses a Cortex-A8. So I was able to get a programing guide for that SoC. I started from there.

The fact that I am using qemu makes things easier but removes a lot of fun. For example, qemu boots my kernel directly. On a real board, I would need to write a bootloader (or use u-boot). I would need to initialized SDRAM, initialize clocks and "power domains" and other board initialization. QEMU boots your kernel directly into RAM and you can run from there. So I wouldn't quite call this "bare-metal" programming. I guess I could only call this project "kernel programming for a Cortex-A8".

Getting started

The first step is to create a small test and actually run it. So I created the following program:

Note the qemu command in the Makefile. This allows me to run the test using "make run". It will emulate the ARM board which is the realview-pb-a8

Board specifications

When starting development on a new board, the first thing you need to do is to get a memory map of the device. Because the board will contain sdram, sram, memory mapped peripheral IO etc... From board to board, the physical location of those elements will change. Here is the memory map for the realview-pb-a8

Physical Memory Layout

Physical addressDescription
0x00000000-0x0FFFFFFFSDRAM mirror
0x10020000-0x1005FFFFBoard specific stuff that I don't need just yet
0x10060000-0x1007FFFFOn board SRAM
0x10080000-0x6FFFFFFFBoard specific stuff that I don't need just yet
0x90000000-0xFFFFFFFFBoard specific stuff that I don't need just yet

A more detailed memory map can be found in the RealView Platform Baseboard for Cortex A8 User Guide.


Interrupt vector table

This architecture only uses 7 interrupt vectors

0x04Undefined Instruction
0x08Software Interrupt
0x0CPrefetch Abort
0x10Data Abort

The interrupt vector table must be placed at the begining of the memory. Each entry is 32bits wide. It must be an instruction not an address. So you would typically put a branch instruction to jump to the proper handler. Using qemu, my kernel gets loaded at 0x70010000, so putting the IVT at the begining of my kernel would not work. I had to rellocate the IVT to 0x70000000 once the kernel was running. By the way, on that board the SDRAM starts at 0x70000000 but is mirrored to 0x00000000. Still, qemu starts execution at 0x70010000. but if the IVT is at 0x70000000, the CPU will still see it at 0x00000000 because of the mirror.

Setting up the stack

There are 6 CPU modes in this architecture. Each mode will shadow the register r13 (stack pointer). So they each need their own stack. To set those stacks, you must switch mode and set r13 appropriately. I don't set the User mode stack because this will be done on a per-process basis and System mode uses the same registers as user mode.

    msr     CPSR_c,#0b11010001           // stack for FIQ mode
    ldr     r13,=STACK_BASE_FIQ
    msr     CPSR_c,#0b11010010           // stack for IRQ mode
    ldr     r13,=STACK_BASE_IRQ
    msr     CPSR_c,#0b11010111           // stack for Abort mode
    ldr     r13,=STACK_BASE_ABORT
    msr     CPSR_c,#0b11011011           // stack for Undefined mode
    ldr     r13,=STACK_BASE_UNDEFINED
    msr     CPSR_c,#0b11010011           // stack for Supervisor mode. And we will stay in that mode
    ldr     r13,=STACK_BASE_SUPERVISOR

Memory Management Unit

Creating a paged memory system is not difficult. The MMU offers a 2 level page table system The level1 table has 4096 entries, each mapping 1Mb of virtual addresses. You could create "section" entries to map those 1Mb to physical memory directly. You would then get pages of 1Mb and only 1 table that takes 16k in memory. But if you want 4k pages, then those entries need to be "Coarse table" entries, meaning that each entry will reference a subtable (a level2 table). Each level 2 table contain 256 entries, mapping 4k of memory. So for a 4k paging system you would have 1 Level1 table with 4096 entries (a total of 16k in size) and 4096 level2 tables containing 256 entries each for a total of 4Mb in size.

Level1, Section
base addrNS0nGSAP2TEXAP0domainXNCB1PXN

Level1, Page Table (TODO)
page table addr0domain0NSPXN01

Level2, Small Page (TODO)
base addrnGSAP[2]TEXAP[1:0]CB1XN

Domains and permissions

The mmu has a concept of domains and access permissions. 16 access domains exist. In a page descriptor, we set the access bits and the domain associated with that page. CP15.register3 contains 2 bits for each domains 0 to 15. These bits determine how page access should be checked. Example: a page is associated to domain 12. CP15.register3 indicates that domain12 is Client. Therefore access permissions in the page will be checked. If domain12 was set to Manager, permissions would have been ignored.

Initializing the MMU

The first thing you need to do is setup the page tables like mentionned above. Obviously, you might want to do an identity mapping for the region of code that is currently running the MMU initialization code so that the mapping does not change after having initialized the MMU.

Level 1 Page Table

70100000  01 40 10 00 01 44 10 00  01 48 10 00 01 4c 10 00
70100010  01 50 10 00 01 54 10 00  01 58 10 00 01 5c 10 00
70103fe0  01 20 50 00 01 24 50 00  01 28 50 00 01 2c 50 00
70103ff0  01 30 50 00 01 34 50 00  01 38 50 00 01 3c 50 00

Level 2 Page Tables (all contiguous)

70104000  fe 0f 00 00 fe 1f 00 00  fe 2f 00 00 fe 3f 00 00
70104010  fe 4f 00 00 fe 5f 00 00  fe 6f 00 00 fe 7f 00 00
70503fd0  f2 4f ff ff f2 5f ff ff  f2 6f ff ff f2 7f ff ff
70503fe0  f2 8f ff ff f2 9f ff ff  f2 af ff ff f2 bf ff ff
70503ff0  f2 cf ff ff f2 df ff ff  f2 ef ff ff f2 ff ff ff

Then you need to configure the CP15 register.

CP15.reg2Translation Table Base RegisterLoad the base address of the Level1 table
CP15.reg3Domain Access control registerMCR p15, 0, , c3, c0, 0 where Rd contains the 32bits we want to write. We will use domain 0 for the kernel for now. So bit 1:0 will be set to 0b11 to allow access without checking permissions. all other domains will be set to 0b00 to unconditionally deny access.
CP15.reg1ControlSet bit0 high to enable the MMU. Do this as the last step

The following registers are also useful but not needed during initialization

CP15.reg5FSRRead this in you fault hander. It is the fault code
CP15.reg6FARRead this in you fault hander. It is the faulty virtual address
CP15.reg8Invalidate TLBUsed to invalidate the TLB. invalidate entire TLB: mcr p15, 0, Rd, c8, c7,0
CP15.reg10TLB LockdownUsed mark a TLB entry as persistent so it does not get overwritten by other entries. can increase performance for pages such as those containing interrupt handling code so that the translation is always cached. We could probably use a level1 table entry as a Section of 1Mb for the kernel and lock it down in the TLB.


I will not list the different reasons for getting a fault since this is all covered in the reference manual. Basically, if a fault occurs while prefetching an instruction then the Instruction Fetch fault will occur. If the fault occurs while accessing data, a Data fault will occur. The virtual address that caused the fault will be stored in FAR. A more detailed error code will be found in FSR


To test the multitasking system, I have created some small programs that I build separately and package in the image. The kernel loads the programs in their own process. Ideally, I would have like to create a flash image that contains the kernel at the very begining and then the programs would be appended at the end, kind of like a real hard disk with a bootloader and programs. But I could never get qemu to eumlate a flash file. Even with the "-pflash". The documentation for the board says that when the board is powered on, the flash is mapped at 0x00000000. This will shadow the sdram. To use the sdram,you must remap the flash to some other place. But I could never make that work. I tried creating a flash image and provide it to qemu with the "-pflash" option but that doesn't seem to work. Qemu always wants a kernel file to be provided. I don't know why I can just put my code in a flat binary that would act as flash and get the code running from 0x00000000. The kernel file gets loaded at 0x70010000 which is the sdram. So I am creating a image file containing the kernel and the programs that get loaded in sdram by qemu.

Programs run as domain 1, and in user mode. Their virtual mapping is:

Level1 table entries (1mb mapping)TypePermissionsDescription
0Section PL1 RWX, PL0 -Kernel code. Identity mapping
1-FFSection PL1 -, PL0 -Unmapped
100Page table PL1 RW, PL0 -peripherals, and SRAM. Identity mapping
101-1DFSection PL1 -, PL0 -Unmapped
1E0Page Table PL1 RW, PL0 -Peripherals. Identity mapping
1E1-1FFSection PL1 -, PL0 -Unmapped
200-2FFPage Table PL1 RWX, PL0 RWXProcess code
300-6FFSection PL1 -, PL0 -Unmapped
700-8FFSection PL1 RWX, PL0 -Kernel code. Identity mapping
900-EFFSection PL1 -, PL0 -Unmapped
F00-FFFPage tables PL1 RW, PL0 RWProcess Stack

The task information page

When creating a process, I add it in a list of process. The list of process is a fixed-size list in kernel memory (accessible by any process in privileged mode) that contains a pointer to the L1 table of the process and several other usefull information for the process. This information is used by the scheduler and is formatted like this:

0x0000Physical address of the process's L1 page table
0x0004saved r13_irq registers
0x0040Quantum count

Using Software Interrupts

When using the SWI instruction, you need to pass it a parameter that would normally be the function number you would want to call. For example: SWI 0x02. Once you are in the SWI handler, you want to get that parameter to know where to dispatch the handler. but the SWI instruction completely ignores the parameter. It is not given to you in any way when you get in your handler. In order to get this, you need to take r14, which contains the return address and substract 4. That would give you the address of the SWI instruction itself. So you can read at that memory area and see what parameter was provided. That is pretty weird in my opinion. I would rather just put the function number in r0 before calling SWI and read r0 once in the handler. That would illimnate an unncessary memory access. Plus, the page at that location will obviously be in the instruction prefetch cache but since we are using "LDM" to load the instruction in a register to read it, it means we will be looking in the data cache. And the page will most probably not be in that cache. So in my project, I will only pass function parameters in a register and ignore the one provided to SWI.

Something that confuses me is that when calling SWI, you enter Supervisor mode. Then you are in privileged mode. That makes sense, But then r13 and r14 gets shadowed. I'm not entirely sure why I would want that. It actually complicates things when multi-tasking. I guess that in some more complex OS design, this is very usefull.


Saving registers of mode X from mode Y

Assume we have a function called schedule(). This function saves the current context, and reloads the context of the next task to run. In my implementation, this function willa lways be called from the IRQ mode. So the schedule function will be called from a non-user mode. The schedule function will need to store the user-mode context (registers r0 to r14). But from the non-user mode of IRQ, registers r13 and r14 are shadowed. r0-r12 will be the same as the user-mode so we need to find a way to save the r13 and r14 of the user mode. For this, the instruction stm/ldm with "^" can be used to store/load the user-mode registers. this will save/load r0-r12 as usual but the r13 and r14 will be the ones of the user mode.

CPSR and SPSR: While in an exception (therefore in a mode different than user or system) the previous cpsr is saved in spsr. Before returning back from the exception, you must reload spsr back into cpsr. This will change the mode automatically, re-enable interrupts etc. To load this and to load r14 in r15 at the same time, look at the notes below about the LDM instruction.

LDM instruction format: Compared to AVR32 and x86, this is pretty complicated in my opinion. The "ldm" instruction has 3 forms. The first form does what it says it does. But the second form which is: ldm Rn,registers_without_r15^ (yes, there is a "^" at the end) loads all user mode registers while you are in a non-user mode. so it is a way to load user registers while they are shadowed. The third form, ldm Rn,registers_with_r15^, will automatically load spsr back into cpsr. You could also use a data instruction with the "S" flag and R15 as a destination. For some reason, it will conveniently reload spsr back into cpsr at the same time... Go figure. For example: movs r15,14; will reload r15 and also reload spsr back into cpsr. I am wondering why they re-purposed a flag like this. That is one small thing that makes me like x86 more than ARM.

Since I re-enable interrupts after entering SWI, the SVC context must be saved also since a context switch could occur while in a service call (that is actually the whole point of re-enabling interrupts in SWI). So my context-switching code also pushes the r13_svc and r14_svc on the the task's IRQ stack.

Context switching

The schedule function needs to do the following:

  • save registers r0-r14 user-mode
  • save register r14_irq (since this will be done from the IRQ handler)
  • save register spsr (which is the usermode cpsr)
  • change level1 page table for new process
  • flush tlb (unless using ASID)
  • restore r0-r14 (for user-mode)
  • restore the return address in r14_irq
  • restore spsr
  • to return, load r14_irq in r15 and spsr into cpsr

Each task have their own stack and they have their own IRQ mode stack. When entering the schedule() function in IRQ mode, the use-mode registers are pushed onto the IRQ stack. The current spsr and r14_irq is also pushed on the IRQ stack. The r13_irq is then saved in a list somewhere. When time comes to restore the task, the page tables are switched back to that task's page tables, and r13_irq is restored from the list. At this point, the task's IRQ stack has been restored. We can then pop everything from it and the context switch is done. Here is a sample of my schedule function

    mrs     r0,SPSR
    // At this point, whole context is saved on stack

    // Determine what is the next task to run

    // store context: save r13_irq
    // r0 points to the entry in the process list (as decribed earlier)
    str     r13,[r0,#4]     // offset 0x04 is r13

    // load new page table
    ldr     r1,[r5]             //r5 points to the entry of the next task to run
    mcr     p15,0,r1,c2,c0,0
    // flush TLB (note that there are ways to avoid this
    mov     r1,#0
    mcr     p15,0,r1,c8,c7,0

    //load r13_irq
    ldr     r13,[r5,#4]

    // now restore context on stack
    pop(r0)            // this is just SPSR, only reg available to touch is r14
    msr     SPSR,r0
    b       returnFromInterrupt

This is a sample only. My schedule() function does a bit more than that. But it gives you the general idea.

When r0-r14 will be restored for the user mode, it would restore the task's context as it was before entering the IRQ to schedule(). r13 and r14 of the user mode will be restored and not the banked ones of the currently executing mode.

Note that the TLB must be flushed when reloading the "translation table base register" in CP15 because the cached TLB entries will continue to correspong to the previous mapping. This is a very expensive operation but we can use the concept of ASID by using the CONTEXTIDR register. By setting a unique task ID in CONTEXTIDR, all page translations that gets loaded in the TLB will be tagged with that ID. When doing a lookup, the MMU will ignore entries that do not match the current CONTEXTIDR. So on a context switch, you would change the ID in CONTEXTIDR. This would create duplicate entries in the TLB but with different IDs. So instead of flushing the TLB, entries will be removed only when the TLB is full. See this article for more information about the TLB and ASID.

The schedule() function is called by the timer IRQ handler. But you might want to call it from other places. For example, if a task wants to yield, it should be scheduled out immediately. I could do this in a SWI handler but trying to change context from the SVC mode brings up other challenges. So to keep things simple, I want to do context switches only from the IRQ mode. For this, it is possible to use a "software IRQ". This is well documented in the GIC documentation.


Here is my source code