Openflow Web ServerLast edited on Nov 14, 2016

To learn a bit about openflow, I wanted to try to build a fake web server that runs on an openflow controller. This idea came to mind when I first learned about DHCP servers running in controllers. So I wanted to try to build a web server to see if I understood the concept correctly

OVS setup with pox

For my test, instead of creating a TAP interface and attaching a VM to it, I am simply going to create a VETH pair so I can use that interface from the host.

Create a bridge and attach to controller to it:

$ ovs-vsctl add-br sw0
$ ovs-vsctl set-controller sw0 tcp:

Now create a VETH pair and add one side in the bridge

$ ip link add veth1 type veth peer name veth2
$ ip l set dev veth1 up
$ ip l set dev veth2 up
$ ip addr add dev veth2
$ ovs-vsctl add-port sw0 veth1

This is what the bridge would look like

$ ovs-vsctl show
    Bridge "sw0"
        Controller "tcp:"
        Port "sw0"
            Interface "sw0"
                type: internal
        Port "veth1"
            Interface "veth1"
    ovs_version: "2.5.1"

Launch the POX controller with the modules that replies to any ping requests (and any ARP queries). I am assuming POX is already installed.

$ ./pox.py proto.pong
$ ping -I veth2
PING ( from veth2: 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=45.5 ms
64 bytes from icmp_seq=2 ttl=64 time=10.1 ms


This is not meant to be a a full explanation about what openflow is. It is just enough to understand the power of SDN and what could be done with it. Openflow is the protocol that the switch and controller use to make SDN happen.

When a packet comes in an openflow enabled switch, the switch will look in it's flow tables to see if a flow matches the incoming packet. If no flow is found, the packet is forwarded to the controller. The controller will inspect the packet and from there, a few things can happen. The controller could instruct the switch to add a new flow in its tables to instruct it how to forward the packet. This flow would then be used for the current packet and since it will be added in the flow tables, all future packets that match the rule won't need to be sent to the controller anymore since the flow will be found. The controller could also remove flows obviously. But things get interesting here because instead of adding/removing flows, the controller can just decide to tell the switch to send a packet instead. The controller would construct a packet, and tell the switch to send it on a specific physical port. The initial packet will then be dropped.

So this means, that the controller could receive a packet, inspect it and determine: This packet came from port1 from MAC-a, for MAC-b, it is a TCP (therefore IP) packet from IP-a, for IP-b and it is a SYN packet. So the controller could decide to build another packet similar to the incoming one but reverse the MAC addresses, reverse the IP addresses, add a ACK bit (SYN/ACK), swap TCP ports, and play around with the SEQ/ACK numbers a bit. The controller would instruct the switch to send that packet and drop the initial one. By doing this, any TCP connection packet, regardless of the destination MAC, destination IP, destination TCP port, sent on the network would get acknowledged.

Naive web server construction

Taking the idea above, we only need to do 3 things to make a really dumb web server.

  • Respond to any incoming ARP queries with a dummy MAC address. It doesn't matter what MAC we hand out. As long as the client receives a reply and is happy about it.
  • Ack tcp connections going to the IP address that we received a ARP query for earlier. So the client will successfully establish a TCP connection.
  • Respond with a HTTP response when a TCP packet with a payload with a "GET" is received at the IP used previously.


Here we see the client ARP request being broadcast

07:47:22.622933 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has tell, length 28
    0x0000:  ffff ffff ffff 0a26 7cd0 bfdc 0806 0001  .......&|.......
    0x0010:  0800 0604 0001 0a26 7cd0 bfdc c0a8 040a  .......&|.......
    0x0020:  0000 0000 0000 c0a8 049d                 ..........

The controller wrote a reply by putting the destination MAC in the source field and writing a ARP reply payload

07:47:22.667931 ARP, Ethernet (len 6), IPv4 (len 4), Reply is-at 02:00:de:ad:be:ef, length 28
    0x0000:  0a26 7cd0 bfdc 0200 dead beef 0806 0001  .&|.............
    0x0010:  0800 0604 0002 0200 dead beef c0a8 049d  ................
    0x0020:  0a26 7cd0 bfdc c0a8 040a                 .&|.......

Then, the client is satisfied with the ARP resolution, so the tries to connect to that fake host. Sending a SYN

07:47:22.667948 IP (tos 0x0, ttl 64, id 46203, offset 0, flags [DF], proto TCP (6), length 60) > Flags [S], cksum 0x8a26 (incorrect -> 0x893c), seq 547886470, win 29200, options [mss 1460,sackOK,TS val 522625533 ecr 0,nop,wscale 7], length 0
    0x0000:  0200 dead beef 0a26 7cd0 bfdc 0800 4500  .......&|.....E.
    0x0010:  003c b47b 4000 4006 fc48 c0a8 040a c0a8  .<.{@.@..H......
    0x0020:  049d cb19 0050 20a8 1586 0000 0000 a002  .....P..........
    0x0030:  7210 8a26 0000 0204 05b4 0402 080a 1f26  r..&...........&
    0x0040:  a1fd 0000 0000 0103 0307                 ..........

The controller added a ACK flag in the TCP header, copied SEQ+1 into ACK, set SEQ to 0. It swapped dst/src MAC addresses in the ethernet header and the IP addresses in the IP header and swapped ports on the TCP header.

07:47:22.670347 IP (tos 0x0, ttl 64, id 27868, offset 0, flags [none], proto TCP (6), length 40) > Flags [S.], cksum 0xb231 (correct), seq 0, ack 547886471, win 29200, length 0
    0x0000:  0a26 7cd0 bfdc 0200 dead beef 0800 4500  .&|...........E.
    0x0010:  0028 6cdc 0000 4006 83fc c0a8 049d c0a8  .(l...@.........
    0x0020:  040a 0050 cb19 0000 0000 20a8 1587 5012  ...P..........P.
    0x0030:  7210 b231 0000                           r..1..

The client sent an ACK as the final step of the 3way handshake, but we don't care.

07:47:22.670376 IP (tos 0x0, ttl 64, id 46204, offset 0, flags [DF], proto TCP (6), length 40) > Flags [.], cksum 0x8a12 (incorrect -> 0xb232), seq 547886471, ack 1, win 29200, length 0
    0x0000:  0200 dead beef 0a26 7cd0 bfdc 0800 4500  .......&|.....E.
    0x0010:  0028 b47c 4000 4006 fc5b c0a8 040a c0a8  .(.|@.@..[......
    0x0020:  049d cb19 0050 20a8 1587 0000 0001 5010  .....P........P.
    0x0030:  7210 8a12 0000                           r.....

Now we are getting a HTTP GET request

07:47:22.671570 IP (tos 0x0, ttl 64, id 46205, offset 0, flags [DF], proto TCP (6), length 117) > Flags [P.], cksum 0x8a5f (incorrect -> 0x498f), seq 547886471:547886548, ack 1, win 29200, length 77
    0x0000:  0200 dead beef 0a26 7cd0 bfdc 0800 4500  .......&|.....E.
    0x0010:  0075 b47d 4000 4006 fc0d c0a8 040a c0a8  .u.}@.@.........
    0x0020:  049d cb19 0050 20a8 1587 0000 0001 5018  .....P........P.
    0x0030:  7210 8a5f 0000 4745 5420 2f20 4854 5450  r.._..GET./.HTTP
    0x0040:  2f31 2e31 0d0a 5573 6572 2d41 6765 6e74  /1.1..User-Agent
    0x0050:  3a20 6375 726c 2f37 2e32 392e 300d 0a48  :.curl/7.29.0..H
    0x0060:  6f73 743a 2031 3932 2e31 3638 2e34 2e31  ost:.
    0x0070:  3537 0d0a 4163 6365 7074 3a20 2a2f 2a0d  57..Accept:.*/*.
    0x0080:  0a0d 0a                                  ... 

So we change all source/destination, SEQ, ACK etc. and add a reponse payload.

07:47:22.673373 IP (tos 0x0, ttl 64, id 27871, offset 0, flags [none], proto TCP (6), length 122) > Flags [.], cksum 0xa53c (correct), seq 1:83, ack 547886471, win 29200, length 82
    0x0000:  0a26 7cd0 bfdc 0200 dead beef 0800 4500  .&|...........E.
    0x0010:  007a 6cdf 0000 4006 83a7 c0a8 049d c0a8  .zl...@.........
    0x0020:  040a 0050 cb19 0000 0001 20a8 1587 5010  ...P..........P.
    0x0030:  7210 a53c 0000 4854 5450 2f31 2e31 2032  r..<..HTTP/1.1.2
    0x0040:  3030 204f 4b0a 436f 6e74 656e 742d 5479  00.OK.Content-Ty
    0x0050:  7065 3a74 6578 742f 706c 6169 6e0a 436f  pe:text/plain.Co
    0x0060:  6e74 656e 742d 4c65 6e67 7468 3a37 0a43  ntent-Length:7.C
    0x0070:  6f6e 6e65 6374 696f 6e3a 636c 6f73 650a  onnection:close.
    0x0080:  0a41 7765 736f 6d65                      .Awesome

This is obviously completely useless and you would definitely not want to have a controller doing that on your network. But it proves how powerfull SDN can be. The controller has complete power over the traffic that goes on the switch. It could respond to DNS queries, ARP queries, ICMP etc.

Usefull things to do

Learning switch

If that web server is completely useless, one thing we could do is to inspect all incomming packets, and forward them to the right port. If the destination MAC is unknown, we can get it to flood on all port and then learn what MACs are on what port when receiving packets so that we can make a better decision next time. But that's kind of useless because that's literally what every switches do.


A more usefull thing to do is to look at the destination IP address of a packets and wrap the packet in a vxlan packet based on the IP that was in there. Then we would always forward those packets out of a specific port. You would then have created a kind of vxlan gateway. As a bonus: no need for vlans. Since the switch only wraps the packet and will always forward out of port 24 (for example). This can all happen on the default vlan.

My code

This is the code I wrote to make the web server. I am using what pox already had for the ARP portion of things so you need to copy this file in "pox/proto". One you run the controller, just do a HTTP GET to any address, any port, and you will get a response. download the code


Having a controller like this adds a very powerful tool in the network. It can also be very dangerous if someone has access to the controller with malicious intents. It becomes very easy to poison DNS queries without even having access to the DNS server. Or to simply build a MITM, because the controller is basically a MITM anyway.

Creating your own linux containersLast edited on Jul 10, 2016

I'm not trying to say that this solution is better than LXC or docker. I'm just doing this because it is very simple to get a basic container created with chroot and cgroups. Of course, docker provides much more features than this, but this really is the basis. It's easy to make containers in linux, depending on the amount of features you need.


cgroups are a way to run a process while limiting its resources such as cpu time and memory. The it works is by creating a group (a cgroup), define the limits for various resources (you can control more than just memory and CPU time) and then run a process under that cgroup. It is important to know that every child process of that process will also run under the same same cgroup. So if you start a bash shell under a cgroup, then any program that you manually launch from the shell will be constrained to the same cgroup than the one of the shell.

If you create a cgroup with a memory limit of 10mb, this 10mb limit will be applicable to all processes running in the cgroup. The 10mb is a constraint on the sum of memory usage by all process under the same cgroup.


On slackware 14.2 RC2, I didn't have to install or setup anything and cgroups were available to use with the tools already installed. I had to manually enable cgroups in the kernel though since I compiled my own kernel. Without going into the details (because this is covered in a million other websites) you need to make sure that:

  • cgroup kernel modules are built-in are loaded as modules
  • cgroup tools are installed
  • cgroup filesystem is mounted (normally accessible through /sys/fs/cgroup/)

Here's how to run a process in a cgroup

cgcreate -g memory:testgroup
# now the "/sys/fs/cgroup/memory/testgroup" exists and contains files that control the limits of the group

# assign a limit of 4mb to that cgroup
echo "4194304" > /sys/fs/cgroup/memory/testgroup/memory.limit_in_bytes

# run bash in that cgroup
cgexec -g memory:testgroup /bin/bash

Note that instead of using cgexec, you could also just write the current shell's PID into /sys/fs/cgroup///task. Then your shell, and whatever process you start from it, would execute in the cgroup.

Making your own containers

A container is nothing more than a chroot'd environment with processes confined in a cgroup. It's not difficult to write your own software to automate the environment setup. There is a "chroot" system call that already exist. For cgroups, I was wondering if there was any system calls available to create them. Using strace while running cgcreate, I found out that cgcreate only manipulates files in the cgroup file system. Then I got the rest of the information I needed from the documentation file located in the Documentation folder of the linux kernel source: Documentation/cgroups/cgroups.txt.

Creating a cgroup

To create a new cgroup, it is simply a matter of creating a new directory under the submodule folder that the cgroups needs to control. For example, to create a cgroup that controls memory and cpu usage, you just need to create a directory "AwesomeControlGroup" under /sys/fs/cgroup/memory and /sys/fs/cgroup/cpu. These directories will automatically be populated with the files needed to control the cgroup (cgroup fs is a vitual filesystem, so files do not exist on a physical medium).

Configuring a cgroup

To configure a cgroup, it is just a matter of writing parameters in the relevant file. For example: /sys/fs/cgroup/memory/testgroup/memory.limit_in_bytes

Running a process in a cgroup

To run a process in a cgroup, you need to launch the process, take its PID and write it under /sys/fs/cgroup///task My "container creator" application (let's call it the launcher) does it like this:

  • The launcher creates a cgroup and sets relevant parameters.
  • The launcher clones (instead of fork. now we have a parent and a child)
  • The parent waits for the child to die, after which it will destroy the cgroup.
  • The child writes its PID in the /sys/fs/cgroup///task file for all submodules (memory, cpu, etc)
  • At this point, the child runs as per the cgroup's constraints.
  • The child invokes execv with the application that the user wanted to have invoked in the container.

The reason I use clone() instead of fork, is that clone() can use the CLONE_NEWPID flag. This will create a new process namespace that will isolate the new process for the others that exist on the system. Indeed, when the cloned process queries its PID it will find that it is assigned PID 1. Doing a "ps" would not list other processes that run on the system since this new process is isolated.

Destroying a cgroup

To destroy a cgroup, just delete the /sys/fs/cgroup// directory

So interfacing with cgroups from userland is just a matter of manipulating files in the cgroup file system. It is really easy to do programmatically and also from the shell without any special tools or library.

My container application

My "container launcher" is a c++ application that chroots in a directory and run a process under a cgroup that it creates. To use it, I only need to type "./a.out container_path". The container_path is the path to a container directory that contains a "settings" files and a "chroot" directory. The "chroot" directory contains the environment of the container (a linux distribution maybe?) and the "settings" file contains settings about the cgroup configuration and the name of the process to be launched.

You can download my code: cgroups.tar


I've extracted the slackware initrd image found in the "isolinux" folder of the dvd.

cd /tmp/slackware/chroot
gunzip < /usr/src/slackware64-current/isolinux/initrd.img  | cpio -i --make-directories

Extracting this in /tmp/slackware/chroot gives me a small linux environment that directory and I've created a settings file in /tmp/slackware. Call this folder a "container", it contains a whole linux environment under the chroot folder and a settings file to indicate under what user the container should run, what process it should start, how much ram max it can get etc. For this example, my settings file is like this:

user: 99
group: 98
memlimit: 4194304
cpupercent: 5

And running the container gives me:

[12:49:34 root@pat]# ./a.out /tmp/slackware
Mem limit: 4194304 bytes
CPU shares: 51 (5%)
Added PID 9337 in cgroup
Dropping privileges to 99:98
Starting /bin/bash -i

[17:26:10 nobody@pat:/]$ ps aux
nobody       1  0.0  0.0  12004  3180 ?        S    17:26   0:00 /bin/bash -i
nobody       2  0.0  0.0  11340  1840 ?        R+   17:26   0:00 ps aux
[17:26:14 nobody@pat:/]$ ls
a  a.out  bin  boot  cdrom  dev  etc  floppy  init  lib  lib64  lost+found  mnt  nfs  proc  root  run  sbin  scripts  sys  tag  tmp  usr  var
[17:26:15 nobody@pat:/]$ exit
Exiting container


When cloning with CLONE_NEWNET, the child process gets a separate netwrok namespace. It doesn't see the host's network interfaces anymore. So in order to get networking enabled in the container, we need to create a veth pair. I am doing all network interface manipulations with libnl (which was already installed on a stock slackware installation). The veth pair will act as a kind of tunnel between the host and the container. The host side interface will be added in a bridge so that it can be part of another lan. The container side interface will be assigned an IP and then the container will be able to communicate wil all peers that are on the same bridge. The bridge could be used to connect the container on the LAN or within a local network that consists of only other containers from a select group.

The launcher creates a "eth0" that appears in the container. The other end of the veth pair is added in a OVS bridge. An ip address is set on eth0 and the default route is set. I then have full networking functionality in my container.

Settings for networking

bridge: br0    


Mem limit: 4194304 bytes
CPU shares: 51 (5%)
Added PID 14230 in cgroup
Dropping privileges to 0:0
Starting /bin/bash -i

[21:59:33 root@container:/]# ping google.com
PING google.com ( 56 data bytes
64 bytes from seq=0 ttl=52 time=22.238 ms
--- google.com ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 22.238/22.238/22.238 ms
[21:59:36 root@container:/]# exit
Exiting container

virtio driver implementationLast edited on Jun 6, 2016


Virtio is a standard for implementing device drivers on a virtual machine. Normally, a host would emulate a device like a rtl8139 network card, the VM would detect such a device and would load a driver for it. But this adds a lot of overhead because the host is emulating that device so it must translate the data it receives into a format that is understood by the real hardware. The more complicated the emulated device it, the more challenge it will be for the host to keep latency low.

Virtio solves this problem by letting the host expose a Virtio device. A virtio device is a fake device that will be used by the VM. Virtio devices are very simple to use compared to other real hardware devices. For example, a host may implement a Virtio network card. The VM would detect such a device and start using it as its network card. Of course, the end-user wouldn't really notice this. The simplicity of the device is seen by the device driver implementers.

So to use Virtio, the host must support it. Currenly, KVM does. Then the guest must install the appropriate device drivers. Virtio device drivers are included in the linux kernel already, so there is no need to download separate drivers. On windows, drivers must be downloaded separately.

Virtio can be seen as a two layer device architecture. The first layer is the communication layer between the host and the guest. This is how both exchange information to say "Here's a packet I want you to send on the real hardware" or "Here's a packet I just received from the real hardware". Note that the driver knows it is running in a virtual environment and can implement optimizations in that effect. But the rest of the OS, using the Virtio driver, doesn't know that. It only knows that it is using a network card with a driver like any other ones. Every Virtio device drivers communicate with the host using the same model. This means that the code for this layer can be shared between all Virtio drivers.

The second layer is the protocol used over the first layer. Every virtio device use a different protocol. For example, a virtio-net driver will speak differently than a virtio-block driver to the guest. But they would both convey the messages the same way to the host.


The reason I was interested in virtio was because my hobby operating system required some device drivers to work. I had already written an ATA driver and a netcard driver (rtl8139) but those are old devices and I wanted to learn something new anyway. By having implemented virtio drivers in my OS, I should, technically, be able to run my OS on any host that support virtio. I don't need to worry about developing several device drivers because different hosts support different hardware. If virtio because a widely accepted standard (maybe it is already), then my OS should be fine on all different hosts. Note that I will still need to implement several different drivers if I want to support my OS on real hardware. But running it in a VM for now is just fine.

My code

These are the drivers. Note that, without the full code of my OS, these drivers won't make much sense but I'm putting them here in case someone could use it as an example whenever trying to write such a driver.


Implementing the virtio drivers was very simple. I was able to do so by using only two sources of information:


I'm going to describe the implementation using pseudo-code and will skip some of the basic details. Things like PCI enumeration will be left out since it is out of the scope of this document.

Device setup

Pci enumeration

The first thing to do is to discover the device on the PCI bus. You will be searching for a device with vendor ID 0x1AF4 with device ID between 0x1000 and 0x103F. The subsystemID will indicate which type of device it is. For example, susbsystem ID 1 is a netcard. So after finding device on the PCI bus, you will obtain the base IO address and IRQ number. You can then proceed to attaching your device driver interrupt handler to that IRQ and setup the device using the IObase address..

    foreach pci_dev
        if pci_dev.vendor == 0x1AF4 && pci_dev.device >= 0x1000 && pci_dev.device <= 0x103F && pci_dev.subsystem == 1
            return [pci_dev.iobase, pci_dev.irq];

Init sequence

The device initialization is very well described in the spec so there is no need to go into much details here. Here is the sequence:

    //Virtual I/O Device (VIRTIO) Version 1.0, Spec 4, section 3.1.1:  Device Initialization

    // Tell the device that we have noticed it
    // Tell the device that we will support it.

    // Get the features that this device supports. Different host may implement different features
    // for each device. The list of device-specific features can be found in the spec

    // This is called the "negotiation". You will negotiate, with the device, what features you will support.
    // You can disable features in the supportedFeatures bitfield. You would disable
    // features that your driver doesn't implement. But you cannot enable more features
    // than what is currently specified in the supportedFeatures. 

    // Tell the device that we are OK with those features

    // Initialize queues


The init_queues() function will discover all available queues for this device and initialize them. These queues are the core communication mechanism of virtio. This is what I was refering as the first layer. I will go in more details about queues a bit later. For now, to discover the queues, You just need to verify the queue size for each queue. If the size is not zero, then a queue exist. Queues are addressed with a 16bit number.

    q_addr = 0
    size = -1
    while (size != 0)
        // Write the queue address that we want to access
        // Now read the size. The size is not the byte size but rather the element count.

        if (size > 0) init_queue(q_addr, size)

For each queue, you must prepare a rather large structure containing information about the queue and slots for buffers to send in the queue. The structure is created in memory (anywhere you want, as long as it sits on a 4k boundary) and the address will be given to the device driver. I find that the structure that is detailed in the spec is a bit confusing because the structure can't really be defined as a struct since it has many elements that must be dynamically allocated since their size depends on the queue size.

Field Format Size
Buffer Descriptors u64 address; u32 length; u16 flags; u16 next; queue_size
Available buffers header u16 flags; u16 index; 1
Available buffers u16 rings queue_size
Padding to next page byte variable
Used buffers header u16 flags; u16 index; 1
Used buffers u32 index; u32 length; queue_size

This is how I create the structure in memory

typedef struct
    u64 address;
    u32 length;
    u16 flags;
    u16 next;
} queue_buffer;

typedef struct
    u16 flags;
    u16 index;
    u16 rings[];
} virtio_available;

typedef struct
    u32 index;
    u32 length;
} virtio_used_item;

typedef struct
    u16 flags;
    u16 index;
    virtio_used_item rings[];
} virtio_used;

typedef struct
    queue_buffer* buffers;
    virtio_available* available;
    virtio_used* used;
} virt_queue;

init_queue(index, queueSize)
    u32 sizeofBuffers = (sizeof(queue_buffer) * queueSize);
    u32 sizeofQueueAvailable = (2*sizeof(u16)) + (queueSize*sizeof(u16));
    u32 sizeofQueueUsed = (2*sizeof(u16))+(queueSize*sizeof(virtio_used_item));
    u32 queuePageCount = PAGE_COUNT(sizeofBuffers + sizeofQueueAvailable) + PAGE_COUNT(sizeofQueueUsed);
    char* buf = kernelAllocPages(queuePageCount);
    u32 bufPage = buf >> 12;
    vq->buffers = (u64)buf;
    vq->available = (virtio_available*)&buf[sizeofBuffers];
    vq->used = (virtio_used*)&buf[((sizeofBuffers + sizeofQueueAvailable+0xFFF)&~0xFFF)];
    vq->next_buffer = 0;

    // Tell the device what queue we are working on

    // Now we have to tell the device what is the page number (of the physical address, not logical) of the structure
    // for that queue
    vq->available->flags = 0;

The communication layer

The way the driver talks to the device is by placing data in a queue and notifying the device that some data is ready. Data is stored in a dynamically allocated buffer. The buffer's physical address is then writen to the first free buffer descriptor in the queue. Buffers can be chained, but forget about that now (it will be usefull when you want to optimize). Then, you need to tell the device that a buffer was placed in the queue. This is done by writing the buffer index into the next free slot in the "available" array.

BTW: it's important to know that queue sizes will always be powers of 2. Making it easy to naturally wrap around, so you never need to take care of checking bounds.

    // Find next free buffer slot
    buf_index = 0;
    foreach desc in vq->buffers
        if desc.length == 0
            buf_index = index of this descriptor in the vq->buffers array

    // Add it in the available ring
    u16 index = vq->available->index % vq->queue_size;
    vq->available->rings[index] = buffer_index;

    // Notify the device that there's been a change
    OUTPORTW(queue_index, dev->iobase+0x10);

Once the device has read your data, you should get an interrupt. You would then check the "used" ring and clear any used descriptors in vq->buffers that are referenced by the "used" ring (ie: set lenght back to 0)

To receive data, you would do it almost the same way. You would still place a buffer in the queue but you would set its lenght to the max size that you are expecting data (512bytes for a block device for example, or MTU for a net device). Then you would monitor the "used" ring to see when the buffer has been used by the device and filled up.

The transport interface

With this information, you should be able to write a generic virtio transport layer that provides 3 functions:

  • init()
  • send_buffer()
  • receive_buffer()

The virtio-net implementation

MAC address

The MAC address can be found in the 6 bytes at iobase+0x14..0x19. You must access those bytes one by one.

To send a packet out, you need to create a buffer that contains a "net_header" and the payload. For simplicity, we'll assume that no buffer chaining is done. So sending a packet would be done like this:

typedef struct
    u8 flags;
    u8 gso_type;
    u16 header_length;
    u16 gso_size;
    u16 checksum_start;
    u16 checksum_offset;
} net_header;

    char buffer[size+sizeof(net_header)];
    net_header* h = &buffer;
    h.gso_type = 0;
    h.checksum_start = 0;
    h.checksum_offset = size;

To receive packets, just fill up the rx queue with empty buffers (with lenth=MTU) and set them all available. It's important to set them back available after you received data in it (ie: they've been added in the used ring) since you want to keep the queue full of ready buffers at all time.

I didn't talk about buffer chaining (it's very simple, and well described in the spec) but you should obviously use that. You could use one buffer for the header and another one for the data. You could use the address of the data buffer supplied by the calling function in the descriptor directly (as long as you convert it to physical address) instead of copying the entire frame. This allows you to implement a zero-copy mechanism.

The virt-block implementation

Block devices are similar to net device but they use one queue only and instead of a net_header, they use a block_header

typedef struct
    u32 type;
    u32 reserved;
    u64 sector;
} block_header;

To write, fill the header with type = 1, sector = sector number to write. Followed by the 512 bytes of data and send the buffer. To read, fill the header with type = 0, sector = sector number to read. Followed by a 512 bytes empty buffer. The device will fill the buffer and will put the buffer descriptor in the used ring.

I think you need to separate the header and the data buffer into 2 descriptors that are chained. That's the way I did it anyway, but I think I read that it won't work if you don't do that.


This was a very rough explanation of virtio but it should be enough to get you started and have something working. Once this is done, I suggest going through the specs again since it has a lot of information what will be needed for handling failure scenarios, optmization and multi-platform support. The driver I wrote works for my OS only and has only been tested with KVM. I am not doing any real feature negoiation nor am I handling any failure cases. Things could surely be optmized also since virtio allows very easy zero-copy buffer passing.

Networking in my OSLast edited on Apr 14, 2016

Designing my OS

When I design and code my operating system, I usually document what I do in a couple markdown files. I thought maybe some people would find it usefull to see them so I am posting one here. This one is about the networking architecture of my OS. Note that this is really the product of my imagination so it might not be (or should I say it definitely isn't) the best way of doing things. It's just the way I imagined it and it actually works.

Of course, I can't pretend this is only the product of my imagination. I did get a lot of inspiration from many sources online. But I tend to stay away (as much as I can) from the linux source code. Since I think Linux is such a great implementation, it would be too easy to assimilate a lot of its concepts while looking at it. Because once you look at it, you realize that what you're looking at is probably the best way to do things. Take sockets for example. Once you are used to using sockets with open/send/recv/close, how else would you do it? So I ended up doing that because it's what I know. But I have no idea how linux works behind the socket implementation, so I tried to design it the way I think it works. I'm not looking at conceiving the best thing, I'm looking at picking my brain and create something that works the way I imagined it. Maybe like art instead of science.

There are flaws and still open questions but it does work. I don't have any code to post here because I am thinking on either posting the whole code on my website or on github. I have to sort that out.

Network card drivers

The netcard abstration layer is contained in netcard.c.

When the OS boots, net_init() is called. This function iterates through the PCI bus's devices to find all devices that match one of the OS's netcard drivers. For each netcard found, a struct NetworkCard is instanciated. The structure contains several function pointers that are initialized with the matching driver's functions. For example, NetworkCard::receive would be set to &rtl8139_receive if the netcard was a rtl8139. Each of these NetworkCard instances will be refered to as the "netcard implementation"

Only the rtl8139 driver is implemented right now. You can view the source code I have posted a while ago in Realtek 8139 network card driver

When a netcard IRQ occurs, a softIRQ is scheduled. (I know, softIRQ is a linux concept. But I actually thought about it before knowing it existed in linux. I didn't have a name for it so I found that Linux had a great name so I stole it. But it doesn't really work like linux's softIRQ. Linux has a way better way of doing it than my OS). When the softIRQ handler is invoked, it calls net_process(). net_process() iterates through all netcard that was discovered during boot. It then checks if data is available and forwards the the data up to the TCP/IP stack if the data is an IP packet, or to the ICMP stack if the packet is of the ICMP type.

TCP/IP stack

Receiving data

The ICMP handler responds immediately to ping requests. Therefore the ICMP response is sent from the softIRQ thread. This allows consistent RTT. The IP handler forwards the message to the TCP or UDP handlers. The UDP handler is not implemented yet. The TCP handler forwards segments to active sockets by finding the socket listening to the port and ip of the message. This is done by by finding the socket instance within a hash list. The message is added in the socket's queue. The user thread is then responsible for dequeuing those messages.

Sending data

The netcard abstraction's net_send() function locks the netcard implementation's send spinlock. This way, only one thread can send to one given netcard at the same time.

net_send() takes an interface index as destination parameter. ip_send takes an ip address as destination parameter. ip_send invokes ip_routing_route() to determine on what netcard to send the message based the destination address.

net_send() will send 1 frame of maximum MTU size. It returns 0 if sending failed or size of frame if sending suceeded. Frames are guaranteed to be sent in full or not at all but never partly.

ip_send() will send a packet of maximum 65536 bytes. It will do ip fragmentation in order to send frames of appropriate size.

tcp_send() will send 1 segment of maximum 64k. This means that the underlying ip_send() can end up calling net_send several times.

There is actually a design flaw with tcp_send right now. Sending data on a socket will end up in netcard_send being called from the user thread. The call is thus blocking.

Also, If net_send() returns 0, because of HW buffer overflow, then ip_send will return the number of bytes sent. But tcp_send should not attempt to send the rest of the data because the IP protocol expects the rest of the packet.

IP routing

IP routing is a whole subject itself. There are many algorithms and many ways to do this. It's actually the center of a very big industry. Basically, when a packet needs to be sent out, we know destination IP to which we want to send. The OS needs to know out of with netcard (if there are more than one) to send the packet and it also needs to know the source IP address to put in the packet. Normally, there would be one IP per netcard. Linux allows you to setup multiple IP for one physical netcard, so this would be done by creating virtual netcards. So the rule would still hold: 1 IP per netcard. My OS does not support virtual netcards, so only one IP per physical netcards.

This is all done with a routing table. By default, the routing table will consist of an entry for each IP configured IP for the host. Those route would be something like "anything within my own LAN must be forwarded out on my netcard". So if netcard0 has IP and netcard1 has IP then the routes would be:

  • -> netcard with IP (netcard0)
  • -> netcard with IP (netcard1)

A route can either say:

  • if destination IP matches my subnet, then route out of interface X
  • if destination IP matches my subnet, then route out to GW with IP address X

To match a rule, the rule's netmask is applied to the destination IP and the rule's IP. These masked IPs are subnets. If both subnets match, then the rule matches. A rule of would be a default route and would guarantee to always match. It would be the last resort route.

Usually, routing out of an interface is done when we expect a device connected on that interface (through a switch probably) to be able to recognize that destination IP. When routing to a GW IP, it is because we expect a specific device to know where to find that destination IP. This would be a router. A default route would be configured like that. If an outgoing packet's destination IP address matches the default route, it means it couldn't match any of the other routes. So we want to send that packet to a gateway. We expect the gateway to be able to route the packet because our own routing table can't.

So the destination subnet is verified in order to find a route for that subnet. The subnet is the IP address logically AND'ed with the netmask.

Then a route for the default gateway must be added. This will tell: "Anything not matched in the LAN IPs must be sent to this IP". So something like: -> So if something needs to be sent to then that route would be matched. The OS would then compare the subnet (using the netmask) of that gateway with the IPs configured on the host. It would find that this IP (of the gateway) is on the same LAN than so netcard1 will be chosen as the outgoing netcard and the destination MAC address of the gateway will be used. So it is clear that using a gateway that reside on a subnet that doesn't match any of the host's IPs would be an invalid configuration.

Frame forwarding

At the layer 2, devices don't know about IP addresses. They only know about MAC addresses. Only devices connected together on the same layer 2 network (physically connected with switches) can talk to each other. When a packet is ready to be sent out, its MAC address might not be known yet. The OS will discover it by sending an ARP request: "who has IP". Then, that device will answer "Me, and my MAC is xx:xx:xx:xx:xx:xx". So that destination MAC address will be used. When sending a packet to an external network, that is behind a router, such as the example of earlier, the destination MAC of the gateway will be used. So if we send a packet to a device (by selecting its MAC as the destination) but the destination IP does not match that device's IP, it means we are expecting that device to be able to do layer 3 routing. Such a device would generally be a router. But it could also be any workstation on the LAN. Those workstations could be able to route. Linux does it.

Blocking operations

All socket operations such as connect, accept, send, receive and close are non-blocking. This implies that the lower-level layer operations are non-blocking also. There are some exceptions to this, but only because there are design flaws that need to be addressed.

Problem Upon connecting a socket, ARP discovery and DNS resolution might need to be performed. DNS resolution will be left to the user to do. connect() will require an IP address. Upon connect, the socket will be placed in "arp discovery" state. The MAC address will be fetched from the cache or a discovery will be sent. Upon receiving an ARP reply, or any other ARP cache entry addition, the stack will search through the socket list for sockets that are waiting for that entry. The connection will then be resumed. Once the an ARP entry is found for a socket, it will be saved in the socket and reused for all remaining transmission.


At any time, only two threads can access the tcp/ip stack: The softIRQ thread, for processing incomming frames, and the owner of the socket. If multiple threads want to share socket usage, they will have to implement their own locking mechanism. The tcp/ip stack will only guarantee thread safety between the two threads mentioned above. Sockets should be used by one thread only in a user application. Two different sockets can be used by two different threads at the same time though.

net_send() locking

net_send() is used by the softIRQ context and user threads. Since the softIRQ context has high priority, it means that if a thead is preempted while it was holding the netcard's send spinlock, and then the softIRQ attempts to request the lock, a deadlock might occur. On uni-CPU systems, a deadlock would occur because the softIRQ will never give CPU time to other threads until it has completed its work. This could also happen on systems where the number of netcard is greater than or equal to the number of CPUs.

To solve this problem, the spinlock will disable interrupts.

  • Spinlock will prevent another CPU from accessing the send function
  • The thread would not be preempted on the local CPU so there is no chances that a softIRQ would deadlock (since softIRQ are prioritized over that thread, it could continue to run and never give time to the thread to release the lock).
  • On a single-CPU system, the interrupt cleared and spinlock would be redundant but would not cause a problem

Socket locking

No locking is currently done at the socket level. The following is a list of problems that would arise, and the associated solution

Problem 1
  • Make sure a thread does not delete its socket while softIRQ is using it.
  • A thread might want to get the seqnumber while the softirq is modifying it.
  • A thread might read in the receive queue while softIRQ is writing in it

Problem 2
Make sure that only one consumer at a time can append/remove from the socket list.

The hash list is thread safe so it should already be handled correctly. But more tests need to be done because there is a little more to it than just accessing the hash list.

Problem 3
Socket backlog list might get accessed by softIRQ and owning thread


Accepting incomming connections

A socket is created using create_socket(). Then the socket is placed in listening mode with listen(). listen() will set the source ip/port of the socket as per parameters of the function and destination ip/port to 0. A backlog buffer is created with size=(sizeof(socket*)*backlogCount).

When a segment is processed by tcp_process (in softirq context), a socket will try to be matched. If no socket is found then tcp_process() will try to find a listening socket that uses the same source ip/port (matching the received segment's destination fields) and with destination ip/port set to 0. if a socket is found, then we know that this segment is for an incomming connection.

The listening socket will only process SYN segments and will ignore any other segments. When processing a SYN segment, it will create a new socket with the same source ip/port and with destination ip/port matching the source of the incomming segment. The state will be set to CONNECTING. The new pending socket will be saved in the listening socket's backlog. The new socket will stay in the backlog until it gets picked up by accept(). accept() will then move the socket to the main list. The socket created in the backlog is only temporary. accept() will create a new socket instance based on that socket so that the new instance will reside in the accepting process's heap.

When the accept function is called, it will go through the backlog of the listening socket and will finish creating the socket. It will clone the socket and create the receive buffer and send the syn/ack. The socket will stay in "CONNECTING" state until it receives the ack of the syn/ack accept will move the socket from the backlog to the main list.

Other articles about my OS

Process Context ID and the TLB
Thread management in my hobby OS
Enabling Multi-Processors in my hobby OS
Block caching and writeback
Memory Paging
Stack frame and the red zone (x86_64)
AVX/SSE and context switching
Realtek 8139 network card driver

ESP8266 based Irrigation ControllerLast edited on Apr 8, 2016


In the past couple of months, I've taken interest in the popular ESP8266 wifi enable uC. I've also wanted for a long time to start designing PCBs, so I decided to mix both interests and build an irrigation controller. I am no expert in PCB design nor any electronics for that matter. I'm a software guy. So this whole PCB design is done in a very amateur way and some of the experts out there might find many flaws in my design. But the important thing is that it does work as expected and most of all: I had a lot of fun and learned a lot of things while doing this. That project may look like a failure to some, but it is definitely a victory for me since I ended up learning things. And next time, I'll learn even more until I get good at it. That's the process I went through to learn about software design. I have to fail once before taking on the theory.


An irrigation controller is a pretty simple device. It's just a couple of relays, a microcontroller and a power supply. For my project, I decided to use 4 relays (to control 4 valves). The uC I chose is a esp8266, more specifically, the ESP-03 board. This brings wifi capabilities to my project. So the esp8266 connects to my network and with a multicast message, it advertises its presence. My New Home Automation system then picks it up and connects to it. From that point, it can control all 4 valves.

Part list

  • My custom built PCB
  • esp-03
  • 4 G5LE-1A4 relays
  • 1 LM2596
  • capacitors (see diagram for values)
  • resistors (see diagram for values)
  • diodes (see diagram for values)
  • uf.l SMD connector
  • 2.4ghz adhesive antenna
  • 4 MPS2907A transistors (PNP)

Power supply

The ESP8266 needs a 3.3 VDC input and my valves need 24VAC. So I decided to supply my board with 24VAC, supply the valves (through the relays) directly and convert to 3.3vdc before the ESP8266 input. I had very limited experience in designing a power supply before this project. I still do. But I figured that using a diode and a capacitor, I could easily convert VAC to VDC. I then used a switched regulator to convert the high VDC down to 3.3v. I used the LM2596T regulator because I already had some pre-built power supplies that had that on it and found that they were working pretty well. The choice of capacitor to use was confusing. Looking online, I found many people using different values than what the datasheet suggests. So I experimented with a breadboard until I found a combination that worked (I know....). So bascially: 24VAC gets converted to something like 33VDC after going through 1 diode and a capacitor to smooth it out. Then Through a LM2596 to step down to 3.3V.


I wasn't expecting to use an external antenna at first. But when trying my board for the first time, I saw that it couldn't connect to wifi. When I designed the PCB, I added a trace to a SMD uf.l connector. But at that time, I had no connector nor any antenna on hand. So I was wondering if that trace could be the problem. So I cut it. But it still wasn't working. So I finally bought a connector and an antenna, soldered a wire on the board because the trace had been cut. And then everything worked out great. So yeah, you need an external antenna. The small ceramic antenna on the esp-03 won't do the job. Possibly because I put it too close to the power supply on my board? Or because I have no ground plane? I have no idea, that's something I'm gonna have to investigate.


The choice of PNP transistors was because the ESP8266 GPIO are set high on boot. So I wanted my relays to stay opened during that time. I keep all GPIOs high when valves must be shut off, and pull the GPIO down when a valve needs to be opened. I did a lot of searching about where to put the load (my relay) in the circuit. Do I put it after the collector of before the emitter? Apperently it's better to put it after the collector. I still don't understand why though. So I still have some reading to do. But it works.


I've already talked about the basics of the ESP8266 in my Led strip controller using ESP8266. article. It is using my custom made protocole "DWN protocol" to speak with my home automation software. So after my home automation software is connected to the irrigation controller, it sends 2-bytes commands to activate a valve or to turn it off. The firmware takes care of a couple of things like:

  • Hard limit of 30 minutes "on" time for every valve. In case you forget to shut it off. If longer period is needed, then the controlling software will need to take care of it.
  • Only 1 valve at a time can be opened. If the controlling software opens a valve, the irrigation controller will shut off any other opened valves.
  • No valve sequence support. For example: V1 on for 10min, then V3 for 5min and then V2 for 1min. This needs to be done by the controlling software.

The sources:


So far the biggest flaw I found is that I mixed up the RX/TX pins of the FTDI header. In eagleCAD, the part I was using for the header had to have its RX connected to the uC's RX pin. I didn't know that, I thought I had to reverse them. If I had looked at the FTDI pinout, I would have noticed it. Lesson learned. But it's not a big deal, I made an adapter that reverses both pins.

Also, like I said earlier, the wifi doesn't work without an external antenna. No big deal, I can use an external antenna. But I added that SMD footprint like 5 min before sending to the fab. I'm pretty glad I did that. The lesson learned here: anticipate for errors: put extra stuff in case you need it. That's obviously not good thinking for production but for a hobbyist like me, it's a good thing to do. Noticed the "spare" capacitor foot print on the board? hehehe.


I've seen a lot of amateur projects like mine on sites like Hackaday and a lot of people always make the same comments so I figured I'd answer those questions immediately.

  • Why did you do it this way instead of that way? Because I felt like doing it that way.
  • Why not use this and that instead? Because that's not what I felt like playing with.
  • Isn't this a fire hazard? I don't know! Is it?!?!?! Please tell me at youjustbuiltadeathtrapyoulunatic@dumaisnet.ca
  • Why not use an Insteon EZFlora? I did. But I wanted to build an irrigation controller. So now my EZFlora is useless