Implementing HTTP Digest AuthenticationLast edited on Aug 15, 2014

Recently, I was trying to add HTTP digest authentication on my Home automation device. The device exposes a REST interface trough a proxy server. My web server is setup like this

Now since my API is exposed to the world by proxying it like that, I wanted to add security by implementing HTTP digest authentication. Whether or not Digest authentication with MD5 is secure or not is a completely different story, but let's assume it is good enough for now. I have a restricted access webpage that I go on to control my home automation device. This web page makes requests to DHAS using javascript. Since I've implemented digest authentication, I now need to put the credentials in the javascript so that the calls made with XMLHttpRequest can succeed. Even though that javascript code will only be served to me, while I am authenticated on the website, I felt uncomfortable to leave a hardcoded username and password in the JS source. So this is what I came up with:

Note that messages sent from JS to DHAS are being proxied by Apache. Therefore, DHAS receives a GET for /insteon/listmodules and not for /dhas/insteon/listmodules

  • use XMLHttpRequest to make a request to DHAS (through the proxy)
  • add a header "X-NeedAuthenticationHack" in the request
  • receive a 401
  • get the "X-WWW-Authenticate" header from the 401 response
  • Make a XMLHttpRequest to the server and send it the "X-WWW-Authenticate" data
  • Server side php script with hardcoded username/password for DHAS solves the challenge and returns the resonse
  • use XMLHttpRequest to make a request to DHAS (through the proxy) and append the response in a "Authorization" header

So basically, I just intercepted the 401 and instead of letting the browser prompt for a username password, I created the response myself. And instead of doing in the JS, I did it on the server, limiting the exposition of the username/password. You may notice my two special X headers. This is because if the server returns a 401 with WWW-Authenticate, the browser will prompt for your credentials. Event if I have a handler defined to get the 401. So when I send my initial request, I set the X-NeedAuthenticationHack header to tell the server: "Hey, don't send me a WWW-Authenticate, send a X-WWW-Authenticate instead so I can deal with it".

By the way, even if the information is easy to find, this is how the digest authentication is done:

  • Client makes request to http://webserver.com/url1/index.html
  • Server sends a "WWW-Authenticate: realm="testrealm", nonce="testnonce"
  • ha1 = md5("username:testrealm:password")
  • ha2 = md5("GET:/url1/index.html")
  • ha3 = md5(ha1+":testnonce:"+ha2)
  • Client sends: "Authorization: Digest username="username", realm="testrealm", nonce="testnonce", response=""+ha3+"", uri="/url1/index.html"

Stack frame and the red zone (x86_64)Last edited on Mar 18, 2014

after days, and days, and days of troubleshooting odd problems I had in my homebrew x86_64 OS, I found out that it was caused by my C compiler. After spending all these years arguing with everyone that it is easier to make a hobby OS in pure assembly, I decided to make this OS with asm and C, and I just proved myself that C is evil! Seriously, C is not evil but it does hide a lot of things that makes it hard to know what your OS does.

So after disassembling all my C code and inspecting the assembly code, I found this:

55                push rbp
4889E5            mov rbp,rsp
48897DE8          mov [rbp-0x18],rdi
488975E0          mov [rbp-0x20],rsi
488955D8          mov [rbp-0x28],rdx
C745FC00000000    mov dword [rbp-0x4],0x0
C9                leave
C3                ret

Notice how the function never decreases the stack pointer? Arguments are passed below the stack pointer. Can you image how insane that is???? What would happen if an interrupt would trigger while in that function? The correct code would be:

55                push rbp
4889E5            mov rbp,rsp
4883EC28     ---> sub rsp,byte +0x28 <---
48897DE8          mov [rbp-0x18],rdi
488975E0          mov [rbp-0x20],rsi
488955D8          mov [rbp-0x28],rdx
C745FC00000000    mov dword [rbp-0x4],0x0
C9                leave
C3                ret

It turns out that this behavior is normal according to the amd64 ABI. There is a thing called the "red zone". The red zone is a 128 bytes buffer that is guaranteed to be untouched by interrupt handlers (I'm not sure how though). To quote the ABI:

The 128-byte area beyond the location pointed to by %rsp is considered to be reserved and shall not be modified by signal or interrupt handlers. Therefore, functions may use this area for temporary data that is not needed across function calls. In particular, leaf functions may use this area for their entire stack frame, rather than adjusting the stack pointer in the prologue and epilogue. This area is known as the red zone.

So my solution was to just disable that damn red-zone with the gcc flag "-mno-red-zone". I'm guessing that the compiler does that to improve performances because it assumes that your code will be running in ring-3, so when an interrupt occurs, the stack will change because the handler will run in ring-0. Yeah sure, it will improve performances because there is one less instruction in the code, but I think that's a huge assumption to make. It definitely isn't the case when you are writing kernel code anyway.

AVX/SSE and context switchingLast edited on Mar 18, 2014

This article describes the way I designed AVX/SSE support in my homebrew OS.

AVX registers

In long mode, there are 16 XMM registers. These registers are 128bit long. With AVX, these registers are extended to 256 bit and named YMM. The YMM registers are not new registers, they are only extensions. YMM0 is to XMM0 what AX is to AL. Meaning that XMM0 represents the lower 128bit of the YMM0 register.

The xcr0 register enables processor states saving for XSAVE and XRSTOR instructions. The way to set bits in xcr0 is by using the XSETBV instruction. These bits represents feature sets.

  • 0b001: FPU feature set. Will save/restore content of FPU registers
  • 0b010: XMM feature set. Will save/restore all XMM registers (128bit)
  • 0b100: YMM feature set. Will save/restore upper half of YMM registers

Since YMM registers are 256 bit registers, and that XMM registers aliases the lower 128 bits of the YMM register, it is important to enable bit 2 and 1 in order to save the entire content of the YMM registers.

Enabling AVX support

  • Enable monitoring media instruction to generate #NM when CR0.TS is set: CR0.MP (bit 1) = 1
  • Disable coprocessor emulation: CR0.EM (bit 2) = 0
  • Enable fxsave instruction: CR4.OSFXSR (bit 9) = 1
  • Enable #XF instead of #UD when a SIMD exception occurs: CR4.OSXMMEXCPT (bit 10) = 1
  • Enable XSETBV: CR4.OSXSAVE (bit 18)= 1
  • Enable FPU, SSE, and AVX processor states: XCR0 = 0b111
mov     %cr0,%rax
or      $0b10,%rax
mov     %rax,%cr0

mov     %cr4,%rax
or      $0x40600,%rax
mov     %rax,%cr4

mov     $0,%edx
mov     $0b111,%eax
mov     $0,%ecx

Context Switching

On a context switch, it is important to save the state of all 16 YMM registers if we want to avoid data corruption between threads. Saving/restoring 16 256bit registers can add a lot of overhead to a context switch (we could even wonder if implementing a fast_memcpy() is worth it because of that overhead). Saving/restoring is done with the XSAVE and XRSTOR instruction. Each instruction take a memory operand that specifies the save area where registers will be dumped/restored. These instructions also looks at the content of EDX:EAX to know with processor states to save. EDX:EAX will be bitwise ANDed with XCR0 to determine which processor state to save/restore. In my case, I want to use EDX:EAX= 0b110 to save XMM, YMM, but fpu. Remember, if we set 0b100, we will only get the upper half of YMMx saved/restored. To get the lower half, we need to set bit 1 to enable XMM state saving.

Optimizing context switching - lazy switching

Since media instructions are not used extensively by all threads, it is possible that one thread does not use any media instructions during a time slice (or even during its whole lifetime). In such a case, saving/restoring the whole AVX state would add a lot of overhead to the context switch for absolutely nothing.

There is a workaround for this. In my OS, everytime there is a task switch, I explicitely set the TS bit in register CR0. Everytime a media instruction is executed and that the CR0.TS bit is set, a #NM exception will be raised (Device Non Available). My OS then handles that exception to save/restore the AVX context. So if a task does not use media instructions during a time slice, then no #NM will be triggered so there will be no AVX context switch. The logic is simple.

  • Assume that there is a global kernel variable called LastTaskThatRestoredAVX.
  • On task switch, set CR0.TS=1
  • media instruction is executed, so #NM is generated
  • on #NM:
    • clear CR0.TS
    • if LastTaskThatRestoredAVX==current task, return from exception (still the same context!)
    • XSAVE into LastTaskThatRestoredAVX's save area
    • XRSTOR from current task's save area
    • LastTaskThatRestoredAVX = current task
  • Next media instruction to be executed will not trigger #NM, because we cleared CR0.TS

Save area

The memory layout of the saved registers will look like this (notice how highest 128bits of YMM registers are saved separately)

How to answer a question the smart way.Last edited on Jan 6, 2014


The way I see it, the internet has made it easier for everyone to get answers and solutions for different problems. That's the beauty of the internet: information is easily accessed. If you think about web forums, they allow people to talk to each other. They allow you to ask a question and get an answer. Asking a question on a forum is easier that posting a question in a magazine or trying to find something in an encyclopedia. If you look at the section "Before you ask" in Eric Steven Raymond's "How To Ask Questions The Smart Way", he lists 7 steps that you should do before asking a question. Attempting those 7 steps defeats the whole point of making information easily accessible. So what if a person asks a question on a forum without having performed those 7 steps? Does it make it harder for you to answer the question? If you don't want to answer the question, then just don't answer. In my opinion, if the question was asked before and the answer was already provided, there is no harm in providing the answer a second time. The more the information is duplicated, the more it gets easy to find that information. If you understand how the Google search engine works, you will know that this is true.

replying "Google it"

When a person asks a question and someone else replies "let me Google that for you" or just gives a link to a Google search, that person should just not reply at all. How many times did I Google something, clicked the first result and landed on a forum where the OP asked the exact same question that I am asking myself and the only answer is "Google it". Well I did Google it actually, and I am landing on a page that says to Google it. Was it really hard to provide the right answer or to just ignore the OP?

replying "why would you wanna do it like that" or "you shouldn't do that"

I see that too often. The OP asks something like "I wanna print a document that I just scanned.... blah blah... how do I do it?" and someone replies "why would you do that? just use the original document". Never mind why he wants to do it that way. Do you know the answer or not? If you don't, then don't reply. The other day I was searching for "how to create SSH keys on behalf of another user". I landed on a forum with where the OP asked that same question and there was one reply: "You should not do that because the private key is private blah blah blah.". The person who replied that may find it stupid to do such a thing but I had very specific constraints that pushed me into doing that. Maybe I have a script running as root that creates keys for users. Maybe I have other reasons too. So if that person just found it odd to do such a thing and did not know the answer, maybe that person should have ignored the question.

Questions not to ask

in Eric Steven Raymond's "How To Ask Questions The Smart Way", you can find this:

Q: Where can I find program or resource X?
A: The same place I'd find it, fool at the other end of a web search. Ghod, doesn't everybody know how to use Google yet?

Let me get this straight, because you used to walk 4 miles in 4feet of snow to go to school, I shouldn't take the bus? You just said that you found it at the other end of a web search, so do us a favor and share the information so we don't have to do a big search like that. And by giving us the link and duplicating that answer, the link will end up ranking high in Google.


"How To Ask Questions The Smart Way" seems to have been written by a smart person who is really tech savvy but has neither the skills and patience to share his knowledge. That person should not become a teacher.

My philosophy is: Make the information easy to find. Why would I search a word in a dictionary when the guy sitting across me knows the definition and could tell me right now? The days of the teachers saying "You'll learn more if you work at finding it" are over. Make the information accessible. Duplicate the information and spend less time looking for answers. That's the whole point of the "information super highway". At least that's how my employer thinks. My boss will be very mad if I spend 8 hours searching for a solution on Google because a co-worker, who knows the answer, replies "Google it".

Realtek 8139 network card driverLast edited on Dec 3, 2013

While building my homebrew OS, I go to the point where I needed a netcard driver. I run my os in QEMU, which provides a RealTek 8139 netcard. The specs for that card are very easy to find.

Before I continue, you should know that when the datasheet specifies a register that is 2 bytes long (like ISR), it is important to read it as a 16bit word even if all you need is the first 8bit. I was reading ISR with "inb" and couldn't make my software work event if all I needed was the first byte. Changing "inb" for "inw" worked. The datasheet indicates that some registers need to be read or written as words or dwords even if it looks like they could be accessed as bytes.


  • Enable the card: OUTPORTB(0,iobase+0x52);
  • Reset the card:
    You need to write the "reset" bit in register 0x37, and then wait until that bit gets cleared
        unsigned char v=0x10;
        while ((v&0x10)!=0) INPORTB(v,iobase+0x37);
  • enable TX and RX interrupts: OUTPORTB(0b101, iobase+0x3C); There are other interrupts in register 0x3C that can be interesting but I just need TOK and ROK for now.
  • enable 100mbps full duplex: OUTPORTB(0b00100001, iobase+0x63)
  • Set the Receive Configuration Register (RCR):
    OUTPORTL(0x8F, iobase+0x44);
    Looking at the datasheet, you can see what those bits mean. Bascically what we did is:
    • set promiscuous mode
    • accept frames for our MAC address
    • accept frames for out multicast address
    • accept broadcasted frames
    • Do not accept runts and erroneous frames
    • set the RX buffer size to 8k
    • disable WRAP. This means that is a frame is received and we are near the end of the RX buffer, the card will continue copying data after the buffer. We are basically allowing buffer overflow here. so for this reason, we need to give extra space to our buffer. I chose to use a 10k buffer just to be sure
  • Set the RX buffer address. The details of this buffer will be explained in the next section. For now, let's just reserve a buffer of 34k and tell the card about it: OUTPORTL(buf_addr, iobase+0x30)

    Warning: The addresses for TX and RX buffers must be physical addresses. Not virtual addresses

  • Set the Transmit Configuration Register (TCR): The default values after reset are fine. So I'm not touching that register.
  • Set the tx descriptors for now, I won't go in the details of those buffers, this will be explained in the next section all you need to know right now is that you need 4 2k buffers and tell the card about them
    OUTPORTL(buf_addr_desc0, iobase+0x20);
    OUTPORTL(buf_addr_desc1, iobase+0x24);
    OUTPORTL(buf_addr_desc2, iobase+0x28);
    OUTPORTL(buf_addr_desc3, iobase+0x2C);
  • enable TX and RX: OUTPORTB(0b00001100,iobase+0x37);

This is my init code. Note that there is some PCI stuff in there that I don't describe. I am assuming that you have a PCI driver written at this point

void initrtl8139()
    unsigned int templ;
    unsigned short tempw;
    unsigned long i;
    unsigned long tempq;

    deviceAddress = pci_getDevice(0x10EC,0x8139); // vendor, device. Realtek 8139
    if (deviceAddress == 0xFFFFFFFF)
        pf("No network card found\r\n");

    for (i=0;i<6;i++)
        unsigned int m = pci_getBar(deviceAddress,i);
        if (m==0) continue;
        if (m&1)
            iobase = m & 0xFFFC;
            memoryAddress = m & 0xFFFFFFF0;

    irq = pci_getIRQ(deviceAddress);

    // Activate card

    // reset
    unsigned char v=0x10;
    while ((v&0x10)!=0)

    tempq = templ;
    tempq = tempq <<32;
    tempq |= templ;
    macAddress = tempq;

void rtl8139_start()
    // Enable TX and RX:

    // Set the Receive Configuration Register (RCR)
    OUTPORTL(0x8F, iobase+0x44);

    // set receive buffer address
    // We need to uses physical addresses for the RX and TX buffers. In our case, we are fine since
    // we are using identity mapping with virtual memory.
    OUTPORTL((unsigned char*)&rxbuf[0], iobase+0x30);  // this is a 10k buffer

    // set TX descriptors
    OUTPORTL((unsigned char*)&txbuf[0][0], iobase+0x20); // 2k alligned buffers
    OUTPORTL((unsigned char*)&txbuf[1][0], iobase+0x24);
    OUTPORTL((unsigned char*)&txbuf[2][0], iobase+0x28);
    OUTPORTL((unsigned char*)&txbuf[3][0], iobase+0x2C);

    // enable Full duplex 100mpbs
    OUTPORTB(0b00100001, iobase+0x63);

    //enable TX and RX interrupts:
    OUTPORTW(0b101, iobase+0x3C);


Since we have enabled the ROK and TOK interrupts, we will receive and interrupt when a new frame arrives. So from my interrupt handler I check the ISR register to know if I got a TOK or ROK. if ROK, then proceed with getting the frame. First, some definitions:

  • CAPR: This register holds the address within the RX buffer where the driver should read the next frame. This register must be incremented by the driver when a frame is read. The netcard will check that register to determine if a buffer overrun is occuring.
  • packet header: This is a 4bytes field that is found at the begining of the frame. The first word is a bitfield indicating if the frame is OK, if it was received as part of multicast ect. More information can be found in section 5.1 of the datasheet. The following 2 bytes indicate the size of the frame

This is what I do:

  • 1) Trigger on interrupt: Since interrupts have been enabled, IRQ will have been raised. So this will be done from the handler. We need to check TOK in the ISR register
  • 2) Get position of frame within the RX buffer by reading CAPR
  • 3) Get size of data: 2nd 16bit word from begining of buffer (CAPR+2)
  • 4) copy the frame: address starts at rx_buffer_base+CAPR
  • 5) Update CAPR: CAPR=((rxBufIndex+size+4+3)&0xFFFC)-0x10 We are adding 4 to take into account the header size and the +3&0xFFFC is to align on a 4bytes boundary. I have no idea why we need to substract 0x10 from there. Note that you should keep track of rxBufIndex separately. I.e: do not update it with CAPR everytime.
  • 6) Check BUFE bit in CMD. if set, go back to step 2
  • 7) write 1 to ROK in the ISR register

The receiving function:

unsigned long rtl8139_receive(unsigned char** buffer)
    if (readIndex != writeIndex)
        unsigned short size;
        unsigned short i;
        unsigned char* p = rxBuffers[readIndex];
        size = p[2] | (p[3]<<8);
        if (!(p[0]&1)) return 0; // PacketHeader.ROK
        *buffer = (char*)&p[4]; // skip header
        readIndex = (readIndex+1) & 0x0F; // increment read index and wrap around 16
        return size;
        return 0;

I also wrote A 64bit memcpy in a separate ASM file

// rdi = source, rsi = destination, rdx = size
    push    %rcx
    xchg    %rdi,%rsi
    mov     %rdx,%rcx
    shr     $3,%rcx
    rep     movsq
    mov     %rdx,%rcx
    and     $0x07,%rcx
    rep     movsb
    pop     %rcx

The interrupt handler:

unsigned short isr;
OUTPORTW(0xFFFF,iobase + 0x3E);
unsigned int status;
unsigned char  cmd=0;
unsigned short size;
unsigned short i;
if (isr&1)                  // ROK
        // It is very important to check this first because it's possible to get an interrupt
        // and still have cmd.BUFE set to 1. that caused me lots of problems like
        // reading bad status, causing buffer overflows

	while (!(cmd&1))   // check if CMD.BUFE == 1
		// if last frame overflowed buffer, this won't will start at rxBufferIndex%RX_BUFFER_SIZE instead of zero
		if (rxBufferIndex>=RX_BUFFER_SIZE) rxBufferIndex = (rxBufferIndex%RX_BUFFER_SIZE);

		status =*(unsigned int*)(rxbuf+rxBufferIndex);
		size = status>>16;


                rxBufferIndex = ((rxBufferIndex+size+4+3)&0xFFFC);
		writeIndex = (writeIndex+1)&0x0F;
		if (writeIndex==readIndex)
			// Buffer overrun


I found that Sending was easier than receiving. The first thing that needs to be done is to setup the buffer pointers in TSAD0-TSAD3. I'm not sure if these buffers require any special alignment but I've aligned mine on 2k boundaries.

Sending a frame

There are 4 TX buffers available. You should keep track of which one is free by incrementing an index everytime you send a frame. This way, you will know what buffer to use next time. You will need to copy your frame into the buffer pointed to by TSAD[CurrentSendIndex]. You will then need to write the size of the frame into TSD[CurrentSendIndex] and clear bit 13. Bit 13 is the OWN bit. It indicates to the card that this buffer is ready to be transmitted. Then you increment CurrentSendIndex to be ready for next time. At the next send, if TSD[CurrentSendIndex].bit13 is cleared, it means that the frame still belongs to the card and it wasn't transmitted. This would indicate a buffer overrun, your software is sending faster than what the card can handle.

unsigned long rtl8139_send(unsigned char* buf, unsigned short size)
    if (size>1792) return 0;
    unsigned short tsd = 0x10 + (currentTXDescriptor*4);
    unsigned int tsdValue;

    if (tsdValue & 0x2000 == 0)
        //the whole queue is pending packet sending
        return 0;
        tsdValue = size;
        currentTXDescriptor = (currentTXDescriptor+1)&0b11; // wrap around 4
        return size;

Handling TX interrupt

Handling the interrupt is mostly done to detect send errors. I don't use it much. I won't go into details here, as the code explains pretty much everything.

unsigned short isr;
OUTPORTW(0xFFFF,iobase + 0x3E);
if (isr&0b100)              //TOK
	unsigned long tsdCount = 0;
	unsigned int tsdValue;
	while (tsdCount <4)
		unsigned short tsd = 0x10 + (transmittedDescriptor*4);
		transmittedDescriptor = (transmittedDescriptor+1)&0b11;
		if (tsd&0x2000) // OWN is set, so it means that the data was transmitted to FIFO
			if ((tsd&0x8000)==0)
				//TOK is false, so the packet transmission was bad. Ignore that for now. We will drop it.
			// this frame is pending transmission, we will get another interrupt.
		OUTPORTL(0x2000,iobase+tsd); // set lenght to zero to clear the other flags but leave OWN to 1


These are good resources if you need more information on the rtl8139:

Get the full source code