Implementing your own mutex with cmpxchgLast edited on Jun 28, 2012

The cmpxchg instruction takes the form of "cmpxchg destination source" where the destination is a memory location and the source is a register. Before using this instruction, you need to load a value in the EAX register. The instruction will first compare the value in EAX to the value in memory pointed by the destination operand. If both values are equal, the value of the source operand will be loaded in memory where the destination operand points to. Note that this compare and store operation is done atomically. If, on the other hand, the destination and EAX do not match, then the destination will be loaded into eax. At first, it might not be clear why this instruction would be usefull. But consider this:

l2: mov eax,[mutex]
    cmp eax,1
    je l2
    mov eax,1
l3: mov [mutex],eax

This is an unsafe way of creating a mutex. You loop until its value is zero and then set a 1 in it. But what if another thread or another CPU changed the value between l2 and l3?

If you need to store the value of a lock in memory (let's say at location 0x12345678) then before attempting to lock a section of code, you would read the lock to see if it is free. So you would read location 0x12345678 and test if this value is zero. If it isn't, then keep on reading memory until it reads as zero (because some other thread cleared it). After that, you would need to store a "1" in this location to take ownership of the lock. But what if another thread takes ownership between the time you read the value and the time you wrote it? The CMPXCHG instruction will write a "1" in there only if a "0" was in memory first. EAX would be equal to "0" because we would first spin until the memory value is "0". So after that, we tell the CPU: "EAX is zero now, so compare value at 0x12345678 with EAX (thus 0) and change it to 1 if it is equal. Otherwise, if the value at 0x12345678 is not equal to 0 anymore, then load this value into EAX and I will go back to spinning until I get a zero". Simple enough? Here is a sample code that illustrates this.

    mov edx,1
l2: mov eax,[mutex]
    cmp eax,1
    je l2                   ; spin until we see that eax == 0
    lock cmpxchg [mutex],edx; At this point, eax=0 for sure. Now if memory location still equal to
                            ; eax, then store edx in there.
                            ; otherwise, eax will be loaded with changed value of mutex (should be 1)
                            ; if not equal to zero, it means it was modified. If it was modified,
    jnz l2                  ; it means cmpxchg has loaded the value of the mutex in it.
                            ; and if the value of mutex was loaded, it means it wasn't equal to zero
                            ; by the definition of the CMPXCHG instruction.
                            ; zf will have been set in that case, so we can just make a conditional jump

Now, notice how we used "lock" before using cmpxchg? This is because we want the CPU to lock the bus before doing the operation so that no other CPU will interfere with that memory location.

WakeupCall server using resiprocateLast edited on Jun 14, 2012

This is my first project I did with the resiprocate SIP stack. There's a lot of things left to do in this project but I wanted to post the code here right away in case someone needs more example on how to use resiprocate.

Dependencies and limitations

I chose to use resiprocate as the SIP stack and ortp as the RTP stack and libxml2 and the XML parser. The application only supports G.711 uLaw. The application only supports SIP info for receiving DTMF (inband and RFC2833 not supported).


Resiprocate provides a Dialog Usage Manager (DUM). This engine is very useful for applications that don't want to deal with low level SIP messages. The DUM allows you to receive events such as onOffered, onAnswer, onTerminated (plus many more) by the use of an observer pattern. Using a class called AppDialogSet, it is possible to represent a "call" or a "dialog" and let the DUM manage it. For example, you could override the AppDialogSetFactory with your own CallFactory that would create "Call" objects derived from AppDialogSet. When receiving an event such as onOffered, the DUM will already have created a AppDialogSet with your factory class and you can then cast this AppDialogSet with your "Call". This is a good way to receive a "Call" reference on every events you get. And the beauty of this is that you never need to delete it becausr the DUM will take care of it. More information is available on the resiprocate website.


ortp is very easy to use but only provides basic functionalities. It won't bind to any sound cards or include encoding like other fancy stack do. This stack only allows you to open a stream and feed it data encoded with whatever codec you want. It is the developper's responsibility to make sure that the data that is fed is encoded with the proper codec.

Threading model

I chose to use 1 thread for general processing and 1 thread for each RTP session. The main thread is used to give cycles to the resiprocate DUM and to the WakeupCallService. A new thread is created for each RTP sessions. The RTP session only handles outgoing stream since we don't need the incomming stream. The ortp stack provides a way to read multiple streams from the same thread but I prefer to use different threads in order to leverage multi-cores CPUs.


The server is a user agent that registers with you PBX. Just call the server and enter the time at wich you want your wakeup call and the extension at which you wanna be notified. For example, you would enter 0,6,3,0 to get a wakeup call at 6h30 AM. I left out the prompts from the package so you'll want to replace them. The IVR is defined in the xml file. Just change the prompt names. There is no configuration file you can use right now. You will need to set the proper values that you need in config.h. To launch the application, run it and provide, as a command line argument, the ip address on which to bind on your computer.


Download the source code

FFT on AMD64Last edited on Jun 5, 2012

Fast Fourier Transform with x86-64 assembly language

This is an old application I did a while ago. I did this in 2005 when I got my first 64bit CPU (AMD). The first I did after installing my new CPU was to open VI and start coding an FFT using 64 bit registers. This is old news, but 64 bit at that time was awesome. Not only can you store 64 bits in a register, but you get 32 general purpose registers!

The only really annoying thing with this architecture is that they don't provide a bit reveral instruction. I don't understand why a simple RISC processor like the AVR32 (lookup "brev") has one but not a high end CISC like Intel or AMD. I don't actually show the bit reveral part of the FFT in here though.

By the way, I remember doing some tests with this algorithm and, although I don't remember the results exactly (7 years ago), I remember that it was running at least 5 times faster than most other FFTs in other libraries.

//; x8664realfft(float* source,float** spectrum,long size)
        mov     	$1,%eax
        cvtsi2ss     %eax,%xmm10
        pshufd  	$0b00000000,%xmm10,%xmm10
        mov     	$-1,%eax
        cvtsi2ss     %eax,%xmm10
        pshufd  	$0b11000100,%xmm10,%xmm10
        jmp     	fftentry
	mov		$1,%eax
	cvtsi2ss	%eax,%xmm10
	pshufd	$0b00000000,%xmm10,%xmm10
        pushq   	%rbp
	movq    	%rsp,%rbp
	pushq	%rbp
	subq		$0xFF,%rsp
	movq	%rsp,%rbp
	//; make a 16bytes aligned buffer
	addq		$16,%rbp
	andq		$0xFFFFFFFFFFFFFFF0,%rbp

	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%r11
	pushq	%r10
	pushq	%r9
	pushq	%r8

        //; rcx = size
        movq    	%rdx,%rcx  				
        pushq	%rcx
	//; rdx = source 
	mov		%rdi,%rdx				
	pushq		%rdx

	//; rdi = spectrum[0]
	movq	(%rsi), %rdi			
	addq		$8, %rsi
	//; rsi = spectrum[1]
	movq	(%rsi), %rsi			

	//; r8 = log2(N), r14= N
	pushq	%rcx
	fild		(%rsp)
	xorq		%r8,%r8
	pushq	%r8
	fistp		(%rsp)
	popq		%r8
	popq		%r14	
	//; bit reversal has already been done prior to calling this function
	//; r9 = nLargeSpectrum
	//; r10 = nPointsLargeSpectrum
	movq	%r14,%r9
	movq	$1,%r10
	movq	$1,%r11
	mov	%rdi,%r14
	mov	%rsi,%r15
	//;load 2PI in st(0)
	faddp	%st(0),%st(1)
	movq	%r8,%rcx

l1:	pushq	%rcx
	shrq	$1,%r9
	shlq	$1,%r10
	//;st(0) = theta, st(1) = 2pi
	fld	%st(0)
	pushq	%r10
	fidiv	(%rsp)
	popq	%r10

	//;xmm0 = 2*costheta[0],2*costheta[0],2*costheta[0],2*costheta[0]
	//;  st(0) = theta, st(1) = 2pi
	pushq	%rax
	fld	%st(0)
	fstp	(%rsp)
	movss	(%rsp),%xmm0
	pshufd	$0b00000000,%xmm0,%xmm0
	popq	%rax
	addps	%xmm0,%xmm0
	movq	%r9,%rcx
l2:	pushq	%rcx
	//; r12 = point1 (index *4bytes)    r13 = point2 (index *4bytes)
	movq	%r10,%r12
	movq	%r9,%rax
	subq	%rcx,%rax
	pushq	%rdx
	mulq	%r12
	popq	%rdx
	movq	%rax,%r12
	movq	%r11,%r13
	addq	%r12,%r13
	shlq	$2,%r13
	shlq	$2,%r12

	//; xmm2 = costheta[2],sintheta[2],costheta[1],sintheta[1]  
	movq	%r12,16(%rbp)
	decq		16(%rbp)
	fld		%st(0)
	fimul		16(%rbp)
	fstp		(%rbp)
	fstp		4(%rbp)
	decq		16(%rbp)
	fld		%st(0)
	fimul		16(%rbp)
	fstp		8(%rbp)
	fstp		12(%rbp)
	movaps	(%rbp),%xmm2
	pshufd	$0b10110001 ,%xmm2,%xmm2
	//;xmm1 = costheta[1],sintheta[1],0,0
	movhlps	%xmm2,%xmm1
	movq	%r11,%rcx
	//; recurrence formula
	//; xmm3 = w.re,w.im,w.re,w.im
	movaps	%xmm2,%xmm3
	mulps	%xmm0,%xmm3
	subps	%xmm1,%xmm3
	movlhps	%xmm3,%xmm3
	movaps	%xmm2,%xmm1
	movaps	%xmm3,%xmm2
	mulps	%xmm10,%xmm3
	//; xmm5 := c.im,c.re,c.re,c.im
	movq	%r14,%rdi
	movq	%r15,%rsi
	addq		%r13,%rdi
	addq		%r13,%rsi
	movss	(%rdi),%xmm5
	pshufd	$0b00000011,%xmm5,%xmm5
	addss	(%rsi),%xmm5
	pshufd	$0b00101000,%xmm5,%xmm5
	//; xmm3 := inner product: re,re,im,im
	mulps	%xmm3,%xmm5
	pshufd	$0b11011101 ,%xmm5,%xmm3
	pshufd	$0b10001000 ,%xmm5,%xmm5
	addsubps	%xmm5,%xmm3
	pshufd	$0b10101111,%xmm3,%xmm3
	//;xmm6 := sortedArray[point1].re,sortedArray[point1].re,sortedArray[point1].im,sortedArray[point1].im
	movq	%r14,%rdi
	movq	%r15,%rsi
	addq	%r12,%rdi
	addq	%r12,%rsi
	movss	(%rdi),%xmm6
	pshufd	$0b00001111,%xmm6,%xmm6
	addss	(%rsi),%xmm6
	pshufd	$0b11100000,%xmm6,%xmm6
	addsubps	%xmm3,%xmm6
	pshufd	$0b00100111,%xmm6,%xmm6
	movss	%xmm6,(%rdi)
	pshufd	$0b11100001,%xmm6,%xmm6
	movss	%xmm6,(%rsi)
	movq	%r14,%rdi
	movq	%r15,%rsi
	addq	%r13,%rdi
	addq	%r13,%rsi
	pshufd	$0b01001110,%xmm6,%xmm6
	movss	%xmm6,(%rdi)
	pshufd	$0b11100001,%xmm6,%xmm6
	movss	%xmm6,(%rsi)
	//; increase point1 and point2 by 4 bytes (each index represent a float)
	addq		$4,%r12
	addq		$4,%r13
	decq		%rcx
	jnz		l3
	popq		%rcx
	decq		%rcx
	jnz		l2

	//; remove theta from fpu stack
	fstp		%st(0)
	shlq		$1,%r11
	popq		%rcx
	decq		%rcx
	jnz		l1

	popq	%rdx
	//; rcx is already pushed in stack
	cvtsi2ss      (%rsp),%xmm1
	pshufd  	$0b00000000,%xmm1,%xmm1
	popq		%rcx
	shrq          $2,%rcx
	movq	%r14,%rdi
	movq	%r15,%rsi

	//; is this a ifft or a fft?
	cvtss2si	%xmm10,%eax
	cmp	$-1,%eax
	jne	nrm

cp:	movaps	(%rdi),%xmm2
	movntdq	%xmm2,(%rdx)
	addq	$16,%rdi
	addq	$16,%rdx
	loop	cp
	jmp	cleanexit

	movaps	        (%rdi),%xmm2
	movaps	        (%rsi),%xmm3
	divps		%xmm1,%xmm2
	divps		%xmm1,%xmm3
	movntdq	        %xmm2,(%rdi)
	movntdq	        %xmm3,(%rsi)
	addq		$16,%rdi
	addq		$16,%rsi
	loop		nrm

	fstp		%st(0)
	popq		%r8
	popq		%r9
	popq		%r10
	popq		%r11
	popq		%r12
	popq		%r13
	popq		%r14
	popq		%r15
	addq		$0xFF,%rsp	
	popq		%rbp

Cloning a hard driveLast edited on May 17, 2012

Cloning a hard drive

In one of my computers, I have one hard drive that contains 2 partitions: 1 for the root filesystem and one for my /home partition. When I bought a new hard drive, I needed to clone the old one on the new one. This can be easily done with "dd" as long as your partitions are the same size. So I decided to keep the root filesystem with the same size, but wanted to grow the /home partition.

Create the partitions

First, you need to create the partitions on the new drive using fdisk. Remember to keep the same size for the partitions you wanna clone. If you create them smaller, you will end up with a corrupted filesystem. If you create them larger, you will not be able to access the extra space so it will be wasted. After creating the partitions, you don't need to create a filesystem on them (mkfs) since "dd" will clone the partition table of the old hard drive too. But of course, you will need to create a FS for the other partitions that won't be cloned.


You need to clone your master boot record (which contains lilo/grub). We need to copy the first 512 bytes (the first sector):

dd if=/dev/sda of=/dev/sdb bs=512 count=1

Then, we can clone the partition:

dd if=/dev/sda1 of=/dev/sdb1 bs=4096 conv=noerror

At this point, my root partition was cloned successfully. For the other partition (/dev/sdb2), I had to create a new filesystem (mkfs) because my partition needed to be larger. After that, I copied the files manually using "cp".

Configuring and Using KVM-QemuLast edited on Feb 28, 2012

KVM Qemu

I was tired of Vmware Server's sloooooow web interface that only works half of the time. I just couldn't take it anymore. So I started looking for other virtualization solutions. I found KVM. KVM/QEmu is, by far, easier to use than VMWARE Server. The thing I like about qemu is that there is no virtual machine files. You only create a virtual disk file but the machine itself is built from the command line when invoking qemu. That means you have to "rebuild" the machine every time you reload it. It looks painful but you just have to save your command in a script and invoke it. So it comes down to say that what a shell script is to qemu what a VMX file is to vmware. Don't ask me why, but this is a strong point for me.

Installing and preparing KVM Qemu

  1. Compile kernel using KVM (see flags VIRTUALIZATION,KVM,KVM_AMD,KVM_INTEL)
  2. Download and Install qemu-kvm
  3. Install "tunctl"
  4. make network bridge script. will need to create a script that will need to be run after every reboot (put in rc.local):
    #load tun driver and create a TAP interface
    modprobe tun
    tunctl -t tap0
    # bring eth0 down, we will set it as promiscuous and it will be part of a bridge
    ifconfig eth0 down
    brctl addbr br0
    ifconfig eth0 promisc up
    ifconfig tap0 promisc up
    # set the IP address of the bridge interface. This is the interface that we will use from now on. So use
    # an IP address on your LAN. This is the address of the host computer, not the guest.
    ifconfig br0 netmask broadcast up
    # add tap0 and eth0 as members of the bridge and bring it up.
    brctl stp br0 off
    brctl setfd br0 1
    brctl sethello br0 1
    brctl addif br0 eth0
    brctl addif br0 tap0
    # setup default gateway.
    route add default gw

Note that you will need to run that on every reboot. So you might want to save this is a boot script.

Create a VM

  1. Create a 10g disk: qemu-img create -f qcow2 vdisk.img 10G.
  2. install OS: qemu-system-x86_64 -hda vdisk.img -cdrom /path/to/boot-media.iso -boot d -m 512 -vnc :1. Let's analyze that command:
    • "-hda vdisk.img": use vdisk.img as primary disk
    • "-cdrom /path/to/boot-media.iso": cdrom should be mouted asboot-media.iso
    • "-boot d": Boot from D drive, the cdrom
    • "-m 512": 512 mb of RAM
    • "-vnc :1" : The display will be on VNC port index number 1. Depending on your settings, if your base port is 5900, then the TCP port used in that case will be (5900 + 1).

So you can now use a VNC client to connect to port 5901 on your host to have access to the display. The VM will boot from the OS install CD you have provided so you will be able to install the OS like you would on a real computer.

Use a VM

  1. Run: qemu-system-x86_64 -usbdevice tablet -daemonize -enable-kvm --hda /virtual-machines/vdisk.img -boot c -m 512 -vnc :1 -monitor telnet:,server,nowait,ipv4 -net tap,ifname=tap0,script=no -net nic Let's analyze that command:
    • -usbdevice tablet: I had problems with my mouse cursor when using VNC if I didn't use that option.
    • -daemonize: Run as background process
    • -enable-kvm: Enable the use of kernel-based virtualization.
    • "-hda vdisk.img": use vdisk.img as primary disk
    • "-boot c": Boot from C drive, the primary disk
    • "-m 512": 512 mb of RAM
    • "-vnc :1" : The display will be on VNC port index number 1.
    • -monitor telnet:,server,nowait,ipv4: Listen on for the telnet configuration.
    • -net tap,ifname=tap0,script=no: Use tap0, and don't run network setup script.
  2. Install a vnc viewer on some other computer (TightVNC). Connect to host on port 5901
  3. Configure network on guest (If windows, enable remote desktop and disable firewall or poke a hole in it)

You should now have access to your VM through remote desktop or SSH or whatever you configured in that last step.

Managing the VM

You can telnet in the VM console to manage it. use the port you have setup with option "-monitor telnet". To exit the monitor, use 'ctrl-]' and press 'q'. If you type 'q' without 'ctrl-]', you will kill the VM.

Change CD in cdrom

telnet in management console and: change ide1-cd0 /shared/newimg.iso

Changing specs

Of course, if you want to add more RAM or change other system specs, you can do it from the command line when invoking qemu.