WWW.DUMAIS.IO

Installing kubernetes manuallyLast edited on Sep 13, 2017

The procedure described here applies to centos but the same recipe can obviously be adapted for other distritions with some minor tweaks.

I am deploying a 3 nodes cluster and all servers are master/nodes at the same time. This is not recommended by the k8s team but for a lab environment it is perfectly fine. If you understand this procedure well, then you will find that deploying nodes and masters separately can be just as easy. Let's assume I have 3 servers with IPs 192.168.1.100,192.168.1.101 and 192.168.1.102.

System preparation

First thing we need to do is disable selinux, disable the firewall, NetworkManager and setup a yum repo. This is probably a bad idea, but then again, it makes things easier in a lab environment.

systemctl disable NetworkManager
systemctl stop NetworkManager
systemctl disable firewalld
systemctl stop firewalld
setenforce 0
sed -i "s/^SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config 
cat >> /etc/yum.repos.d/virt7-docker-common-release.repo << EOF
[virt7-docker-common-release]
name=virt7-docker-common-release
baseurl=http://cbs.centos.org/repos/virt7-docker-common-release/x86_64/os/
gpgcheck=0
EOF

If you are behind a corporate proxy, also add this line to that last repo file: proxy=http://yourproxy_address

Install docker

yum install -y docker
systemctl enable docker
systemctl start docker

If you are behind a corporate proxy, add this line to /etc/sysconfig/docker: HTTP_PROXY=http://yourproxy_address and then restart docker

Install etcd

etcd should be installed on all masters. You actually have to install etcd on the same server where the k8s binaries will be found, but you should definitely install several instances of it and cluster them. Some people also suggest to install at least one instance somewhere in the cloud. This is because etcd will store all the persistant admin stuff that k8s needs. So it would be painful to lose that data. Bringing back (or adding) a k8s node is very easy and transparent as long as your etcd cluster is intact.

yum install -y etcd 
cat >> /etc/etcd/etcd.conf << EOF
ETCD_INITIAL_CLUSTER="k1=http://192.168.1.100:2380,k2=http://192.168.1.101:2380,k3=http://192.168.1.102:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER_TOKEN="awesome-etcd-cluster"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.1.100:2380"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="http://192.168.1.100:2380"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.1.100:2379"
ETCD_NAME="k1"
EOF
systemctl enable etcd
systemctl start etcd

The ETCD_INITIAL_CLUSTER line should list all the hosts where etcd will be running. The other lines where you see 192.168.1.100 should be modified to match the IP address of the server you are currently installing etcd on. and ETCD_NAME should also match the server you are installing on (see the first line where these names are used). The ETCD_NAME can be any arbitrary name (as long as you properly match them in ETCD_INITIAL_CLUSTER) but most people try to use the server's hostname.

After having installed etcd on all your servers, wait for the cluster to be up checking the output of "etcdctl cluster-health" Make sure the cluster is up and healthy before continuing.

Now add some info about k8s in etcd

etcdctl mkdir /kube-centos/network
etcdctl mk /kube-centos/network/config "{ \"Network\": \"172.30.0.0/16\", \"SubnetLen\": 24, \"Backend\": {\"Type\": \"vxlan\" } }"

Install Flannel

Your k8s cluster will need an overlay network so that all pods appear to be on the same layer 2 segment. This is nice because even though you cluster runs on different servers, each pod will be able to talk to each other using a virtual network that sits on top of your real network and will span accross all k8s nodes. It's basically a virtual distributed switch, just like openvswitch does. Flannel is one such driver that enables that for kubernetes. It is the most basic overly driver and works just fine. For more advanced stuff, Nuage is an option and there are many other options. If you are new to this (like I was), then this is the SDN stuff that all the cool kids are talking about.

yum install -y flannel
cat >> /etc/sysconfig/flanneld << EOF
FLANNEL_ETCD_ENDPOINTS="http://192.168.1.100:2379,http://192.168.1.101:2379,http://192.168.1.102:2379"
FLANNEL_ETCD_PREFIX="/kube-centos/network"
EOF
systemctl enable flanneld
systemctl start flanneld

Install Kubernetes

So far, we have just installed all the stuff that k8s requires. Now we get to the part where we actually install Kubernetes. So to resume, we have installed:

  • Docker: This is what will actually run the containers
  • Etcd: This is the database that k8s and flannel uses to store admin data
  • Flannel: The overlay network

Before we continue, let's create a user and some folders

groupadd kube
useradd -g kube -s /sbin/nologin kube
mkdir -p /var/run/kubernetes
chown root:kube /var/run/kubernetes
chmod 770 /var/run/kubernetes
mkdir /etc/kubernetes
mkdir /var/lib/kubelet

Kubernetes can be downloaded as a binary package from github. What I really like about these binaries is that they are simple standalone applications. You don't need to install a RPM and a whole bunch of libraries. Simply copy the executable and run it. You will need a total of 6 process to run. So the first thing to do is to unpack the binaries. Download the version you want from https://github.com/kubernetes/kubernetes/releases

tar -zxf kubernetes.tar.gz
cd kubernetes
cluster/get-kube-binaries.sh
cd server
tar -zxf kubernetes-server-linux-amd64.tar.gz
cp kube-apiserver /usr/bin
cp kube-controller-manager /usr/bin
cp kubectl /usr/bin
cp kubelet /usr/bin
cp kube-proxy /usr/bin
cp kube-scheduler /usr/bin
chmod 554 /usr/bin/kube*

Each of the daemons can simply be started from the CLI with a whole bunch of command line arguments, you don't even need any configuration files. This is beautiful because it is so easy. So technically, you could just do:

kube-apiserver  &
kube-controller-manager  &
kubelet  &
kube-proxy  &
kube-scheduler  &

And that's it, kubernetes is running. Simple. But let's get systemd to manage those process. If you are not using systemd (slackware?) then you can setup a bunch of rc scripts to launch those. But for this example, let's create some systemd unit files that will launch this for us.

#/usr/lib/systemd/system/kube-apiserver.service
[Unit]
Description=Kubernetes API Server
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
After=network.target
After=etcd.service

[Service]
EnvironmentFile=-/etc/kubernetes/config
EnvironmentFile=-/etc/kubernetes/apiserver
User=kube
ExecStart=/usr/bin/kube-apiserver \
        $KUBE_LOGTOSTDERR \
        $KUBE_LOG_LEVEL \
        $KUBE_ETCD_SERVERS \
        $KUBE_API_ADDRESS \
        $KUBE_API_PORT \
        $KUBELET_PORT \
        $KUBE_ALLOW_PRIV \
        $KUBE_SERVICE_ADDRESSES \
        $KUBE_ADMISSION_CONTROL \
        $KUBE_API_ARGS
Restart=on-failure
Type=notify
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
#/usr/lib/systemd/system/kube-controller-manager.service
[Unit]
Description=Kubernetes Controller Manager
Documentation=https://github.com/GoogleCloudPlatform/kubernetes

[Service]
EnvironmentFile=-/etc/kubernetes/config
EnvironmentFile=-/etc/kubernetes/controller-manager
User=kube
ExecStart=/usr/bin/kube-controller-manager \
        $KUBE_LOGTOSTDERR \
        $KUBE_LOG_LEVEL \
        $KUBE_MASTER \
        $KUBE_CONTROLLER_MANAGER_ARGS
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
#/usr/lib/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet Server
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
After=docker.service
Requires=docker.service

[Service]
WorkingDirectory=/var/lib/kubelet
EnvironmentFile=-/etc/kubernetes/config
EnvironmentFile=-/etc/kubernetes/kubelet
ExecStart=/usr/bin/kubelet \
        $KUBE_LOGTOSTDERR \
        $KUBE_LOG_LEVEL \
        $KUBELET_API_SERVER \
        $KUBELET_ADDRESS \
        $KUBELET_PORT \
        $KUBELET_HOSTNAME \
        $KUBE_ALLOW_PRIV \
        $KUBELET_POD_INFRA_CONTAINER \
        $KUBELET_ARGS
Restart=on-failure

[Install]
WantedBy=multi-user.target
#/usr/lib/systemd/system/kube-proxy.service
[Unit]
Description=Kubernetes Kube-Proxy Server
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
After=network.target

[Service]
EnvironmentFile=-/etc/kubernetes/config
EnvironmentFile=-/etc/kubernetes/proxy
ExecStart=/usr/bin/kube-proxy \
        $KUBE_LOGTOSTDERR \
        $KUBE_LOG_LEVEL \
        $KUBE_MASTER \
        $KUBE_PROXY_ARGS
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
#/usr/lib/systemd/system/kube-scheduler.service
[Unit]
Description=Kubernetes Scheduler Plugin
Documentation=https://github.com/GoogleCloudPlatform/kubernetes

[Service]
EnvironmentFile=-/etc/kubernetes/config
EnvironmentFile=-/etc/kubernetes/scheduler
User=kube
ExecStart=/usr/bin/kube-scheduler \
        $KUBE_LOGTOSTDERR \
        $KUBE_LOG_LEVEL \
        $KUBE_MASTER \
        $KUBE_SCHEDULER_ARGS
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Note how I did not hard code all command line arguments in the systemd unit files. Instead I will store them in separate environment files under /etc/kubernetes. So earlier I was praising about how nice it was to just be able to launch the processes on the command line with not extra config files or no service files but here I am create service files and config files. I know... I just like the fact that I can customize it any way I want. So here are the config files needed. But as I said, you could just write one bash script that invokes all those 5 process with all their command line arguments and you would have zero config file and no need for systemd. just 5 binaries that you copied on your server.

#/etc/kubernetes/config 
KUBE_ETCD_SERVERS="--etcd-servers=k1=http://192.168.1.100:2379,k2=http://192.168.1.101:2379,k3=http://192.168.1.102:2379"
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow-privileged=false"
KUBE_MASTER="--master=http://127.0.0.1:8080"
KUBE_SCHEDULER_ARGS="--leader-elect=true"
KUBE_API_ARGS="--leader-elect=true"
#/etc/kubernetes/apiserver 
KUBE_API_ADDRESS="--insecure-bind-address=0.0.0.0"
KUBE_ETCD_SERVERS="--etcd-servers=http://127.0.0.1:2379"
KUBE_SERVICE_ADDRESSES="--service-cluster-ip-range=10.254.0.0/16"
KUBE_ADMISSION_CONTROL="--admission-control=AlwaysAdmit"
KUBE_API_ARGS=""
#/etc/kubernetes/controller-manager 
KUBE_CONTROLLER_MANAGER_ARGS="--node-monitor-period=2s --node-monitor-grace-period=16s --pod-eviction-timeout=30s"
#/etc/kubernetes/kubelet 
KUBELET_ADDRESS="--address=0.0.0.0"
KUBELET_PORT="--port=10250"
KUBELET_API_SERVER="--api-servers=http://127.0.0.1:8080"
KUBELET_ARGS="--node-status-update-frequency=4s --cgroup-driver=systemd"
#/etc/kubernetes/proxy
# empty
#/etc/kubernetes/scheduler
# empty

Then you can enable and start all those new services. You have a k8s cluster running. You can test by invoking "kubectl get nodes"

Some notes about the k8s processes

kubectl can be copied on any machine. It will try to communicate to kubernetes through the apiserver process on the localhost. If you are running this tool on a server where apiserver is not running, then you need to specify --server=http://api-server-address:8080. In our case, we have installed the apiserver on all k8s nodes. So you can connect to any instances in the cluster.

The apiserver process needs to run on all masters. This is so you can control each masters remotely. I've configured each apiserver to only talk to it's local etcd because we have installed etcd on all nodes. But would could configure it to talk to all etcd servers, even remote ones.

the kube-proxy should run on all worker nodes (in our cae, workers and masters are the same). Let's say you have a pod running a web server on port 242. You don't know on which node your pod will run. But you want to be able to access it using the IP of any of the nodes in the cluster. That is what kube-proxy does. So in you go to http://192.168.1.100:242 but your pod runs on 192.168.1.102, then kube-proxy will handle it. It will, as you guessed it, proxy the request. And this is at the TCP/UDP level, it is not just a http proxy.

I am not entirely sure which of these processes are considered as "master" vs "worker nodes" processes. But I believe that the nodes should only be running kube-proxy and kubelet. The other processes should only run on what would be considered master servers. So since it is just a matter of copying the binaries over and changing some addresses in the environment files under /etc/kubernetes, it would be easy to tweak the procedure above to get a different infrastructure.

Ansible

As a bonus, I wrote an ansible script to do all this automatically. https://github.com/pdumais/ansible-kubernetes



Writing a hypervisor with Intel VT-x Last edited on Jun 21, 2017

Introduction

Before virtualization support was available on CPUs, virtual machines were emulating most of the code that a guest OS would run. With VMX, the CPU handles the virtualization. If you are familiar with the hardware context-switches that were available on ia32, it's a bit similar. The hypervisor manages many VMs. When executing code of a VM, there is a hardware context switch that occurs. The context switch saves the current CPU state (but not the general purpose registers) and loads another state that would be the one of the VM. Then code can execute from there. So the VM ends up running "bare-metal". The CPU has changed it's entire context and runs your VM just as it would if the guest was running bare-metal. With one exception: Some operations causes VMExits. For example, when an interrupt occurs, the CPU automatically switches back to the hypervisor context (what they call the VMM). The VMM can then handle the interrupt and change the Guest's state data so that on the next entry it will think it has an interrupt to handle (or not). This is basically how you would handle the timer interrupt on the guest and on the host so that multithreading works. VMExits occur for many other reasons. You could have a VMExit occur when the VM tries to access unmapped memory, when executing "hlt" or many other reasons.

For other articles about my OS:
Networking in my OS
virtio driver implementation
Process Context ID and the TLB
Thread management in my hobby OS
Enabling Multi-Processors in my hobby OS
Memory Paging
AVX/SSE and context switching
Stack frame and the red zone (x86_64)

To view the full source code of my OS: https://github.com/pdumais/OperatingSystem

Goal

The virtual environment we'll be setting up is for a 16bit real mode VM because that's how a real CPU would start. Control will initially be transfered to address 0x00000000. Normally, an entry point to BIOS code should be located there and then the BIOS would eventually load the first sector of the primary HDD and transfer control to 0x7c00. But in my case, I am not writting a BIOS, so only test code will be executed in the VM.

Memory

In a VMX environment, there exists several layers of memory:

  • Guest virtual memory: The guest's virtual memory if paging is enabled in the Guest. That memory translates to guest physical memory using the guest's paging structures, entirely controlled by the guest.
  • Guest physical memory: This is the physical memory from the VM's point of view. From the hypervisor's point of view, this is just virtual memory that has to be mapped to physical memory. This memory would be allocated lazily as the VM uses more and more memory. This memory is mapped to physical memory by the hypervisor.
  • Physical memory: This is the real hardware physical memory as seen by the host's CPU.

For more details about how paging works, please refer to Memory Paging

For the VM to have memory, some of the host's physical memory must be assigned to the VM. The guest's physical memory is seen as virtual memory by the host. To assign memory to the guest, a feature provided by the CPU called "EPT" can be used. EPT works Just like the regular paging system. It is a way to map guest-physical addresses to real physical addresses using a paging structure with a PML4 table on top.

So a guest would prepare it's own internal paging system and have it's own PML4 table. When an a virtual address inside the VM needs to be resolved, it will be resolved to the guest's physical address using its PML4 table. Then, the guest-physical address will be resolved using the EPT's PML4 table. So 2 translations would be done.

EPT is enabled by setting the "Enable EPT" bit in the VM-Execution controls. When enabled, an EPT PML4 structure must be prepared and and the base address of that structure must be stored in the VMCS EPT pointer

Lazy allocation of guest physical memory

When a guest is launched with, for example, 16gb of ram, it would not be desirable to reserve that entire memory on the hypervisor immediately. Instead, it would be preferable to allocate that memory in a lazy allocation fashion. A guest might never use it's entire RAM so reserving it on the hypervisor would be a waste.

Reading an unallocated guest physical page will return zeros. Writing to it will trigger a EPT violation. The VMM can then reserve a physical page and map it there instead of the dummy page.

When the guest is launched, a single page of write-protected physical memory (filled with zeros) should be mapped to all the guest physical memory. If the VM's BIOS starts probing memory by reading all of it, every 4k would be mapped to that single page so the BIOS would read zeros and think it is reading valid memory. If the BIOS (or the guest OS) writes into memory, a VMexit would occur because writing to any address would map to this write-protected physical page. The hypervisor can then allocate a new R/W page to the EPT's paging structure for the guest physical memory. Of course, if the guest does a write/read-back kind of algorithm the probe the memory, then all the guest physical memory will have been allocated because all pages will have been written to, so all bets are off.

Reclaiming freed-up memory is a different story. The only way the hypervisor can know about memory that can be reclaimed is by using memory balooning. The virtio specification describes a memory baloon device that can be implemented for that purpose.

When the guest will access guest-physical memory that is unmapped to physical memory, a VMExit will occur. The hypervisor needs to evaluate if the guest-physical address falls into the possible addressing space of the VM (i.e. if it is trying to access memory beyond the last byte of the virtual RAM). If the memory is unmapped because of the lazy-loading scheme, then a new page can be allocated. If it is because that memory address should not be mapped, then an exception will be injected back in the VM and the guest will need to handle that.

Translation Lookaside Buffer (TLB)

When the process translate a virtual address to a physical address, the translation is cached in the TLB so that it wont have to walk through the page tables again next time. In a virtual environment, the CPU only has one TLB in which is caches translations for:

  • All host virtual/physical mappings
  • All guest-physical/physical mappings (of all VMs)
  • All guest-virtual/guest-physical mappings (of all VMs).

This could create collisions in the TLB. For example, 2 VMs could have a mapping of virtual address 0x00000000. A simpler case would be two processes with different virtual memory mappings running on the host. For this reason, as I described in another article (Process Context ID and the TLB), the TLB supports processID-tagging. With VMX, two new taggings exist: VPID and EP4TA.

VPID is the tagging of a TLB entry for a specific virtual CPU. This is only tru if VPID is enabled in the VMCS and if a non-zero VPID has been set in the VMCS. In such a case, every guest-virtual to guest physical translations that are inserted in the TLB will be tagged with the current VPID. This means that another VCPU can insert a similar entry in the TLB and it will be tagged with a different VPID. When the MMU looks up a mapping in the TLB, it will only consider those tagged with the current VPID.

EP4TA tagging is done when EPT is enabled. Every guest-virtual to host-physical (going through EPT tables) are cached in the TLB with a tagging of the current EPT,PCID and VPID. The EPT ID (EP4TA) is derived from the current EPTP, so there is no specific ID to set, unlike PCID and VPID.

If VPID is not enabled, then the VPID tag is always 0 and is always ignored during lookups. A lookup while EPT is active (during non-root operation) will only consider entries with a matching EP4TA.

I'm a bit confused by what the Intel documentation says about EPT and the TLB. From what I can understand, it seems like the TLB will cache guest-virtual mappings to host-physical addresses. Those are called "combined mappings" When EPT is enabled, these entries will be tagged with the current EP4TA,VPID,PCID. If VPID is not enabled:

  • entries will be tagged with VPID=0.
  • On vmexit/vmentry, all entries with VPID==0 will be invalidated, regardless of EP4TA tag.

For this reason, I am enabling VPID with a tag of "1" for all my VCPUs. The documentation clearly states that having the same VPID for different EPT is acceptable.

When the guest will invoke "invlpg" or "invpcid", these instructions should be trapped because they will invalidate all requested TLB entries regardless of the current EP4TA. So the VMM should set a VMExit on those operations and emulate them using "invept. That's what I am understanding from section 28.3.3.1, but section 28.3.3.3 says that when EPT is enabled, there is no need to emulate these instruction.

When to invalidate

When a VMExit occurs because of an EPT violation, the faulting page will always be invalidated in the TLB. In my OS, I use the EPT violation handler to allocate more memory to the VM. When the VM resumes, the TLB will be up to date the VM will be writing data in the correct physical page. If the VM is rescheduled on another core where the stale TLB entry exists for that address, then it will be accessing the wrong physical page. The TLB was only invalidated on the CPU that detected the EPT violation. For this reason, the host needs to do a TLB shootdown on all host CPUs. This done by sending a IPI to all other CPUS. The CPUS will then execute "invept" to invalidate TLB entries associated with the EPT in question.

The "invept" can be used to invalidate ALL tlb entries tagged with a specific EPT, or all EPT.

Initialization

The initialization of vmx mode is well described. There are a couple of web sites with examples that can be found using a simple

google search. But the Intel SDM #3 describes an algorithm to do so. Basically, this is what needs to happen:

  • detect support for VMX
  • create a VMCS for vmxon
  • execute vmxon
  • create a VMCS for a VM and write all required data in it
  • run vmlaunch

It's important to note that VMXON should be executed on all CPUs and this should be done when the hypervisor starts. In my OS, I do this as soon as I switch to long mode on all cores and I leave it on all the time. The tricky part is to get the VMCS for the guest initialized correctly. I had to spend quite some time fine-tunning the fields. If VMLAUNCH fails, the zero flag will be set, then you can do a VMREAD to VM_INSTRUCTION_ERROR to get the error code. Then, the SDM (chapter 26) describes all the checkings that the CPU does so you can walk through those and make sure your code is compliant.

Also, it seems like access rights and segment limits for guest segment descriptors still need to be properly set even if the VM will run in real mode. If I recall correctly, it's because they will be checked even in real mode and that is how "unreal mode" works.

I am not going in the details about this because everyone talks about that and the SDM has very good documentation about it.

Multi-tasking (in a multi-core environment)

Running a VM in a multitasking environment requires a bit of logic that is beyond that the Intel manuals describe. This is because a VM really is just a process getting a time slice with the scheduler. So when the process is scheduled out, the entire VM is paused.

Every VCPU is managed by one and only one thread in the OS. So each thread has a VMCS (if it is running a VCPU) Before the task get's scheduled out, vmclear is executed so that any cached data gets flushed in the VMCS. At least that's what Intel recommends to do. When a task is scheduled back in, VMPTRLD is executed to load the VMCS (if any) that this task manages.

When a timer interrupt occurs and it's time to do a context switch on the host, since the VM runs as part of a process on the host, it is normal that it gets interrupted and that time is given to another process. That process might want to run a VM also, or maybe it's just a normal application. When the timer interrupt occurs, this will trigger a VMExit and the CPU will be back at executing code in the context of the host. The thread will change. When getting back to this thread, we will do a VMRESUME to continue executing the VM. We will also reload the VMCS with VMPTRLD because another process might have loaded another VMCS. We could even be running on another processor in a multi-core system. And remember, loading a VMCS is only loaded on the current core. So with 4 cores, you could have 4 different VMs running

When the VMM is running in root-operations because of a VMexit, it could have been scheduled out at any moment before doing the VMRESUME. So when VMRESUME is executed, it is possible that VMCLEAR was executed before because of a task switch. For this reason, the OS checks if VMRESUME fails and executes VMLAUNCH instead.

The OS should avoid scheduling VCPUs from cores to cores because moving a VCPU to another core makes it mandatory to do a VMCLEAR. My OS does not make such effort and blindly does a VMCLEAR on every task switch. So performance could be improved there.

And interesting point here is that after executing VMLAUNCH, control never returns to the calling function (unless vmlaunch fails). VMLAUNCH and VMRESUME will jump in the guest's code without pushing a return address. A VMExit handler has been defined in the VMCS, so from the host's point of view. control skips from vmlaunch to the handler. Also, since the stack is setup in the host's section of the VMCS, the VMM gets a brand new stack everytime the VMExit handler gets called. This means that there is no need to pop the stack before executing VMLAUNCH/VMRESUME

VM Exits

When the cpu is running in non-root operation (running VM code), it's pretty much as if the VM was running bare-metal. But Several events can make the VM exit back to root-operation (the hypervisor). A VM exit will restore the CPU state back to what it was set in the VMCS before vmlaunch. The hypervisor must determine what caused the VMExit and take proper action. VMExits are similar to interrupts because the general purpose registers are not automatically saved. A VMExit handler must save the registers and restore them before resuming the VM otherwise the VM will be corrupted. As per the host section of the VMCS though, a separate stack is being set. The VMM could alter the VM state by changing data in the VMCS before resuming the VM. For example, if the VMExit occured because of an exception of type #UD, the VMM could emulate the invalid instruction and change general purpose registers and flags in the VM.

Interrupts

When an interrupt needs to be delivered to the host, a VMExit will occur and control will be immediately given to the VMM's exit handler. At that moment interrupts will be disabled (flags.IF=0). When bit 15 (Acknowledge interrupt on exit) is cleared on the VM Exit control, the interrupt is not acknowleged by the processor. This means that as soon as the "IF" flag becomes 1 (ie, when executing "sti"), the interrupt handler will get called.

In my hypervisor, if such a VM exit is detected, the kernel executes "sti" and finally "vmresume". This will let the host execute the handler as soon as interrupts are re-enabled, then control will be handed back to the VMM exit handler to execute "vmresume"

It is worth noting that it doesn't seem to matter if interrupts were cleared before VMLAUNCH. Interrupts will still trigger a VMExit and the interrupt will be pending until the VMM enables them (using "sti")

Multiple VCPU

supporting multiple VCPUs for a VM requires having 1 VMCS per VCPU (just as if they were all used in independent VMs). But the hypervisor has to make those CPUs dormant until a "Startup IPI" is sent. From there, the hypervisor can configure the VCPU to start executing code at the desired address. But the main thing here is that only one VCPU would be started when the VM is launched. My OS does not yet support this. I am focussing on 1 VCPU for now

One thing to consider when running multiple VCPUs is that one of them could trigger a VMExit because of a memory allocation requirement (as described in a previous section of this article). In such a case, another VCPU could be running, or the current thread could get scheduled out with another VCPU one being scheduled in. So extra care should be taken to ensure that memory does not get corrupted.

Other thoughts

There are many other things to consider to improve performances and to make the design robust Things to consider are:

  • Trapping "hlt" with a vmexit so we can schedule out an idle VCPU
  • Trapping TLB/MMU operations to better manage memory
  • Trapping IO operations so that the hypervisor can retain control over hardware.
  • Setup a virtual apic so that VCPUs can receive interrupts (and make a scheduler).

The code

For reference, this is a part of my VMX code:

What next?

There is still some work to be done for implementing a full hypervisor. The work done here only creates virtual CPUs. The hypervisor needs to emulate other hardware such as a hard drive, a network card, a video card, keyboard etc. I would probably write virtio devices since they are easy to code and very efficient. Another big part that is missing is a BIOS. The hypervisor has to provide a BIOS for your guests since they rely on that to boot.



Openflow Web ServerLast edited on Nov 14, 2016

To learn a bit about openflow, I wanted to try to build a fake web server that runs on an openflow controller. This idea came to mind when I first learned about DHCP servers running in controllers. So I wanted to try to build a web server to see if I understood the concept correctly

OVS setup with pox

For my test, instead of creating a TAP interface and attaching a VM to it, I am simply going to create a VETH pair so I can use that interface from the host.

Create a bridge and attach to controller to it:

$ ovs-vsctl add-br sw0
$ ovs-vsctl set-controller sw0 tcp:127.0.0.1:6633

Now create a VETH pair and add one side in the bridge

$ ip link add veth1 type veth peer name veth2
$ ip l set dev veth1 up
$ ip l set dev veth2 up
$ ip addr add 192.168.4.10/24 dev veth2
$ ovs-vsctl add-port sw0 veth1

This is what the bridge would look like

$ ovs-vsctl show
6e1ab0bb-d086-46f0-866e-81196d0e9b09
    Bridge "sw0"
        Controller "tcp:127.0.0.1:6633"
        Port "sw0"
            Interface "sw0"
                type: internal
        Port "veth1"
            Interface "veth1"
    ovs_version: "2.5.1"

Launch the POX controller with the modules that replies to any ping requests (and any ARP queries). I am assuming POX is already installed.

$ ./pox.py proto.pong
$ ping -I veth2 192.168.4.104
PING 192.168.4.104 (192.168.4.104) from 192.168.4.10 veth2: 56(84) bytes of data.
64 bytes from 192.168.4.104: icmp_seq=1 ttl=64 time=45.5 ms
64 bytes from 192.168.4.104: icmp_seq=2 ttl=64 time=10.1 ms
^C

Openflow

This is not meant to be a a full explanation about what openflow is. It is just enough to understand the power of SDN and what could be done with it. Openflow is the protocol that the switch and controller use to make SDN happen.

When a packet comes in an openflow enabled switch, the switch will look in it's flow tables to see if a flow matches the incoming packet. If no flow is found, the packet is forwarded to the controller. The controller will inspect the packet and from there, a few things can happen. The controller could instruct the switch to add a new flow in its tables to instruct it how to forward the packet. This flow would then be used for the current packet and since it will be added in the flow tables, all future packets that match the rule won't need to be sent to the controller anymore since the flow will be found. The controller could also remove flows obviously. But things get interesting here because instead of adding/removing flows, the controller can just decide to tell the switch to send a packet instead. The controller would construct a packet, and tell the switch to send it on a specific physical port. The initial packet will then be dropped.

So this means, that the controller could receive a packet, inspect it and determine: This packet came from port1 from MAC-a, for MAC-b, it is a TCP (therefore IP) packet from IP-a, for IP-b and it is a SYN packet. So the controller could decide to build another packet similar to the incoming one but reverse the MAC addresses, reverse the IP addresses, add a ACK bit (SYN/ACK), swap TCP ports, and play around with the SEQ/ACK numbers a bit. The controller would instruct the switch to send that packet and drop the initial one. By doing this, any TCP connection packet, regardless of the destination MAC, destination IP, destination TCP port, sent on the network would get acknowledged.

Naive web server construction

Taking the idea above, we only need to do 3 things to make a really dumb web server.

  • Respond to any incoming ARP queries with a dummy MAC address. It doesn't matter what MAC we hand out. As long as the client receives a reply and is happy about it.
  • Ack tcp connections going to the IP address that we received a ARP query for earlier. So the client will successfully establish a TCP connection.
  • Respond with a HTTP response when a TCP packet with a payload with a "GET" is received at the IP used previously.

Analysis

Here we see the client ARP request being broadcast

07:47:22.622933 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.4.157 tell 192.168.4.10, length 28
    0x0000:  ffff ffff ffff 0a26 7cd0 bfdc 0806 0001  .......&|.......
    0x0010:  0800 0604 0001 0a26 7cd0 bfdc c0a8 040a  .......&|.......
    0x0020:  0000 0000 0000 c0a8 049d                 ..........

The controller wrote a reply by putting the destination MAC in the source field and writing a ARP reply payload

07:47:22.667931 ARP, Ethernet (len 6), IPv4 (len 4), Reply 192.168.4.157 is-at 02:00:de:ad:be:ef, length 28
    0x0000:  0a26 7cd0 bfdc 0200 dead beef 0806 0001  .&|.............
    0x0010:  0800 0604 0002 0200 dead beef c0a8 049d  ................
    0x0020:  0a26 7cd0 bfdc c0a8 040a                 .&|.......

Then, the client is satisfied with the ARP resolution, so the tries to connect to that fake host. Sending a SYN

07:47:22.667948 IP (tos 0x0, ttl 64, id 46203, offset 0, flags [DF], proto TCP (6), length 60) 
    192.168.4.10.51993 > 192.168.4.157.http: Flags [S], cksum 0x8a26 (incorrect -> 0x893c), seq 547886470, win 29200, options [mss 1460,sackOK,TS val 522625533 ecr 0,nop,wscale 7], length 0
    0x0000:  0200 dead beef 0a26 7cd0 bfdc 0800 4500  .......&|.....E.
    0x0010:  003c b47b 4000 4006 fc48 c0a8 040a c0a8  .<.{@.@..H......
    0x0020:  049d cb19 0050 20a8 1586 0000 0000 a002  .....P..........
    0x0030:  7210 8a26 0000 0204 05b4 0402 080a 1f26  r..&...........&
    0x0040:  a1fd 0000 0000 0103 0307                 ..........

The controller added a ACK flag in the TCP header, copied SEQ+1 into ACK, set SEQ to 0. It swapped dst/src MAC addresses in the ethernet header and the IP addresses in the IP header and swapped ports on the TCP header.

07:47:22.670347 IP (tos 0x0, ttl 64, id 27868, offset 0, flags [none], proto TCP (6), length 40) 
    192.168.4.157.http > 192.168.4.10.51993: Flags [S.], cksum 0xb231 (correct), seq 0, ack 547886471, win 29200, length 0
    0x0000:  0a26 7cd0 bfdc 0200 dead beef 0800 4500  .&|...........E.
    0x0010:  0028 6cdc 0000 4006 83fc c0a8 049d c0a8  .(l...@.........
    0x0020:  040a 0050 cb19 0000 0000 20a8 1587 5012  ...P..........P.
    0x0030:  7210 b231 0000                           r..1..

The client sent an ACK as the final step of the 3way handshake, but we don't care.

07:47:22.670376 IP (tos 0x0, ttl 64, id 46204, offset 0, flags [DF], proto TCP (6), length 40) 
    192.168.4.10.51993 > 192.168.4.157.http: Flags [.], cksum 0x8a12 (incorrect -> 0xb232), seq 547886471, ack 1, win 29200, length 0
    0x0000:  0200 dead beef 0a26 7cd0 bfdc 0800 4500  .......&|.....E.
    0x0010:  0028 b47c 4000 4006 fc5b c0a8 040a c0a8  .(.|@.@..[......
    0x0020:  049d cb19 0050 20a8 1587 0000 0001 5010  .....P........P.
    0x0030:  7210 8a12 0000                           r.....

Now we are getting a HTTP GET request

07:47:22.671570 IP (tos 0x0, ttl 64, id 46205, offset 0, flags [DF], proto TCP (6), length 117)
    192.168.4.10.51993 > 192.168.4.157.http: Flags [P.], cksum 0x8a5f (incorrect -> 0x498f), seq 547886471:547886548, ack 1, win 29200, length 77
    0x0000:  0200 dead beef 0a26 7cd0 bfdc 0800 4500  .......&|.....E.
    0x0010:  0075 b47d 4000 4006 fc0d c0a8 040a c0a8  .u.}@.@.........
    0x0020:  049d cb19 0050 20a8 1587 0000 0001 5018  .....P........P.
    0x0030:  7210 8a5f 0000 4745 5420 2f20 4854 5450  r.._..GET./.HTTP
    0x0040:  2f31 2e31 0d0a 5573 6572 2d41 6765 6e74  /1.1..User-Agent
    0x0050:  3a20 6375 726c 2f37 2e32 392e 300d 0a48  :.curl/7.29.0..H
    0x0060:  6f73 743a 2031 3932 2e31 3638 2e34 2e31  ost:.192.168.4.1
    0x0070:  3537 0d0a 4163 6365 7074 3a20 2a2f 2a0d  57..Accept:.*/*.
    0x0080:  0a0d 0a                                  ... 

So we change all source/destination, SEQ, ACK etc. and add a reponse payload.

07:47:22.673373 IP (tos 0x0, ttl 64, id 27871, offset 0, flags [none], proto TCP (6), length 122)
    192.168.4.157.http > 192.168.4.10.51993: Flags [.], cksum 0xa53c (correct), seq 1:83, ack 547886471, win 29200, length 82
    0x0000:  0a26 7cd0 bfdc 0200 dead beef 0800 4500  .&|...........E.
    0x0010:  007a 6cdf 0000 4006 83a7 c0a8 049d c0a8  .zl...@.........
    0x0020:  040a 0050 cb19 0000 0001 20a8 1587 5010  ...P..........P.
    0x0030:  7210 a53c 0000 4854 5450 2f31 2e31 2032  r..<..HTTP/1.1.2
    0x0040:  3030 204f 4b0a 436f 6e74 656e 742d 5479  00.OK.Content-Ty
    0x0050:  7065 3a74 6578 742f 706c 6169 6e0a 436f  pe:text/plain.Co
    0x0060:  6e74 656e 742d 4c65 6e67 7468 3a37 0a43  ntent-Length:7.C
    0x0070:  6f6e 6e65 6374 696f 6e3a 636c 6f73 650a  onnection:close.
    0x0080:  0a41 7765 736f 6d65                      .Awesome

This is obviously completely useless and you would definitely not want to have a controller doing that on your network. But it proves how powerfull SDN can be. The controller has complete power over the traffic that goes on the switch. It could respond to DNS queries, ARP queries, ICMP etc.

Usefull things to do

Learning switch

If that web server is completely useless, one thing we could do is to inspect all incomming packets, and forward them to the right port. If the destination MAC is unknown, we can get it to flood on all port and then learn what MACs are on what port when receiving packets so that we can make a better decision next time. But that's kind of useless because that's literally what every switches do.

VXLANs

A more usefull thing to do is to look at the destination IP address of a packets and wrap the packet in a vxlan packet based on the IP that was in there. Then we would always forward those packets out of a specific port. You would then have created a kind of vxlan gateway. As a bonus: no need for vlans. Since the switch only wraps the packet and will always forward out of port 24 (for example). This can all happen on the default vlan.

My code

This is the code I wrote to make the web server. I am using what pox already had for the ARP portion of things so you need to copy this file in "pox/proto". One you run the controller, just do a HTTP GET to any address, any port, and you will get a response. download the code

Conclusion

Having a controller like this adds a very powerful tool in the network. It can also be very dangerous if someone has access to the controller with malicious intents. It becomes very easy to poison DNS queries without even having access to the DNS server. Or to simply build a MITM, because the controller is basically a MITM anyway.



Creating your own linux containersLast edited on Jul 10, 2016

I'm not trying to say that this solution is better than LXC or docker. I'm just doing this because it is very simple to get a basic container created with chroot and cgroups. Of course, docker provides much more features than this, but this really is the basis. It's easy to make containers in linux, depending on the amount of features you need.

cgroups

cgroups are a way to run a process while limiting its resources such as cpu time and memory. The it works is by creating a group (a cgroup), define the limits for various resources (you can control more than just memory and CPU time) and then run a process under that cgroup. It is important to know that every child process of that process will also run under the same same cgroup. So if you start a bash shell under a cgroup, then any program that you manually launch from the shell will be constrained to the same cgroup than the one of the shell.

If you create a cgroup with a memory limit of 10mb, this 10mb limit will be applicable to all processes running in the cgroup. The 10mb is a constraint on the sum of memory usage by all process under the same cgroup.

Configuring

On slackware 14.2 RC2, I didn't have to install or setup anything and cgroups were available to use with the tools already installed. I had to manually enable cgroups in the kernel though since I compiled my own kernel. Without going into the details (because this is covered in a million other websites) you need to make sure that:

  • cgroup kernel modules are built-in are loaded as modules
  • cgroup tools are installed
  • cgroup filesystem is mounted (normally accessible through /sys/fs/cgroup/)

Here's how to run a process in a cgroup

cgcreate -g memory:testgroup
# now the "/sys/fs/cgroup/memory/testgroup" exists and contains files that control the limits of the group

# assign a limit of 4mb to that cgroup
echo "4194304" > /sys/fs/cgroup/memory/testgroup/memory.limit_in_bytes

# run bash in that cgroup
cgexec -g memory:testgroup /bin/bash

Note that instead of using cgexec, you could also just write the current shell's PID into /sys/fs/cgroup///task. Then your shell, and whatever process you start from it, would execute in the cgroup.

Making your own containers

A container is nothing more than a chroot'd environment with processes confined in a cgroup. It's not difficult to write your own software to automate the environment setup. There is a "chroot" system call that already exist. For cgroups, I was wondering if there was any system calls available to create them. Using strace while running cgcreate, I found out that cgcreate only manipulates files in the cgroup file system. Then I got the rest of the information I needed from the documentation file located in the Documentation folder of the linux kernel source: Documentation/cgroups/cgroups.txt.

Creating a cgroup

To create a new cgroup, it is simply a matter of creating a new directory under the submodule folder that the cgroups needs to control. For example, to create a cgroup that controls memory and cpu usage, you just need to create a directory "AwesomeControlGroup" under /sys/fs/cgroup/memory and /sys/fs/cgroup/cpu. These directories will automatically be populated with the files needed to control the cgroup (cgroup fs is a vitual filesystem, so files do not exist on a physical medium).

Configuring a cgroup

To configure a cgroup, it is just a matter of writing parameters in the relevant file. For example: /sys/fs/cgroup/memory/testgroup/memory.limit_in_bytes

Running a process in a cgroup

To run a process in a cgroup, you need to launch the process, take its PID and write it under /sys/fs/cgroup///task My "container creator" application (let's call it the launcher) does it like this:

  • The launcher creates a cgroup and sets relevant parameters.
  • The launcher clones (instead of fork. now we have a parent and a child)
  • The parent waits for the child to die, after which it will destroy the cgroup.
  • The child writes its PID in the /sys/fs/cgroup///task file for all submodules (memory, cpu, etc)
  • At this point, the child runs as per the cgroup's constraints.
  • The child invokes execv with the application that the user wanted to have invoked in the container.

The reason I use clone() instead of fork, is that clone() can use the CLONE_NEWPID flag. This will create a new process namespace that will isolate the new process for the others that exist on the system. Indeed, when the cloned process queries its PID it will find that it is assigned PID 1. Doing a "ps" would not list other processes that run on the system since this new process is isolated.

Destroying a cgroup

To destroy a cgroup, just delete the /sys/fs/cgroup// directory

So interfacing with cgroups from userland is just a matter of manipulating files in the cgroup file system. It is really easy to do programmatically and also from the shell without any special tools or library.

My container application

My "container launcher" is a c++ application that chroots in a directory and run a process under a cgroup that it creates. To use it, I only need to type "./a.out container_path". The container_path is the path to a container directory that contains a "settings" files and a "chroot" directory. The "chroot" directory contains the environment of the container (a linux distribution maybe?) and the "settings" file contains settings about the cgroup configuration and the name of the process to be launched.

You can download my code: cgroups.tar

Example

I've extracted the slackware initrd image found in the "isolinux" folder of the dvd.

cd /tmp/slackware/chroot
gunzip < /usr/src/slackware64-current/isolinux/initrd.img  | cpio -i --make-directories

Extracting this in /tmp/slackware/chroot gives me a small linux environment that directory and I've created a settings file in /tmp/slackware. Call this folder a "container", it contains a whole linux environment under the chroot folder and a settings file to indicate under what user the container should run, what process it should start, how much ram max it can get etc. For this example, my settings file is like this:

user: 99
group: 98
memlimit: 4194304
cpupercent: 5
process:/bin/bash
arg1:-i

And running the container gives me:

[12:49:34 root@pat]# ./a.out /tmp/slackware
Mem limit: 4194304 bytes
CPU shares: 51 (5%)
Added PID 9337 in cgroup
Dropping privileges to 99:98
Starting /bin/bash -i

[17:26:10 nobody@pat:/]$ ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nobody       1  0.0  0.0  12004  3180 ?        S    17:26   0:00 /bin/bash -i
nobody       2  0.0  0.0  11340  1840 ?        R+   17:26   0:00 ps aux
[17:26:14 nobody@pat:/]$ ls
a  a.out  bin  boot  cdrom  dev  etc  floppy  init  lib  lib64  lost+found  mnt  nfs  proc  root  run  sbin  scripts  sys  tag  tmp  usr  var
[17:26:15 nobody@pat:/]$ exit
exit
Exiting container

Networking

When cloning with CLONE_NEWNET, the child process gets a separate netwrok namespace. It doesn't see the host's network interfaces anymore. So in order to get networking enabled in the container, we need to create a veth pair. I am doing all network interface manipulations with libnl (which was already installed on a stock slackware installation). The veth pair will act as a kind of tunnel between the host and the container. The host side interface will be added in a bridge so that it can be part of another lan. The container side interface will be assigned an IP and then the container will be able to communicate wil all peers that are on the same bridge. The bridge could be used to connect the container on the LAN or within a local network that consists of only other containers from a select group.

The launcher creates a "eth0" that appears in the container. The other end of the veth pair is added in a OVS bridge. An ip address is set on eth0 and the default route is set. I then have full networking functionality in my container.

Settings for networking

bridge: br0    
ip: 192.168.1.228/24
gw: 192.168.1.1   

Result

Mem limit: 4194304 bytes
CPU shares: 51 (5%)
Added PID 14230 in cgroup
Dropping privileges to 0:0
Starting /bin/bash -i

[21:59:33 root@container:/]# ping google.com
PING google.com (172.217.2.142): 56 data bytes
64 bytes from 172.217.2.142: seq=0 ttl=52 time=22.238 ms
^C
--- google.com ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 22.238/22.238/22.238 ms
[21:59:36 root@container:/]# exit
exit
Exiting container



Pages:123456789101112