Saturday, July 27, 2013

Plan for next milestone

So after the final (I hope it is the final one) version of VMX hypervisor patch is commited, we need another plan for the next milestone. The next one is about test cases for control bits in VM-Execution Control Fields and VM-Exit/VM-Entry Control Fields.


  1. Save and restore IA32_PAT and IA32_EFER in VM-Exit and VM-Entry. These two bits are defined in Intel SDM, bit 18-21 in Table 24-10 and bit 14-15 in Table 24-12. IA32_EFER should be tested separately in and out of IA32e mode.
  2. Test for VMX preemption timer. See details in Intel SDM "25.5.1 VMX-Preemption Timer", and bit 6 of Table 24-5.
  3. I/O bitmaps and exception bitmaps. Test if they act right.
  4. CR0/4 shadowing. See details in Intel SDM "24.6.6 Guest/Host Masks and Read Shadows for CR0 and CR4". CRx shadowing needs Haswell host, so as APICv and posted interrupts. Some related VMCS field is "CR0 guest/host mask", "CR4 guest/host mask", "CR0 read shadow" and "CR4 read shadow".
  5. Instruction intercepts. Test instruction intercepts of VMX ans their exit codes and information. See Table 24-6 and 24-7, as well as chapter Chapter 27.2 which describes the information when VM exit. Some instructions maybe host capabilities.
The test cases cited above needs to be put in some separate files, all basic tests are so simple that they can live in the same file, maybe in instruction_intercepts.c, entry-exit-control.c. I'd prefer to put them all in a sub-directory named nvmx.

All the relevant patches should be committed after the framework patch is in, but I will keep developing in my tree.

Saturday, July 13, 2013

Finally finished the first vmx hypervisor

I finally finished my first vmx hypervisor!

Mini-Hypervisor Milestone 2013/07/13

The job is too trivial to debug!!!

Now the VM can run and the VMCALL can call to vmx_entry and get the right VMEXIT reason. But calling to printf in VM fails and the VM blocks in someplace I don't know.

Well, goto sleep first :)

Checking and Loading Guest State When VMLAUNCH/VMRESUME

If all checks on the VMX controls and the host-state area pass, the following checks will take place in any order. Because VMX is entering Guest now, it will cause VMEXIT by loading host state fields in VMCS. So some exit reasons can be retrieved from VMCS exit reason.

I. Checks on Guest State Area

1. Checks on Guest Control Registers, Debug Registers and MSRs


  • CR0 should be set correctly except:
    • CR0.PE (bit 0) and CR0.PG (bit 31) are not checked if "unrestricted guest" VM-execution control is 1 (Bit 7 of Secondary Processor-Based VM-Execution Controls)
    • CR0.NW (bit 29) and CR0.CD (bit 30) are never checked.
  • If bit 31 in the CR0 field (corresponding to PG) is 1, bit 0 in that field (PE) must also be 1.
  • The CR4 field must not set any bit to a value not supported in VMX operation
  • If the “load debug controls” (Bit 2 of VM-Entry Controls) VM-entry control is 1, bits reserved in the IA32_DEBUGCTL MSR must be 0 in the field for that register.
  • The following checks are performed on processors that support Intel 64 architecture:
    • If the “IA-32e mode guest” (Bit 9 of VM-Entry Controls) VM-entry control is 1, bit 31 in the CR0 field (corresponding to CR0.PG) and bit 5 in the CR4 field (corresponding to CR4.PAE) must each be 1.
    • If the “IA-32e mode guest” (Bit 9 of VM-Entry Controls) VM-entry control is 0, bit 17 in the CR4 field (corresponding to CR4.PCIDE) must each be 0.
    • The CR3 field must be such that bits 63:52 and bits in the range 51:32 beyond the processor’s physicaladdress width are 0.
    • If the “load debug controls” (Bit 2 of VM-Entry Controls) VM-entry control is 1, bits 63:32 in the DR7 field must be 0.
    • The IA32_SYSENTER_ESP field and the IA32_SYSENTER_EIP field must each contain a canonical address.
  • If the “load IA32_PERF_GLOBAL_CTRL” (Bit 13 of VM-Entry Controls) VM-entry control is 1, bits reserved in the IA32_PERF_GLOBAL_CTRL MSR must be 0 in the field for that register. (See Intel SDM Volumn 3B, Figure 18-3)
    • If the “load IA32_PAT” (Bit 14 of VM-Entry Controls) VM-entry control is 1, the value of the field for the IA32_PAT MSR must be one that could be written by WRMSR without fault at CPL 0. Specifically, each of the 8 bytes in the field must have one of the values 0 (UC), 1 (WC), 4 (WT), 5 (WP), 6 (WB), or 7 (UC-).
      • If the “load IA32_EFER” (Bit 15 of VM-Entry Controls) VM-entry control is 1, the following checks are performed on the field for the IA32_EFER MSR :
        • Bits reserved in the IA32_EFER MSR must be 0.(See Intel SDM Volumn 3A, 2.2.1, Figure 2-4, Table 2-1)
        • Bit 10 (corresponding to IA32_EFER.LMA) must equal the value of the “IA-32e mode guest” (bit 9 of VM-Entry Controls) VM-exit control. It must also be identical to bit 8 (LME) if bit 31 in the CR0 field (corresponding to CR0.PG) is 1.

      2. Checks on Guest Segment Registers


      This section specific the checks on the fields for CS, SS, DS, ES, FS, GS, TR, and LDTR.

      The following terms will be checked first:

      • The guest will be virtual-8086 if the VM flag (bit 17) is 1 in the RFLAGS field in the guest-state area.
      • The guest will be IA-32e mode if the “IA-32e mode guest” VM-entry control (Bit 9 of VM-Entry Controls) is 1. (This is possible only on processors that support Intel 64 architecture.)
      • Any one of these registers is said to be usable if the unusable bit (bit 16) is 0 in the access-rights field for that register.
      Then the following checks on segment registers:
      • Selector fields
        • TR. The TI flag (bit 2) must be 0.
        • LDTR. If LDTR is usable, the TI flag (bit 2) must be 0.
        • SS. If the guest will not be virtual-8086 and the “unrestricted guest” VM-execution control is 0, the RPL (bits 1:0) must equal the RPL of the selector field for CS.
      • Base-address fields.
        • CS, SS, DS, ES, FS, GS. If the guest will be virtual-8086, the address must be the selector field shifted left 4 bits (multiplied by 16).
        • The following checks are performed on processors that support Intel 64 architecture:
          • TR, FS, GS. The address must be canonical
          • LDTR. If LDTR is usable, the address must be canonical
          • CS. Bits 63:32 of the address must be zero
          • SS, DS, ES. If the register is usable, bits 63:32 of the address must be zero.
      • Limit fields for CS, SS, DS, ES, FS, GS. If the guest will be virtual-8086, the field must be 0000FFFFH.
      • Access-rights fields.
        • CS, SS, DS, ES, FS, GS.
          • If the guest will be virtual-8086, the field must be 000000F3H. This implies the following:
            • Bits 3:0 (Type) must be 3, indicating an expand-up read/write accessed data segment.
            • Bit 4 (S) must be 1.
            • Bits 6:5 (DPL) must be 3.
            • Bit 7 (P) must be 1.
            • Bits 11:8 (reserved), bit 12 (software available), bit 13 (reserved/L), bit 14 (D/B), bit 15 (G), bit 16 (unusable), and bits 31:17 (reserved) must all be 0.
          • If the guest will not be virtual-8086, the different sub-fields are considered separately:
            • Bits 3:0 (Type).
              • CS. The values allowed depend on the setting of the “unrestricted guest” VM-execution control (Bit 7 of Secondary Processor-Based VM-Execution Controls):
                • If the control is 0, the Type must be 9, 11, 13, or 15 (accessed code segment).
                • If the control is 1, the Type must be either 3 (read/write accessed expand-up data segment) or one of 9, 11, 13, and 15 (accessed code segment).
              • SS. If SS is usable, the Type must be 3 or 7 (read/write, accessed data segment).
              • DS, ES, FS, GS. The following checks apply if the register is usable:
                • Bit 0 of the Type must be 1 (accessed).
                • If bit 3 of the Type is 1 (code segment), then bit 1 of the Type must be 1 (readable).
            • Bit 4 (S). If the register is CS or if the register is usable, S must be 1.
            • Bits 6:5 (DPL).
              • CS.
                • If the Type is 3 (read/write accessed expand-up data segment), the DPL must be 0. The Type can be 3 only if the “unrestricted guest” VM-execution control is 1.
                • If the Type is 9 or 11 (non-conforming code segment), the DPL must equal the DPL in the access-rights field for SS.
                • If the Type is 13 or 15 (conforming code segment), the DPL cannot be greater than the DPL in the access-rights field for SS.
              • SS.
                • If the “unrestricted guest” VM-execution control is 0, the DPL must equal the RPL from the selector field.
                • The DPL must be 0 either if the Type in the access-rights field for CS is 3 (read/write accessed expand-up data segment) or if bit 0 in the CR0 field (corresponding to CR0.PE) is 0.
              • DS, ES, FS, GS. The DPL cannot be less than the RPL in the selector field if (1) the “unrestricted guest” VM-execution control is 0; (2) the register is usable; and (3) the Type in the access-rights field is in the range 0 – 11 (data segment or non-conforming code segment).
            • Bit 7 (P). If the register is CS or if the register is usable, P must be 1.
            • Bits 11:8 (reserved). If the register is CS or if the register is usable, these bits must all be 0.
            • Bit 14 (D/B). For CS, D/B must be 0 if the guest will be IA-32e mode and the L bit (bit 13) in the access-rights field is 1.
            • Bit 15 (G). The following checks apply if the register is CS or if the register is usable:
              • If any bit in the limit field in the range 11:0 is 0, G must be 0.
              • If any bit in the limit field in the range 31:20 is 1, G must be 1.
            • Bits 31:17 (reserved). If the register is CS or if the register is usable, these bits must all be 0.
        • TR. The different sub-fields are considered separately:
          • Bits 3:0 (Type).
            • If the guest will not be IA-32e mode, the Type must be 3 (16-bit busy TSS) or 11 (32-bit busy TSS).
            • If the guest will be IA-32e mode, the Type must be 11 (64-bit busy TSS).
          • Bit 4 (S). S must be 0.
          • Bit 7 (P). P must be 1.
          • Bits 11:8 (reserved). These bits must all be 0.
          • Bit 15 (G).
            • If any bit in the limit field in the range 11:0 is 0, G must be 0.
            • If any bit in the limit field in the range 31:20 is 1, G must be 1.
          • Bit 16 (Unusable). The unusable bit must be 0.
          • Bits 31:17 (reserved). These bits must all be 0.
        • LDTR. The following checks on the different sub-fields apply only if LDTR is usable:
          • Bits 3:0 (Type). The Type must be 2 (LDT).
          • Bit 4 (S). S must be 0.
          • Bit 7 (P). P must be 1.
          • Bits 11:8 (reserved). These bits must all be 0.
          • Bit 15 (G).
            • If any bit in the limit field in the range 11:0 is 0, G must be 0.
            • If any bit in the limit field in the range 31:20 is 1, G must be 1.
          • Bits 31:17 (reserved). These bits must all be 0.

      3. Checks on Guest Descriptor-Table Registers


      The following checks are performed on the fields for GDTR and IDTR:

      • On processors that support Intel 64 architecture, the base-address fields must contain canonical addresses.
      • Bits 31:16 of each limit field must be 0.

      4. Checks on Guest RIP and RFLAGS



      • RIP. The following checks are performed on processors that support Intel 64 architecture:
        • Bits 63:32 must be 0 if the “IA-32e mode guest” (Bit 9 of VM-Entry Controls) VM-entry control is 0 or if the L bit (bit 13) in the access rights field for CS is 0.
        • If the processor supports N < 64 linear-address bits, bits 63:N must be identical if the “IA-32e mode guest” VM-entry control is 1 and the L bit in the access-rights field for CS is 1.
      • RFLAGS.
        • Reserved bits 63:22 (bits 31:22 on processors that do not support Intel 64 architecture), bit 15, bit 5 and bit 3 must be 0 in the field, and reserved bit 1 must be 1.
        • The VM flag (bit 17) must be 0 either if the “IA-32e mode guest” VM-entry control is 1 or if bit 0 in the CR0 field (corresponding to CR0.PE) is 0.
        • The IF flag (RFLAGS[bit 9]) must be 1 if the valid bit (bit 31) in the VM-entry interruption-information field is 1 and the interruption type (bits 10:8) is external interrupt. (See Intel SDM, Volumn 3C, 24.8.3 VM-Entry Controls for Event Injection)

      Sunday, July 7, 2013

      CPU state transfer in QEMU live migration

      The CPU migration state in QEMU is in target-i386/machine.c. This file defines all the CPU states needed to transfer to the target VM in struct const VMStateDescription vmstate_x86_cpu. This struct consists of many version control messages. Messages can be set in VMStateSubsection to ignore version problems.

      Read the relevant codes for more details.

      Friday, July 5, 2013

      Format of GDT entry

      The Global Descriptor Table (GDT) usually used to hold segment informations. But GDT can hold things other than segment descriptors, every 8-byte entry in GDT is a descriptor, which can be TSS descriptor, LDT descriptor, Call Gate descriptor. I will introduce segment descriptors in this blog, because the last three descriptors are rarely used in most operating systems. Linux uses a shared TSS and LDT descriptor for every task and imply its own call gates mechanism.

      Firstly, GDT can be accessed by a GDT descriptor. This descriptor is not the same as descriptors cited before, it is an assembly structure with a 2-byte size and a 4-byte offset. 2-byte size saves total entries in GDT and 4-byte offset refers to the first entry of GDT. So this means GDT entry can only be set on the first 4G memory.

      Each entry of segment descriptor in GDT has a complex structure:

      Here are three Base fields refers to the base address of a segment. What "Limit 0:15" means is that the field contains bits 0-15 of the limit value. The base is a 32 bit value containing the linear address where the segment begins. The limit, a 20 bit value, tells the maximum addressable unit (either in 1 byte units, or in pages). Hence, if you choose page granularity (4 KiB) and set the limit value to 0xFFFFF the segment will span the full 4 GiB address space. Here is the structure of the access byte and flags:


      The bit fields are:
      • Pr: Present bit. This must be 1 for all valid selectors.
      • Privl: Privilege, 2 bits. Contains the ring level, 0 = highest (kernel), 3 = lowest (user applications).
      • Ex: Executable bit. If 1 code in this segment can be executed, ie. a code selector. If 0 it is a data selector.
      • DC: Direction bit/Conforming bit.
        • Direction bit for data selectors: Tells the direction. 0 the segment grows up. 1 the segment grows down, ie. the offset has to be greater than the base.
        • Conforming bit for code selectors:
          • If 1 code in this segment can be executed from an equal or lower privilege level. For example, code in ring 3 can far-jump to conforming code in a ring 2 segment. Theprivl-bits represent the highest privilege level that is allowed to execute the segment. For example, code in ring 0 cannot far-jump to a conforming code segment withprivl==0x2, while code in ring 2 and 3 can. Note that the privilege level remains the same, ie. a far-jump form ring 3 to a privl==2-segment remains in ring 3 after the jump.
          • If 0 code in this segment can only be executed from the ring set in privl.
      • RW: Readable bit/Writable bit.
        • Readable bit for code selectors: Whether read access for this segment is allowed. Write access is never allowed for code segments.
        • Writable bit for data selectors: Whether write access for this segment is allowed. Read access is always allowed for data segments.
      • Ac: Accessed bit. Just set to 0. The CPU sets this to 1 when the segment is accessed.
      • Gr: Granularity bit. If 0 the limit is in 1 B blocks (byte granularity), if 1 the limit is in 4 KiB blocks (page granularity).
      • Sz: Size bit. If 0 the selector defines 16 bit protected mode. If 1 it defines 32 bit protected mode. You can have both 16 bit and 32 bit selectors at once.
      Not shown in the picture is the 'L' bit (bit 21, next to 'Sz') which is used for x86-64 mode. See table 3-1 of the Intel Architecture manual.

      Refer to Intel Manual 3.4.5 for more details.

      Some materials are cited from http://wiki.osdev.org/Global_Descriptor_Table

      Tuesday, July 2, 2013

      Intel Control Registers

      Control registers of Intel CPU, including CR0/CR1/CR2/CR3/CR4/CR8, determine operating mode of the processor and the characteristics of the currently executing task. Most of the contents of this blog is cited from Intel manual Volume 3C.

      All the CRx registers can be written by MOV, but only support moving from general registers to CRx, any move from memory may cause error when assembling. In 64-bit mode, control registers are expanded to 64 bits width. But the upper bits are always not used and set zero.

      • CR0 - Contains system control flags that control operating mode and states of the processor.
      • CR1 - Reserved.
      • CR2 - Contains the page-fault linear address (the linear address that caused a page fault).
      • CR3 - Contains the physical address of the base of the paging-structure hierarchy and two flags (PCD and PWT). The lower 12 bits of address are supposed to be 0 because they are for page offset while CR3 contains page directory. Then these bits store flags of page table.
      • CR4 - Contains a group of flags that enable several architectural extensions, and indicate operating system or executive support for specific processor capabilities.
      • CR8 - This register is available only in 64-bit mode. Provides read and write access to the Task Priority Register (TPR). It specifies the priority threshold value that operating systems use to control the priority class of external interrupts allowed to interrupt the processor.
      One of the most import bit(s) in CR0 is PG and PE bit. PE(Protection Enable) bit is the bit 0 of CR0, which enables protected mode when set. It enables segment-level protection. PG(Paging) bit is the bit 31 of CR0, which enables paging when set. PG bit can be set only when PE is set. Set to these two bits means entering protected mode.

      For more information about bits in control registers, you can refer to Intel Manual Volume 3C, chapter 2 System Architecture Overview.