CS695 Topics in Virtualization and Cloud Computing Spring 2019 Lecture 4 16.1.2019 Lecture 5 18.1.2019 -------------------- 0. Exercise #2 due 30th January 2019 https://www.cse.iitb.ac.in/~puru/ https://www.cse.iitb.ac.in/~puru/courses/spring19/cs695/exercises/ex2.html references for this lecture: - [popelgoldberk] - [vmbook] Chapter 8. - [vmware] [xen] [kvm] - [x86virt] 1. Recap - Challenges of VMM/hypervisor design - resource control and management - resource virtualization - system execution state management, e.g., system calls! - VMM design principles Popek and Goldberg 1974 [popekgoldberg], properties a VMM should satisfy (i) efficiency: as much native execution as possible (performance) (ii) resource control: VMM controls manages all resources (safety) (iii) equivalence: identical behavior on bare-metal and virtual platforms (fidelity) (w.r.t performance and OS system-view of machine) - virtualization and VMM types - process, OS and hypervisor execution timeline on a CPU. 2. process/CPU virtualization - system view of a machine - CPU, ISA, machine registers, privilege levels for execution - linear addressable physical memory - devices - we will start with the CPU, and the simplest variant, in which the CPU being virtualized is identical to the physical CPU (same ISA, modes of execution, machine registers for status and control etc.) - the physical CPU has to be shared by the VM and the VMM. - VM's time is used by the guest OS to execute itself and to schedule and execure processes in the VM 3. Design 1: trap-and-emulate - allocate CPU to VM, same as an OS allocates CPU to a process. - [Q] but, on a physical machine, how does the OS regain control of the system (from the CPU)? - using privileged modes of execution - applications in Ring 3, OS in Ring 0 (x86 example) - explicit interrupts (system calls) or implicit interrupts (IO interrupts, timer) - exceptions - each of the above cases, switches execution mode to Ring 0 and OS take control and handle situations, - increments time, schedules a new task, handles page fault, handles interrupt etc. - extend privileges modes of execution idea to VMM ring 3: applications ring 1: guest OS ring 0: hypervisor/VMM - will this work? - lets assume that guest OS is switching in a process. one of the activities is to update the CR3 machine register. - what follows? - trap generated (guest OS does not enough privileges to update CR3) - hypervisor handles the trap: checks value to be written in CR3 is valid for the VM. if so, performs write, else error. - CR3 is virtualized ... hoooray! - other examples: timer interrupt, packet arrival, etc. - VMM can first process the exception and do necesarry actions for the VM. - yoohoo! CPU virtualization done. unfortunately, NO! - reason: some instructions which are executed without required privileges for correct operation, do not generate a trap! the ISA emulates this situation as the no-op. - e.g., popf: pop from stack to EFLAGS register. one of the bits of EFLAGS is used to enable/disable interrupts using popf. popf is critical instruction, no trap, treated as no-op. read LSB bits of code register, to determine CPL. read is a no-op without execution in Ring 0. no trap! S: sensitive instructions: instructions that impact system operation P: set of privileged instructions C: critical instructions: sensitive instructions that do not result in a trap with less privilege mode of execution C is subset of S. S may not be a subset of P. is S is subset of P, ISA is virtualizable (with trap and emulate) - implication: the equivalence condition is not satisified. opportunity for VMM to intervene and virtualize the operation is lost. - is it all over for world peace using hypervisors? - apparently not, VMs seem to be still in use! - techniques for CPU virtualization, 0. trap and emulate 1. scan and patch 2. para-virtualization 3. hardware-assisted CPU virtualziation - 1. scan-and-patch (examples: vmware, virtualbox) - scan for instructions (in the kernel binary) that are critical and replace them with a trap to the VMM and also store information () about the instruction as VMM state. - on custom trap, control in Ring 0 with VMM, emulate the instruction and virtualize system view correctly. - optimizations: - scan multiple blocks - caching of scanned blocks, along with patches - each architecture has its own set critical instructions which are not privileged. - technique provides: equivalence, resource control, efficiency(?) - 2. para-virtualization (xen) - this is an intrusive technique - guest OS knows it will execute on a virtual machine and cpu virtualization is tricky - guest OS does not issue critical instructions invokes VMM explicitly when work with critical instructions needs to be done - explicit service request from hypervisor via hypercalls similar to system calls, issued by entity executing in Ring 1 and using a different software interrupt (int 82h) - since explicit calling and knowledge of execution in VM exists, abstraction of hypercall call interface is not at instruction level! e.g., instead of writing to CR3 from guest OS and generating a trap etc. whenever CR3 or page table is to be updated, make an explicit hypercall with required guest OS details for the hypervisor to handle. in effect, trap and emulate is replaced by hypercalls for sensitive and critical instructions and OS tasks. - improved efficiency. resource control is part of design, equivalence (?) - 3. hardware-assisted virtualization (kvm, vmware, xen) - the above techniques were software based - with hardware assisted techniques, CPU architecture is extended for virtualization support - AMD-V and Intel VT-x CPUs - Makes ISA virtualizable by, (i) introducing a new mode of operation, the vmx mode. more specifically, the vmx root mode and vmx non-root mode. - each mode in turn has Ring 0 to Ring privilege modes - VMM/Hypervisor: executes in vmx root mode - Guest OS and its applications: execute in vmx non-root mode (ii) "programmatically" control switch between vxm non-root to root mode. can configure conditions for switch around critical tasks and critical instructions. - vmxon vmlaunch vmresume vmxoff vmcall: similar to hypercall to exit vmx non-root mode vmexit: exit on critical instruction/conditions met for exit to vmx root mode vmcs: Virtual Machine Control Structure a per VM, per CPU control structure to store VMM and VM state (context), and conditions for exit and other control fields vmptr: pointer to current VM (of a CPU) vmptrld vmread and vmwrite for read/write to/from vmcs vmcs format: Guest state area: Area to save guest context Host state area: Area to save host context VM execution controls: specify what is allowed and what causes exit VM exit controls: specify registers to store during exit VM entry controls: specify registers/state to load during entry VM exit information: area to store infomration on exit