1The PPC KVM paravirtual interface 2================================= 3 4The basic execution principle by which KVM on PowerPC works is to run all kernel 5space code in PR=1 which is user space. This way we trap all privileged 6instructions and can emulate them accordingly. 7 8Unfortunately that is also the downfall. There are quite some privileged 9instructions that needlessly return us to the hypervisor even though they 10could be handled differently. 11 12This is what the PPC PV interface helps with. It takes privileged instructions 13and transforms them into unprivileged ones with some help from the hypervisor. 14This cuts down virtualization costs by about 50% on some of my benchmarks. 15 16The code for that interface can be found in arch/powerpc/kernel/kvm* 17 18Querying for existence 19====================== 20 21To find out if we're running on KVM or not, we leverage the device tree. When 22Linux is running on KVM, a node /hypervisor exists. That node contains a 23compatible property with the value "linux,kvm". 24 25Once you determined you're running under a PV capable KVM, you can now use 26hypercalls as described below. 27 28KVM hypercalls 29============== 30 31Inside the device tree's /hypervisor node there's a property called 32'hypercall-instructions'. This property contains at most 4 opcodes that make 33up the hypercall. To call a hypercall, just call these instructions. 34 35The parameters are as follows: 36 37 Register IN OUT 38 39 r0 - volatile 40 r3 1st parameter Return code 41 r4 2nd parameter 1st output value 42 r5 3rd parameter 2nd output value 43 r6 4th parameter 3rd output value 44 r7 5th parameter 4th output value 45 r8 6th parameter 5th output value 46 r9 7th parameter 6th output value 47 r10 8th parameter 7th output value 48 r11 hypercall number 8th output value 49 r12 - volatile 50 51Hypercall definitions are shared in generic code, so the same hypercall numbers 52apply for x86 and powerpc alike with the exception that each KVM hypercall 53also needs to be ORed with the KVM vendor code which is (42 << 16). 54 55Return codes can be as follows: 56 57 Code Meaning 58 59 0 Success 60 12 Hypercall not implemented 61 <0 Error 62 63The magic page 64============== 65 66To enable communication between the hypervisor and guest there is a new shared 67page that contains parts of supervisor visible register state. The guest can 68map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE. 69 70With this hypercall issued the guest always gets the magic page mapped at the 71desired location. The first parameter indicates the effective address when the 72MMU is enabled. The second parameter indicates the address in real mode, if 73applicable to the target. For now, we always map the page to -4096. This way we 74can access it using absolute load and store functions. The following 75instruction reads the first field of the magic page: 76 77 ld rX, -4096(0) 78 79The interface is designed to be extensible should there be need later to add 80additional registers to the magic page. If you add fields to the magic page, 81also define a new hypercall feature to indicate that the host can give you more 82registers. Only if the host supports the additional features, make use of them. 83 84The magic page layout is described by struct kvm_vcpu_arch_shared 85in arch/powerpc/include/asm/kvm_para.h. 86 87Magic page features 88=================== 89 90When mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE, 91a second return value is passed to the guest. This second return value contains 92a bitmap of available features inside the magic page. 93 94The following enhancements to the magic page are currently available: 95 96 KVM_MAGIC_FEAT_SR Maps SR registers r/w in the magic page 97 KVM_MAGIC_FEAT_MAS0_TO_SPRG7 Maps MASn, ESR, PIR and high SPRGs 98 99For enhanced features in the magic page, please check for the existence of the 100feature before using them! 101 102Magic page flags 103================ 104 105In addition to features that indicate whether a host is capable of a particular 106feature we also have a channel for a guest to tell the guest whether it's capable 107of something. This is what we call "flags". 108 109Flags are passed to the host in the low 12 bits of the Effective Address. 110 111The following flags are currently available for a guest to expose: 112 113 MAGIC_PAGE_FLAG_NOT_MAPPED_NX Guest handles NX bits correclty wrt magic page 114 115MSR bits 116======== 117 118The MSR contains bits that require hypervisor intervention and bits that do 119not require direct hypervisor intervention because they only get interpreted 120when entering the guest or don't have any impact on the hypervisor's behavior. 121 122The following bits are safe to be set inside the guest: 123 124 MSR_EE 125 MSR_RI 126 127If any other bit changes in the MSR, please still use mtmsr(d). 128 129Patched instructions 130==================== 131 132The "ld" and "std" instructions are transformed to "lwz" and "stw" instructions 133respectively on 32 bit systems with an added offset of 4 to accommodate for big 134endianness. 135 136The following is a list of mapping the Linux kernel performs when running as 137guest. Implementing any of those mappings is optional, as the instruction traps 138also act on the shared page. So calling privileged instructions still works as 139before. 140 141From To 142==== == 143 144mfmsr rX ld rX, magic_page->msr 145mfsprg rX, 0 ld rX, magic_page->sprg0 146mfsprg rX, 1 ld rX, magic_page->sprg1 147mfsprg rX, 2 ld rX, magic_page->sprg2 148mfsprg rX, 3 ld rX, magic_page->sprg3 149mfsrr0 rX ld rX, magic_page->srr0 150mfsrr1 rX ld rX, magic_page->srr1 151mfdar rX ld rX, magic_page->dar 152mfdsisr rX lwz rX, magic_page->dsisr 153 154mtmsr rX std rX, magic_page->msr 155mtsprg 0, rX std rX, magic_page->sprg0 156mtsprg 1, rX std rX, magic_page->sprg1 157mtsprg 2, rX std rX, magic_page->sprg2 158mtsprg 3, rX std rX, magic_page->sprg3 159mtsrr0 rX std rX, magic_page->srr0 160mtsrr1 rX std rX, magic_page->srr1 161mtdar rX std rX, magic_page->dar 162mtdsisr rX stw rX, magic_page->dsisr 163 164tlbsync nop 165 166mtmsrd rX, 0 b <special mtmsr section> 167mtmsr rX b <special mtmsr section> 168 169mtmsrd rX, 1 b <special mtmsrd section> 170 171[Book3S only] 172mtsrin rX, rY b <special mtsrin section> 173 174[BookE only] 175wrteei [0|1] b <special wrteei section> 176 177 178Some instructions require more logic to determine what's going on than a load 179or store instruction can deliver. To enable patching of those, we keep some 180RAM around where we can live translate instructions to. What happens is the 181following: 182 183 1) copy emulation code to memory 184 2) patch that code to fit the emulated instruction 185 3) patch that code to return to the original pc + 4 186 4) patch the original instruction to branch to the new code 187 188That way we can inject an arbitrary amount of code as replacement for a single 189instruction. This allows us to check for pending interrupts when setting EE=1 190for example. 191 192Hypercall ABIs in KVM on PowerPC 193================================= 1941) KVM hypercalls (ePAPR) 195 196These are ePAPR compliant hypercall implementation (mentioned above). Even 197generic hypercalls are implemented here, like the ePAPR idle hcall. These are 198available on all targets. 199 2002) PAPR hypercalls 201 202PAPR hypercalls are needed to run server PowerPC PAPR guests (-M pseries in QEMU). 203These are the same hypercalls that pHyp, the POWER hypervisor implements. Some of 204them are handled in the kernel, some are handled in user space. This is only 205available on book3s_64. 206 2073) OSI hypercalls 208 209Mac-on-Linux is another user of KVM on PowerPC, which has its own hypercall (long 210before KVM). This is supported to maintain compatibility. All these hypercalls get 211forwarded to user space. This is only useful on book3s_32, but can be used with 212book3s_64 as well. 213