---------------------------------------------------- ----------------
Paper review    Debadatta Mishra   Roll No: 114050005 
---------------------------------------------------- ----------------

Paper title: Satori: Enlightened page sharing

Summary: The paper proposes ideas to implement page sharing across guest VM's by taking advantage of flexible design choices available by virtue of para-virtualization. It tries to find out and solve issues in VMware proposed solution based on content match sharing triggered by a periodic scan. The problems that are addressed in this work are:
1) Finding short lived matches   2) Avoid system level swapping  3) Avoid periodic scan for matching 4) Finding matches quicker 5) Simplification of share breaking(private copy creation) 6) Fairness in sharing benefit distribution 7)Security in sharing

The problems 1,2,3 and 4 are solved by doing a static memory allocation to the VM and finding sharing opportunities from page cache on every physical disk read.Putting a hook at the disk read is good design choice considering the magnitude difference in Disk I/O speeds and processing required to enable sharing.A page cache manager matches every disk read[to a MFN] with a self maintained hash table to find sharing opportunities and takes help of hyperviser to share it.
The problems 4,5,6 are addressed by modifications in both hyperviser and guest to leverage sharing benefits. The guest gets more physical memory resulting from sharing and provides its choice of pages to be used by hyperviser in case of a share break. Micro level arrangements like 'selective non-sharable pages' takes care of security issues that may come up because of sharing.       

Detailed Notes
--------------

A static allocation of physical memory is done to each VM for performance isolation. Whenever there are sharing opportunities the benefit of sharing is reaped by all the guests those take part in sharing. The approach follows self-paging, which required each application to use its own resources (disk, memory and CPU) to deal with its own memory faults. The intermediate Pseudo-physical addresses of pages containing same content are redirected i.e. their pseudo-physical frame numbers (PFNs) are mapped to a single machine frame number (MFN). The guest shadow page tables are also updated accordingly.Shared pages are marked as read-only so that hyperviser gets traps when there is a write to it and do the share breaking by allocating a copy of the page to the trapping guest. 
 
Duplicate detection:

A large share of sharing scope is within the page cache, so data as monitored as it enters the cache. This can be achieved by intercepting block device access and building the knowledge about the sharing. Also if guests are using a single disk than the process becomes simpler[physical block no would be unique] and efficient [read can be avoided]. This approach has much less overhead than frequent scans of all the VM’s memory. Also this approach is not dependent on periodicity of scan and detects short lived sharing opportunities.Zero page sharing is not possible by this approach and justified with reasons and experimental results.

Distribution of saved memory:

Satori design proposes to distribute reclaimed memory from sharing in proportion to the amount of memory that each VM shares which is represented by its sharing entitlement. A VM’s sharing entitlement is calculated by adding the sharing ranks of all its pages where sharing rank of a page based on how many references to the MFN exist. The page’s sharing rank is updated whenever new duplicates are found or shared pages are evicted, and accordingly the VMs’ entitlements are readjusted. The sharing entitlements for VMs are checked every second for providing them appropriate amount of memory. The guests claim their sharing entitlement using memory balloons. Whenever they want memory and have high entitlements, the balloon deflates to release additional pages to the guest. This kind of design ensures fair distribution of shared memory.

Write to a shared Page:

Shared pages are necessarily marked as COW. When a guest VM attempts to write to a shared page, Satori proposes to obtain the memory to make a private copy of page from guest itself. For this guest is modified to provide list of pages the guest is willing to give up without even being  notified. This improves the efficiency of hypervisor to get pages as and when needed from the guest. This also eliminates the double paging issue, as the guest is aware of these pages.In a worst case of all the shared pages from a particular guest may need to be copied, the no of pages provided by the guest VM should be greater than or equal to its gain from sharing entitlement.The hypervisor is designed to use sharing entitlements to determine the VM from which memory should be reclaimed. The VM which drew more than its entitlement is selected. Note that only the VMs which are involved in the broken sharing will be affected ensuring performance isolation.

When a guest wants to reallocate a shared page, Satori uses a no-copy-on-write scheme, where guest OS informs the VMM that a page is being reallocated, so it directly allocates a zero page to the VM. This would save the unnecessary copy overhead for VMM to produce a private version of the page which gets erased by the guest.Guests can protect sensitive data by specifying the pages that should never be shared so hypervisor would ignore these pages while sharing.

Hyperviser and User space changes for Sharing:

1) Content Based Sharing:-   Xen control domain(dom0) runs an user process tapdisk per virtual block device(VBD) that does the actual disk I/O. Shared Page Cache controller implemented as Control Domain user space user process communicates with the per VBD tapdisk process to get information about the disk blocks read and the MFN they are read onto. The cache controller does hashing of the MFN after the read to find out the opportunities of sharing. If there is a matching entry in the hash table, a hypercall(share_mfns) is issued with the two MFN entries that match. This hypercall takes care of byte-by-byte matching of these two MFNs and the book keeping(P2M mapping,freeing the frame,making the shared frame COW etc). If there is no hash match, any previous entry that maps to this MFN is invalidated and new entry is made.

2) Copy on write Disk sharing:- On a shared disk by many VMs, whenever a disk block is read, corresponding MFN is made readonly by doing a hypercall(make_ro). It also makes an entry to the hash table [blockno,MFN] for future lookups(Either assumes page size = Disk Block size or does some change in hash table to have different psudeo block numbers from a single block number and map them to different MFNs, not clear). If there is a match in future, the MFN is verified if it is still readonly and incrementing the reference count by issuing another hypercall(get_ro_refs).After that share_mfn hypercall is issued to do the sharing activities in hyperviser.      

Guest Kernel Modifications:

1) Guest is modified to provide hints about the pages those can be used when sharing breaks (Repayment FIFO).
2) Kernel is modified to handle the cases when there is a page fault because the page is silently used up by hyperviser. The kernel is modified to handle such faults and update all VA references to this PFN.
3) Balloon driver inside the kernel to claim memory resulting from sharing.
4) Marking some critical pages as non-sharable (for security) by the help of hyperviser.
5) Implementing a mechanism to let the hyperviser know that a page share break should not result in a copy of the page (To avoid wasted copy if the guest is going to reallocate).          

Experiments:

1) Run kernel build,httperf and RUBis in 256MB and 512 MB machines. 
2) Sharing count in each rank, lifetime of sharing, no of hypercalls are measured. 
3) Impact on I/O is measured using file system read.

Positives:

1) Takes advantage of para-virtualization to simplify the process of finding sharable pages. Also makes the memory reclaim and assignment easier and efficient by change in the guest OS.   
2) Addresses the security and isolation aspects of the sharing by giving the guest a facility to select set of pages those are not to be shared.
3) Shows that short lived sharing opportunities are significant and can not be ignored. Also the approach finds matches quicker than VMware.
4) Takes advantages of previously done work and applies them differently as the need. For example IBM's collaborative memory management is used for repayment FIFO implementation.  

Negatives:

1) Introducing a single point of failure in the form of a cache manager user space process(Apart from the performance problems that are associated with the approach as discussed in the paper). If this process fails/killed/died for some reason there is a chance that all the guests will be affected.There might be some implementation tricks to handle such a situation but that is no where mentioned/addressed in the paper.
2) In the performance tests run in Xen implementation, the authors have not looked into the CPU utilization in the dom0. There can be significant CPU overheads for computing hash, updating hash(find and update the stale hash entries when a block gets read for the second time) including the context switching resulting from pipe based communication model.
3) The authors assume sharing potential for dirty pages to be low without any logical or experimental reasoning to support it. There may be sharing opportunities created in the memory based on order of execution of two processes working on same data sets(may be packets received from another machine). 
4) The paper claims the improvements as hyperviser agnostic but use many specific features of Xen (dom0,split driver model etc). They don't discuss the implementation issues on a emulated Disk I/O case.      

Possible extensions:

1) Combining the solution of page cache matching and a guest module implementation for collaborative share management may be used to support unmodified guests.