| 
    
 [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] xen: speed up grant-table reclaim
 On 12.06.23 22:09, Demi Marie Obenour wrote: On Mon, Jun 12, 2023 at 08:27:59AM +0200, Juergen Gross wrote:On 10.06.23 17:32, Demi Marie Obenour wrote:When a grant entry is still in use by the remote domain, Linux must put it on a deferred list.This lacks quite some context. The main problem is related to the grant not having been unmapped after the end of a request, but the side granting the access is assuming this should be the case.The GUI agent has relied on deferred grant reclaim for as long as it has existed. One could argue that doing so means that the agent is misusing gntalloc, but this is not documented anywhere. A better fix would be to use IOCTL_GNTDEV_SET_UNMAP_NOTIFY in the GUI daemon. I don't think this is a gntalloc specific problem. You could create the same problem with a kernel implementation. And yes, using IOCTL_GNTDEV_SET_UNMAP_NOTIFY might be a good idea. In general this means that the two sides implementing the protocol don't agree how it should work, or that the protocol itself has a flaw.What would a better solution be? This is going to be particularly tricky with Wayland, as the wl_shm protocol makes absolutely no guarantees that compositors will promptly release the mapping and provides no way whatsoever for Wayland clients to know when this has happened. Relying on an LD_PRELOAD hack is not sustainable.Normally, this list is very short, because the PV network and block protocols expect the backend to unmap the grant first.Normally the list is just empty. Only in very rare cases like premature PV frontend module unloading it is expected to see cases of deferred grant reclaims.In the case of a system using only properly-designed PV protocols implemented in kernel mode I agree. However, both libxenvchan and the Qubes GUI protocol are implemented in user mode and this means that if the frontend process (the one that uses gntalloc) crashes, deferred grant reclaims will occur. This would be the user space variant of frontend driver unloading. Worse, it is possible for the domain to use the grant in a PV protocol. If the PV backend driver then maps and unmaps the grant and then tells the frontend driver to reclaim it, but the backend userspace process (the one using gntdev) maps it before the frontend does reclaim it, the frontend will think the backend is trying to exploit XSA-396 and freeze the connection. Let me sum up how I understand above statement: 1. Frontend F in DomA is granting DomB access via grant X for PV-device Y. 2. DomB backend B for PV-device Y is mapping the page using grant X and uses it. 3. DomB backend B unmaps grant X and signals end of usage to DomA frontend F. 4. DomB userspace process maps grant X 5. DomA frontend F tries to reclaim grant X and fails due to DomB userspace mapping So why would DomB userspace process map grant X? This would imply a malicious process in DomB. This might be possible, but I don't see how such an attack would relate to your problem. It could happen with any malicious userspace program with root access in a domain running a PV backend. However, Qubes OS's GUI protocol is subject to the constraints of the X Window System, and as such winds up with the frontend unmapping the window first. As a result, the list can grow very large, resulting in a massive memory leak and eventual VM freeze.I do understand that it is difficult to change the protocol and/or behavior after the fact, or that performance considerations are in the way of doing so.Would the correct fix be to use IOCTL_GNTDEV_SET_UNMAP_NOTIFY? That would require that the agent either create a new event channel for each window or maintain a pool of event channels, but that should be doable. I think this sounds like a promising idea. This still does not solve the problem of the frontend exiting unexpectedly, though. Such cases need to be handled via deferred reclaim. You could add a flag to struct deferred_entry when called through gntalloc_vma_close(), indicating that this is an expected deferred reclaim not leading to error messages. To partially solve this problem, make the number of entries that the VM will attempt to free at each iteration tunable. The default is still 10, but it can be overridden at compile-time (via Kconfig), boot-time (via a kernel command-line option), or runtime (via sysfs).Is there really a need to have another Kconfig option for this? AFAICS only QubesOS is affected by the problem you are trying to solve. I don't see why you can't use the command-line option or sysfs node to set the higher reclaim batch size.Fair. In practice, Qubes OS will need to use the sysfs node, since the other two do not work with in-VM kernels. Sure, this seems appropriate. 
 Nice. 
 In this case the flag should relate to the file pointer used for gntalloc_open(). Juergen Attachment:
OpenPGP_0xB0DE9DD628BF132F.asc Attachment:
OpenPGP_signature 
 
 
  | 
  
![]()  | 
            
         Lists.xenproject.org is hosted with RackSpace, monitoring our  |