[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device



George / Tim,
Could you help me review these memory patches? Thanks!

-Quan



> -----Original Message-----
> From: Xu, Quan
> Sent: Wednesday, September 16, 2015 9:24 PM
> To: andrew.cooper3@xxxxxxxxxx; Dong, Eddie; ian.campbell@xxxxxxxxxx;
> ian.jackson@xxxxxxxxxxxxx; jbeulich@xxxxxxxx; Nakajima, Jun; keir@xxxxxxx;
> Tian, Kevin; tim@xxxxxxx; Zhang, Yang Z; george.dunlap@xxxxxxxxxxxxx
> Cc: xen-devel@xxxxxxxxxxxxx; Xu, Quan
> Subject: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
> 
> Introduction
> ============
> 
>    VT-d code currently has a number of cases where completion of certain
> operations is being waited for by way of spinning. The majority of instances 
> use
> that variable indirectly through IOMMU_WAIT_OP() macro , allowing for loops of
> up to 1 second (DMAR_OPERATION_TIMEOUT). While in many of the cases this
> may be acceptable, the invalidation case seems particularly problematic.
> 
> Currently hypervisor polls the status address of wait descriptor up to 1 
> second to
> get Invalidation flush result. When Invalidation queue includes Device-TLB
> invalidation, using 1 second is a mistake here in the validation sync. As the 
> 1
> second timeout here is related to response times by the IOMMU engine, Instead
> of Device-TLB invalidation with PCI-e Address Translation Services (ATS) in 
> use.
> the ATS specification mandates a timeout of 1 _minute_ for cache flush. The 
> ATS
> case needs to be taken into consideration when doing invalidations. Obviously
> we can't spin for a minute, so invalidation absolutely needs to be converted 
> to a
> non-spinning model.
> 
>    Also i should fix the new memory security issue.
> The page freed from the domain should be on held, until the Device-TLB flush 
> is
> completed (ATS timeout of 1 _minute_).
> The page previously associated  with the freed portion of GPA should not be
> reallocated for another purpose until the appropriate invalidations have been
> performed. Otherwise, the original page owner can still access freed page 
> though
> DMA.
> 
> Why RFC
> =======
>     Patch 0001--0005, 0013 are IOMMU related.
>     Patch 0006 is about new flag (vCPU / MMU related).
>     Patch 0007 is vCPU related.
>     Patch 0008--0012 are MMU related.
> 
>     1. Xen MMU is very complicated. Could Xen MMU experts help me verify
> whether I
>        have covered all of the case?
> 
>     2. For gnttab_transfer, If the Device-TLB flush is still not completed 
> when to
>        map the transferring page to a remote domain, schedule and wait on a
> waitqueue
>        until the Device-TLB flush is completed. Is it correct?
> 
>        (I have tested waitqueue in decrease_reservation() [do_memory_op()
> hypercall])
>         I wake up domain(with only one vCPU) with debug-key tool, and the
> domain(with only one vCPU)
>         is still working after waiting 60s in a waitqueue. )
> 
> 
> Design Overview
> ===============
> 
> This design implements a non-spinning model for Device-TLB invalidation -- 
> using
> an interrupt based mechanism. Track the Device-TLB invalidation status in an
> invalidation table per-domain. The invalidation table keeps the count of 
> in-flight
> Device-TLB invalidation requests, and also provides a global polling 
> parameter per
> domain for in-flight Device-TLB invalidation requests.
> Update invalidation table's count of in-flight Device-TLB invalidation 
> requests and
> assign the address of global polling parameter per domain in the Status 
> Address
> of each invalidation wait descriptor, when to submit invalidation requests.
> 
> For example:
>   .
> 
> |invl |  Status Data = 1 (the count of in-flight Device-TLB invalidation
> |requests) wait |  Status Address =
> |virt_to_maddr(&_a_global_polling_parameter_per_domain_)
> |dsc  |
>   .
>   .
> 
> |invl |
> |wait | Status Data = 2 (the count of in-flight Device-TLB invalidation
> |requests) dsc  | Status Address =
> |virt_to_maddr(&_a_global_polling_parameter_per_domain_)
>   .
>   .
> 
> |invl |
> |wait |  Status Data = 3 (the count of in-flight Device-TLB invalidation
> |requests) dsc  |  Status Address =
> |virt_to_maddr(&_a_global_polling_parameter_per_domain_)
>   .
>   .
> 
> More information about VT-d Invalidation Wait Descriptor, please refer to
> 
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
>   6.5.2.8 Invalidation Wait Descriptor.
> Status Address and Data: Status address and data is used by hardware to 
> perform
> wait descriptor
>                          completion status write when the Status Write(SW)
> field is Set. Hardware Behavior
>                          is undefined if the Status address range of
> 0xFEEX_XXXX etc.). The Status Address
>                          and Data fields are ignored by hardware when the
> Status Write field is Clear.
> 
> The invalidation completion event interrupt is generated only after the
> invalidation wait descriptor completes. In invalidation interrupt handler, it 
> will
> schedule a soft-irq to do the following check:
> 
>   if invalidation table's count of in-flight Device-TLB invalidation requests 
> ==
> polling parameter:
>     This domain has no in-flight Device-TLB invalidation requests.
>   else
>     This domain has in-flight Device-TLB invalidation requests.
> 
> Track domain Status:
>    The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield
> list if it has in-flight Device-TLB invalidation requests.
> 
> Memory security issue:
>     In case with PCI-e Address Translation Services(ATS) in use, ATS spec
> mandates a timeout of 1 minute for cache flush.
>     The page freed from the domain should be on held, until the Device-TLB
> flush is completed. The page previously associated  with the freed portion of
> GPA should not be reallocated for another purpose until the appropriate
> invalidations have been performed. Otherwise, the original page owner can 
> still
> access freed page though DMA.
> 
>    *Held on The page until the Device-TLB flush is completed.
>       - Unlink the page from the original owner.
>       - Remove the page from the page_list of domain.
>       - Decrease the total pages count of domain.
>       - Add the page to qi_hold_page_list.
> 
>     *Put the page in Queued Invalidation(QI) interrupt handler if the 
> Device-TLB
> is completed.
> 
> Invalidation Fault:
> A fault event will be generated if an invalidation failed. We can disable the
> devices.
> 
> For Context Invalidation and IOTLB invalidation without Device-TLB 
> invalidation,
> Queued Invalidation(QI) submits invalidation requests as before(This is a 
> tradeoff
> and the cost of interrupt is overhead. It will be modified in coming series of
> patch).
> 
> More details
> ============
> 
> 1. invalidation table. We define qi_table structure per domain.
> +struct qi_talbe {
> +    u64 qi_table_poll_slot;
> +    u32 qi_table_status_data;
> +};
> 
> @ struct hvm_iommu {
> +    /* IOMMU Queued Invalidation(QI) */
> +    struct qi_talbe talbe;
> }
> 
> 2. Modification to Device IOTLB invalidation:
>     - Enabled interrupt notification when hardware completes the 
> invalidations:
>       Set FN, IF and SW bits in Invalidation Wait Descriptor. The reason why 
> also
> set SW bit is that
>       the interrupt for notification is global not per domain.
>       So we still need to poll the status address to know which Device-TLB
> invalidation request is
>       completed in QI interrupt handler.
>     - A new per-domain flag (*qi_flag) is used to track the status of 
> Device-TLB
> invalidation request.
>       The *qi_flag will be set before sbumitting the Device-TLB invalidation
> requests. The vCPU is NOT
>       allowed to entry guest mode and put into SCHEDOP_yield list, if the
> *qi_flag is Set.
>     - new logic to do synchronize.
>         if no Device-TLB invalidation:
>             Back to current invalidation logic.
>         else
>             Set IF, SW, FN bit in wait descriptor and prepare the Status Data.
>             Set *qi_flag.
>             Put the domain in pending flush list (The vCPU is NOT allowed to
> entry guest mode and put into SCHEDOP_yield if the *qi_flag is Set.)
>         Return
> 
> More information about VT-d Invalidation Wait Descriptor, please refer to
> 
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
>   6.5.2.8 Invalidation Wait Descriptor.
>    SW: Indicate the invalidation wait descriptor completion by performing a
> coherent DWORD write of the value in the Status Data field
>        to the address specified in the Status Address.
>    FN: Indicate the descriptors following the invalidation wait descriptor 
> must be
> processed by hardware only after the invalidation
>        Wait descriptor completes.
>    IF: Indicate the invalidation wait descriptor completion by generating an
> invalidation completion event per the programing of the
>        Invalidation Completion Event Registers.
> 
> 3. Modification to domain running lifecycle:
>     - When the *qi_flag is set, the domain is not allowed to enter guest mode
> and put into SCHEDOP_yield list
>       if there are in-flight Device-TLB invalidation requests.
> 
> 4. New interrupt handler for invalidation completion:
>     - when hardware completes the Device-TLB invalidation requests, it
> generates an interrupt to notify hypervisor.
>     - In interrupt handler, schedule a tasklet to handle it.
>     - tasklet to handle below:
>         *Clear IWC field in the Invalidation Completion Status register. If 
> the
> IWC field in the Invalidation
>          Completion Status register was already Set at the time of setting 
> this
> field, it is not treated as a new
>          interrupt condition.
>         *Scan the domain list. (the domain is with vt-d passthrough devices.
> scan 'iommu->domid_bitmap')
>                 for each domain:
>                 check the values invalidation table (qi_table_poll_slot and
> qi_table_status_data) of each domain.
>                 if equal:
>                    Put the on hold pages.
>                    Clear the invalidation table.
>                    Clear *qi_flag.
> 
>         *If IP field of Invalidation Event Control Register is Set, try to 
> *Clear IWC
> and *Scan the domain list again, instead of
>          generating another interrupt.
>         *Clear IM field of Invalidation Event Control Register.
> 
> ((
>   Logic of IWC / IP / IM as below:
> 
>                           Interrupt condition (An invalidation wait descriptor
> with Interrupt Flag(IF) field Set completed.)
>                                   ||
>                                    v
>            ----------------------(IWC) ----------------------
>      (IWC is Set)                                (IWC is not Set)
>           ||                                            ||
>           V                                             ||
> (Not treated as an new interrupt condition)             ||
>                                                          V
>                                                    (Set IWC / IP)
>                                                         ||
>                                                          V
>                                   
> ---------------------(IM)---------------------
>                               (IM is Set)
> (IM not Set)
>                                   ||
> ||
>                                   ||
> V
>                                   ||                    (cause Interrupt
> message / then hardware clear IP)
>                                    V
>    (interrupt is held pending, clearing IM to cause interrupt message)
> 
> * If IWC field is being clear, the IP field is cleared.
> ))
> 
> 5. invalidation failed.
>     - A fault event will be generated if invalidation failed. we can disable 
> the
> devices if receive an
>       invalidation fault event.
> 
> 6. Memory security issue:
> 
>     The page freed from the domain should be on held, until the Device-TLB
> flush is completed. The page previously associated  with the freed portion of
> GPA should not be reallocated for another purpose until the appropriate
> invalidations have been performed. Otherwise, the original page owner can 
> still
> access freed page though DMA.
> 
>    *Held on The page unitl the Device-TLB flush is completed.
>       - Unlink the page from the original owner.
>       - Remove the page from the page_list of domain.
>       - Decrease the total pages count of domain.
>       - Add the page to qi_hold_page_list.
> 
>   *Put the page in Queued Invalidation(QI) interrupt handler if the 
> Device-TLB is
> completed.
> 
> 
> ----
> There are 3 reasons to submit device-TLB invalidation requests:
>     *VT-d initialization.
>     *Reassign device ownership.
>     *Memory modification.
> 
> 6.1 *VT-d initialization
>     When VT-d is initializing, there is no guest domain running. So no memory
> security issue.
> iotlb(iotlb/device-tlb)
> |-iommu_flush_iotlb_global()--iommu_flush_all()--intel_iommu_hwdom_init(
> |)
>                                               |--init_vtd_hw()
> 6.2 *Reassign device ownership
>     Reassign device ownership is invoked by 2 hypercalls: do_physdev_op() and
> arch_do_domctl().
> As the *qi_flag is Set, the domain is not allowed to enter guest mode. If the
> appropriate invalidations maybe have not been performed, the *qi_flag is still
> Set, and these devices are not ready for guest domains to launch DMA with
> these devices. So if the *qi_flag is introduced, there is no memory security 
> issue.
> 
> iotlb(iotlb/device-tlb)
> |-iommu_flush_iotlb_dsi()
>                        |--domain_context_mapping_one() ...
>                        |--domain_context_unmap_one() ...
> 
> |-iommu_flush_iotlb_psi()
>                        |--domain_context_mapping_one() ...
>                        |--domain_context_unmap_one() ...
> 
> 6.3 *Memory modification.
> While memory is modified, There are a lot of invoker flow for updating EPT, 
> but
> not all of them will update IOMMU page tables. All of the following three
> conditions are met.
>   * P2M is hostp2m. ( p2m_is_hostp2m(p2m) )
>   * Previous mfn is not equal to new mfn. (prev_mfn != new_mfn)
>   * This domain needs IOMMU. (need_iommu(d))
> 
> ##
> |--iommu_pte_flush()--ept_set_entry()
> 
> #PoD(populate on demand) is not supported while IOMMU passthrough is
> enabled. So ignore PoD invoker flow below.
>       |--p2m_pod_zero_check_superpage()  ...
>       |--p2m_pod_zero_check()  ...
>       |--p2m_pod_demand_populate()  ...
>       |--p2m_pod_decrease_reservation()  ...
>       |--guest_physmap_mark_populate_on_demand() ...
> 
> #Xen paging is not supported while IOMMU passthrough is enabled. So ignore
> Xen paging invoker flow below.
>       |--p2m_mem_paging_evict() ...
>       |--p2m_mem_paging_resume()...
>       |--p2m_mem_paging_prep()...
>       |--p2m_mem_paging_populate()...
>       |--p2m_mem_paging_nominate()...
>       |--p2m_alloc_table()--shadow_enable()
> --paging_enable()--shadow_domctl() --paging_domctl()--arch_do_domctl()
> --do_domctl()
> 
> |--paging_domctl_continuation()
> 
> #Xen sharing is not supported while IOMMU passthrough is enabled. So ignore
> Xen paging invoker flow below.
>       |--set_shared_p2m_entry()...
> 
> 
> #Domain is paused, the domain can't launch DMA.
>       |--relinquish_shared_pages()--domain_relinquish_resources( case
> RELMEM_shared: ) --domain_kill()--do_domctl()
> 
> #The below p2m is not hostp2m. It is L2 to L0. So ignore invoker flow below.
>       |--nestedhap_fix_p2m() --nestedhvm_hap_nested_page_fault()
> --hvm_hap_nested_page_fault()
> --ept_handle_violation()--vmx_vmexit_handler()
> 
> #If prev_mfn == new_mfn, it will not update IOMMU page tables. So ignore
> invoker flow below.
>       |--p2m_mem_access_check()-- hvm_hap_nested_page_fault()
> --ept_handle_violation()--vmx_vmexit_handler()(L1 --> L0 / but just only check
> p2m_type_t)
>       |--p2m_set_mem_access() ...
>       |--guest_physmap_mark_populate_on_demand() ...
>       |--p2m_change_type_one() ...
> # The previous page is not put and allocated for Xen or other guest domains. 
> So
> there is no memory security issue. Ignore invoker flow below.
>    |--p2m_remove_page()--guest_physmap_remove_page() ...
> 
>    |--clear_mmio_p2m_entry()--unmap_mmio_regions()--do_domctl()
>                            |--map_mmio_regions()--do_domctl()
> 
> 
> # Held on the pages which are removed in guest_remove_page(), and put in QI
> interrupt handler when it has no in-flight Device-TLB invalidation requests.
> 
> |--clear_mmio_p2m_entry()--*guest_remove_page()*--decrease_reservation()
> 
> |--xenmem_add_to_physmap_one() --xenmem_add_to_physmap()
> /xenmem_add_to_physmap_batch()  .. --do_memory_op()
>                                                |--p2m_add_foreign() --
> xenmem_add_to_physmap_one() ..--do_memory_op()
> 
> |--guest_physmap_add_entry()--create_grant_p2m_mapping()  ...
> --do_grant_table_op()
> 
> ((
>    Much more explanation:
>    Actually, the previous pages are maybe mapped from Xen heap for guest
> domains in decrease_reservation() / xenmem_add_to_physmap_one()
>    / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4
> functions will map xen heap page for guest domains:
>           * share page for xen Oprofile.
>           * vLAPIC mapping.
>           * grant table shared page.
>           * domain share_info page.
> ))
> 
> # For grant_unmap*. ignore it at this point, as we can held on the page when
> domain free xenbllooned page.
> 
> 
> |--iommu_map_page()--__gnttab_unmap_common()--__gnttab_unmap_grant
> _ref() --gnttab_unmap_grant_ref()--do_grant_table_op()
> 
> |--__gnttab_unmap_and_replace() -- gnttab_unmap_and_replace()
> --do_grant_table_op()
> 
> # For grant_map*, ignore it as there is no pfn<--->mfn in Device-TLB.
> 
> # For grant_transfer:
>   |--p2m_remove_page()--guest_physmap_remove_page()
>                                                  |--gnttab_transfer() ...
> --do_grant_table_op()
> 
>     If the Device-TLB flush is still not completed when to map the 
> transferring
> page to a remote domain,
>     schedule and wait on a waitqueue until the Device-TLB flush is completed.
> 
>    Plan B:
>    ((If the Device-TLB flush is still not completed before adding the 
> transferring
> page to the target domain,
>    allocate a new page for target domain and held on the old transferring page
> which will be put in QI interrupt
>    handler when there are no in-flight Device-TLB invalidation requests.))
> 
> 
> Quan Xu (13):
>   vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt
>   vt-d: Register MSI for async invalidation completion interrupt.
>   vt-d: Track the Device-TLB invalidation status in an invalidation table.
>   vt-d: Clear invalidation table in invaidation interrupt handler
>   vt-d: Clear the IWC field of Invalidation Event Control Register in
>   vt-d: Introduce a new per-domain flag - qi_flag.
>   vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to
>   vt-d: Held on the freed page until the Device-TLB flush is completed.
>   vt-d: Put the page in Queued Invalidation(QI) interrupt handler if
>   vt-d: Held on the removed page until the Device-TLB flush is completed.
>   vt-d: If the Device-TLB flush is still not completed when
>   vt-d: For gnttab_transfer, If the Device-TLB flush is still
>   vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB
> 
>  xen/arch/x86/hvm/vmx/entry.S         |  10 ++
>  xen/arch/x86/x86_64/asm-offsets.c    |   1 +
>  xen/common/domain.c                  |  15 ++
>  xen/common/grant_table.c             |  16 ++
>  xen/common/memory.c                  |  16 +-
>  xen/drivers/passthrough/vtd/iommu.c  | 290
> +++++++++++++++++++++++++++++++++--
>  xen/drivers/passthrough/vtd/iommu.h  |  18 +++
> xen/drivers/passthrough/vtd/qinval.c |  51 +++++-
>  xen/include/xen/hvm/iommu.h          |  42 +++++
>  9 files changed, 443 insertions(+), 16 deletions(-)
> 
> --
> 1.8.3.2


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.