[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] [Q] Device error handling discussion -- Was: Is qemu used when we use VTd?
Yuji Shimada <mailto:shimada-yxb@xxxxxxxxxxxxxxx> wrote: > On Thu, 16 Oct 2008 15:32:40 +0800 > "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote: > >>>>> Non-fatal error on I/O device: >>>>> - kill the domain with error source function. >>>>> - reset the function. >>>> >>>>> From following staement in PCI-E 2.0 section 6.6.2: > "Note that Port >>>> state machines associated with Link functionality including those >>>> in the Physical and Data Link Layers are not reset by FLR", I'm not >>>> sure if FLR is a right method to handle the error situation. That's >>>> the reason I asked on how to handle multiple-function devices. >>> >>> I think Non-fatal error is transaction's error and it does not require >>> to reset lower layer. But I am not sure. >> >> By default, the data link layer's error is fatal, but the result >> depends on how driver setup it. We can trap the access to AER >> register, and make sure data link layer error always report as >> fatal. That is easy to implement. > > It means non-fatal error is transaction layer's error, with default > setting. When non-fatal error occurs on I/O device, FLR seems to recover it. > > >>> >>>>> Non-fatal error on PCI-PCI bridge. >>>>> - kill all domains with the functions under the PCI-PCI bridge. >>>>> - reset PCI-PCI bridge and secondary bus. >>>>> >>>>> Fatal error: >>>>> - kill all domains with the functions under the same root port. >>>>> - reset the link (secondary bus reset on root port). >>>> >>>> Agree. Basically I think the action of "reset PCI-PCI bridge and >>>> secondary bus" or "reset the link" has been done by AER core >>>> already. What we need define is PCI back's error handler. In first >>>> step, the error handler will trigger domain reset, in future, more >>>> elegant action can be defined/implemented, Any idea? >>> >>> I agree with you basically. >>> >>> Current AER core does not reset PCI-PCI bridge and secondary bus, >>> when Non-fatal error occurs on PCI-PCI bridge. We need to implement >>> resetting PCI-PCI bridge and secondary bus. >> >> I'd keep the AER core as current-is unless some special reason. For >> example, why should we kill all domains under same root port and >> reset root port's secondary link? Currently it will do so only if >> the impacted device has no aer service register. > > On linux 2.6.27, there is aer driver which bind to root port. But > there is no aer driver for other device. So When fatal error occurs, > linux resets root port's secondary link. > > drivers/pci/pcie/aer/aerdrv.c:aer_root_reset > > >> Also not sure if we need reset the link for non-fatal error if AER >> core does not do that. Are there any special difference between >> virtualization/native situation? > > No. There is no difference. > I agree with you to keep the AER core as current-is. > > >>>>> Note: we have to consider to prevent device from destroying other >>>>> domain's memory. >>>> >>>> Why should we consider destroy other domain's memory? I think VT-d >>>> should gurantee this. >>> >>> The device is re-assigned to dom0 on destroying HVM domain. If we >>> destroy domain before resetting the device, I/O device can write >>> memory of dom0. On the other hand, we have to stop guest software >>> before resetting the device to prevent guest software from accessing >>> device. >> >> That should same to normal VT-d situation. We need FLR before we >> re-assign device to dom0 (If current not working like this, it >> should be a bug). Also, to stop guest software before resetting the >> device maybe helpful, but maybe not so important. Do you think >> guest's second access will cause host impacted? After all, even on >> native environment this is guranted unless platform support it. (It >> is said PPC has such support). >> >> BTW, you stated "We have to solve many difficulties to keep guest >> domain running", can you give some detail difficulties (it maybe >> difficult to HVM, but not sure for PV side)? > > - HVM For HVM, yes, it is tricky, and we have no plan for it till now. > * Implementing root port emulator in ioemu. > * Implementing memory mapped configuration access mechanism > for guest os. > * Enhancing guest aml to allow guest os to handle aer. > * Mapping host error to guest error. This should be the the tricky one considering: 1) How to translate the TLP for the header log register? 2) Need to map the Source Identification register. > * Interaction between ioemu and pci back driver. The main difficult is mapping the error_handler to AER register operation. > * Handling when guest does not work fine. > > - PV > * Notifying pciback to pcifront. > * Handling when guest does not work fine. We are working on this now. > >>> By the way, do you have any plan to implement these function? >>> I can provide the idea. But I can't provide the code. >> >> Yes, we try to work on it. But we may have not enough environment to >> test all types of error. Also although the AER code can be >> backported easily, some required ACPI fix is more challenge. > > I'm not sure backporting is good. In the long term, dom0 linux will be > based on newer linux. How/When developers(we) can switch it to newer > one? I'd like other developer's comment. At least we need do that for internal testing. Not sure when kernel update will happen. > > Thanks, > -- > Yuji Shimada _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |