|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v2 5/6] tools/xenstored: partially handle domains without a shared ring
On Wed, Sep 22, 2021 at 06:46:25PM +0500, Julien Grall wrote:
> (+ Some AWS folks)
>
> Hi Juergen,
>
> On 22/09/2021 17:34, Juergen Gross wrote:
> > On 22.09.21 12:23, Julien Grall wrote:
> > > Hi Roger,
> > >
> > > On 22/09/2021 14:58, Roger Pau Monné wrote:
> > > > On Wed, Sep 22, 2021 at 02:07:44PM +0500, Julien Grall wrote:
> > > > > Hi Roger,
> > > > >
> > > > > On 22/09/2021 13:21, Roger Pau Monne wrote:
> > > > > > Failure to map the shared ring and thus establish a xenstore
> > > > > > connection with a domain shouldn't prevent the "@introduceDomain"
> > > > > > watch from firing, likewise with "@releaseDomain".
> > > > > >
> > > > > > In order to handle such events properly xenstored should keep track
> > > > > > of
> > > > > > the domains even if the shared communication ring cannot be mapped.
> > > > > > This was partially the case due to the restore mode, which needs to
> > > > > > handle domains that have been destroyed between the save and restore
> > > > > > period. This patch extends on the previous limited support of
> > > > > > temporary adding a domain without a valid interface ring, and
> > > > > > modifies
> > > > > > check_domains to keep domains without an interface on the list.
> > > > > >
> > > > > > Handling domains without a valid shared ring is required in order to
> > > > > > support domain without a grant table, since the lack of grant table
> > > > > > will prevent the mapping of the xenstore ring grant reference.
> > > > > >
> > > > > > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> > > > > > ---
> > > > > > oxenstored will need a similar treatment once grant mapping is used
> > > > > > there. For the time being it still works correctly because it uses
> > > > > > foreign memory to map the shared ring, and that will work in the
> > > > > > absence of grant tables on the domain.
> > > > > > ---
> > > > > > Changes since v1:
> > > > > > - New in this version.
> > > > > > ---
> > > > > > tools/xenstore/xenstored_domain.c | 30
> > > > > > ++++++++++++++++++------------
> > > > > > 1 file changed, 18 insertions(+), 12 deletions(-)
> > > > > >
> > > > > > diff --git a/tools/xenstore/xenstored_domain.c
> > > > > > b/tools/xenstore/xenstored_domain.c
> > > > > > index 9fb78d5772..150c6f082e 100644
> > > > > > --- a/tools/xenstore/xenstored_domain.c
> > > > > > +++ b/tools/xenstore/xenstored_domain.c
> > > > > > @@ -119,6 +119,11 @@ static int writechn(struct connection *conn,
> > > > > > struct xenstore_domain_interface *intf =
> > > > > > conn->domain->interface;
> > > > > > XENSTORE_RING_IDX cons, prod;
> > > > > > + if (!intf) {
> > > > > > + errno = ENODEV;
> > > > > > + return -1;
> > > > > > + }
> > > > > > +
> > > > > > /* Must read indexes once, and before anything
> > > > > > else, and verified. */
> > > > > > cons = intf->rsp_cons;
> > > > > > prod = intf->rsp_prod;
> > > > > > @@ -149,6 +154,11 @@ static int readchn(struct
> > > > > > connection *conn, void *data, unsigned int len)
> > > > > > struct xenstore_domain_interface *intf =
> > > > > > conn->domain->interface;
> > > > > > XENSTORE_RING_IDX cons, prod;
> > > > > > + if (!intf) {
> > > > > > + errno = ENODEV;
> > > > > > + return -1;
> > > > > > + }
> > > > > > +
> > > > > > /* Must read indexes once, and before anything
> > > > > > else, and verified. */
> > > > > > cons = intf->req_cons;
> > > > > > prod = intf->req_prod;
> > > > > > @@ -176,6 +186,9 @@ static bool domain_can_write(struct
> > > > > > connection *conn)
> > > > > > {
> > > > > > struct xenstore_domain_interface *intf =
> > > > > > conn->domain->interface;
> > > > > > + if (!intf)
> > > > > > + return false;
> > > > > > +
> > > > >
> > > > > Rather than adding an extra check, how about taking
> > > > > advantage of is_ignore?
> > > >
> > > > Right, I just need to change the order in conn_can_read and
> > > > conn_can_write so that the is_ignored check is performed before the
> > > > can_{read,write} handler is called.
> > > >
> > > > > > return ((intf->rsp_prod - intf->rsp_cons) !=
> > > > > > XENSTORE_RING_SIZE);
> > > > > > }
> > > > > > @@ -183,7 +196,8 @@ static bool domain_can_read(struct
> > > > > > connection *conn)
> > > > > > {
> > > > > > struct xenstore_domain_interface *intf =
> > > > > > conn->domain->interface;
> > > > > > - if (domain_is_unprivileged(conn) && conn->domain->wrl_credit <
> > > > > > 0)
> > > > > > + if ((domain_is_unprivileged(conn) &&
> > > > > > conn->domain->wrl_credit < 0) ||
> > > > > > + !intf)
> > > > > > return false;
> > > > > > return (intf->req_cons != intf->req_prod);
> > > > > > @@ -268,14 +282,7 @@ void check_domains(void)
> > > > > > domain->shutdown = true;
> > > > > > notify = 1;
> > > > > > }
> > > > > > - /*
> > > > > > - * On Restore, we may have been unable to remap the
> > > > > > - * interface and the port. As we don't know whether
> > > > > > - * this was because of a dying domain, we need to
> > > > > > - * check if the interface and port are still valid.
> > > > > > - */
> > > > > > - if (!dominfo.dying && domain->port &&
> > > > > > - domain->interface)
> > > > > > + if (!dominfo.dying)
> > > > > > continue;
> > > > >
> > > > > This is mostly a revert on "tools/xenstore: handle dying
> > > > > domains in live
> > > > > update". However, IIRC, this check was necessary to release
> > > > > the connection
> > > > > if the domain has died in the middle of Live-Update.
> > > >
> > > > But if the domain has died in the middle of live update
> > > > get_domain_info will return false, and thus the code won't get here.
> > >
> > > Hmmm... I think I am mixing up a few things... I went through the
> > > original discussion (it was on the security ML) to find out why I
> > > wrote the patch like that. When going through the archives, I
> > > noticed that I provided a different version of this patch (see [1])
> > > because there was some issue with the check here (I wrote that it
> > > would lead to zombie domain, but don't have the rationale :().
> > >
> > > Juergen, I don't seem to find the reason why the patch was not
> > > replaced as we discussed on the security ML. Do you remember why?
> >
> > Sorry, no, I don't.
> >
> > You did send the new version for V6 of the LU series, but it seems at
> > least in V9 you commented on the patch requesting that a comment just
> > in the section being different between the two variants to be removed.
> >
> > So either we both overlooked the new variant not having gone in, or we
> > agreed to use the old version (e.g. in a security meeting). In my IRC
> > logs I couldn't find anything either (the only mentioning of that patch
> > was before V6 of the series was sent, and that was before you sending
> > the new one as a reply to V6).
> >
> > > Assuming this was a mistake, could someone take care of sending an
> > > update? If not, I could do it when I am back in October.
> > >
> > > For the archives, the issues would appear when shutting down a
> > > domain during Live-Update.
> >
> > Hmm, IIRC you did quite some extensive testing of LU and didn't find
> > any problem in the final version.
>
> I did extensive testing with early series but I can't remember whether I
> tested the shutdown reproducer with the latest series.
>
> >
> > Are you sure there really is a problem?
>
> I thought a bit more and looked at the code (I don't have access to a test
> machine at the moment). I think there is indeed a problem.
>
> Some watchers of @releaseDomain (such as xenconsoled) will only remove a
> domain from their internal state when the domain is actually dead.
>
> This is based on dominfo.dying which is only set when all the resources are
> relinquished and waiting for the other domains to release any resources for
> that domain.
>
> The problem is Xenstore may fail to map the interface or the event channel
> long before the domain is actually dead. With the current check, we would
> free the allocated structure and therefore send @releaseDomain too early. So
> daemon like xenconsoled, would never cleanup for the domain and leave a
> zombie domain. Well... until the next @releaseDomain (or @introduceDomain
> for Xenconsoled) AFAICT.
>
> The revised patch is meant to solve it by just ignoring the connection. With
> that approach, we would correctly notify watches when the domain is dead.
I think the patch provided by Julien is indeed better than the current
code, or else you will miss @releaseDomain events in corner cases for
dominfo.dying.
I think however the patch is missing a change in the order of the
checks in conn_can_{read,write}, so that the is_ignored check is
performed before calling can_{read,write} which will try to poke at
the interface and trigger a fault if not mapped.
Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |