| [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
 Re: x86/CET: Fix S3 resume with shadow stacks active
 
To: Jan Beulich <jbeulich@xxxxxxxx>From: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>Date: Fri, 25 Feb 2022 12:41:00 +0000Accept-language: en-GB, en-USArc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=noneArc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=la1fAJ8UP1zhnZR7eEXYw9LSuWccAxsyPyp/KC8csKE=; b=a0V8G2cEhzmP4ejU/Vwu8gGKjyiYPDGOAljMgnttsSmi4hMoVMKP1kOKoHYERrt9dPgsyKupOHXcKAYLcr75xXeR9eM8P5KDZPGmFIbwg4R21OQn7RSctg5qM1TIzDRlpJnaY88aQYW0F2IVpBDiYyKsNWA2bfr5pi3gdWNRsulNCFFrrgBD1xZKKcVTib5X3Uh28skVA3nBAGWe265b9AJtGJGoT41p1BhdtQHrNXsCb47rLpyZyQE/GLzF1Awix/D2sYSJRJsloTcJoiqgMwb+H5iFoIIJzS77uov16WsEbvrK+It9yI07WNQiX6c73M60yFpVO/2i8/TqzOiq4A==Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=nCtXdOKsKA3/gMsw6cuT3O1kjzkxRMieGojDJi8LYiDj3VNCCXrGYNb5BNcNVbnCUO7UYgX3COT0HVoe/k/rtR6W7PdVk3qCQ8MzhmDO/LmQHXP6eAPHC6j9x9wdcZI93oORTD7yoaX4YabR07xDtZ9koT1xkl5hm1mSuiNkxvCzlEywP/Sfj0rp05jtLg9qZdZiVa0D/pH5CCabfNmLILu/coR/lYUKHAvdbOVwmkOFW6ZzTuNevsC9fOTWYZILzsjU7cP0lEco8mtL/cmd2BKF5krVnzYaG8Rx8loMSFlQThPPs16QXQen7jMncMDX54BOVVOSRFfYhSHUN74egQ==Authentication-results: esa4.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.comCc: Roger Pau Monne <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, "Thiner Logoer" <logoerthiner1@xxxxxxx>,	Marek Marczykowski-Górecki	<marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>Delivery-date: Fri, 25 Feb 2022 12:41:20 +0000Ironport-data: A9a23:6FUhLKBC6P+NfRVW//zjw5YqxClBgxIJ4kV8jS/XYbTApD5x0TcBy mVJX2DVb/+JZTGjf9sib4619ksB7MKBztYyQQY4rX1jcSlH+JHPbTi7wuYcHM8wwunrFh8PA xA2M4GYRCwMZiaA4E/raNANlFEkvU2ybuOU5NXsZ2YgHWeIdA970Ug5w7Rj39Yy6TSEK1jlV e3a8pW31GCNg1aYAkpMg05UgEoy1BhakGpwUm0WPZinjneH/5UmJMt3yZWKB2n5WuFp8tuSH I4v+l0bElTxpH/BAvv9+lryn9ZjrrT6ZWBigVIOM0Sub4QrSoXfHc/XOdJFAXq7hQllkPhr9 +dwkZKKezsJHZPrqv8cThcbKiVXaPguFL/veRBTsOSWxkzCNXDt3+9vHAc9OohwFuRfWD8Us 6ZCcXZUM07F17neLLGTE4GAguwKKsXxMZxZkXZn1TzDVt4tQIzZQrWM7thdtNs1rp4RR6uHN 5BFAdZpRAjLewVBYQ0oM4kBk+P22mCvWTJ/k03A8MLb5ECMlVcsgdABKuH9f9WQQMxPk0Wwp 2TY/n/4CBUXKNyezzWe9numwOTImEvTSI8UUbG16PNuqFmS3XAITg0bU0Ohpvu0gVL4XMhQQ 2QQ/SUpoLIu9E2tQ8Okd0Tm+ziPuRt0c9haHvA+6QqN4rHJ+AvfDW8BJhZebPQ2uclwQiYlv mJlhPuwW2Yp6ufMDyvAqPHE9lteJBT5M0cYWh05aQU/0eXDg78UvAnOU9ggKaqc24id9S7L/ xiGqy03hrM2hMEN1rmm8V2vvw9AtqQlXSZuuFyJAzvNAhdRIdf8Otf2sQSzAeNocd7BJmRtq kTojCR3AAomKZiW3BKAT+wWdF1Cz6bUaWaM6bKD8nRIythMx5JBVdwBiN2dDB0wWirhRdMPS BWC0e+2zMUOVEZGlYctP+qM5z0ClMAM7+jNWPHOdcZpaZNsbgKB9ywGTRfOgz2yzBFwzPlnY c/znSOQ4ZAyU/UPIN2eHbp17FPW7npmmTO7qW7TlXxLLoZylFbKEOxYYTNin8gy7b+eoRW9z jqsH5Di9vmra8WnOnO/2ddKdTgidCFnbbir+50/XrPSeWJORTB+Y8I9NJt8IuSJaYwOzbyWl px8M2cFoGfCaYrvcl3bOig+M+q0Bf6SbxsTZEQRALph4FB6Ca6H56YDbZonO74h8e1o1/lvS PcZPc6HB5xypv7volzxsbGVQFReSSmWIronport-hdrordr: A9a23:qAHFsayqGs1GEhtc0xpYKrPxguskLtp133Aq2lEZdPULSKOlfp GV8MjziyWYtN9IYgBcpTiBUJPwJE81bfZOkMYs1MSZLXXbUQyTXc9fBOrZsnHd8kjFmNK1up 0QCpSWZOeAbmSSyPyKmjVQcOxQgeVvkprY/ds2pk0FJWoBCsFdBkVCe32m+yVNNVJ77PECZf 6hD7981lydkAMsH6OG7xc+Lor+juyOsKijTQ8NBhYh5gXLpyiv8qTGHx+R2Qpbey9TwJ85mF K10zDR1+GGibWW2xXc32jc49B9g9360OZOA8SKl4w8NijssAC1f45sMofy/wzd4dvfqmrCou O85yvIDP4DrE85uVvF5ycF7jOQlQrGLUWSkGNwz0GT+fARDwhKdPapzbgpDCcxrXBQ5u2UmZ g7r15w/fBsfGL9tTW46N7SWx5wkE2o5XIkjO4IlnRaFZATcblLsOUkjQho+bo7bWvHAbocYa FT5QDnlYJrWELfa2qcsnhkwdSqUHh2FhCaQlIassjQ1zRNhnh2w0YR2cRaxx47hd8AYogB4/ 6BPrVjlblIQMNTZaVhBP0ZSc/yDmDWWxrDPG+bPFyiHqAaPHDGrYLx/dwOlauXUY1NyIF3lI XKUVteu2J3c0XyCdeW1JkO6RzJSHXVZ0Wa9iif3ekPhlTRfsuYDcTYciFcryKJmYRrPvHmList-id: Xen developer discussion <xen-devel.lists.xenproject.org>Thread-index: AQHYKbemeqHKT0/FpkqigAUimYUei6yj8pMAgABD3oA=Thread-topic: x86/CET: Fix S3 resume with shadow stacks active 
 On 25/02/2022 08:38, Jan Beulich wrote:
> On 24.02.2022 20:48, Andrew Cooper wrote:
>> The original shadow stack support has an error on S3 resume with very bizzare
>> fallout.  The BSP comes back up, but APs fail with:
>>
>>   (XEN) Enabling non-boot CPUs ...
>>   (XEN) Stuck ??
>>   (XEN) Error bringing CPU1 up: -5
>>
>> and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
>> to be scheduled on the BSP dies with:
>>
>>   (XEN) d1v0 Unexpected vmexit: reason 3
>>   (XEN) domain_crash called from vmx.c:4304
>>   (XEN) Domain 1 (vcpu#0) crashed on cpu#0:
>>
>> The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
>> scheduled vCPU, and will be addressed in a subsequent patch.  It is a
>> consequence of the APs triple faulting.
>>
>> The reason the APs triple fault is because we don't tear down the stacks on
>> suspend.  The idle/play_dead loop is killed in the middle of running, meaning
>> that the supervisor token is left busy.
>>
>> On resume, SETSSBSY finds the token already busy, suffers #CP and triple
>> faults because the IDT isn't configured this early.
>>
>> Rework the AP bringup path to (re)create the supervisor token.  This ensures
>> the primary stack is non-busy before use.
>>
>> Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
>> Link: https://github.com/QubesOS/qubes-issues/issues/7283
>> Reported-by: Thiner Logoer <logoerthiner1@xxxxxxx>
>> Reported-by: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
>> Tested-by: Thiner Logoer <logoerthiner1@xxxxxxx>
>> Tested-by: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
> Reviewed-by: Jan Beulich <jbeulich@xxxxxxxx>
Thanks.
>> Slightly RFC.  This does fix the crash encountered, but it occurs to me that
>> there's a race condition when S3 platform powerdown is incident with an
>> NMI/#MC, where more than just the primary shadow stack can end up busy on
>> resume.
>>
>> A larger fix would be to change how we allocate tokens, and always have each
>> CPU set up its own tokens.  I didn't do this originally in the hopes of 
>> having
>> WRSSQ generally disabled, but that plan failed when encountering reality...
> While I think this wants fixing one way or another, I also think this
> shouldn't block the immediate fix here (which addresses an unconditional
> crash rather than a pretty unlikely one).
Fair point.  I'll get this committed now, and work on the IST shstks later.
As a note for backporting, this depends on the adjustments made in c/s
311434bfc9d1 so isn't safe to backport in exactly this form.  I'll sort
something out in due course.
~Andrew
 |