[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Block protocol incompatibilities with 4K logical sector size disks



Hello,

To give some context, this started as a bug report against FreeBSD failing to
work with PV blkif attached disks with 4K logical sectors when the backend is
Linux kernel blkback:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=280884

Further investigation has lead me to discover that the protocol described in
the public blkif.h header is not implemented uniformly, and there are major
inconsistencies between implementations regarding the meaning of the `sectors`
and `sector-size` xenstore nodes, and the sector_number and {first,last}_sect
struct request fields.  Below is a summary of the findings on the
implementation I've analyzed.

Linux blk{front,back} always assumes the `sectors` xenstore node to be in 512b
units, regardless of the value of the `sector-size` node.  Equally the ring
request sector_number and the segments {first,last}_sect fields are always
assumed to be in units of 512b regardless of the value of `sector-size`.  The
`feature-large-sector-size` node is neither exposed by blkfront, neither
checked by blkback before exposing a `sector-size` node different than 512b.

FreeBSD blk{front,back} calculates (and for blkback exposes) the disk size as
`sectors` * `sector-size` based on the values in the xenstore nodes (as
described in blkif.h).  The ring sector_number is filled with the sector number
based on the `sector-size` value, however the {first,last}_sect fields are
always calculated as 512b units.   The `feature-large-sector-size` node is
neither exposed by blkfront, neither checked by blkback before exposing a
`sector-size` node different than 512b.

QEMU qdisk blkback implementation exposes the `sectors` disk size in units of
`sector-size` (as FreeBSD blkback).  The ring structure fields sector_number
and {first,last}_sect are assumed to be in units of `sector-size`.  This
implementation will not expose a `sector-size` node with a value different than
512 unless the frontend xenstore path has the `feature-large-sector-size` node
present.

Windows blkfront calculates the disk size as `sectors` * `sector-size` from the
xenstore nodes exposed by blkback.   The ring structure fields sector_number
and {first,last}_sect are assumed to be in units of `sector-size`.  This
frontend implementation exposes `feature-large-sector-size`.

When using a disk with a logical sector size different than 512b, Linux is only
compatible with itself, same for FreeBSD.  QEMU blkback implementation is also
only compatible with the Windows blkfront implementation.  The
`feature-large-sector-size` seems to only be implemented for the QEMU/Windows
combination, both Linux and FreeBSD don't implement any support for it neither
in the backend or the frontend.

The following table attempts to summarize in which units the following fields
are defined for the analyzed implementations (please correct me if I got some
of this wrong):

                        │ sectors xenbus node │ requests sector_number │ 
requests {first,last}_sect
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
FreeBSD blk{front,back} │     sector-size     │      sector-size       │        
   512
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
Linux blk{front,back}   │         512         │          512           │        
   512
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
QEMU blkback            │     sector-size     │      sector-size       │       
sector-size
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
Windows blkfront        │     sector-size     │      sector-size       │       
sector-size
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
MiniOS                  │     sector-size     │          512           │        
   512
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
tapdisk blkback         │         512         │      sector-size       │        
   512

It's all a mess, I'm surprised we didn't get more reports about brokenness when
using disks with 4K logical sectors.

Overall I think the in-kernel backends are more difficult to update (as it
might require a kernel rebuild), compared to QEMU or blktap.  Hence my slight
preference would be to adjust the public interface to match the behavior of
Linux blkback, and then adjust the implementation in the rest of the backends
and frontends.

There was an attempt in 2019 to introduce a new frontend feature flag to signal
whether the frontend supported `sector-size` xenstore nodes different than 512 
[0].
However that was only ever implemented for QEMU blkback and Windows blkfront,
all the other backends will expose `sector-size` different than 512 without
checking if `feature-large-sector-size` is exposed by the frontend.  I'm afraid
it's now too late to retrofit that feature into existing backends, seeing as
they already expose `sector-size` nodes greater than 512 without checking if
`feature-large-sector-size` is reported by the frontend.

My proposal would be to adjust the public interface with:

 * Disk size is calculated as: `sectors` * 512 (`sectors` being the contents of
   such xenstore backend node).

 * All the sector related fields in blkif ring requests use a 512b base sector
   size, regardless of the value in the `sector-size` xenstore node.

 * The `sector-size` contains the disk logical sector size.  The frontend must
   ensure that all request segments addresses are aligned and it's length is
   a multiple of such size.  Otherwise the backend will refuse to process the
   request.

Regards, Roger.

[0] 
https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=67e1c050e36b2c9900cca83618e56189effbad98



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.