|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Block protocol incompatibilities with 4K logical sector size disks
On Fri, Aug 30, 2024 at 04:09:25PM +0000, Anthony PERARD wrote:
> On Thu, Aug 29, 2024 at 05:42:45PM +0200, Roger Pau Monné wrote:
> > On Thu, Aug 29, 2024 at 01:15:42PM +0000, Anthony PERARD wrote:
> > > On Thu, Aug 29, 2024 at 12:59:43PM +0200, Roger Pau Monné wrote:
> > > > The following table attempts to summarize in which units the following
> > > > fields
> > > > are defined for the analyzed implementations (please correct me if I
> > > > got some
> > > > of this wrong):
> > > >
> > > > │ sectors xenbus node │ requests sector_number
> > > > │ requests {first,last}_sect
> > > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > > FreeBSD blk{front,back} │ sector-size │ sector-size
> > > > │ 512
> > > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > > Linux blk{front,back} │ 512 │ 512
> > > > │ 512
> > > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > > QEMU blkback │ sector-size │ sector-size
> > > > │ sector-size
> > > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > > Windows blkfront │ sector-size │ sector-size
> > > > │ sector-size
> > > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > > MiniOS │ sector-size │ 512
> > > > │ 512
> > > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > > tapdisk blkback │ 512 │ sector-size
> > > > │ 512
>
> Tapdisk situation seems more like:
>
> tapdisk blkback │ ?????????? │ ??????????? │
> ?????
>
> I've looks at the implementation at xapi-project/blktat[1] and the way
> sector_number or {first,last}_sect seems to be used varied on which
> backend is used (block-vhd, block-nbd, block-aio).
>
> [1] https://github.com/xapi-project/blktap
>
> block-vhd seems mostly sectors of 512 but recalculated with "s->spb"
> (sector per block?) but still, sector seems to be only 512.
>
> block-nbd seems to set "sector-size" to always 512, but uses
> "sector-size" for sector_number and {first,last}_sect.
>
> The weirdest one is block-aio, where on read it multiply sector_number
> and {first,last}_sect by 512, but on write, those are multiplied by
> "sector-size". With "sector-size" set by ioctl(BLKSSZGET)
>
> At least, is seems "sectors" is a multiple of 512 on all those, like in
> the table, but I've only look at those 3 "drivers".
You looked more than myself, I've just looked at the block-aio write
path I think, and I was happy enough to see it was yet different from
the other implementations.
I think it's clear enough that every implementation has slight
differences, and I don't plan to fix all of them.
>
> > > There's OVMF as well, which copied MiniOS's implementation, and looks
> > > like it's still the same as MiniOS for the table above:
> > >
> > > OVMF (base on MiniOS) │ sector-size │ 512 │
> > > 512
> > >
> > > >
> > > > It's all a mess, I'm surprised we didn't get more reports about
> > > > brokenness when
> > > > using disks with 4K logical sectors.
> > > >
> > > > Overall I think the in-kernel backends are more difficult to update (as
> > > > it
> > > > might require a kernel rebuild), compared to QEMU or blktap. Hence my
> > > > slight
> > > > preference would be to adjust the public interface to match the
> > > > behavior of
> > > > Linux blkback, and then adjust the implementation in the rest of the
> > > > backends
> > > > and frontends.
> > >
> > > I would add that making "sector-size" been different from 512 illegal
> > > makes going forward easier, has every implementation will work with a
> > > "sector-size" of 512, and it probably going to be the most common sector
> > > size for a while longer.
> >
> > My main concern is the amount of backends out there that already
> > expose a "sector-size" different than 512. I fear any changes here
> > will take time to propagate to in-kernel backends, and hence my
> > approach was to avoid modifying Linux blkback, because (as seen in the
> > FreeBSD bug report) there are already instances of 4K logical sector
> > disks being used by users. Modifying the frontends is likely easier,
> > as that's under the owner of the VM control.
> >
> > > > There was an attempt in 2019 to introduce a new frontend feature flag
> > > > to signal
> > > > whether the frontend supported `sector-size` xenstore nodes different
> > > > than 512 [0].
> > > > However that was only ever implemented for QEMU blkback and Windows
> > > > blkfront,
> > > > all the other backends will expose `sector-size` different than 512
> > > > without
> > > > checking if `feature-large-sector-size` is exposed by the frontend.
> > > > I'm afraid
> > > > it's now too late to retrofit that feature into existing backends,
> > > > seeing as
> > > > they already expose `sector-size` nodes greater than 512 without
> > > > checking if
> > > > `feature-large-sector-size` is reported by the frontend.
> > >
> > > Much before that, "physical-sector-size" was introduced (2013):
> > >
> > > https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=a67e2dac9f8339681b30b0f89274a64e691ea139
> > >
> > > Linux seems to implement it, but QEMU or OVMF don't have it.
> >
> > Yeah, I was aware of this, normal disks already have a physical sector
> > size (optimal sector size) and a logical sector size (minimal size
> > supported by the drive). Some implement a smaller logical than
> > physical sector size by doing read-modify-write.
> >
> > > > My proposal would be to adjust the public interface with:
> > > >
> > > > * Disk size is calculated as: `sectors` * 512 (`sectors` being the
> > > > contents of
> > > > such xenstore backend node).
> > > >
> > > > * All the sector related fields in blkif ring requests use a 512b base
> > > > sector
> > > > size, regardless of the value in the `sector-size` xenstore node.
> > > >
> > > > * The `sector-size` contains the disk logical sector size. The
> > > > frontend must
> > > > ensure that all request segments addresses are aligned and it's
> > > > length is
> > > > a multiple of such size. Otherwise the backend will refuse to
> > > > process the
> > > > request.
> > >
> > > You still want to try to have a "sector-size" different from 512? To me
> > > this just add confusion to the confusion. There would be no way fro
> > > backend or frontend to know if setting something other than 512 is going
> > > to work.
> >
> > But that's already the case, most (all?) backends except QEMU will set
> > "sector-size" to the underlying block storage logical sector size
>
> QEMU, only if feature-large-sector-size is set, indeed, otherwise it
> just return an error if it have to set "sector-size" to a value
> different from 512.
>
> Otherwise, yes for Linux, FreeBSD, and maybe yes for blktap. For blktap
> it seems to depend of the storage, more or less:
> - block-vhd: always "sector-size" = 512
> - block-nbd: always "sector-size" = 512
> - block-aio: physical storage sector size
>
> > without any way to tell if the frontend supports sector-sizes != 512.
> > So the issue is not inherently with the setting of the "sector-size"
> > node to a value different than 512, but rather how different
> > implementations have diverged regarding which is the base unit of
> > other fields.
> >
> > > Also, it is probably easier to update backend than frontend, so
> > > it is just likely that something is going to lag behind and broke.
> >
> > Hm, I'm not convinced, sometimes the owner of a VM has no control over
> > the version of the backends if it's not the admin of the host. OTOH
> > the owner of a VM could always update the kernel in order to
> > workaround such blkfront/blkback incompatibility issues. Hence my
> > preference was for solutions that didn't involve changing Linux
> > blkback, as I believe that's the most commonly used backend.
>
> Going the Linux way might be the least bad option indeed. sectors in
> requests has been described as a 512-bytes for a long while. It's only
> "sectors" that have been described as "sector-size"-bytes size.
>
> > > Why not make use of the node "physical-sector-size" that have existed
> > > for 10 years, even if unused or unadvertised, and if an IO request isn't
> > > aligned on it, it is just going to be slow (as backend would have to
> > > read,update,write instead of just write sectors).
> >
> > I don't really fancy implementing read-modify-write on the backends,
> > as it's going to add more complexity to blkback implementations,
> > specially the in-kernel ones I would assume.
> >
> > All frontends I've looked into support "sector-size" != 512, but
> > there's a lack of uniformity on whether other units used in the
> > protocol are based on the blkback exposed "sector-size", or hardcoded
> > to 512.
> >
> > So your suggestion would be to hardcode "sector-size" to 512 and use
> > the "physical-sector-size" node value to set the block device logical
> > sector size the frontends?
> >
> > If we go that route I would suggest that backends are free to refuse
> > requests that aren't a multiple of "physical-sector-size".
>
> After looking in more detail in the different implementations, and linux
> one, I don't think changing "physical-sector-size" meaning is going to
> be helpful.
>
> What to do about "feature-large-sector-size"? Should backend refuse to
> connect to the front end if that flag is set and "sector-size" want to
> be different than 512? This would just be Windows frontend I guess.
> (Just as an helper for updated backend)
I think it's likely too late to mandate exposing
"feature-large-sector-size" in the frontends, as for example Linux
blkfront will handle "sector-size" != 512, yet it doesn't expose
"feature-large-sector-size". If we retroactively push this change to
the backends we might break setups that were working fine
previously.
>
> So yes, after more research, having sector in the protocol been a
> 512-byte size seems the least bad option. "sector_number" and
> "{first,last}_sect" have been described as is for a long while. Only
> "sectors" for the size has been described as a "sector-size" quantity.
Thanks for your input. I would also like to hear from the blktap and
Windows PV drivers maintainers, as the change that I'm proposing here
will require changes to their implementations.
Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |