[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 13/26] block: move cache control settings out of queue->flags

To: Christoph Hellwig <hch@xxxxxx>, Jens Axboe <axboe@xxxxxxxxx>
From: Hannes Reinecke <hare@xxxxxxx>
Date: Tue, 11 Jun 2024 11:58:19 +0200
Authentication-results: smtp-out2.suse.de; none
Cc: Geert Uytterhoeven <geert@xxxxxxxxxxxxxx>, Richard Weinberger <richard@xxxxxx>, Philipp Reisner <philipp.reisner@xxxxxxxxxx>, Lars Ellenberg <lars.ellenberg@xxxxxxxxxx>, Christoph Böhmwalder <christoph.boehmwalder@xxxxxxxxxx>, Josef Bacik <josef@xxxxxxxxxxxxxx>, Ming Lei <ming.lei@xxxxxxxxxx>, "Michael S. Tsirkin" <mst@xxxxxxxxxx>, Jason Wang <jasowang@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Alasdair Kergon <agk@xxxxxxxxxx>, Mike Snitzer <snitzer@xxxxxxxxxx>, Mikulas Patocka <mpatocka@xxxxxxxxxx>, Song Liu <song@xxxxxxxxxx>, Yu Kuai <yukuai3@xxxxxxxxxx>, Vineeth Vijayan <vneethv@xxxxxxxxxxxxx>, "Martin K. Petersen" <martin.petersen@xxxxxxxxxx>, linux-m68k@xxxxxxxxxxxxxxxxxxxx, linux-um@xxxxxxxxxxxxxxxxxxx, drbd-dev@xxxxxxxxxxxxxxxx, nbd@xxxxxxxxxxxxxxxx, linuxppc-dev@xxxxxxxxxxxxxxxx, ceph-devel@xxxxxxxxxxxxxxx, virtualization@xxxxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx, linux-bcache@xxxxxxxxxxxxxxx, dm-devel@xxxxxxxxxxxxxxx, linux-raid@xxxxxxxxxxxxxxx, linux-mmc@xxxxxxxxxxxxxxx, linux-mtd@xxxxxxxxxxxxxxxxxxx, nvdimm@xxxxxxxxxxxxxxx, linux-nvme@xxxxxxxxxxxxxxxxxxx, linux-s390@xxxxxxxxxxxxxxx, linux-scsi@xxxxxxxxxxxxxxx, linux-block@xxxxxxxxxxxxxxx
Delivery-date: Tue, 11 Jun 2024 09:58:30 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 6/11/24 07:19, Christoph Hellwig wrote:

Move the cache control settings into the queue_limits so that they
can be set atomically and all I/O is frozen when changing the
flags.

Add new features and flags field for the driver set flags, and internal
(usually sysfs-controlled) flags in the block layer.  Note that we'll
eventually remove enough field from queue_limits to bring it back to the
previous size.

The disable flag is inverted compared to the previous meaning, which
means it now survives a rescan, similar to the max_sectors and
max_discard_sectors user limits.

The FLUSH and FUA flags are now inherited by blk_stack_limits, which
simplified the code in dm a lot, but also causes a slight behavior
change in that dm-switch and dm-unstripe now advertise a write cache
despite setting num_flush_bios to 0.  The I/O path will handle this
gracefully, but as far as I can tell the lack of num_flush_bios
and thus flush support is a pre-existing data integrity bug in those
targets that really needs fixing, after which a non-zero num_flush_bios
should be required in dm for targets that map to underlying devices.

Signed-off-by: Christoph Hellwig <hch@xxxxxx>
---
  .../block/writeback_cache_control.rst         | 67 +++++++++++--------
  arch/um/drivers/ubd_kern.c                    |  2 +-
  block/blk-core.c                              |  2 +-
  block/blk-flush.c                             |  9 ++-
  block/blk-mq-debugfs.c                        |  2 -
  block/blk-settings.c                          | 29 ++------
  block/blk-sysfs.c                             | 29 +++++---
  block/blk-wbt.c                               |  4 +-
  drivers/block/drbd/drbd_main.c                |  2 +-
  drivers/block/loop.c                          |  9 +--
  drivers/block/nbd.c                           | 14 ++--
  drivers/block/null_blk/main.c                 | 12 ++--
  drivers/block/ps3disk.c                       |  7 +-
  drivers/block/rnbd/rnbd-clt.c                 | 10 +--
  drivers/block/ublk_drv.c                      |  8 ++-
  drivers/block/virtio_blk.c                    | 20 ++++--
  drivers/block/xen-blkfront.c                  |  9 ++-
  drivers/md/bcache/super.c                     |  7 +-
  drivers/md/dm-table.c                         | 39 +++--------
  drivers/md/md.c                               |  8 ++-
  drivers/mmc/core/block.c                      | 42 ++++++------
  drivers/mmc/core/queue.c                      | 12 ++--
  drivers/mmc/core/queue.h                      |  3 +-
  drivers/mtd/mtd_blkdevs.c                     |  5 +-
  drivers/nvdimm/pmem.c                         |  4 +-
  drivers/nvme/host/core.c                      |  7 +-
  drivers/nvme/host/multipath.c                 |  6 --
  drivers/scsi/sd.c                             | 28 +++++---
  include/linux/blkdev.h                        | 38 +++++++++--
  29 files changed, 227 insertions(+), 207 deletions(-)

diff --git a/Documentation/block/writeback_cache_control.rst 
b/Documentation/block/writeback_cache_control.rst
index b208488d0aae85..9cfe27f90253c7 100644
--- a/Documentation/block/writeback_cache_control.rst
+++ b/Documentation/block/writeback_cache_control.rst
@@ -46,41 +46,50 @@ worry if the underlying devices need any explicit cache 
flushing and how
  the Forced Unit Access is implemented.  The REQ_PREFLUSH and REQ_FUA flags
  may both be set on a single bio.

+Feature settings for block drivers

+----------------------------------

-Implementation details for bio based block drivers

---------------------------------------------------------------
+For devices that do not support volatile write caches there is no driver
+support required, the block layer completes empty REQ_PREFLUSH requests before
+entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from
+requests that have a payload.

-These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit

-directly below the submit_bio interface.  For remapping drivers the REQ_FUA
-bits need to be propagated to underlying devices, and a global flush needs
-to be implemented for bios with the REQ_PREFLUSH bit set.  For real device
-drivers that do not have a volatile cache the REQ_PREFLUSH and REQ_FUA bits
-on non-empty bios can simply be ignored, and REQ_PREFLUSH requests without
-data can be completed successfully without doing any work.  Drivers for
-devices with volatile caches need to implement the support for these
-flags themselves without any help from the block layer.
+For devices with volatile write caches the driver needs to tell the block layer
+that it supports flushing caches by setting the

+ BLK_FEAT_WRITE_CACHE-Implementation details for request_fn based block drivers

----------------------------------------------------------
+flag in the queue_limits feature field.  For devices that also support the FUA
+bit the block layer needs to be told to pass on the REQ_FUA bit by also setting
+the

-For devices that do not support volatile write caches there is no driver

-support required, the block layer completes empty REQ_PREFLUSH requests before
-entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from
-requests that have a payload.  For devices with volatile write caches the
-driver needs to tell the block layer that it supports flushing caches by
-doing::
+   BLK_FEAT_FUA
+
+flag in the features field of the queue_limits structure.
+
+Implementation details for bio based block drivers
+--------------------------------------------------
+
+For bio based drivers the REQ_PREFLUSH and REQ_FUA bit are simplify passed on
+to the driver if the drivers sets the BLK_FEAT_WRITE_CACHE flag and the drivers
+needs to handle them.
+
+*NOTE*: The REQ_FUA bit also gets passed on when the BLK_FEAT_FUA flags is
+_not_ set.  Any bio based driver that sets BLK_FEAT_WRITE_CACHE also needs to
+handle REQ_FUA.

- blk_queue_write_cache(sdkp->disk->queue, true, false);

+For remapping drivers the REQ_FUA bits need to be propagated to underlying
+devices, and a global flush needs to be implemented for bios with the
+REQ_PREFLUSH bit set.

-and handle empty REQ_OP_FLUSH requests in its prep_fn/request_fn. Note that

-REQ_PREFLUSH requests with a payload are automatically turned into a sequence
-of an empty REQ_OP_FLUSH request followed by the actual write by the block
-layer.  For devices that also support the FUA bit the block layer needs
-to be told to pass through the REQ_FUA bit using::
+Implementation details for blk-mq drivers
+-----------------------------------------

- blk_queue_write_cache(sdkp->disk->queue, true, true);

+When the BLK_FEAT_WRITE_CACHE flag is set, REQ_OP_WRITE | REQ_PREFLUSH requests
+with a payload are automatically turned into a sequence of a REQ_OP_FLUSH
+request followed by the actual write by the block layer.

-and the driver must handle write requests that have the REQ_FUA bit set

-in prep_fn/request_fn.  If the FUA bit is not natively supported the block
-layer turns it into an empty REQ_OP_FLUSH request after the actual write.
+When the BLK_FEA_FUA flags is set, the REQ_FUA bit simplify passed on for the
+REQ_OP_WRITE request, else a REQ_OP_FLUSH request is sent by the block layer
+after the completion of the write request for bio submissions with the REQ_FUA
+bit set.
diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c
index cdcb75a68989dd..19e01691ea0ea7 100644
--- a/arch/um/drivers/ubd_kern.c
+++ b/arch/um/drivers/ubd_kern.c
@@ -835,6 +835,7 @@ static int ubd_add(int n, char **error_out)
        struct queue_limits lim = {
                .max_segments           = MAX_SG,
                .seg_boundary_mask      = PAGE_SIZE - 1,
+               .features               = BLK_FEAT_WRITE_CACHE,
        };
        struct gendisk *disk;
        int err = 0;
@@ -882,7 +883,6 @@ static int ubd_add(int n, char **error_out)
        }

blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);

-       blk_queue_write_cache(disk->queue, true, false);
        disk->major = UBD_MAJOR;
        disk->first_minor = n << UBD_SHIFT;
        disk->minors = 1 << UBD_SHIFT;
diff --git a/block/blk-core.c b/block/blk-core.c
index 82c3ae22d76d88..2b45a4df9a1aa1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -782,7 +782,7 @@ void submit_bio_noacct(struct bio *bio)
                if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE &&
                                 bio_op(bio) != REQ_OP_ZONE_APPEND))
                        goto end_io;
-               if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
+               if (!bdev_write_cache(bdev)) {
                        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
                        if (!bio_sectors(bio)) {
                                status = BLK_STS_OK;
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 2234f8b3fc05f2..30b9d5033a2b85 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -381,8 +381,8 @@ static void blk_rq_init_flush(struct request *rq)
  bool blk_insert_flush(struct request *rq)
  {
        struct request_queue *q = rq->q;
-       unsigned long fflags = q->queue_flags;       /* may change, cache */
        struct blk_flush_queue *fq = blk_get_flush_queue(q, rq->mq_ctx);
+       bool supports_fua = q->limits.features & BLK_FEAT_FUA;


Shouldn't we have a helper like blk_feat_fua() here?

        unsigned int policy = 0;

/* FLUSH/FUA request must never be merged */

@@ -394,11 +394,10 @@ bool blk_insert_flush(struct request *rq)
        /*
         * Check which flushes we need to sequence for this operation.
         */
-       if (fflags & (1UL << QUEUE_FLAG_WC)) {
+       if (blk_queue_write_cache(q)) {
                if (rq->cmd_flags & REQ_PREFLUSH)
                        policy |= REQ_FSEQ_PREFLUSH;
-               if (!(fflags & (1UL << QUEUE_FLAG_FUA)) &&
-                   (rq->cmd_flags & REQ_FUA))
+               if ((rq->cmd_flags & REQ_FUA) && !supports_fua)
                        policy |= REQ_FSEQ_POSTFLUSH;
        }

@@ -407,7 +406,7 @@ bool blk_insert_flush(struct request *rq)

         * REQ_PREFLUSH and FUA for the driver.
         */
        rq->cmd_flags &= ~REQ_PREFLUSH;
-       if (!(fflags & (1UL << QUEUE_FLAG_FUA)))
+       if (!supports_fua)
                rq->cmd_flags &= ~REQ_FUA;

/*

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 770c0c2b72faaa..e8b9db7c30c455 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -93,8 +93,6 @@ static const char *const blk_queue_flag_name[] = {
        QUEUE_FLAG_NAME(INIT_DONE),
        QUEUE_FLAG_NAME(STABLE_WRITES),
        QUEUE_FLAG_NAME(POLL),
-       QUEUE_FLAG_NAME(WC),
-       QUEUE_FLAG_NAME(FUA),
        QUEUE_FLAG_NAME(DAX),
        QUEUE_FLAG_NAME(STATS),
        QUEUE_FLAG_NAME(REGISTERED),
diff --git a/block/blk-settings.c b/block/blk-settings.c
index f11c8676eb4c67..536ee202fcdccb 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -261,6 +261,9 @@ static int blk_validate_limits(struct queue_limits *lim)
                lim->misaligned = 0;
        }

+ if (!(lim->features & BLK_FEAT_WRITE_CACHE))

+               lim->features &= ~BLK_FEAT_FUA;
+
        err = blk_validate_integrity_limits(lim);
        if (err)
                return err;
@@ -454,6 +457,8 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
  {
        unsigned int top, bottom, alignment, ret = 0;

+ t->features |= (b->features & BLK_FEAT_INHERIT_MASK);

+
        t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
        t->max_user_sectors = min_not_zero(t->max_user_sectors,
                        b->max_user_sectors);
@@ -711,30 +716,6 @@ void blk_set_queue_depth(struct request_queue *q, unsigned 
int depth)
  }
  EXPORT_SYMBOL(blk_set_queue_depth);

-/**

- * blk_queue_write_cache - configure queue's write cache
- * @q:         the request queue for the device
- * @wc:                write back cache on or off
- * @fua:       device supports FUA writes, if true
- *
- * Tell the block layer about the write cache of @q.
- */
-void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
-{
-       if (wc) {
-               blk_queue_flag_set(QUEUE_FLAG_HW_WC, q);
-               blk_queue_flag_set(QUEUE_FLAG_WC, q);
-       } else {
-               blk_queue_flag_clear(QUEUE_FLAG_HW_WC, q);
-               blk_queue_flag_clear(QUEUE_FLAG_WC, q);
-       }
-       if (fua)
-               blk_queue_flag_set(QUEUE_FLAG_FUA, q);
-       else
-               blk_queue_flag_clear(QUEUE_FLAG_FUA, q);
-}
-EXPORT_SYMBOL_GPL(blk_queue_write_cache);
-
  int bdev_alignment_offset(struct block_device *bdev)
  {
        struct request_queue *q = bdev_get_queue(bdev);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 5c787965b7d09e..4f524c1d5e08bd 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -423,32 +423,41 @@ static ssize_t queue_io_timeout_store(struct 
request_queue *q, const char *page,

static ssize_t queue_wc_show(struct request_queue *q, char *page)

  {
-       if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
-               return sprintf(page, "write back\n");
-
-       return sprintf(page, "write through\n");
+       if (q->limits.features & BLK_FLAGS_WRITE_CACHE_DISABLED)


Where is the difference between 'flags' and 'features'?
Ie why is is named BLK_FEAT_FUA but BLK_FLAGS_WRITE_CACHE_DISABLED?
And if the feature is the existence of a capability, and the flag is
the setting of that capability, can you make it clear in the documentation?

Cheers,

Hannes
--
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@xxxxxxx                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

Follow-Ups:
- Re: [PATCH 13/26] block: move cache control settings out of queue->flags
  - From: Christoph Hellwig

References:
- move features flags into queue_limits
  - From: Christoph Hellwig
- [PATCH 13/26] block: move cache control settings out of queue->flags
  - From: Christoph Hellwig

Prev by Date: Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()
Next by Date: Re: [PATCH v2 3/7] x86/irq: limit interrupt movement done by fixup_irqs()
Previous by thread: Re: [PATCH 13/26] block: move cache control settings out of queue->flags
Next by thread: Re: [PATCH 13/26] block: move cache control settings out of queue->flags
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.