[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance


  • To: xen-users@xxxxxxxxxxxxx
  • From: Gordan Bobic <gordan@xxxxxxxxxx>
  • Date: Mon, 30 Jun 2014 10:04:05 +0100
  • Delivery-date: Mon, 30 Jun 2014 09:04:19 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

On 06/29/2014 06:07 AM, lee wrote:
Gordan Bobic <gordan@xxxxxxxxxx> writes:

On 06/28/2014 12:25 PM, lee wrote:
Kuba <kuba.0000@xxxxx> writes:

SSD caching
means two extra disks for the cache (or what happens when the cache disk
fails?),

For ZIL (write caching), yes, you can use a mirrored device. For read
caching it obviously doesn't matter.

That's not so obvious --- when the read cache fails, ZFS would
automatically have to resort to the disks.

and ZFS doesn't increase the number of SAS/SATA ports you have.

No, but it does deprecate the RAID and caching parts of a controller,

Why does it deprecate them?

Because it's RAID is far more advanced, and it makes far better use of the caches built into the disks. For example, ZFS parity RAID (n+1, n+2, n+3) avoids the parity RAID write hole, where a partial stripe write requires two operations:

1) Write that needs to be committed + read of the rest of the stripe
2) Write of the updated parity block

Since ZFS uses variable width stripes, every write is always a single operation.

so you might as well just use an HBA (cheaper). Covering the whole
stack, ZFS can also make much better use of on-disk caches (my 4TB
HGSTs have 64MB of RAM each. If you have 20 of them on a 4-port SATA
card with a 5-port multiplier on each port,

There are multipliers for SATA ports?

Yes.

Can you connect SAS disks to them as well?

No. You cannot plug SAS disks into SATA ports, with or without multipliers.

Do the disks show up individually or bundled when you use one?

Individually - unless you get a more advanced multiplier that makes them into one big logical RAID-ed device - but don't do that. :)

Aren't they getting into each others ways, filling up the bandwidth of
the port?

If your HBA/PMP/HDD don't support FIS+NCQ, then yes. Operations cannot be multiplexed, and the performance reduces with every added disk since every operation holds the bus until it completes.

If your HBA/PMP/HDD do support FIS+NCQ (the models I mentioned do), then the bandwidth is effectively multiplexed on demand. It works a bit like VLAN tagging. A command gets issued, but while that command is completing (~8ms on a 7200rpm disk) you can issue commands to other disks, so multiple commands on multiple disks can be completing at the same time. As each disk completes the command and returns data, the same happens in reverse. Because the commands get interleaved this way, as you add more disk, you are increasing the upstream port utilization (up to it's capacity, if your command mix can saturate that much bandwidth).

Saturating the upstream port is only really an issue if all of your disks are doing large linear transfers. With typical I/O patterns, you spend most of the time waiting for the rotational latency, so in realistic use, the fact that you don't have a dedicated port's worth of bandwidth for each disk doesn't matter as much.

SAS expanders work in a similar way.

How does it do the checksumming?

Every block is checksummed, and this is stored and checked on every
read of that block. In addition, every block (including it's checksum)
are encoded for any extra redundancy specified (e.g. mirroring or n+1,
n+2 or n+3). So if you read the block, you also read the checksum
stored with it, and if it checks out, you hand the data to the app
with nothing else to be done. If the checksum doesn't match the data
(silent corruption), or read of one of the disks containing a piece of
the block fails (non-silent corruption, failed sector)), ZFS will go
and

And? Correct the error?

Sorry, did that get truncated?
Yes, indeed, ZFS will initiate recovery procedures, find a combination of blocks which, when assembled, match the checksum, return the data, re-calculate the damaged block and write it back to the disk that didn't return the correct data.

So it's like RAID built into the file system?  What about all the CPU
overhead?

It's expensive - for a file system. In reality, I use a number of 1.3GHz N36L HP Microservers with 4-8 disks in each in RAIDZ2 (n+2 redundancy similar to RAID6), and even on weekly disk scrubs they never get anywhere near running out of CPU.

Read everything after it's been written to verify?

No, just written with a checksum on the block and encoded for extra
redundancy.

That means you don't really know whether the data has been written as
expected before it's read.

No, you don't - but you don't with any kind of RAID. The only feature available for that is disk's own WRV feature, which most disks don't support. But at least with ZFS you get a decent chance of getting the data back intact even if some of it ended up as a phantom write.

If you have Seagate disks that support the feature you can
enable Write-Read-Verify at disk level. I wrote a patch for hdparm for
toggling the feature.

Only 4 small SAS disks are Seagates (I only put two of them in), the
rest is WD SATAs --- and I'm starting to suspect that the RAID
controller in the server doesn't like the WD disks at all, which causes
the crashes.  Those disks weren't made at all for this application.

This is another problem with clever controllers, especially hardware RAID. RAID controllers typically wait around 8-9 seconds for the disk to return the data. If it doesn't, they kick the disk out of the array. A while back, most disks shipped with Time Limited Error Recovery (TLER). Nowdays most disks ship with feature crippled firmwares to enable manufacturers to charge extra for disks which have TLER enabled (e.g. WD Reds do, other WDs don't).

My 1TB drives have it (got some from all manufacturers, but most are up to 5 years old). Recent drives generally don't unless they are the ones marketed for use in NAS applications. HGST are one exception to the rule - I have a bunch of their 4TB drives, and they only make one 4TB model, which has TLER. Most other manufacturers make multiple variants of the same drive, and most are selectively feature-crippled.


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.