[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AW: AW: [Xen-API] SG_IO for iscsi targets in XCP

To: Uli StÃrk <Uli.Staerk@xxxxxxxxxxxxxx>, "xen-api@xxxxxxxxxxxxxxxxxxx" <xen-api@xxxxxxxxxxxxxxxxxxx>
From: George Shuklin <george.shuklin@xxxxxxxxx>
Date: Wed, 20 Jul 2011 01:42:24 +0400
Cc:
Delivery-date: Tue, 19 Jul 2011 14:42:12 -0700
List-id: Discussion of API issues surrounding Xen <xen-api.lists.xensource.com>

Well..

I'm not very well understand the reasons. You talking about online oroffline split brain? As I say early, online split brain could beprevented by using same network adapter (if link lost - there is noreplication, no new writing operations, no 'insonsistent' readingoperations).

Offline split brain could be prevented by manual startup (host bootswithout active DRBD and iscsi service). If only one server has beenrebooted, than clients are served by second server. If both of them godown, you need to find most recent node (manually, with help from DRBDsync process) and bring them up after resync (you already got down, solittle more time will not make a drastic changes).

The main reason I wants primary/primary DRBD is doubled amount ofreading devices - this will really reduce load. I expect some verysignificant difference... And one more little part: in primary/primarymode some XCP host go to one target, other to second. If one of the nodewill fail only half of customers will get a pretty long lag beforeswitching.


On 19.07.2011 19:10, Uli StÃrk wrote:

An SAN-replication is not good enough, because of the giant raidsets. There is so 
much (random) workload on the disks, that a re-sync wont exceed 100 MB/s. We usually 
have about 50 MB/s if we donât want to affect the running applications. Our 
raidsets would take more than a week to synchronize/verify :( We must have the 
possibility to replicate smaller sets of data. So we use DRBD for replicating data 
like you suggested for SANs.

Due to our experience, there are service several  service interruptions on redundant 
wan connections. You cant avoid this! Usually the service interruptions are very 
short (less than 5 minutes). Each interruption would trigger a failover process for 
a master-master setup and go into a split brain mode. In this case you will lose 
data, if you discard the changes on one node. Losing data is usually the worst thing 
that can happen. A merge is usually not cost-effective possible (database-duplicate 
key entries, etc). A short service-interruption is within the SLA and we donât 
lose data. If we can predict that a service interruption will take more than a few 
minutes, we fail over to the second site. Usually this happens if the datacenter 
burns to the ground or a redundant server or networking component fails. This 
usually this happens less than once a year ;)

IMHO a master-master setup can only be recommended if you have no real 
networking between the nodes and use it for higher performance as a single node 
can offer. In all other cases, use it for backup and a backup should be a 
master-slave setup.


-----UrsprÃngliche Nachricht-----
Von: George Shuklin [mailto:george.shuklin@xxxxxxxxx]
Gesendet: Dienstag, 19. Juli 2011 16:09
An: Uli StÃrk
Betreff: Re: AW: [Xen-API] SG_IO for iscsi targets in XCP

There is two types of split-brain: online and offline.

Offline split-brain:

two primary/primary (p/p) are online
first go down, second primary operates some time second go down firts go up 
[stage1] second go up and found that one conflicts with first. [stage2]

This situation is somehow bad. In stage2 we will need to dischange every data 
second and problem actually starts at stage1, when we 'go to the past' by 
bringing up older machine.

In this situation we can:  go down again and replicate all data from second to 
first (we loosing 'time fork' we created during second StandAlone operation).
OR
simply replicate second from first and continue to operate in 'past fork', 
polling back state to moment 'first go down' and forgetting all second efforts.

All those problems can be solved by manual disaster recovery. If one of the 
servers go down, when it came back it must be stated manually. In normal 
datacenter downtime usually assisted by staff.

The second case is 'online' split-brain.

DRBD do require link between 'heads'. If this link go down, both heads have 
starting to think that remote node is down and continue operates independently. 
(If we say 'go down if remote disconnected', that means we kill any Fault 
Tolerance in DRBD - no reason to do p/p DRBD at all).
In this case we will met a horrible completely data loss - some data going to 
one, someone to second, and if we using load balancing, we can shutdown storage 
and says 'oops, sorry guys, no more data'.

Even a dedicated cord between DRBD hosts does not save from constant fear of 
online split brain.
If some asshole plug it out?
... or simply pull by moving equipment (drop something heavy?) If network card 
or cord die?
If someone say 'ethX down' by mistake on one of the servers?

All those cases is not a 'sorry, we have 36hr downtime', it all 'sh.t, 
everything is lost'.

And there is simple and elegant solution to all fears: use SAN for replication 
(same interface for replication and iscsi serving).

If you have enough bandwidth (10G usually do), this solve everything:

If some link, cord, network card and so on goes down, this host stops to serve 
clients. No IO, no new data, no problems with data corruption.


So I think dual head is possible in case of XCP. Specific architecture allow 
this. (I hope, I'll test and report later).

Ð ÐÑÑ, 19/07/2011 Ð 12:50 +0000, Uli StÃrk ÐÐÑÐÑ:

My 5 cents: In real-world applications a split-brain will cause so
much work/trouble (and even service-interruption) that most admins
here will not consider using a dual-primary configuration ;)

-----UrsprÃngliche Nachricht-----
Von: xen-api-bounces@xxxxxxxxxxxxxxxxxxx
[mailto:xen-api-bounces@xxxxxxxxxxxxxxxxxxx] Im Auftrag von George
Shuklin
Gesendet: Dienstag, 19. Juli 2011 14:34
An: Dave Scott
Cc: xen-api@xxxxxxxxxxxxxxxxxxx
Betreff: RE: [Xen-API] SG_IO for iscsi targets in XCP

Thank you very much.

I feel more safe now with dual primary DRBD configuration. I'll report results 
of practical deployment with real-life load later.

Ð ÐÑÑ, 19/07/2011 Ð 12:21 +0100, Dave Scott ÐÐÑÐÑ:

Hi George,

XCP just uses shared LVM over iSCSI as a generic block device. This is only safe because 
(i) we modified LVM to run in a "read-only" mode on slaves; and (ii) we 
co-ordinate all LVM metadata updates across the pool in the XCP storage layer.

I'm researching if XCP by anyway is issuing some SCSI commands
like reservation or persistent reservation. I done 'greping' via
source code for SG_IO ioctl() and found just few innocent inquiry/id requests.

Just to be sure: Is any SCSI-specific features used in XCP for
cluster management or resource locking? Or iscsi used only as
generic block device with LVM?



_______________________________________________
xen-api mailing list
xen-api@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/mailman/listinfo/xen-api


_______________________________________________
xen-api mailing list
xen-api@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/mailman/listinfo/xen-api

References:
- [Xen-API] SG_IO for iscsi targets in XCP
  - From: George Shuklin
- RE: [Xen-API] SG_IO for iscsi targets in XCP
  - From: Dave Scott
- RE: [Xen-API] SG_IO for iscsi targets in XCP
  - From: George Shuklin

Prev by Date: [Xen-API] Guest lifecycle via XenAPI
Next by Date: Re: [Xen-API] Guest lifecycle via XenAPI
Previous by thread: RE: [Xen-API] SG_IO for iscsi targets in XCP
Next by thread: [Xen-API] XCP BETA BUG: Network Connection Lost
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.