[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Re: Interdomain comms



I like it.  To start with, local communication only would be fine.  Eventually 
it would scale neatly to things like remote device access.

I particularly like the abstraction for remote memory - this would be an 
excellent fit to take advantage of RDMA where available (e.g. a cluster 
running on an IB fabric).

Cheers,
Mark

On Friday 06 May 2005 13:14, Harry Butterworth wrote:
> On Fri, 2005-05-06 at 08:46 +0100, Mike Wray wrote:
> > Harry Butterworth wrote:
> > > The current overhead in terms of client code to establish an entity on
> > > the xen inter-domain communication "bus" is currently of the order of
> > > 1000 statements (counting FE, BE and slice of xend).  A better
> > > inter-domain communication API could reduce this to fewer than 10
> > > statements.  If it's not done by the time I finish the USB work, I will
> > > hopefully be allowed to help with this.
> >
> > This reminded me you had suggested a different model for inter-domain
> > comms. I recently suggested a more socket-like API but it didn't go down
> > well.
>
> What exactly were the issues with the socket-like proposal?
>
> > I agree with you that the event channel model could be improved -
> > what kind of comms model do you suggest?
>
> The event-channel and shared memory page are fine as low-level
> primitives to implement a comms channel between domains on the same
> physical machine. The problem is that the primitives are unnecessarily
> low-level from the client's perspective and result in too much
> per-client code.
>
> The inter-domain communication API should preserve the efficiency of
> these primitives but provide a higher level API which is more convenient
> to use.
>
> Another issue with the current API is that, in the future, it is likely
> (for a number of virtual-iron/fault-tolerant-virtual-machine-like
> reasons) that it will be useful for the inter-domain communication API
> to span physical nodes in a cluster. The problem with the current API is
> that it directly couples the clients to a shared memory implementation
> with a direct connection between the front and back end domains and the
> clients would all need to be rewritten if the implementation was to span
> physical machines or require indirection. Eventually I would expect the
> effort invested in the clients of the inter-domain API to equal or
> exceed the effort invested in the hypervisor in the same way that the
> linux device drivers make up the bulk of the linux kernel code. There is
> a risk therefore that this might become a significant architectural
> limitation.
>
> So, I think we're looking for a higher-level API which can preserve the
> current efficient implementation for domains resident on the same
> physical machine but allows for domains to be separated by a network
> interface without having to rewrite all the drivers.
>
> The API needs to address the following issues:
>
> Resource discovery --- Discovering the targets of IDC is an inherent
> requirement.
>
> Dynamic behaviour --- Domains are going to come and go all the time.
>
> Stale communications --- When domains come and go, client protocols must
> have a way to recover from communications in flight or potentially in
> flight from before the last transition.
>
> Deadlock --- IDC is a shared resource and must not introduce resource
> deadlock issues, for example when FE and BEs are arranged symetrically
> in reverse across the same interface or when BEs are stacked and so
> introduce chains of dependencies.
>
> Security --- There are varying degrees of trust beween the domains.
>
> Ease of use --- This is important for developer productivity and also to
> help ensure the other goals (security/robustness) are actually met.
>
> Efficiency/Performance --- obviously.
>
> I'd need a few days (which I don't have right now) to put together a
> coherent proposal tailored specifically to xen.  However, it would
> probably be along the lines of the following:
>
> A buffer abstraction to decouple the IDC API from the memory management
> implementation:
>
> struct local_buffer_reference;
>
> An endpoint abstraction to represent one end of an IDC connection.  It's
> important that this is done on a per connection basis rather than having
> one per domain for all IDC activity because it avoids deadlock issues
> arising from chained, dependent communication.
>
> struct idc_endpoint;
>
> A message abstraction because some protocols are more efficiently
> implemented using one-way messages than request-response pairs,
> particularly when the protocol involves more than two parties.
>
> struct idc_message
> {
>     ...
>     struct local_buffer_reference message_body;
> };
>
> /* When a received message is finished with */
>
> void idc_message_complete( struct idc_message * message );
>
> A request-response transaction abstraction because most protocols are
> more easily implemented with these.
>
> struct idc_transaction
> {
>     ...
>     struct local_buffer_reference transaction_parameters;
>     struct local_buffer_reference transaction_status;
> };
>
> /* Useful to have an error code in addition to status.  */
>
> /* When a received transaction is finished with. */
>
> void idc_transaction_complete
>   ( struct idc_transaction * transaction, error_code error );
>
> /* When an initiated transaction completes. Error code also reports
> transport errors when endpoint disconnects whilst transaction is
> outstanding. */
>
> error_code idc_transaction_query_error_code
>   ( struct idc_transaction * transaction );
>
> An IDC address abstraction:
>
> struct idc_address;
>
> A mechanism to initiate connection establishment, can't fail because
> endpoint resource is pre-allocated and create doesn't actually need to
> establish the connection.
>
> The endpoint calls the registered notification functions as follows:
>
> 'appear' when the remote endpoint is discovered then 'disappear' if it
> goes away again or 'connect' if a connection is actually established.
>
> After 'connect', the client can submit messages and transactions.
>
> 'disconnect' when the connection is failing, the client must wait for
> outstanding messages and transactions to complete (sucessfully or with a
> transport error) before completing the disconnect callback and must
> flush received messages and transactions whilst disconnected.
>
> Then 'connect' if the connection is reestablished or 'disappear' if the
> remote endpoint has gone away.
>
> A disconnect, connect cycle guarantees that the remote endpoint also
> goes through a disconnect, connect cycle.
>
> This API allows multi-pathing clients to make intelligent decisions and
> provides sufficient guarantees about stale messages and transactions to
> make a useful foundation.
>
> void idc_endpoint_create
> (
>     struct idc_endpoint * endpoint,
>     struct idc_address address,
>     void ( * appear     )( struct idc_endpoint * endpoint ),
>     void ( * connect    )( struct idc_endpoint * endpoint ),
>     void ( * disconnect )
>       ( struct idc_endpoint * endpoint, struct callback * callback ),
>     void ( * disappear )( struct idc_endpoint * endpoint ),
>     void ( * handle_message )
>       ( struct idc_endpoint * endpoint, struct idc_message * message ),
>     void ( * handle_transaction )
>     (
>         struct idc_endpoint * endpoint,
>         struct idc_transaction * transaction
>     )
> );
>
> void idc_endpoint_submit_message
>   ( struct idc_endpoint * endpoint, struct idc_message * message );
>
> void idc_endpoint_submit_transaction
>   ( struct idc_endpoint * endpoint, struct idc_transaction *
> transaction );
>
> idc_endpoint_destroy completes the callback once the endpoint has
> 'disconnected' and 'disappeared' and the endpoint resource is free for
> reuse for a different connection.
>
> void idc_endpoint_destroy
> (
>     struct idc_endpoint * endpoint,
>     struct callback * callback
> );
>
> The messages and transaction parameters and status must be of finite
> length (these quota properties might be parameters of the endpoint
> resource allocation). Need a mechanism for efficient, arbitrary length
> bulk transfer too.
>
> An abstraction for buffers owned by remote domains:
>
> struct remote_buffer_reference;
>
> Can register a local buffer with the IDC to get a remote buffer
> reference:
>
> struct remote_buffer_reference idc_register_buffer
>   ( struct local_buffer_reference buffer, some kind of resource probably
> required here );
>
> remote buffer references may be passed between domains in idc messages
> or transaction parameters or transaction status.
>
> remote buffer references may be forwarded between domains and are usable
> from any domain.
>
> Once in posession of a remote buffer reference, a domain can transfer
> data between the remote buffer and a local buffer:
>
> void idc_send_to_remote_buffer
> (
>     struct remote_buffer_reference remote_buffer,
>     struct local_buffer_reference local_buffer,
>     struct callback * callback, /* transfer completes asynchronously */
>     some kind of resource required here
> );
>
> void idc_receive_from_remote_buffer
> (
>     struct remote_buffer_reference remote_buffer,
>     struct local_buffer_reference local_buffer,
>     struct callback * callback, /* Again, completes asynchronously */
>     some kind of resource required here
> );
>
> Can unregister to free a local buffer independent of remote buffer
> references still knocking around in remote domains (subsequent
> sends/receives fail):
>
> void idc_unregister_buffer
>   ( probably a pointer to the resource passed on registration );
>
> So, the 1000 statements of establishment code in the current drivers
> becomes:
>
> Receive an idc address from somewhere (resource discovery is outside the
> scope of this sketch).
>
> Allocate an IDC endpoint from somewhere (resource management is again
> outside the scope of this sketch).
>
> Call idc_endpoint_create.
>
> Wait for 'connect' before attempting to use connection for device
> specific protocol implemented using messages/transactions/remote buffer
> references.
>
> Call idc_endpoint_destroy and quiesce before unloading module.
>
> The implementation of the local buffer references and memory management
> can hide the use of pages which are shared between domains and reference
> counted to provide a zero copy implementation of bulk data transfer and
> shared page-caches.
>
> I implemented something very similar to this before for a cluster
> interconnect and it worked very nicely.  There are some subtleties to
> get right about the remote buffer reference implementation and the
> implications for out-of-order and idempotent bulk data transfers.
>
> As I said, it would require a few more days work to nail down a good
> API.
>
> Harry.
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.