[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Re: Interdomain comms



I quite like some of this.  A few more comments below.

On Sat, 2005-05-07 at 19:57 -0500, Eric Van Hensbergen wrote:
> 
> In our world, this would result in you holding a Fid pointing to the
> open object.  The Fid is a pointer to meta-data and is considered
> state on both the FE and the BE. (this has downsides in terms of
> reliability and the ability to recover sessions or fail over to
> different BE's -- one of our summer students will be addressing the
> reliability problem this summer).

OK, so this is an area of concern for me.  I used the last version of
the sketchy API I outlined to create an HA cluster infrastructure. So I
had to solve these kind of protocol issues and, whilst it was actually
pretty easy starting from scratch, retrofitting a solution to an
existing protocol might be challenging, even for a summer student.

> 
> The FE performs a read operation passing it the necessary bits:
>   ret = read( fd, *buf, count );

Here the API is coupling the client to the memory management
implementation by assuming that the buffer is mapped into the client's
virtual address space.

This is probably likely to be true most of the time so an API at this
level will be useful but I'd also like to be able to write I/O
applications that manage the data in buffers that are never mapped into
the application address space.

Also, I'd like to be able to write applications that have clients which
use different types of buffers without having to code for each case in
my application.

This is why my API deals in terms of local_buffer_references which, for
the sake of argument, might look like this:

struct local_buffer_reference
{
    local_buffer_reference_type type;
    local_buffer_reference_base base;
    buffer_reference_offset     offset;
    buffer_reference_length     length;
};

A local buffer reference of type virtual_address would have a base value
equal to the buf pointer above, an offset of zero and a length of count.

A local buffer reference for the hidden buffer pages would have a
different type, say hidden_buffer_page, the base would be a pointer to a
vector of page indices, the offset would be the offset of the start of
the buffer into the first page and the length would be the length of the
buffer.

So, my application can deal with buffers described like that without
having to worry about the flavour of memory management backing them.

Also, I can change the memory management without changing all the calls
to the API, I only have to change where I get buffers from.

BTW, this specific abstraction I learnt about from an embedded OS
architected by Nik Shalor. He might have got it from somewhere else.

> 
> This actually would be translated (in collaboration with local
> meta-data into a t_read mesage)
>   t_read tag fid offset count (where offset is determined by local fid 
> metadata)
> 
> The BE receives the read request, and based on state information kept
> in the Fid (basically your metadata), it finds the file contents in
> the buffer cache.  It sends a response packet with a pointer to its
> local buffer cache entry:
> 
>  r_read tag count *data
> 
> There are a couple ways we could go when the FE receives the response:
> a) it could memcopy the data to the user buffer *buf .  This is the
> way things   currently work, and isn't very efficient -- but may be
> the way to go for the ultra-paranoid who don't like sharing memory
> references between partitions.
> 
> b) We could have registered the memory pointed to by *buf and passed
> that reference along the path -- but then it probably would just
> amount to the BE doing the copy rather than the front end.  Perhaps
> this approximates what you were talking about doing?

No, 'c' is closer to what I was sketching out, except that I was
proposing a general mechanism that had one code path in the clients even
though the underlying implementation could be a, b or c or any other
memory management strategy determined at run-time according to the type
of the local buffer references involved.

> 
> c) As long as the buffers in question (both *buf and the buffer cache
> entry) were page-aligned, etc. -- we could play clever VM games
> marking the page as shared RO between the two partitions and alias the
> virtual memory pointed to by *buf to the shared page.  This is very
> sketchy and high level and I need to delve into all sorts of details
> -- but the idea would be to use virtual memory as your friend for
> these sort of shared read-only buffer caches.  It would also require
> careful allocation of buffers of the right size on the right alignment
> -- but driver writers are used to that sort of thing.

Yes, it also requires buffers of the right size and alignment to be used
at the receiving end of any network transfers and for the alignment to
be preserved across the network even if the transfer starts at a
non-zero page offset. You might think that once the data goes over the
network you don't care but it might be received by an application that
wants to share pages with another application so, in fact, it does
matter. This is just something you have to get right if you want any
kind of page referencing technique to work although you can fall back to
memcopy for misaligned data if necessary.

> The above looks complicated, but to a FE writer would be as simple as:
>  channel = dial("net!BE"); /* establish connection */ 
> /* in my current code, channel is passed as an argument to the FE as a
> boot arg */
>   root = fsmount(channel, NULL); /* this does the t_version, auth, & attach */
>   fd = open(root, "/some/path/file", OREAD);
>   ret = read(fd, *buf, sizeof(buf));
>   close(fd);
>  close(root);
>  close(channel);

So, this is obviously a blocking API.  My API was non-blocking because
the network latency means that you need a lot of concurrency for high
throughput and you don't necessarily want so many threads. Like AIO.
Having a blocking API as well is convenient though.

> 
> If you want to get fancy, you could get rid of the root arg to open
> and use a private name space (after fsmount):
>   bind(root, "/mnt/be", MREPL); /* bind the back end to a well known place */
> then it would be:
>   fd=open("/mnt/be/some/path/file", OREAD);
> 
> There's also all sorts of cool stuff you can do on the domain
> controller to provision child partitions using dynamic name space and
> then just exporting the custom fashioned environment using 9P -- but
> that's higher level organization stuff again.  There's all sorts of
> cool tricks you can play with 9P (similar to the stuff that the FUSE
> and FiST user-space file system packages provide) like copy-on-write
> file systems, COW block devices, muxed ttys, etc. etc.

I'm definitely going to study the organisational aspects of 9P.

> I've described it in terms of a file system, using your example as a
> basis, but the same sort of thing would be true for a block device or
> a network connection (with some slightly different semantic rules on
> the network connection).  The main point is to keep things simple for
> the FE and BE writers, and deal with all the accounting and magic you
> describe within the infrastructure (no small task).

If you use the buffer abstraction I described then you can start with a
very simple mm implementation and improve it without having to change
the clients.

> 
> Another difference would involve what would happen if you did have to
> bridge a cluster network - the 9P network encapsulation is well
> defined, all you would need to do (at the I/O partition bridging the
> network) is marshall the data according to the existing protocol spec.
>  For more intelligent networks using RDMA and such things, you could
> keep the scatter/gather style semantics and send pointers into the
> RDMA space for buffer references.

I don't see this as a difference. My API was explicitly compatible with
a networked implementation.

One of the thoughts that did occur to me was that a reliance on in-order
message delivery (which 9p has) turns out to be quite painful to satisfy
when combined with multi-pathing and failover because a link failure
results in the remaining links being full of messages that can't be
delivered until the lost messages (which happened to include the message
that must be delivered first) have been resent.  This is more
problematical than you might think because all communications stall for
the time taken to _detect_ the lost link and all the remaining links are
full so you can't do the recovery without cleaning up another link.

This is not insurmountable but it's not obviously the best solution
either, particularly since, with SMP machines, you aren't going to want
to serialise the concurrent activity so there needn't be any fundamental
in-order requirements at the protocol level.

> 
> As I said before, there's lots of similarities in what we are talking
> about, I'm just gluing a slightly more abstract interface on top,
> which has some benefits in some additional organizational and security
> mechanisms (and a well-established (but not widely used yet) network
> protocol encapsulation).

Maybe something like my API underneath and something like the
organisational stuff of 9p on top would be good.  I'd have to see how
the 9p organisational stuff stacks up against the publish and subscribe
mechanisms for resource discovery that I'm used to. Also, the
infrastructure requirements for building fault-tolerant systems are
quite demanding and I'd have to be confident that 9p was up to the task
before I'd be happy with it personally.

> 
> There are plenty of details I know I'm glossing over, and I'm sure
> I'll need lots of help getting things right.  I'd have preferred
> staying quiet until I had my act together a little more, but Orran and
> Ron convinced me that it was important to let people know the
> direction I'm planning on exploring.

Yes, definitely worthwhile.  I'd like to see more discussion like this
on the xen-devel list.  On the one hand, it's kind of embarrassing to
discuss vaporware and half finished ideas but on the other, the
opportunity for public comment at an early stage in the process is
probably going to save a lot of effort in the long run.

-- 
Harry Butterworth <harry@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.