[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [RFC PATCH 06/24] ARM: GICv3 ITS: introduce host LPI array
On Thu, 10 Nov 2016, Andre Przywara wrote: > Hi, > > On 27/10/16 23:59, Stefano Stabellini wrote: > > On Wed, 28 Sep 2016, Andre Przywara wrote: > >> The number of LPIs on a host can be potentially huge (millions), > >> although in practise will be mostly reasonable. So prematurely allocating > >> an array of struct irq_desc's for each LPI is not an option. > >> However Xen itself does not care about LPIs, as every LPI will be injected > >> into a guest (Dom0 for now). > >> Create a dense data structure (8 Bytes) for each LPI which holds just > >> enough information to determine the virtual IRQ number and the VCPU into > >> which the LPI needs to be injected. > >> Also to not artificially limit the number of LPIs, we create a 2-level > >> table for holding those structures. > >> This patch introduces functions to initialize these tables and to > >> create, lookup and destroy entries for a given LPI. > >> We allocate and access LPI information in a way that does not require > >> a lock. > >> > >> Signed-off-by: Andre Przywara <andre.przywara@xxxxxxx> > >> --- > >> xen/arch/arm/gic-its.c | 154 > >> ++++++++++++++++++++++++++++++++++++++++++ > >> xen/include/asm-arm/gic-its.h | 18 +++++ > >> 2 files changed, 172 insertions(+) > >> > >> diff --git a/xen/arch/arm/gic-its.c b/xen/arch/arm/gic-its.c > >> index 88397bc..2140e4a 100644 > >> --- a/xen/arch/arm/gic-its.c > >> +++ b/xen/arch/arm/gic-its.c > >> @@ -18,18 +18,31 @@ > >> > >> #include <xen/config.h> > >> #include <xen/lib.h> > >> +#include <xen/sched.h> > >> #include <xen/err.h> > >> #include <xen/device_tree.h> > >> #include <xen/libfdt/libfdt.h> > >> #include <asm/p2m.h> > >> +#include <asm/domain.h> > >> #include <asm/io.h> > >> #include <asm/gic.h> > >> #include <asm/gic_v3_defs.h> > >> #include <asm/gic-its.h> > >> > >> +/* LPIs on the host always go to a guest, so no struct irq_desc for them. > >> */ > >> +union host_lpi { > >> + uint64_t data; > >> + struct { > >> + uint64_t virt_lpi:32; > >> + uint64_t dom_id:16; > >> + uint64_t vcpu_id:16; > >> + }; > >> +}; > > > > Why not the following? > > > > union host_lpi { > > uint64_t data; > > struct { > > uint32_t virt_lpi; > > uint16_t dom_id; > > uint16_t vcpu_id; > > }; > > }; > > I am not sure that gives me a guarantee of stuffing everything into a > u64 (as per the C standard). It probably will on arm64 with gcc, but I > thought better safe than sorry. I am pretty sure that it is covered by the standard, also see IHI0055A_aapcs64. Additionally I don't think the union with "data" is actually required either. > >> /* Global state */ > >> static struct { > >> uint8_t *lpi_property; > >> + union host_lpi **host_lpis; > >> int host_lpi_bits; > >> } lpi_data; > >> > >> @@ -43,6 +56,26 @@ static DEFINE_PER_CPU(void *, pending_table); > >> #define MAX_HOST_LPI_BITS \ > >> min_t(unsigned int, lpi_data.host_lpi_bits, CONFIG_HOST_LPI_BITS) > >> #define MAX_HOST_LPIS (BIT(MAX_HOST_LPI_BITS) - 8192) > >> +#define HOST_LPIS_PER_PAGE (PAGE_SIZE / sizeof(union host_lpi)) > >> + > >> +static union host_lpi *gic_find_host_lpi(uint32_t lpi, struct domain *d) > > > > I take "lpi" is the physical lpi here. Maybe we would rename it to "plpi" > > for clarity. > > Indeed. > > > > >> +{ > >> + union host_lpi *hlpi; > >> + > >> + if ( lpi < 8192 || lpi >= MAX_HOST_LPIS + 8192 ) > >> + return NULL; > >> + > >> + lpi -= 8192; > >> + if ( !lpi_data.host_lpis[lpi / HOST_LPIS_PER_PAGE] ) > >> + return NULL; > >> + > >> + hlpi = &lpi_data.host_lpis[lpi / HOST_LPIS_PER_PAGE][lpi % > >> HOST_LPIS_PER_PAGE]; > > > > I realize I am sometimes obsessive about this, but division operations > > are expensive and this is on the hot path, so I would do: > > > > #define HOST_LPIS_PER_PAGE (PAGE_SIZE >> 3) > > to replace > #define HOST_LPIS_PER_PAGE (PAGE_SIZE / sizeof(union host_lpi))? > > This should be computed by the compiler, as it's constant. > > > unsigned int table = lpi / HOST_LPIS_PER_PAGE; > > So I'd rather replace this by ">> (PAGE_SIZE - 3)". This is actually what I meant, thanks. > But again the compiler would do this for us, as replacing "constant > divisions by power-of-two" with "right shifts" are a text book example > of easy optimization, if I remember this compiler class at uni correctly ;-) Yet, we found instanced where this didn't happen in the common Xen scheduler code on x86. > > then use table throughout this function. > > I see your point (though this is ARMv8, which always has udiv). > But to prove your paranoia wrong: I don't see any divisions in the > disassembly, but a lsr #3 and a lsr #9 and various other clever and > cheap ARMv8 instructions ;-) > Compilers have really come a long way in 2016 ... Fair enough, thanks for checking. That is enough for me. > >> + if ( d && hlpi->dom_id != d->domain_id ) > >> + return NULL; > > > > I think this function is very useful so I would avoid making any domain > > checks here: one day we might want to retrieve hlpi even if hlpi->dom_id > > != d->domain_id. I would move the domain check outside. > > That's why I have "d && ..." in front. If you pass in NULL for the > domain, it will skip this check. That saves us from coding the check in > every caller. > Is that not good enough? There is a simple solution to this: write two functions, one without the check and a wrapper to it with the check. > > > >> + return hlpi; > >> +} > >> > >> #define ITS_COMMAND_SIZE 32 > >> > >> @@ -96,6 +129,33 @@ static int its_send_cmd_sync(struct host_its *its, int > >> cpu) > >> return its_send_command(its, cmd); > >> } > >> > >> +static int its_send_cmd_discard(struct host_its *its, > >> + uint32_t deviceid, uint32_t eventid) > >> +{ > >> + uint64_t cmd[4]; > >> + > >> + cmd[0] = GITS_CMD_DISCARD | ((uint64_t)deviceid << 32); > >> + cmd[1] = eventid; > >> + cmd[2] = 0x00; > >> + cmd[3] = 0x00; > >> + > >> + return its_send_command(its, cmd); > >> +} > >> + > >> +static int its_send_cmd_mapti(struct host_its *its, > >> + uint32_t deviceid, uint32_t eventid, > >> + uint32_t pintid, uint16_t icid) > >> +{ > >> + uint64_t cmd[4]; > >> + > >> + cmd[0] = GITS_CMD_MAPTI | ((uint64_t)deviceid << 32); > >> + cmd[1] = eventid | ((uint64_t)pintid << 32); > >> + cmd[2] = icid; > >> + cmd[3] = 0x00; > >> + > >> + return its_send_command(its, cmd); > >> +} > >> + > >> static int its_send_cmd_mapc(struct host_its *its, int collection_id, int > >> cpu) > >> { > >> uint64_t cmd[4]; > >> @@ -330,15 +390,109 @@ uint64_t gicv3_lpi_get_proptable() > >> return reg; > >> } > >> > >> +/* Allocate the 2nd level array for host LPIs. This one holds pointers > >> + * to the page with the actual "union host_lpi" entries. Our LPI limit > >> + * avoids excessive memory usage. > >> + */ > >> int gicv3_lpi_init_host_lpis(int lpi_bits) > >> { > >> + int nr_lpi_ptrs; > >> + > >> lpi_data.host_lpi_bits = lpi_bits; > >> > >> + nr_lpi_ptrs = MAX_HOST_LPIS / (PAGE_SIZE / sizeof(union host_lpi)); > >> + > >> + lpi_data.host_lpis = xzalloc_array(union host_lpi *, nr_lpi_ptrs); > >> + if ( !lpi_data.host_lpis ) > >> + return -ENOMEM; > > > > Why are we not allocating the 2nd level right away? To save memory? If > > so, I would like some numbers in a real use case scenario written either > > here on in the commit message. > > LPIs can be allocated sparsely. Each LPI uses 8 Bytes, chances are we > never use more than a few dozen on a real system, so we just use two > pages with this scheme. > > Allocating memory for all 2 << 20 (the default) takes 8 MB (probably for > nothing), extending this to 24 bits uses 128 MB already. > The problem is that Xen cannot know how many LPIs Dom0 will use, so I'd > rather make this number generous here - hence the allocation scheme. > > Not sure if this is actually overkill or paranoid and we would get away > with a much smaller single level allocation, driven by a config option > or runtime parameter, though. All right. Please write an in-code comment explaining this reasoning with a sample number of LPIs used by Dom0 on a real case scenario. > >> printk("GICv3: using at most %ld LPIs on the host.\n", MAX_HOST_LPIS); > >> > >> return 0; > >> } > >> > >> +/* Allocates a new host LPI to be injected as "virt_lpi" into the > >> specified > >> + * VCPU. Returns the host LPI ID or a negative error value. > >> + */ > >> +int gicv3_lpi_allocate_host_lpi(struct host_its *its, > >> + uint32_t devid, uint32_t eventid, > >> + struct vcpu *v, int virt_lpi) > >> +{ > >> + int chunk, i; > >> + union host_lpi hlpi, *new_chunk; > >> + > >> + /* TODO: handle some kind of preassigned LPI mapping for DomUs */ > >> + if ( !its ) > >> + return -EPERM; > >> + > >> + /* TODO: This could be optimized by storing some "next available" > >> hint and > >> + * only iterate if this one doesn't work. But this function should be > >> + * called rarely. > >> + */ > > > > Yes please. Even a trivial pointer to last would be far better than this. > > It would be nice to run some numbers and prove that in realistic > > scenarios finding an empty plpi doesn't take more than 5-10 ops, which > > should be the case unless we have to loop over and the initial chucks > > are still fully populated, causing Xen to scan for 512 units at a time. > > We defenitely want to avoid that, if not in rare worse case scenarios. > > I can try, though keep in mind that this code is really only called on > allocating a host LPI, which would only happen when an LPI gets mapped. > And this is done only upon a Dom0 driver initializing a device. Normally > you wouldn't expect this during the actual guest runtime. It would happen during actual guest runtime with device assignment, wouldn't? A simple pointer to last would be a good start, and in-code comment about how many times we loop to find an empty plpi on a normal case (no device assignment). _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |