[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

On Thu, Sep 21, 2017 at 2:16 PM, Thomas Garnier <thgarnie@xxxxxxxxxx> wrote:
> On Thu, Sep 21, 2017 at 8:59 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
> >
> > ( Sorry about the delay in answering this. I could blame the delay on the 
> > merge
> >   window, but in reality I've been procrastinating this is due to the 
> > permanent,
> >   non-trivial impact PIE has on generated C code. )
> >
> > * Thomas Garnier <thgarnie@xxxxxxxxxx> wrote:
> >
> >> 1) PIE sometime needs two instructions to represent a single
> >> instruction on mcmodel=kernel.
> >
> > What again is the typical frequency of this occurring in an x86-64 defconfig
> > kernel, with the very latest GCC?
> I am not sure what is the best way to measure that.

A very approximate approach would be to look at each instruction using
the signed trick with a _32S relocation. All _32S relocations won't be
translated to more instructions because some are just relocating part
of an absolute mov which would be actually smaller if relative.

Used this command to get a relative estimate:

objdump -dr ./baseline/vmlinux | egrep -A 2 '\-0x[0-9a-f]{8}' | grep
_32S | wc -l

Got 6130 places, if you assume each add at least 7 bytes. It adds at
least 42910 bytes on the .text section. The text section is 78599
bytes bigger from baseline to PIE. That's at least 54% of the size
difference. Assuming we found all of them and we can't factor the
impact on using an additional register.

Similar approach with the switch table but a bit more complex:

1) Find all constructs as with an lea (%rip) followed by a jmp
instruction inside a function (typical unfolded switch case).
2) Remove occurrences of less than 4 for the destination address

Result: 480 switch cases in 49 functions. Each case take at least 9
bytes and the switch itself takes 16 bytes (assuming one per

That's 5104 bytes for easy to identify switches (less than 7% of the increase).

I am certainly missing a lot of differences. I checked if the percpu
changes impacted the size and it doesn't (only 3 bytes added on PIE).

I also tried different ways to compare the .text section like size of
symbols or number of bytes on full disassembly but the results are
really off from the whole .text size so I am not sure if it is the
right way to go about it.

> >
> > Also, to make sure: which unwinder did you use for your measurements,
> > frame-pointers or ORC? Please use ORC only for future numbers, as
> > frame-pointers is obsolete from a performance measurement POV.
> I used the default configuration which uses frame-pointer. I built all
> the different binaries with ORC and I see an improvement in size:
> On latest revision (just built and ran performance tests this week):
> With framepointer: PIE .text is 0.837324% than baseline
> With ORC: PIE .text is 0.814224% than baseline
> Comparing baselines only, ORC is -2.849832% than frame-pointers.
> >
> >> 2) GCC does not optimize switches in PIE in order to reduce relocations:
> >
> > Hopefully this can either be fixed in GCC or at least influenced via a 
> > compiler
> > switch in the future.
> >
> >> The switches are the biggest increase on small functions but I don't
> >> think they represent a large portion of the difference (number 1 is).
> >
> > Ok.
> >
> >> A side note, while testing gcc 7.2.0 on hackbench I have seen the PIE
> >> kernel being faster by 1% across multiple runs (comparing 50 runs done
> >> across 5 reboots twice). I don't think PIE is faster than a
> >> mcmodel=kernel but recent versions of gcc makes them fairly similar.
> >
> > So I think we are down to an overhead range where the inherent noise (both 
> > random
> > and systematic one) in 'hackbench' overwhelms the signal we are trying to 
> > measure.
> >
> > So I think it's the kernel .text size change that is the best noise-free 
> > proxy for
> > the overhead impact of PIE.
> I agree but it might be hard to measure the exact impact. What is
> acceptable and what is not?
> >
> > It doesn't hurt to double check actual real performance as well, just don't 
> > expect
> > there to be much of a signal for anything but fully cached microbenchmark
> > workloads.
> That's aligned with what I see in the latest performance testing.
> Performance is close enough that it is hard to get exact numbers (pie
> is just a bit slower than baseline on hackench (~1%)).
> >
> > Thanks,
> >
> >         Ingo
> --
> Thomas


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.