How do I benchmark a spin lock?

Wed Nov 26 18:22:23 IST 2014

Thanks! Great feedback.

I just want to quickly reference two points:

1) In a good designed systems, most of the times, the lock would be held
uncontended.
>From [0]: "it might seem that the behavior of multiprocessors when there is
contention for a spin lock is not important. A highly parallel application
will by definition have no lock with significant amount of contention,
since that would imply a sequential component. If an application has a lock
that is a bottleneck, the best alternative would be to redesign the
application's algorithms to eliminate the contention. In no case does it
make sense to add processors to an application if the end up only
spin-waiting.
There are, however, several situations where spin lock performance when
there is contention is important. Poor contention performance may prevent
an application with a heavily utilized lock from reaching its peak
performance, because the average number of spin-waiting processors will
become nontrivial as the lock approaches saturation. Furthermore, if
processors arrive at a lock in a burst, queue lengths can be temporarily
long, resulting in bad short-term performance, without the lock being a
long-term bottleneck.
alternately, it may not always be possible to tune a program to use the
optimal number of processors. An Operating system, for instance, has little
control over the rate at which users make operating system calls. At high
load, locks that are normally not a problem could become source of
contention. "

Measuring the uncontended case is a very good idea, but is not too
problematic.

2) The best benchmarking is performed by a looking at a real world use case
and simplifying it.

Agreed. This is the motherhood and apple pie of all benchmark. It is
however not always evident how the lock would be used exactly, even if you
are the one intending to use it. True, it'll be used to protect a short
critical section, where the cost of waiting is lower than the cost of a
context switch, but this is hardly enough to specify the use cases.

Instead of further discussion, I'd like to propose a simple practical
spinlock benchmark, which I think could be useful as a general measure, and
could be the baseline of more specific tests. I'll be grateful for any
feedback.

Design goals:
1) Benchmark should run on a reasonable OS. Requiring a special purpose
hardware would increase benchmark accuracy, but would make running it much
more difficult.
2) Running with an OS might distort the benchmark with context switches or
interrupt. To solve that, you can repeat the test multiple times and remove
outliers.
3) Should give a few simple outputs, so that lock designer could use the
results as a proxy for how to optimize the lock for his use case.
4) Initial design assumes x86 for brevity, but changes to other
architecture should be trivial.

Implementation:
Input parameters:
LENGTH estimated # cycles that the lock would be held.
N number of CPUs contending for the lock
1) Count cycles (e.g., r for uncontended acquire and release - that would
be the first output.
2) Measure the memory footprint of uncontended acquire and release - that
would be the second output. This can be done by invalidating the cache,
running lock acquire release, and reading cache misses from performance
counters. To eliminate other overhead, do the same with a NOP lock - this
would be the overhead.
That would be the second output.
3) Pin N threads to N different logical CPUs. We can afford ignoring
logical CPUs, since in most use cases you cannot ensure two threads would
run on different hardware cpus. Each CPU would pause for LENGTH cycles, by
running commands that do NOT use the memory bus.
4) Count time (e.g., rdtsc) from the time the first CPU acquired the lock,
to the time the last one did.
Do that by acquiring the lock, keeping the timestamp in a register,
releasing the lock, and then after a 1ms pause, writing this value to a
known location, so that writing the timestamp won't interfere with the
test. after 10ms, when all CPUS are definitely done with the benchmark.
The problem here is, the timestamps between CPUs might not be synchronized.
Possible solutions are:
4.a. Use HPET instead, than again the HPET overhead might distort the test
case.
4.b On newer Intel HW, I understand there's invariant tsc which can be a
solution. Not sure how portable would that be.
4.c Assuming the time drift is small, averaging over multiple experiments
might give a small enough standard deviation.
I'm not sure how much problematic would that be in practice.
5?) I'm not sure its possible, or meaningful, but one measure the memory
bus utilization during the lock contention. Maybe counting the total cache
misses for all CPUs while contending over the lock.

You'll end up with four actionable parameters:
CPU usage for the uncontended case.
Memory footprint for the uncontended case.
CPU overhead for the contended case
Memory bus usage (or a close proxy of it) in the contended case.

Then the user can choose what to optimize according to his use case.

I'll be glad for any feedback.

Thanks again.

On Wed Nov 26 2014 at 10:43:43 AM Muli Ben-Yehuda <mulix at mulix.org> wrote:

> On Tue, Nov 25, 2014 at 08:56:01PM +0000, Elazar Leibovich wrote:
>
> > The first question I have in mind, is, how do you define a lock
> > benchmark?  Is your goal to minimize overhead? Is your goal to
> > minimize the latency of a successful uncontended acquire? Is your
> > goal to minimize bus load for other CPU when three CPUs are waiting
> > for the spin lock?
>
> Good questions all, but usually you want to focus on the important
> case first. I would argue that for a spinlock, the most important case
> is an uncontended acquire, since if most acquires are contended, then
> you should redesign the code that uses the lock. The second most
> important case is the contended acquire/contended release.
>
> > What we're measuring is not well defined.
> >
> > Looking at [0] (wow, So little have changed in SMP architectures since
> > 1990!) and [1], gives a few options:
> >
> > 1) Measure the time it takes to perform a critical section *N* times
> > by *n *CPUs
> > concurrently.
>
> This is measuring the contended acquire/contended release.
>
> > 2) Measure overhead. Compare the time it takes to a single CPU to do a
> task
> > *N* times with no locks. Do that with *n* CPUs, where the task is
> protected
> > by a spin lock. The time it took to do the task with spinlock, minus the
> > time it took to do it alone is the overhead.
>
> ... is the overhead for *N* CPUs on this architecture. This is again
> measuring the contended case.
>
> > 3) Measure the overhead of other *m* CPUs doing a different task,
> > while *n* CPUs are contending over a critical section.
>
> This is measuring the interference caused by the bus traffic due to
> the contention. It's interesting, but it's hard to know just form this
> benchmark what is reasonable and what isn't, and how to improve the
> results. A benchmark whose results aren't actionable is not as useful
> as a benchmark where it's immediately obvious what you need to fix.
>
> > Is there an industry standard for benchmarking a spinlock? Something
> > like JS octane. A benchmark which should mimic many real world
> > scenarios?
>
> Not as far as I know. Locks are usually benchmarked using a
> micro-benchmark that performs many acquires/releases in a loop with
> exact details depending on how the lock is likely to be used "in
> anger". Before benchmarking, you should first ask yourself
> (1) how is this lock implementation likely to be used? (2) what are
> the fast-path, common operations, and what are the slow-paths,
> uncommon operations? And then benchmark the fast-path operations in a
> setup that mimics how the implementation is likely to be used.
>
> Cheers,
> Muli
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.huji.ac.il/pipermail/linux-il/attachments/20141126/ac2d13b4/attachment.html>