About Multi-cores and Multi-tasking
Gilboa Davara
gilboad at gmail.com
Wed Apr 21 12:53:57 IDT 2010
Shimi,
>
> Intel's hyper-threading makes your system THINK you have two cores;
> It's actually one core, capable of doing the same amount of work. If
> your program is multi-process/thread it will perhaps have a slight
> advantage due to this because of how schedulers work (and what's done
> at "idle" cpu time), but on a single process, I believe it only makes
> it worse. I turn it off on Intel machines I manage...
>
> All in all, when HT does manage to add something to performance,
> benchmarks have shown that it's pretty much negligible. On the past
> few years, Intel dumped this "technology" due to that fact (I guess).
> Lately they've returned it (to i* CPUs), and I have no idea why.
>
The HT implementation on Core 5/7 and Xeon 55xx/56xx/65xx/75xx is -far-
better than its P4 counterpart. Same goes for ATOM cpus.
I'd benchmark my application with and w/o it before deciding on turning
it off.
In my experience (running a kernel based packet inspection software), HT
on Xeon 55xx yields around 15-20% performance benefit. (YMMV, of-course)
> ... and yes, multi-core on one chip certainly counts and sometimes
> even better in terms of performance than multi-CPU - especially if the
> CPU supports a shared cache between the cores and your processes does
> not have affinity to a specific core...)
Actually, as always in life, the picture is far more complicated.
Modern multi-socket capable CPUs (AMD Opteron, Inteel Xeons listed
above) are NUMA based. Read: each socket (or even a single module within
MCM) has its own memory bank which produces unexpected results:
Compare a 4 year old Athlon64 3600X2 to a dual socket, single core
Opteron 246. (I'm using older CPU as it's easier to explain my point)
Both the Opteron and Athlon64 use the same basic core design (The
Opteron has a bigger L2 cache, but forced to use slower registered
memory) and both run at 2Ghz.
In theory the same multi-thread / multi-process application should
more-or-less perform the same on these two machines - but in reality,
you could design an application that will favor one of the configuration
by a factor of 1:2!
E.g. (Which relates back to the OP question)
A multi-thread application in which all threads share the same data set
should run faster on the Athlon64 X2 CPU, as the SRI bridge that
connects the cores is far faster than the HT link that connects the two
Opteron sockets.
A multi-process/multi-threaded application, in which each process uses
its own data set should run faster on the Opteron machine, as each CPU
core has a full uninterrupted access to its adjacent memory bank.
> Of course that when you want to build a million-CPU box, sometimes you
> HAVE to multi-CPU ;) AMD recently announced a 12 core chip that is
> basically what Intel did in Pentium D - two 6 core "glued" together.
> Throw 4 of these on a 4 socket motherboard, and you have a 48-CPU
> supercomputer in one box. Nice, isn't it? :)
Me anxiously waiting to compare a 4S Xeon 75xx to a 4S Opteron 61xx :)
- Gilboa
More information about the Linux-il
mailing list