04 January 2008 - 10:02Does it Scale?
With thanks to AMD for providing us a pair of 8-core systems to play with, I've been able to spend some quality time over the past few months doing more detailed profiling of OpenLDAP's concurrency behavior. As a result, OpenLDAP 2.4.7's performance on multiprocessor machines has again improved beyond previous releases. Just around the New Year I was also loaned access to an 8-core Intel server for testing. With all the hype about quad core designs flying around the web, I thought a detailed comparison would be interesting.First we have the AMD 4P system with 4 Opteron 875 (2.2GHz dual core) processors, 16GB of DDR-333 DIMMs, and Broadcom BCM5704 Gbit ethernet. Next is the AMD 2P system with 2 Opteron 2347 (1.9GHz native quad core) processors, 16GB of DDR2-667 DIMMs, and Nvidia MCP55 Gbit ethernet. And now, thanks to Matt Ezell and the University of Tennessee EECS department, we have an Intel 2P system with 2 Xeon 5345 (2.3GHz MCM quad core) processors, 16GB of DDR2-667 FBDIMMs, and Broadcom BCM5708 Gbit ethernet. All of these machines are running a 2.6.24-rc3 Linux kernel. Both of the quad core systems were running Debian etch, while the AMD dual core system was running FedoraCore 6. The differences in distro are pretty much irrelevant given the identical kernel version.
All of the tests were based on OpenLDAP 2.4.7 with BerkeleyDB 4.6.21 and Google's tcmalloc 0.8. As such, very little of the performance-critical code depended on any other platform libraries. The kernel, OpenLDAP, and BerkeleyDB were custom compiled for each machine. On the AMD 4P system, gcc 4.2.2 was used; on the quad core systems gcc 4.3 was used with "-march=amdfam10" for the Opteron and "-march=core2" for the Xeon.
I initially began testing these machines with back-null and oprofile, to find bottlenecks in slapd's connection management code. (back-null is a very simple slapd backend that simply returns Success responses to any LDAP request. As such, the bulk of processing time measured here is all attributable to the frontend code.) The results of these tests led to further refinements of the Lightweight Dispatcher that was first released experimentally in OpenLDAP 2.3. While there may yet be some areas to explore there, I think these results show that slapd's frontend is now extremely efficient.

With 8 cores the Opteron 2347 yields over 54,000 authentications/second, which generates a network load of around 400,000 packets per second, or over 40MB/second of data. In fact at the 6 core mark the system's ethernet driver is already consuming about 98% of a core all by itself. As such, it's clear that slapd is not the performance limiter on this machine. The Xeon 5345 and Opteron 875 numbers are both indicative of systems that are bandwidth-starved as the number of active cores increases.
With concerns about the frontend put to rest, we look at operations against a real database, authenticating against a 1 million entry database in back-hdb.

The first thing you may notice is the dropoff in performance for both of the AMD systems when slapd is running with all 8 cores. This is because as noted with the back-null results, the ethernet driver consumes a large number of CPU cycles itself. In the tests using 7 cores and less, slapd and the ethernet driver are always using different cores so they don't interfere with each other. When the 8th core is added to slapd however, it's competing with the ethernet driver for CPU cycles so overall network throughput drops. The only solution for this issue would be to use a more efficient ethernet device or driver. (In both cases the drivers were already tuned for interrupt coalescing, but that only goes so far...)
Another interesting point, which is also evident in the back-null tests, is that the Xeon is consistently faster than the AMD systems when only a single core is in use. That shouldn't be too surprising since it also has the fastest core clock, but what's more interesting is that this advantage immediately disappears as soon as 2 or more cores are in use. The Intel Core2 architecture provides an oversized L2 cache which is shared among pairs of cores; when only a single core is in use the entire cache can be used by that core. When both cores are in use, the cache sharing can help facilitate inter-core communication, but it can also hinder performance as each core competes over what data can be stored in it.
Also despite a 5% clock speed disadvantage, the AMD 4P system still outperforms the Xeon system as more cores are used. Both systems are totally outclassed by the AMD 2P Opteron 2347.
More interesting details can be seen in the raw data:
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
With back-null clearly slapd is not CPU bound, all of the delay is in the network. With back-hdb the Opteron systems are CPU-bound at all load points. For the Xeon system there's a strange scheduling overhead for 6-7 cores that disappears when all 8 cores are in use. That's more likely a kernel issue than a system architecture issue.
One other test I usually perform is to see how fast slapd can scan its entire DB. I do this by issuing an ldapsearch command against the entire subtree, filtering on an unindexed attribute for a value that does not exist. Searches that aren't indexed necessitate scanning every entry in the database until a match is found, and in this case no match will be found. This is a handy way to force slapd to load all of the DB into RAM when beginning a benchmark run, as well. Once all of the entries are cached in RAM, this test shows (a) how fast the machine can scan its RAM, i.e., how much memory bandwidth is available, and (b) how fast can slapd test each entry against a search filter. (Note - on a production system, this type of search can really wreak havoc on a server and its entry cache. This is definitely not something you would ordinarily do.)
With a single ldapsearch client, the Xeon 5345 is easily the fastest. But again, as soon as 2 or more threads are involved, it responds the slowest.
| Clients | Opteron 875 | Opteron 2347 | Xeon 5345 |
|---|---|---|---|
| 1 | 3.1s | 2.9s | 1.6s |
| 2 | 6.7s | 6.5s | 12.6s |
| 3 | 12.1s | 10.5s | 20.5s |
| 4 | 18.1s | 14.9s | 44.6s |
The times listed is the total time for all of the ldapsearch clients (running in parallel) to complete. The clients are running on the same box as the server in this case, and there's no search results so there's no network overhead. (Fyi, the database on disk is about 1GB. Typically the in-memory cached entries are about twice as large as on disk, so that's about 2GB of memory being accessed. So for the Xeon, 1.6s represents about 1.25GB/sec memory bandwidth usage.)
For reference, the AMD 4P system is an AMD Quartet/Celestica A8440. The AMD 2P system is a Supermicro AS-2021M-UR+B. The Intel 2P system is a Dell PowerEdge 2950.fourteen comments:
Any chance of trying out one of Sun’s Niagra-based systems? The newest 5×20 systems can have up to 64 simultaneous threads, and have two on-die 10GigE controllers. You can borrow kit for sixty days to test out:
http://www.sun.com/tryandbuy/
@David Magda: We tested a first-generation Niagara system a while back. ( http://connexitor.com/blog/pivot/entry.p.. ) It performed OK but a set of dual-core Opterons still outperformed it, and at lower cost too. Despite the fact that Niagara’s have more bandwidth to the chip, their low per-core processing speed just doesn’t stack up. We never got anywhere near 32×1GHz threads’ worth of performance out of that system. (And considering that we’ve already wrung excellent performance out of a 64-processor SGI Altix system, I’m pretty sure our software isn’t the performance bottleneck.) The new system sounds good, and 64 hardware threads sounds great, but I seriously doubt that any real-world apps will ever see the performance of an actual 64 1.4GHz threads. But you’re right, the integrated 10Gb ethernet controllers sound like a compelling option. We probably ought to test it for completeness’ sake if nothing else.
Just thought it was wort pointing out that on Niagra T1 it is not 32 * 1.4 GHz performance but 8 * 1.4 Ghz performance as four of the threads are all competing for the same core.
@Paul Bryan:
Understood. Still, Sun’s marketing literature tends to play it up for more than it is. E.g. http://www.sun.com/processors/UltraSPARC.. page 2 “The computing power of a 64 thread server, now on a single UltraSPARC® T2 chip.” The photo implies you get the power of an extremely large server…
In a few weeks we’ll know; Sun is shipping us a T5120 to try out.
Why not compare against Intel’s 45nm family? Faster FSB and larger cache.
@Tom: Tom, good idea. This was the machine offered to us when Howard put out a general call for an Intel system to benchmark. We have several feelers out towards Intel for access to comparable 45nm systems and hope we get the chance. ... Marty
@Tom:
As Marty already said, the 65nm chips are what we were offered. (See this email thread:
http://www.openldap.org/lists/openldap-d..
http://www.openldap.org/lists/openldap-d.. )
If someone wants to give us access to a comparable 45nm Harpertown system, we’ll be happy to test it.
With that said, it may close the gap but I doubt it will put Intel ahead. The Clovertown system we used here uses the 5000X (Greencreek) chipset; it has a 1333MHz FSB. Upping the FSB from 1333 to 1600 MHz is a 20% increase; it likely will account for less than a 20% boost in overall performance. The 1.9GHz Barcelona is already more than 20% faster than the 2.3GHz Clovertown we tested here at 5-8 cores.
Looking at the performance gains from doubling a 2MB L2 cache to 4MB, increasing the L2 cache by only 50% to 6MB isn’t likely to show more than a 1-2% performance boost here. Given the sizes of the working sets for our processing, I doubt it will have any impact at all.
http://www.xbitlabs.com/articles/cpu/dis..
http://www.tomshardware.com/2007/10/24/d..
But it’s all empty speculation until we actually have a machine to test.
Out of curiosity, were you seeing a comparable interrupt rate on the different systems?
@Matthew Sayler: Sorry, I don’t have that info, I didn’t record it during any of these tests.
Our Sun T5120 has arrived, with Solaris 10 preinstalled. We’ll be testing on that and then installing Linux to test again afterward. Also I’ve been loaned access to an AMD system with 8 Opteron 885s to do some more testing. Our poor little slamd server is going to be very busy for a while…
@hyc: Initial results on the T5120 aren’t so great. Compiled with Sun C 5.x, we’re getting only around 18,000 auths/sec on back-null with 16 slapd threads. (Only 8 threads were used with back-null on the above AMD and Intel systems; the 8 thread result for the T5120 was only around 13,000/sec.) Bumping it up to 24 slapd threads reduced it to only 17,500 auths/sec. This is with no tuning of the OS yet, but still it’s a far cry from the other systems. I think this reflects one of the big problems with the Niagara approach. In general, the more threads a program uses, the more time it loses in scheduling overhead (and the more memory it uses in thread stacks). But with the Niagara, you must use more threads and attempt to gain more parallelism, because each individual core is so slow.
@hyc:
I see impressive results with Niagara in our multithreaded network-close applications. For example a 6×1GHz core T1000 easily outperforms a 8×1.2GHz V880.
Give the Solaris dtrace-based analysis tools a spin. Try ‘plockstat’ for example – you may find that simple corrections in the software will allow the performance to take off with 16 concurrent threads or more.
+1 We tested a first-generation Niagara system a while back.
old post but still usefull, we use xeon based processors on our tube servers
No trackbacks: