Nehalem EX and RHEL 6.0 KVM – Pushing the scalability envelope for virtualized workloads

In mid-December 2010, IBM made some new contributions to the officially published roster of SPECvirt_sc2010 benchmark results in the form of two new SUTs powered by Intel Nehalem EX CPUs and running RHEL 6.0 KVM:

  • IBM x3690 X5 SUT – 2 socket Intel X7560 platform, 1 TB RAM, running RHEL 6.0 KVM.  SPECvirt score of 1763.68 w/ 108 VM load.
  • IBM x3850 X5 SUT – 8 socket Intel X7560 platform, 2 TB RAM, running RHEL 6.0 KVM.  SPECvirt score of 5466.56 w/ 336 VM load.

Impressive numbers for both SUTs, but personally I’m the most excited about the latter SUT – the 8 socket, 128 logical core, 2 TB RAM monster of a system.  From a benchmarks perspective, the 8-socket x3850 X5 running RHEL 6.0 KVM towers over the other SUT + hypervisor combinations submitted by a large margin.  This benchmark result is almost two months old now and the next closest set of numbers belongs to an ESXi 4.1 Nehalem EX SUT submitted by Bull SAS and VMware literally last week – the Bull SAS ESXi 4.1 SUT posts roughly half the SPECvirt score and half the VM loadout of the IBM x3850 X5 SUT running RHEL 6.0 KVM (5466.58 vs 2721.88 and 336 vs. 168, respectively).

Nehalem EX-powered SUTs aren’t new to SPECvirt_sc2010 roster – IBM submitted results for a x3690 X5 2-socket, 512 GB RAM Nehalem EX system running RHEL 5.5 KVM back in late summer 2010.  What makes the IBM x3850 X5 RHEL 6.0 KVM SUT so immediately special is that, according to the configuration maximums documentation for VMware ESX 4.1 Update 1 (also released last week), this system represents a hardware configuration + VM load currently unsupported by vSphere 4.1.  Let me break this down in detail:

  • The IBM x3850 X5 SUT is configured with 2 TB of RAM, all of which is addressable by RHEL 6.0 KVM.  vSphere 4.1 Update 1 currently only supports host memory configurations up to 1 TB. (note: VMware’s current best shot on the benchmarks roster, the Bull SAS Nehalem EX SUT, tops out at 512 GB of RAM)
  • The IBM x3850 X5 SUT running RHEL 6.0 KVM was able to scale to a workload of 336 VMs.  vSphere 4.1 Update 1 currently only supports up to 320 VMs per host.
  • RHEL 6.0 KVM is capable of addressing all 128 logical cpus (64 physical cpu cores) present in the IBM x3850 X5 SUT, and can actually comfortably scale past this.  It’s worth pointing out that for vSphere 4.1, 128 logical CPUs was the maximum supported configuration per host, although this was just raised to 160 logical CPUs per host with vSphere 4.1 Update 1.

As evidenced by VMware’s move to raise their cap of the number of support logical CPUs per host with vSphere 4.1 Update 1, it’s likely a matter of time before we see a vSphere release capable of addressing 2 TB of RAM and supporting VM workloads in excess of 320 per host.  To proclaim that the KVM hypervisor has definitively won the scalability war over the vmkernel would be pure hubris.  In the meantime though, the IBM x3850 X5 RHEL 6.0 KVM SUT sets a high SPECvirt_sc2010 watermark and demands serious consideration for KVM in today’s enterprise datacenters, where cost, performance, and consolidation ratios on production virtualization infrastructure are all important metrics.

Just as exciting to me is the technology that powers the 8-socket IBM x3850 X5 SUT and what it indicates about the future of virtualization technology and cloud computing.  For those not familiar with the Intel X7560 CPU, it belongs to the Nehalem EX family, an inherently scalable server architecture that is capable of shouldering workloads formerly reserved for very expensive high-performance computing (HPC) platforms.  Besides hardware-level support for a Reliability, Availability, and Serviceability (RAS) features that are par course in the HPC realm – think hot plug CPU and memory modules, memory mirroring, dynamic isolation of failed server components – the general focus of Nehalem EX is more sockets per server, more cores per socket, and more per RAM per server.  Although the 8-core Intel X7560 Nehalem EX CPU is clocked lower than a 6-core Intel X5680 Westmere EP CPU (2.27 GHz vs 3.33 GHz), it has a much larger on-die L3 cache (24 MB vs 12 MB) that lends it more inherent scalability and better clock-for-clock performance with heavy enterprise workloads (e.g. lots of concurrently running virtual machines).  All of this is pure butter for the KVM hypervisor – RHEL 6.0 has inherent support for RAS features in Nehalem EX (vSphere 4.1 currently does not) and it should be noted that the overwhelming majority of today’s HPC clusters are based on Linux.  We are now at a point where aggregate computing resources and virtualization workloads that were once reserved for specialized HPC hardware can now be shouldered by commodity Nehalem EX x86 hardware and marshaled by the RHEL 6.0 KVM hypervisor.

Nehalem EX servers use Non-Uniform Memory Access (NUMA) architecture that dedicates local pools of high speed quad-channel 1066 MHz memory to each CPU socket, reducing the need for multiple sockets to compete for memory access through a relatively slow shared bus.  In addition, each of the sockets on a Nehalem EX motherboard is interconnected using high-speed Quick Path Interconnect (QPI) links that are capable of transmitting data at rates as high as 25.6 GB/sec (see diagram below).  Compare this to the previous generation of systems employing the Intel Xeon 7400-series CPUs, where intensive workloads were easily bottlenecked by a single 533 MHz shared Front Side Bus (FSB).  On Nehalem EX systems, these very fast QPI pathways allow large computing workloads to be scheduled across multiple sockets and memory pools without choking the server, while smaller workloads can be kept on individual NUMA nodes (single socket + dedicated local memory pool) to maximize performance.

In my last post I gave VMware and HP a hard time for using $4000 16 GB 16 GB PC3L-10600 ECC DDR3 1333 MHz RDIMM memory modules in their Intel Westmere EP ESX 4.1 SUT to gain a 4% performance lead over an equivalent IBM Westmere EP RHEL 5.5 KVM system.  Direct performance + cost comparisons between Nehalem EX and Westmere EP systems are an apples vs. oranges affair so I won’t get into a cost breakdown of the two platforms.  That said, I want to firmly establish that each of the 128 16 GB PC3-8500 CL7 ECC DDR3 1066 MHz LP RDIMM modules used in the 8-socket IBM x3850 X5 RHEL 6.0 KVM SUT do NOT cost $4000 apiece.  Nehalem EX uses a relatively unique memory architecture that moves the Advanced Memory Buffer (AMB) unit from the DIMMs to the motherboard/memory daughtercard (see the heatsink-covered chips in the diagram below) – this decreases the operating temperature of the memory DIMMs and significantly decreases costs by allowing for the use of fairly standard off-the-shelf registered ECC DIMMs.  The 16 GB PC3-8500 CL7 ECC DDR3 1066 MHz LP RDIMM modules used in the x3850 X5 RHEL 6.0 KVM SUT can be had for between $800 and $1000 each – you will still end up paying a pretty penny for 128 of these but it’s nowhere near the sticker shock you would get from trying to use $4000 DDR3 1333 MHz Westmere EP RDIMMs.

One really cool aspect of the Nehalem EX QPI architecture is the ability to stitch multiple servers together using external QPI links to form a much more powerful computing node.  For example, there currently isn’t a “single box” IBM x3850 X5 8-socket configuration – to achieve the configuration observed in the RHEL 6.0 KVM SUT, IBM used a QPI scalability kit (see diagram immediately below) to link two quad-socket, 1 TB of RAM x3850 X5 systems together to form the 8-socket compute node with 2 TB of RAM.  It’s worth noting that the Nehalem EX architecture makes it possible to scale past even 8-sockets by employing node controllers (not to be confused with the QPI scalability kit).


It is fairly common for virtualization deployments to follow a top-down deployment model, with the hierarchy broken down into datacenter > cluster > host.  In the end, each virtualization host represents the most basic unit of computing power available to the datacenter – a group of virtual machines can be dedicated to a specific cluster of hosts but in the end a single virtual machine cannot simultaneously span multiple hosts.  What I find so interesting about the Nehalem EX architecture is that its focus on RAS, NUMA, and inter-linked nodes via QPI begins to blur the distinction between individual hosts – we are now moving toward a deployment model where a datacenter can be literally viewed as an aggregate blob of computing resources, from which discrete portions can be dynamically allocated to large numbers of constantly shifting workloads.  RHEL 6.0 KVM has already demonstrated the ability to scale to the resources and workloads presented by Nehalem EX aggregate compute nodes – combine this with inherent support in RHEL 6.0 for scalable technologies like RAS, NUMA, SR-IOV, and QPI along with robust central management tools for public and private-cloud deployments and you have a winner.

Posted in Uncategorized | Leave a comment

SPECvirt_sc2010 VMware ESX 4.1 vs RHEL 5.5 KVM benchmarks: digging a little deeper…

2010 is quickly drawing to a close – both consumer interest in and implementation of virtualized computing infrastructures is high and only continuing to increase going into 2011.  Keeping pace with consumer interest is competition in the hypervisor arena, which is really heating up.  Vendors and customers love to create feature matrixes comparing the various hypervisor + central management solutions but nothing gets competitive blood pumping like good old benchmark scores.  The trouble is that benchmarks suites are worth little if they are either not vendor-neutral or unsupported, which until recently described the major virtualization benchmark suites available.

The non-profit Standard Performance Evaluation Corp (SPEC) has stepped up this year and released their vendor-neutral SPECvirt_sc2010 benchmark suite.  According to the press release on SPEC’s web site, “SPECvirt_sc2010 uses a realistic workload and SPEC’s proven performance- and power-measurement methodologies to enable vendors, user and researchers to compare system performance across multiple hardware, virtualization platforms, and applications.” —– Great!

In Q3 2010, Red Hat and IBM submitted SPECvirt_sc2010 benchmarks for the Kernel-based Virtual Machine (KVM) hypervisor running on RHEL5.5, but things really got interesting when in Q4 2010 VMware and HP submitted their own set of benchmarks for ESX 4.1.  Check out the benchmark scores for the two hypervisors in the table below – RHEL5.5 KVM got a SPECvirt_sc2010 score of 1169 while ESX 4.1 scored a 1221.

These scores are very close but doing the math, that comes out to an ESX 4.1 performance benchmark advantage of 4.44% over RHEL5.5 KVM.  I know that a number of people are going to simply compare these two numbers and jump to the conclusion, “SPECvirt_sc2010 benchmarks show that ESX beats KVM in overall performance,” but I would advise against this.  The purpose of this post is to dig just a bit deeper into both of these benchmarks – that 4.44% performance advantage isn’t nearly as clear-cut as some might think.

Before we go any further, keep in mind that virtualization places significant stress on the CPU, memory, network, and storage (whether local or high-speed shared) available to the host.  Any significant weakness in one of these four areas can act as a bottleneck and bring down overall performance.

The one thing that remains consistent between the HP and IBM system under test (SUT) servers used as ESX and KVM virtualization hosts, respectively, is the CPU – an Intel Xeon X5680 “Westmere” hexacore chip clocked at 3.33 GHz per core.  With two populated sockets per SUT, that brings us to 12 physical cores, or 24 logical hyper-threaded cores.

The hardware used in benchmarking the two different hypervisors differs from this point onward.  First let’s take a look at the IBM x3650M3 SUT used for the RHEL5.5 KVM benchmark:

Notice the amount and speed of RAM in the SUT – 144 GB clocked at 800 MHz.  It is important to note why the IBM x3650 M3 SUT memory is clocked at 800 MHz when PC3L-10600 ECC DDR3 1333 MHz modules were actually used.  The explanation has to do with the Westmere architecture, so take a look at the diagram below depicting the dual X5680 CPUs, along with each socket’s dedicated triple-channel memory bus.

For the IBM x3650 M3 SUT, all three RDIMM slots in each memory channel feeding into the CPU were populated with 8 GB PC3L-10600 DDR3 1333 MHz memory modules.  That comes out to 24 GB per channel, 72 GB per CPU socket, and 144 GB for the entire system. This memory configuration provided enough resources for the 72 VM load running on the IBM SUT but the tradeoff is that when all three RDIMM sockets are used in this Westmere system, the clock speed on the RAM is limited to 800 MHz.  This may not seem especially significant at the moment, but it is very important to keep in the mind the cost of the memory used in the SUT.  I did some quick price lookups and found that the price of a single 8 GB PC3L-10600 DDR3 1333 MHz RDIMM was about $450.  By my rough estimate the 144 GB required to fully load the IBM SUT used in the RHEL5.5 KVM benchmark would have cost about $8100.

Now let’s take a look at the HP DL380 G7 SUT used for the ESX 4.1 benchmark:

Not only was more RAM used in this server – 192 GB – but it is clocked higher at 1333 MHz.  Now that we have seen the memory configuration for the Westmere IBM SUT and understand why the 1333 MHz RAM was actually clocked at 800 MHz, it is definitely worth pointing out how HP and VMware got the larger memory configuration at the faster clock speed (spoiler: it didn’t come for free).  The HP DL380 G7 SUT uses the same Westmere architecture as the IBM x3650 M3 SUT, so let’s bring up the Westmere CPU socket / memory channel diagram to explain this:

We still have the same three-RDIMM, triple memory channel per CPU socket architecture but take note how one of the three RDIMM slots in each memory channel has been greyed out – instead we have 6 RDIMMs per CPU, and 12 for the whole system.  This is significant because under the Westmere architecture, populating only two of the three RDIMM slots in each DDR3 memory channel allows you to maintain the 1333 MHz native clock speed of the memory modules.  That’s good if you want to get some better memory I/O performance out of your virtualized workloads (although this does not scale linearly with the 67% higher clock speed) but it limits the memory capacity of your server and thus the guest-to-host consolidation ratio.  VMware and HP got around this in a rather brute-force manner – through using very high-density memory modules.  Twelve quad-ranked, dual-sided 16 GB PC3L-10600 ECC DDR3 1333 MHz RDIMMs, to be exact, at approximately $4000 a pop.  To load the HP DL380 G7 SUT with the 192 GB of RAM used in the benchmark would have cost on the order of $48,000 – compare that to the 144 GB of RAM in the IBM SUT priced at $8100.  That almost 500% price increase (~$40,000) over the IBM SUT, for a 4.44% higher benchmark, speaks volumes.  $40k is a lot of money and this could be better used to buy shared storage hardware, HBAs, or even additional virtualization hosts.

…and that’s just the difference in memory between the two SUTs.  The network and shared storage configuration differences between the two sets of benchmarks are worth pointing out because VMware and HP clearly were not shy about pumping more expensive, higher performing hardware into their benchmark setup.

First, let’s review the network interface loadout in the IBM SUT used for the KVM benchmarks:

Not bad by any stretch of the imagination – eight 1 GbE interfaces provides ample room to load-balance and segment guest VM traffic.  But looking at the network interface loadout in HP SUT reveals that things are even rosier for ESX 4.1:

There are 22 network interfaces available to this host, all of which are at least 1 GbE, and two of which are 10 GbE.  Although 16 of the 22 network interfaces were used in the benchmark, two of those 16 were the 10 GbE interfaces.  Also of note is the 10 GbE adapter used in the DL380 G7 – a dual-port Intel X520.  10 GbE pipeline aside, this NIC has significant hardware enhancements that directly benefit virtualization:

  • 128 dedicated transmit (Tx) and receive (Rx) queues per port  > efficient packet prioritization without waiting or buffer overflow
  • VMDq, a feature that offloads the data-sorting functionality from the Hypervisor to the network silicon, improving data throughput and CPU usage
  • Dedicated Virtual Machine Load Balancing (VMLB) that provides both Tx and Rx traffic load balancing across virtual guests bound to the team interface.

Make no mistake, the Intel X520 10 GbE NIC in the DL380 G7 is an awesome piece of hardware designed to complement virtualized workloads on the host.  That and the other fourteen 1 GbE interfaces used in the SUT give ESX 4.1 plenty of room to work with.

The last hardware configuration aspect of the benchmark hardware that I want to highlight is the shared storage.  Let’s first look at the SUT storage for the RHEL5.5 KVM benchmark:

To sum it up, we have a 4 Gb Fibre Channel  backbone with 96 15k RPM SAS spindles, spread across eight IBM storage appliances (two DS3400s and six DS3000s).  Not a bad setup at all but HP and VMware had even better hardware for their SUT shared storage implementation:

Here we have an 8 Gb Fibre Channel backbone with 156 15K RPM SAS spindles, spread across 13 HP StorageWorks appliances (11 MSA2212fc, one P2000).  Keep in mind that both RAID 10 and RAID 5-level arrays were used for the HP SUT storage setup and RAID 10 features a significant write performance advantage over RAID 5, which can certainly come into play with virtualized disk I/O intensive workloads.


For me, the take-away from this is that while ESX 4.1 had a 4.44% better SPECvirt_sc2010 score than RHEL5.5 KVM, this advantage likely has more to do with the superior memory, network, and storage configuration of the SUT used by HP and VMware than any inherent performance benefit that could be directly attributed to the vmkernel hypervisor.

Furthermore, a look at cost breakdown of the components in the IBM and HP SUTs used for the benchmarks reveals the financial lengths taken by HP and VMware to ensure that ESX edged out KVM.  I priced out the hardware used in the IBM System X3650 M3 and HP ProLiant DL380 G7 virtualization hosts alone by doing some quick searches on the Internet, threw everything together in a table, and rounded off the numbers for easy math.  For the sake of simplicity, I subtracted the cost of the initial RAM loadout from the base system price for both the IBM and HP servers – I wanted to only reflect the price of the RAM actually used in the benchmark (you know, the most expensive component of each system).

I have a hard time ignoring how HP and VMware decided to go with a server configuration that cost approximately 250% more (~$40,000) than that used by the IBM x3650 M3. This is all for the sake of a 4.44% benchmark advantage and being able to squeeze an additional 6-VM tile into the ESX guest loadout (78 vs 72 on the RHEL5.5 KVM SUT).  The cost of the DL380 G7’s high-density memory loadout is unrealistic for a production system in this current economic climate and sharply illustrates the exponential cost increase associated with pursuing the highest possible RDIMM memory densities in virtualization hosts – both for the sake of performance and high guest-to-host consolidation ratios.

I also have to point out that the KVM SPECvirt_sc2010 benchmark was done on RHEL 5.5 – RHEL 6 just recently went GA on November 10, 2010 and brings significant performance increases to the table with virtualized workloads (definitely a blog post for the near future).  I would be very interested to see SPECvirt_sc2010 benchmarks for RHEL 6 KVM using BOTH of the SUT configurations.

Now this is a virt / tech blog so I won’t dwell too much more on money here, but cost savings <is> a significant driving force in the virtualization of production workloads and I would be remiss if I concluded without bringing up the obvious cost differential between RHEL 5.5 KVM and ESX 4.1.  Cost considerations certainly extend to the choice of one vendor’s hypervisor + management solution over another – even if one were to take the 4.44% performance benefit of ESX 4.1 over RHEL5.5 KVM at face value, going with VMware over open-source RHEL 5.5 KVM is <certainly> more than 4.44% more expensive…

Posted in Uncategorized | Leave a comment