Nehalem EX and RHEL 6.0 KVM – Pushing the scalability envelope for virtualized workloads

In mid-December 2010, IBM made some new contributions to the officially published roster of SPECvirt_sc2010 benchmark results in the form of two new SUTs powered by Intel Nehalem EX CPUs and running RHEL 6.0 KVM:

  • IBM x3690 X5 SUT – 2 socket Intel X7560 platform, 1 TB RAM, running RHEL 6.0 KVM.  SPECvirt score of 1763.68 w/ 108 VM load.
  • IBM x3850 X5 SUT – 8 socket Intel X7560 platform, 2 TB RAM, running RHEL 6.0 KVM.  SPECvirt score of 5466.56 w/ 336 VM load.

Impressive numbers for both SUTs, but personally I’m the most excited about the latter SUT – the 8 socket, 128 logical core, 2 TB RAM monster of a system.  From a benchmarks perspective, the 8-socket x3850 X5 running RHEL 6.0 KVM towers over the other SUT + hypervisor combinations submitted by a large margin.  This benchmark result is almost two months old now and the next closest set of numbers belongs to an ESXi 4.1 Nehalem EX SUT submitted by Bull SAS and VMware literally last week – the Bull SAS ESXi 4.1 SUT posts roughly half the SPECvirt score and half the VM loadout of the IBM x3850 X5 SUT running RHEL 6.0 KVM (5466.58 vs 2721.88 and 336 vs. 168, respectively).

Nehalem EX-powered SUTs aren’t new to SPECvirt_sc2010 roster – IBM submitted results for a x3690 X5 2-socket, 512 GB RAM Nehalem EX system running RHEL 5.5 KVM back in late summer 2010.  What makes the IBM x3850 X5 RHEL 6.0 KVM SUT so immediately special is that, according to the configuration maximums documentation for VMware ESX 4.1 Update 1 (also released last week), this system represents a hardware configuration + VM load currently unsupported by vSphere 4.1.  Let me break this down in detail:

  • The IBM x3850 X5 SUT is configured with 2 TB of RAM, all of which is addressable by RHEL 6.0 KVM.  vSphere 4.1 Update 1 currently only supports host memory configurations up to 1 TB. (note: VMware’s current best shot on the benchmarks roster, the Bull SAS Nehalem EX SUT, tops out at 512 GB of RAM)
  • The IBM x3850 X5 SUT running RHEL 6.0 KVM was able to scale to a workload of 336 VMs.  vSphere 4.1 Update 1 currently only supports up to 320 VMs per host.
  • RHEL 6.0 KVM is capable of addressing all 128 logical cpus (64 physical cpu cores) present in the IBM x3850 X5 SUT, and can actually comfortably scale past this.  It’s worth pointing out that for vSphere 4.1, 128 logical CPUs was the maximum supported configuration per host, although this was just raised to 160 logical CPUs per host with vSphere 4.1 Update 1.

As evidenced by VMware’s move to raise their cap of the number of support logical CPUs per host with vSphere 4.1 Update 1, it’s likely a matter of time before we see a vSphere release capable of addressing 2 TB of RAM and supporting VM workloads in excess of 320 per host.  To proclaim that the KVM hypervisor has definitively won the scalability war over the vmkernel would be pure hubris.  In the meantime though, the IBM x3850 X5 RHEL 6.0 KVM SUT sets a high SPECvirt_sc2010 watermark and demands serious consideration for KVM in today’s enterprise datacenters, where cost, performance, and consolidation ratios on production virtualization infrastructure are all important metrics.

Just as exciting to me is the technology that powers the 8-socket IBM x3850 X5 SUT and what it indicates about the future of virtualization technology and cloud computing.  For those not familiar with the Intel X7560 CPU, it belongs to the Nehalem EX family, an inherently scalable server architecture that is capable of shouldering workloads formerly reserved for very expensive high-performance computing (HPC) platforms.  Besides hardware-level support for a Reliability, Availability, and Serviceability (RAS) features that are par course in the HPC realm – think hot plug CPU and memory modules, memory mirroring, dynamic isolation of failed server components – the general focus of Nehalem EX is more sockets per server, more cores per socket, and more per RAM per server.  Although the 8-core Intel X7560 Nehalem EX CPU is clocked lower than a 6-core Intel X5680 Westmere EP CPU (2.27 GHz vs 3.33 GHz), it has a much larger on-die L3 cache (24 MB vs 12 MB) that lends it more inherent scalability and better clock-for-clock performance with heavy enterprise workloads (e.g. lots of concurrently running virtual machines).  All of this is pure butter for the KVM hypervisor – RHEL 6.0 has inherent support for RAS features in Nehalem EX (vSphere 4.1 currently does not) and it should be noted that the overwhelming majority of today’s HPC clusters are based on Linux.  We are now at a point where aggregate computing resources and virtualization workloads that were once reserved for specialized HPC hardware can now be shouldered by commodity Nehalem EX x86 hardware and marshaled by the RHEL 6.0 KVM hypervisor.

Nehalem EX servers use Non-Uniform Memory Access (NUMA) architecture that dedicates local pools of high speed quad-channel 1066 MHz memory to each CPU socket, reducing the need for multiple sockets to compete for memory access through a relatively slow shared bus.  In addition, each of the sockets on a Nehalem EX motherboard is interconnected using high-speed Quick Path Interconnect (QPI) links that are capable of transmitting data at rates as high as 25.6 GB/sec (see diagram below).  Compare this to the previous generation of systems employing the Intel Xeon 7400-series CPUs, where intensive workloads were easily bottlenecked by a single 533 MHz shared Front Side Bus (FSB).  On Nehalem EX systems, these very fast QPI pathways allow large computing workloads to be scheduled across multiple sockets and memory pools without choking the server, while smaller workloads can be kept on individual NUMA nodes (single socket + dedicated local memory pool) to maximize performance.

In my last post I gave VMware and HP a hard time for using $4000 16 GB 16 GB PC3L-10600 ECC DDR3 1333 MHz RDIMM memory modules in their Intel Westmere EP ESX 4.1 SUT to gain a 4% performance lead over an equivalent IBM Westmere EP RHEL 5.5 KVM system.  Direct performance + cost comparisons between Nehalem EX and Westmere EP systems are an apples vs. oranges affair so I won’t get into a cost breakdown of the two platforms.  That said, I want to firmly establish that each of the 128 16 GB PC3-8500 CL7 ECC DDR3 1066 MHz LP RDIMM modules used in the 8-socket IBM x3850 X5 RHEL 6.0 KVM SUT do NOT cost $4000 apiece.  Nehalem EX uses a relatively unique memory architecture that moves the Advanced Memory Buffer (AMB) unit from the DIMMs to the motherboard/memory daughtercard (see the heatsink-covered chips in the diagram below) – this decreases the operating temperature of the memory DIMMs and significantly decreases costs by allowing for the use of fairly standard off-the-shelf registered ECC DIMMs.  The 16 GB PC3-8500 CL7 ECC DDR3 1066 MHz LP RDIMM modules used in the x3850 X5 RHEL 6.0 KVM SUT can be had for between $800 and $1000 each – you will still end up paying a pretty penny for 128 of these but it’s nowhere near the sticker shock you would get from trying to use $4000 DDR3 1333 MHz Westmere EP RDIMMs.

One really cool aspect of the Nehalem EX QPI architecture is the ability to stitch multiple servers together using external QPI links to form a much more powerful computing node.  For example, there currently isn’t a “single box” IBM x3850 X5 8-socket configuration – to achieve the configuration observed in the RHEL 6.0 KVM SUT, IBM used a QPI scalability kit (see diagram immediately below) to link two quad-socket, 1 TB of RAM x3850 X5 systems together to form the 8-socket compute node with 2 TB of RAM.  It’s worth noting that the Nehalem EX architecture makes it possible to scale past even 8-sockets by employing node controllers (not to be confused with the QPI scalability kit).

Conclusion:

It is fairly common for virtualization deployments to follow a top-down deployment model, with the hierarchy broken down into datacenter > cluster > host.  In the end, each virtualization host represents the most basic unit of computing power available to the datacenter – a group of virtual machines can be dedicated to a specific cluster of hosts but in the end a single virtual machine cannot simultaneously span multiple hosts.  What I find so interesting about the Nehalem EX architecture is that its focus on RAS, NUMA, and inter-linked nodes via QPI begins to blur the distinction between individual hosts – we are now moving toward a deployment model where a datacenter can be literally viewed as an aggregate blob of computing resources, from which discrete portions can be dynamically allocated to large numbers of constantly shifting workloads.  RHEL 6.0 KVM has already demonstrated the ability to scale to the resources and workloads presented by Nehalem EX aggregate compute nodes – combine this with inherent support in RHEL 6.0 for scalable technologies like RAS, NUMA, SR-IOV, and QPI along with robust central management tools for public and private-cloud deployments and you have a winner.

This entry was posted in Uncategorized. Bookmark the permalink.

Comments are closed.