I’ve had several folks ask me recently about how to support very large NTFS volumes on vSphere virtualized Windows servers. The current limitation for a VMDK in vSphere 5.1 is 2TB minus 512B. FWIW, a Hyper-V virtual disk in Windows Server 2012 can be up to 64TB. Those asking the question want to support NTFS volumes greater than 2TB for a variety of purposes – Exchange databases, SQL databases, and file shares. Windows (depending on the version and edition) can theoretically support NTFS volumes up to 256TB (depending on cluster size and assuming GPT), with files up to 256TB in size (see http://en.wikipedia.org/wiki/NTFS for more). The way that they wanted to solve the limitation is to present multiple 2TB VMDKs to a Windows VM, and then use Windows Logical Disk Manager (LDM) to convert the VMDKs to dynamic disks (from the Windows default Basic Disk), then concatenate or span multiple disk partitions into one large NTFS volume. Talk about a monster VM…. The question to me, then, became this: Is using spanned dynamic disks on multiple VMDKs a good idea? Here are some of my thought on the question.
First, there are no right or wrong answers. How you choose to support big data/large disk requirements will be a mix of preference, manageability, performance, recoverability, and fault domain considerations. These considerations will be at a few different levels – storage array, VMware, Guest OS, guest Application, and backup systems. A large spanned Windows volume can offer some simplified management – you might not have to worry as much about running out of space, or junior engineers having to think about where to place data in the guest OS. I tend to avoid using LVM/Spanned Windows Dynamic Disks within VM’s when possible for a variety of reasons – here are some of my considerations (for a variety of systems – Exchange, SQL, file servers, etc.):
- Some applications, such as Microsoft SQL, can benefit from having more, smaller disks with multiple files in a database filegroup. Having different files on different disks, on different vSCSI controllers can increase SQL’s ability to do asynchronous parallel IO. Microsoft’s recommendation for SQL (http://technet.microsoft.com/library/Cc966534) is to have between .25 to 1 data files per filegroup per core, with each file a on different drive/LUN. So a 8 vCPU SQL server would have between 2 and 8 .mdf/.ndf files on an equal number of drives. This lends itself to more, smaller VMDK’s that are not striped or spanned by Windows. This requires a bit of design work within the database, optimizing your table and index structures to span multiple files in a file group.
- Smaller, purpose built files/volumes/LUNS can be placed on the right storage tier with the best caching mechanism (e.g. SQL log volumes placed on RAID1/0 with more write cache availability).
- A single volume may have a limited queue depth. You’ll probably increase queuing capabilities as you scale out the number of VMDK’s, and Windows will be able to drive more IO as additional disk channels are opened up.
- A greater number of virtual disks spread over different VMFS datastores may increase the number of paths used to service the workload. This may allow for increased storage bandwidth, more in-path cache, and more storage processor efficiency.
- By using Dynamic Disk striping, spanning or software RAID within the guest, you are introducing an extra layer of complexity that you will need to keep in mind while performing operations on the VM/VMDK. A storage operation on an array, LUN, VMFS datastore, or VMDK within your guest-striped volume could take the whole volume down.
- Having smaller, purpose-built VMDK’s allows you to move specific parts of your workload to a physical storage tier that best suits it. Putting everything into one monolithic volume doesn’t allow this level of granularity. For example, I might create a smaller Exchange mailbox database and put executives mailboxes in it. I would then place the mailbox database in a VMDK on a VMFS on a LUN on a high tier or replicated disk (great use case for VASA Profile Driven Storage BTW). The interns mailbox datastore would be placed on the lowest tier of non-replicated storage. This configuration would also lend itself to more targeted and efficient backup schemes.
- This Microsoft TechNet article for Exchange 2013 storage architecture (http://technet.microsoft.com/en-us/library/ee832792.aspx) suggests that using GPT Basic Disks is a best practice, although Dynamic disks are supported. Conversely, you could deduce that using spanned dynamic disks is not best practice. The TechNet article also recommends keeping your Exchange mailbox databases (MDB) under 200GB, so there’s no need for a VMDK over 2TB is you’re following best practices.
- The spanned dynamic disk configuration adds an additional layer of complexity and introduces another thing that can fail in the environment. I’m a big fan of reducing complexity and design elements.
- Splitting your workload out onto different databases, Windows Volumes, VMDK’s, VMFS, LUNs creates smaller fault domains. A failure of any one of these components would take the entire system out if you placed everything into a single large guest-striped volume. If you split everything out, a failure in one of the components would be less likely to affect a large population of users or the entire functionality of an application. Take the Exchange example again – a failure of one LUN/VMFS/VMDK/NTFS/Exchange MDB would not down email for your entire user base if you are following best practices of keeping MDBs under 200GB and distributing MDBs on different logical or even physical layers.
Backup and Disaster Recovery
- Backing up and restoring smaller files, filegroups, or virtual disks (depending on which levels you do backup – within app, guest-OS agent based, or VM based) is faster. This leads to more efficient backups, shorter backup windows, less downtime on restores, etc.
- Think about that Executive mailbox database – do you really want to tell your CEO that he has to wait while mailboxes for the interns and former employees are restored in your massive LUN/VMFS/VMDK/NTFS/MDB? No – you want to be able to target that MDB for rapid recovery and more frequent backups (and why is it not in a DAG???).
Volume Maintenance and Management
- Consider volume maintenance tasks like CHKDSK. The larger the volume, the longer CHKDSK will run to find and correct errors or corruption in the file system. I’d much rather get a CHKDSK done faster, especially if the CHKDSK needs to force the volume offline!
- If/When we get >2TB VMDK support, think about the time necessary to do a storage vMotion of a massive virtual disk. And I hope you don’t experience some sort of failure in-flight, or you’ll be starting again from scratch.
Personal Preference / Experience
- I’ve been burned by striped/spanned Windows dynamic disks in the past. Those experiences led to long, sleepless weeks. It was sometime back around Windows 2000 or 2003, and I’m sure the technology is more stable now. But… once bitten, twice shy.
Design Architecture Decisions & Final Thoughts
- Design your apps/data today to fit comfortably within that 2TB size. You can grow a VMDK up to 2TB (for now – bigger VMDK’s are on the horizon), and dynamically grow the volume within the guest to fill that space if needed (or just thin provision up front and be done with it). As your data grows in the future, so too will vSphere’s capabilities when it comes to VMDK size. Don’t handicap yourself today for a perceived future need.
- The same principles are true for VMFS sizing. Just because you can make a 64TB VMFS datastore does not mean that you should. Consider a few smaller datastores, with their backing LUNs spread over storage array controllers and storage paths. Use Storage DRS and storage profiles to manage placement of VMDKs on the best datastore based on capacity and performance.
- Even on VMFS5, where VAAI ATS has eliminated SCSI reservations, there are limits to the total size of VMDK files a single host can address, based on VMFS Heap Size. Without recently released updates, the maximum VMFS Heap Size on vSphere 5.0 was 256MB, allowing only 25-30TB of open VMDK files per host. With the patch, we can address 64TB of open files. If we’re talking about very large volumes within a guest OS, it is possible that you could be bumping into this limit. RDMs don’t count against the VMFS Heap Size limits.
- There may be times where a single large guest volume is needed. In those cases you have to weigh the tradeoffs. RDMs are often a good choice for > 2TB volumes. Virtual Mode RDM’s can be converted to VMDK with an online storage vMotion once vSphere supports VMDK’s with sizes greater than 2TB, but Virtual Mode RDM’s also have a 2TB limit. So that leaves you with physical mode RDMs and the limitations that they present. Another option would be to present a LUN via iSCSI to the guest OS, and use the Windows iSCSI initiator to mount the LUN.
- For file shares, consider DFS. You could create several smaller file servers and/or shares, and present them in a single DFS namespace. DFS would make future server upgrades and migrations very simple.
- You might also consider using NTFS mount points within the guest for your large volume requirements (http://technet.microsoft.com/en-us/library/cc753321.aspx). Mount points could be the best of both worlds as it provides single drive letter access to multiple independent disks with the risk of spanned/stripped software RAID (although they may introduce another layer of complexity).
- NAS storage like EMC Isilon Scale-out NAS or VNX Unified Arrays and serve up to 16TB file systems over SMB 3.0. Placing the data on a storage array may offer you additional options for array-level snapshots, replication and deduplication, without the overhead of a Windows server.
- The customers who asked these questions were all still working with Windows Server 2008 R2. Windows Server 2012 is starting to change how I’m approaching this question. For example, CHKDSK has been rewritten to support much larger volumes. SMB 3.0 and Scale-Out File Server offer some pretty neat ways of dealing with bigger data sets. Data Deduplication is built into NTFS in Server 2012. Resilient File System (ReFS), with support of volumes with 16k clusters of up to 2^78 bytes (75,557,863,725,914,300,000,000), provides for greater than petabyte scale data sets. Storage Spaces allows for stupid simple expansion of storage space on JBOD without the complexity of traditional storage arrays; combined with Cluster Shared Volumes (CSV) you get clustered, failover storage for anything from simple file shares to SQL Server 2012 Parallel Data Warehouse for big data analytics.
- Find your happy place. I subscribe to the ‘everything in moderation’ maxim for my life, and this extends into my technical architectures. Just as I avoid a single ‘beefy’ server for a vSphere ESXi host (and two hosts does not a vSphere cluster make either), I also avoid having a ton of tiny servers. I shoot for a moderate number of sensibly apportioned hosts. The same is true for the number of disks/datastores I design. One big volume and you’ve got all your eggs in one basket. Conversely, if you split your workload up to the Nth degree, you will increase complexity and the number of managed elements in your environment. Find a sensible middle ground that supports your performance and manageability objectives, while not introducing undue complexity.
- I’m happy to hear your thoughts on this topic – leave a comment below.