I have been pulling my hair out with a small VI3 implementation running against an IBM DS3300 iSCSI array.  Performance, for lack of a better term, sucked.  Granted, the DS3300 is not an enterprise level workhorse of a storage system, but it fit the budget.  Read performance was decent from the array, but write performance was terrible, maxing out at 10Mpbs throughput and insanely high latencies on long writes when the system was under load.  This led to some long P2V operations, poor guest performance, and some questions from the project sponsors on why I couldn’t make the environment sing.

The system was configured with a single controller with dual GigE NIC’s.  The controller had 512MB of battery backed cache (there is also a 1GB cache upgrade option available).  I wrote off some of the poor performance to a single controller with a less-than-optimal amount of cache; blamed the SAS controller to SATA disk command translation overhead; cringed at the 6 disk RAID5 configuration; and engaged in some self doubting.  I convinced the powers that be that we were IO constrained and got some funds to fill out the 3U chassis to a full 12 SATA disks, and reconfigured the array as a RAID10.  Performance gains were almost unnoticeable with these changes.  In addition, I did some basic troubleshooting of the network environment, verifying multiple paths to the storage, setting Flow Control on the switches to receive only, and double-checked my iSCSI initiator settings.  Note: The DS3300 is only supported with the ESX software initiator.  I found documentation on the DS3300 to be lacking, but did discover that the Dell MD3000i is based on the same LSI Engenio array.  Some Googling on the Dell solution led to to the ‘SMcli’ command line interface for both arrays.   The commands are slighly different for the Dell and IBM.  The links to the IBM CLI documentation were broken, so I had to do a bit of trial and error to get the commands right.  I used the Dell documentation as a starting point.  (Rant: Seriously, IBM?  Can you make your documentation any harder to get through – is it a Redbook, is it an Engineering Whitepaper, is it a support document, is it a case study – and why can I only find these with complex Google searches, not on your own product pages, and why can’t you name for documents intelligently, not with some random string of characters).

Moving on… I received an automated alert from the DS3300 about an incomplete battery learn cycle.  Using the IBM Storage Manager GUI I generated a  Storage Subsystem Profile’ from the Support tab to check the battery status.  In the profile I discovered that while write cache was enabled, it had a status of “Enabled (Suspended)”.   Ah ha!  Now I’ve got some decent Google material that led me to this: http://communities.vmware.com/thread/195838.  Hot damn I love the VMware Community Forums!

It turns out that in a single-controller configuration the setting for cache mirroring remains enabled by default.  Because there is no 2nd controller to mirror to, the array suspends write caching.  This is probably a safety thing – loss of high availability on the controllers puts data in cache at risk should the only controller fail.  I weighed my options and decided that the poor performance I was experiencing beat HA concerns, so I enabled write cache on the array using this command:

c:\program files\ibm_ds4000\client>smcli -n <ARRAYNAME> -c “set allLogicalDrives mirrorEnabled=false;”

And then followed with this for good measure:

c:\program files\ibm_ds4000\client>smcli -n <ARRAYNAME> -p <arraypassword> -c “set allLogicalDrives writeCacheEnabled=true;”

The results were immediately noticeable:

DS3300 Performance Improvement when Write Cache is Enabled

DS3300 Performance Improvement when Write Cache is Enabled - Click for a Larger View

The screen shot is from Veeam Monitor Free Edition, taken during 4 concurrent V2V operations from Hyper-V to VMware.  With the write cache fully functional, disk usage peaked at 54MBps, latency dropped to about 6ms, and my blood pressure dropped a few notches.

While poking around the CLI I also found that you can dump performance stats from the array (performance is otherwise hard to find on the thing) using this command:

C:\Program Files\IBM_DS4000\client>smcli -n <ARRAYNAME> -c “set session performanceMonitorInterval=5 performanceMonitorIterations=120;save storageSubsystem performanceStats file=\”c:\\ds3300perfstats.csv\“;”

This will give you a 10 minute record of performance from the array which you can analyze using Excel.  The Dell Enterprise Center TechCenter Wiki has a great write-up on how to efficiently analyze the data from this command here: http://www.delltechcenter.com/page/MD3000i+Performance+Monitoring, complete with a YouTube video that walks you through the process:

I am beginning to think that the DS3300 (and MD3000i) may actually be a viable starter solution for SMB’s starting out on a virtualization project.  But I would recommend the cache upgrade, 2nd controller, SAS disks instead of SATA to eliminate the SAS-to-SATA translation overhead and more faster disks instead of fewer slower disks so you can drive throughput and IOPS to a higher level.

Have any of you deployed the DS3300 or MD3000i (or the generic LSI solution)?  Do you have any performance tuning tips for these arrays?  If so, share in the comments!


Related posts:

  1. Free SAN Monitor for DS3300, MD3000i and others
  2. Storage Basics – Part V: Controllers, Cache and Coalescing
  3. Storage Basics – Part IV: Interface
  4. ESXTOP Batch Mode & Windows Perfmon
  5. Storage Basics – Part III: RAID

13 Responses to “IBM DS3300 iSCSI Write Performance Solved”

  • Hi Joshua! You wrote a perfect article! It has very much helped me.

  • Hi,
    thanks for your great articel,
    but what about performance problem with dual controller?

    • @Switchgott – There are several areas for you to consider in your troubleshooting:

      1.) Have you reached the max performance of your unit and workload? That is, with your current disk type, disk count, and RAID configuration, have you reached max load with your application profile (random read, sequential write, etc.)? If so, consider changing disk type, adding disks, or changing RAID type to better match your requirements. Use the SMCLI to capture performance stats and compare what you see to industry standard published numbers for IOPS under specific workloads.

      2.) Write cache could still be disabled on your dual-controller unit. Use the SMCLI to determine if write caching is suspended – this could happen with an incomplete batter learn cycle, for example.

      3.) Do you have a configuration error? Perhaps Jumbo Frames are enabled on the array, but not through the rest of the architecture (network switches, servers, etc.). Poor quality switches, oversubscribed switches, incorrect flow control settings, poor quality iSCSI initiators, multi-pathing errors, etc. could all cause problems on your system.

      I hope this helps – feel free to post back a reply if you have more questions and I’ll do my best to help.

      Josh

  • Do you set flow control on the switch, vSphere 4, and the IBM DS3300?

    Thanks,
    Mark

    • Mark – I’m working from memory here, but as I recall, best practice is to set flow control to Rx only on the switch. The DS3300 should detect that the switch is receiving and will auto-set to Tx (and I think Rx but I couldn’t find the documentation on this tonight). Flow control on ESX should be auto-negotiated by default, so it too should also transmit. Hope this helps. Feel free to post back with additional questions or any answers that you dig up in your own research.

      Josh

  • AndyNo Gravatar:

    You’re gambling with your data integrity if you enable write cache on a single controller model this way, if any component fails on the single controller you’ve told the OS that the data has been committed but it’s only in cache. That’s OK if the cache can be moved to a replacement controller without disconnecting the battery since the replacement controller can then commit it to the disks but the DS3300 cache can’t be transported with the battery attached AFAIK.

    I agree about IBM’s documentation, you can only find the info if you call it a FastT rather than DS3300.

  • As an ex-IBMer and current IBM Business Partner I would have to agree with you on the documentation aspect of finding things within IBM.com. Thought I might share this link with you which I often share with my customers and others which can make it a bit easier to find what you are looking for within IBM.

    http://www.ibmquicklinks.com/

    The IBM website is just sooo huge and the search functionality very hit or miss, but this is a good aggregation of main links.
    Also, I’d recommend contacting a skilled IBM Business Partner or two when you run into issues like this, as they are usually more eager to consult customers like yourselves and help them through issues which they’ve more than likely run into multiple times in the past and may have an easy answer to based on experience. We often do this for no fee to show our value add and hopefully earn your business in the future.
    While trolling the net to try and find answer may be fun at times, it’s probably not the most efficient use of your time. Let me do the trolling for you, if I don’t have the answer already : )

  • Firstly this is a great article! I have followed your article and executed the commands via the SMCli. I ran SQLIO tests before and after and actually notice quite a difference. My question is that I have only 1 controller (512Mb cache) on my DS3300 and I lose that controller, i.e it fails, how can the cache be written to the disk if the controller is unavailable? I am finding it hard to see how having the setting at the default is that much of a risk? I have redundancy at the RAID level which is fine but surely just purchasing the one controller exposes you to some risk anyhow? From what you have written here, because there is no 2nd controller to mirror to, the array suspends write caching, well that’s obvious?

    I’m failing to see why you would want to enable mirrorenabled when you have no second controller?

    Cheers

    • Jerry -

      Good questions! RAID will only protect that data that is already on disk, not data that is in the cache waiting to be written to disk. The risk comes in when you are writing some changes to disk (say to a SQL DB) and only some of the blocks are written to disk, the remaining writes are still in cache. If the single controller dies, the data in cache does not get flushed to disk and the SQL DB is inconsistent/corrupt. The controllers in the DS3300 have battery-backed cache, so a power loss to the array should trigger not be an issue as the contents of the cache will be held as long as the battery can support, and written to disk once power is restored and disks are spinning. The big risk is a catastrophic failure of the controller, but in that case you probably have bigger issues to worry about (like rebuilding your RAID sets) and/or a DR situation.

      In a test/dev environment, I personally see no problem with disabling cache mirroring to enable write caching. In a production environment, I would think twice before accepting the risk, make sure I have good monitoring, and argue long and hard for a second controller for proper redundancy.

  • Josh,

    This simply saved my life! When I checked my MD3000i settings I realized the some LUNs had the status below:

    Read cache: Enabled
    Write cache: Enabled (currently suspended)

    I had to set writeCacheEnabled to FALSE before setting to TRUE. The performance changed dramatically on the fly. I had a heavy Oracle write operation going on and svctm (iostat -xm 2) dropped from 20ms to near 0ms. Bandwidth went from 79Mbps to an avg of 150Mbps. The overall application performance is now at least 10 times better.

    Cheers,

    Leandro Cruz

  • [...] of my most popular posts to date had been IBM DS3300 Write Performance Problem Solved.  I am pleased to have upgrade my internal environment to an EMC Clariion CX4 array, but still [...]

Leave a Reply

Additional comments powered by BackType

About Me

twitterface

Hello, and thank you for visiting VMtoday. My name is Josh Townsend. I am a technology professional with a strong background in VMware Virtualization, Storage, and Microsoft technologies. I am a Sr. Systems Administrator at Tiber Creek Consulting in Fairfax, VA, and hold several technical certifications, including VMware Certified Professional. I am also a 2010 VMware vExpert.

vExpert logo

VCP logo

I am also leader of the Washington DC Metro Area VMware User Group (VMUG).

VMUG logo

The opinions expressed on this site are my own and may not reflect the views of my employer, VMware, or any other party unless otherwise stated.

Please feel free to follow me on Twitter
@joshuatownsend

Virtualization Jobs

Virtualization Resources