Oracle & Sun - What to do with the hardware business

The questions are going to continue here until Oracle officially owns Sun and perhaps beyond. Will Oracle sell the Sun hardware business? As I have said in the past, I do not think they will. I could certainly be wrong and many industry analysts think I am. Here are a few new data points to think about:

  1. The rumor mill continues to churn and is suggesting that HP may want to purchase the Sun hardware business. HP has the cash, but does the investment make sense? Would Oracle sell Solaris as well? HP would be in a tough position if they bought the hardware but Oracle still owned Solaris. Interestingly, in the article, CNNMoney points Mark Hurd at HP out as the unnamed "Party B" in the Sun regulatory filings.
  2. Oracle ran this front page ad in the Wall Street Journal today promoting Oracle DB on Sun SPARC. Is this just Oracle bluffing? Perhaps.
  3. If Oracle wants to be in the appliance space, I believe they need to sell general purpose servers. Without the volume that comes from selling general purpose servers, the cost of the appliance platform goes through the roof. Oracle would also have a difficult time getting specialized hardware without paying a premium for a small production run of servers.
  4. Oracle and Larry Ellison want to own the IT budget. The "save money on hardware and spend it on Oracle software" go to market strategy  was nothing short of brilliant. Keeping the Sun hardware business is Ellison's opportunity to compete head to head with IBM. Oracle would have all the applications and the hardware to run it on. That would be quite a legacy for Ellison.
The US Department of Justice as approved the acquisition. Now, the European Union needs to make a decision before we will get any more answers.

Click to read more ...


Deduplication - The NetApp Approach

After writing a couple of articles (here and here) about deduplication and how I think it should be implemented, I figured I would try it on a NetApp system I have in the lab. The goal of the testing here is to compare storage performance of a data set before and after deduplication. Sometimes capacity is the only factor, but sometimes performance matters. The test is random 4KB reads against a 100GB file. The 100GB file represents significantly more data than the test system can fit into its' 16GB read cache. I am using 4KB because that is the natural block size for NetApp. To maximize the observability of the results in this deduplication test, the 100GB file is completely full of duplicate data. For those who are interested, the data was created by doing a dd from /dev/zero. It does not get any more redundant than that. I am not suggesting this is representative of a real world deduplication scenario. It is simply the easiest way to observe the effect deduplication has on other aspects of the system. This is the output from sysstat -x during the first test. The data is being transferred over NFS and the client system has caching disabled, so all reads are going to the storage device. (The command output below is truncated to the right, but the important data is all there.)

Random 4KB reads from a 100GB file – pre-deduplication:

 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 19%  6572     0     0    6579  1423 27901  23104     11     0     0     7   16%   0%  -  100%      0     7     0     0     0     0
 19%  6542     0     0    6549  1367 27812  23265    726     0     0     7   17%   5%  T  100%      0     7     0     0     0     0
 19%  6550     0     0    6559  1305 27839  23146     11     0     0     7   15%   0%  -  100%      0     9     0     0     0     0
 19%  6569     0     0    6576  1362 27856  23247    442     0     0     7   16%   4%  T  100%      0     7     0     0     0     0
 19%  6484     0     0    6491  1357 27527  22870      6     0     0     7   16%   0%  -  100%      0     7     0     0     0     0
 19%  6500     0     0    6509  1300 27635  23102    442     0     0     7   17%   9%  T  100%      0     9     0     0     0     0
The system is delivering an average of 6536 NFS operations per second. The cache hit rate hovers around 16-17%. As you can see, the working set does not fit in primary cache. This makes sense. The 3170 has 16GB of primary cache and we are randomly reading from a 100GB file. Ideally, we would like to get a 16% cache hit rate (16GB cache / 100GB working set) and we are very close. The disks are running at 100% utilization and are clearly the bottleneck in this scenario. The spindles are delivering as many operations as the are capable of. So what happens if we deduplication this data? First, we need to activate deduplication, a_sis in NetApp vocabulary, on the test volume and deduplicate the test data. (Before deduplication became the official buzz word, NetApp referred to their technology as Advanced Single Instance Storage.)
fas3170-a> sis on /vol/test_vol
SIS for "/vol/test_vol" is enabled.
Already existing data could be processed by running "sis start -s /vol/test_vol".
fas3170-a> sis start -s /vol/test_vol
The file system will be scanned to process existing data in /vol/test_vol.
This operation may initialize related existing metafiles.
Are you sure you want to proceed (y/n)? y
The SIS operation for "/vol/test_vol" is started.
fas3170-a> sis status
Path                           State      Status     Progress
/vol/test_vol                  Enabled    Initializing Initializing for 00:00:04
fas3170-a> df -s
Filesystem                used      saved       %saved
/vol/test_vol/         2277560  279778352          99%
There are a few other files on the test volume that contain random data, but the physical volume size as been reduced by over 99%. This means our 100GB file is now less that 1GB in size on disk. So, let’s do some reads from the same file and see what has changed.

Random 4KB reads from a 100GB file – post-deduplication:

 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 93% 96766     0     0   96773 17674 409570    466     11     0     0    35s  53%   0%  -    6%      0     7     0     0     0     0
 93% 97949     0     0   97958 17821 413990    578    764     0     0    35s  53%   8%  T    7%      0     9     0     0     0     0
 93% 99199     0     0   99206 18071 419544    280      6     0     0    34s  53%   0%  -    4%      0     7     0     0     0     0
 93% 98587     0     0   98594 17941 416948    565    445     0     0    36s  53%   6%  T    6%      0     7     0     0     0     0
 93% 98063     0     0   98072 17924 414712    398     11     0     0    35s  53%   0%  -    5%      0     9     0     0     0     0
 93% 96568     0     0   96575 17590 408539    755    502     0     0    35s  53%   8%  T    7%      0     7     0     0     0     0
There has been a noticeable increase in NFS operations. The system has gone from delivering 6536 NFS ops to delivering 96,850 NFS ops. That is nearly fifteen-fold increase in delivered operations. The CPU utilization has gone up roughly 4.9x. The disk reads have dropped to almost 0 and the system is serving out over 400MB/s. This is a clear indication that the operations are being serviced from cache instead of from disk. It is also worth noting that the average latency, as measured from the host, has dropped by over 80%. The improvement in latency is not surprising given that the requests are no longer being serviced from disk. The cache age has dropped down to 35 seconds. Cache age is the average age of the blocks that are being evicted from cache to make space for new blocks. The test had been running for over an hour when this data was captured, so this is not due to the load ramping. This suggests that even though we are accessing a small number of disk blocks, the system is evicting blocks from cache. I suspect this is because the system is not truly deduplicating cache. Instead, it appears that each logical file block is taking up space in cache even though they refer to the same physical disk block. One potential explanation for this is that NetApp is eliminating the disk read by reading the duplicate block from cache instead of disk. I am not sure how to validate this through the available system stats, but I believe it explains the behavior. It explains why the NFS ops have gone up, the disk ops have gone down, and the cache age has gone down to 35 seconds. While it would be preferable to store only a single copy of the logical block in cache, this is better than reading all of the blocks from disk. The cache hit percentage is a bit of a puzzle here. It is stable at 53% and I am not sure how to explain that. The system is delivering more than 53% of the read operations from cache. The very small number of disk reads shows that. Maybe someone from NetApp will chime in and give us some details on how that number is derived. This testing was done on Data ONTAP 7.3.1 (or more specifically I tried to replicate the results on versions of Data ONTAP prior to 7.3.1 without success. In older versions, the performance of the deduplicated volume is very similar to the original volume. It appears that reads for logically different blocks that point to the same physical block go disk prior to 7.3.1. Check back shortly as I am currently working on a deduplication performance test for VMware guests. It is a simple test to show the storage performance impact of booting many guests simultaneously. The plan is to test use a handful of fast servers to boot a couple hundred guests. The boot time will be compared across a volume that is not deduplicated and one that has been deduplicated. I am also working on one additional test that may provide a bit of performance acceleration. These results are from a NetApp FAS 3170. Please be careful trying to map these results to your filer as I am using very old disk drives and that skews the numbers a bit. The slow drives make the performance of the full data set (non-deduplicated?) slower than it would be with new disk drives.

Click to read more ...


Deduplication - Sometimes it's about performance

In a previous post I discussed the topic of deduplication for capacity optimization. Removing redundant data blocks on disk is the first, and most obvious, phase of deduplication in the marketplace. It helps to drive down the most obvious cost - the cost per GB of disk capacity. This market has grown quickly over the last few years. Both startups and established storage vendors have products that compete in the space. They are most commonly marketed as virtual tape library (VTL) or disk-to-disk backup solutions. Does that mean that deduplication is a point solution for highly sequential workloads? No. There is another somewhat less obvious benefit of deduplication. What storage administrator does not ask for more cache in the storage array? If I can afford 8GB, I want 16GB. If the system supports 16GB, I want 32GB. Whether it is for financial or technical reasons, cache is always limited. What about deduplicating the data in cache? When the workload is streaming sequential backup data from disk, this may not be very helpful. However, in a primary storage system with a more varied workload, this becomes very interesting. The cost per GB of cache (DRAM) is several orders of magnitude higher than the cost of hard drives. If the goal is to reduce storage capital expenses by making the storage array more efficient, then let’s focus on the most expensive component. The cache is the most expensive component. If the physical data footprint on disk is reduced, then it is logical that the array cache should also benefit from that space savings. If the same deduplicated physical block is accessed multiple times through different logical blocks, then it should result in one read from disk and many reads from cache. Storing VMware, Hyper-V, or Xen virtual machine images creates a tremendous amount of duplicate data. It is not uncommon to see storage arrays that are storing tens or even hundreds of these virtual images. If 10 or 20 or all of those virtual servers need to boot at the same time, then it places an extreme workload on the storage array. If all of these requests have to hit disk, the disk drives will be overwhelmed. If each duplicate logical block requires its own space in cache, then the cache will be blown out. If each duplicate block is read off disk once and stored in cache once, then the clients will boot quickly and the cache will be maintained. Preserving the current contents of the cache will ensure the performance of the rest of the applications in the environment is not impacted. This virtual machine example does make a couple of assumptions. The most significant is that the storage controllers can keep up with the workload. Deduplication on disk allows for fewer disk drives. Deduplication in cache expands the logical cache size and makes the disk drives more efficient. Neither of these does anything to reduce the performance demand on the CPU, fibre channel, network, or bus infrastructure on the controllers. In fact, they likely place more demand on these controller resources. If the controller is spending less time waiting for disk drives, it needs the horsepower to deliver higher IOPS from cache with less latency. This same deduplication theory applies to flash technology as well. Whether the flash devices are being used to expand cache or provide another storage tier, they should be deduplicated. Flash devices are more expensive than disk drives, so let's get the capacity utilization up as high as possible. As the mainstream storage vendors start to bring deduplication to market for primary storage, it will be interesting to see how they deal with these challenges. Deduplication for primary storage presents an entirely different a set of requirements from backup to disk.

Click to read more ...


Why Oracle is NOT going to sell off Sun's hardware business

Why is there such a buzz among the analyst, press, and blogging community that Oracle is going to sell of the Sun hardware business? It makes no sense to me. I shared my thoughts on the acquisition in a previous post, but I am going to elaborate a bit here. Not only do I believe Oracle will continue selling Sun hardware, I think it is the primary reason they bought Sun. Why would Oracle spend $7.4B to buy Sun? Is it for Solaris? I don't think so. Solaris is open source and Sun would have welcomed Oracle's help in tuning the operating system for Oracle's software applications. Is it for Java? That is a little more plausible, but there was no need for Oracle to control Java. As far as I know, Sun was not doing anything to make it difficult for Oracle to use Java. Oracle is buying Sun for the hardware business. The hardware (and support) business is what generates the revenue at Sun. I would like to share a few relevant quotes. The first comes from Larry Ellison in a recent interview. He did his best to shut down the rumor mill churning on what will happen to Sun's hardware business. Interviewer - "Are you going to exit the hardware business?" Larry Ellison - "No, we are definitely not going to exit the hardware business. [...] If a company designs both hardware and software, it can build much better systems than if they only design the software." Laura DiDio, an analyst with ITIC, was quoted in a recent Reuters article. "Sun has three decades and billions of dollars in investment in superlative hardware. They have some brilliant engineers," she said. "But Sun's marketing has not matched its technology. Larry Ellison is brilliant at marketing." Laura is being kind to Sun marketing when she suggests that it has not matched their technology. Sun has struggled to deliver a clear message to the market about their technology. Oracle has roughly 50% market share in the relational database market. They also have a dominant market share in other markets with products like Siebel, BEA, and PeopleSoft. Even a sizable portion of SAP deployments use Oracle DB as the back end database. This means Oracle is sitting in the room when a customer makes their application decision most of the time. This application/software decision is made long before any hardware decisions. The support matrix for the software application has a major influence on the hardware and operating systems decisions. What fibre channel HBA do you run? Probably the one that your storage vendor recommended. Solaris is the number one operating system for Oracle today. Linux was an afterthought for enterprise application deployment until Oracle put their name behind it and started pushing it as the platform for Oracle. What impact Oracle's endorsement of Solaris have on AIX? While Oracle will continue to support the major operating systems in the marketplace, it is only logical that Solaris will become their recommended operating system. Larry suggests in the above interview that Oracle is planning to tightly integrate Oracle with Solaris. That is another logical reason for customers to run on Solaris. If the customer has picked an Oracle software platform and Oracle recommends they run on Solaris, then why not sell them the hardware as well? Oracle has used software pricing to move the market in the past and I expect they will continue the practice. Today, Oracle DB licenses are more cost effective (at list price) for IBM Power CPUs than they are for Sun SPARC CPUs. If Oracle is selling the Sun servers, it only makes sense that it will be more cost effective to deploy the application on Sun servers moving forward. In the future, I can see Oracle selling you 2 sockets of Oracle DB licenses and shipping you a 2 socket server that is included with those licenses to plug into an existing RAC cluster. You are welcome to run on a different platform, but this one is preinstalled and will join your cluster after you answer 3 simple setup questions. Will the Oracle hardware offering bother IBM, EMC, HP, Dell, NetApp, etc? Yes. It will. What can they do though? Recommend a different database? It is to late for that. IBM has DB2, but nobody has a stack that is as complete as Oracle. Perhaps this will drive someone like HP to acquire SAP. Why does Oracle want to keep the Sun hardware business if Sun is losing money? The answer is that Sun is generating positive cash flow from operations every quarter. They post a GAAP loss because they make poor investment decisions and then have to write off those investments. Jonathan Schwartz was promoted to president and COO of Sun in 2004. Under his guidance, Sun paid $4.1B for StorageTek in 2005. In mid-2007, Sun announced a $3B stock buyback. Sun stock fell by nearly 50% in the next 12 months. In 2008, Sun paid ~$1B for mySQL. When Sun's stock price fell and the market cap with it, all of these events required write downs that show a GAAP loss. If Sun would stop spending on acquisitions, their bank balance would be rising every quarter. In the last 4 years, Oracle has acquired over 40 companies. It has built a complete software portfolio that allows them to compete effectively throughout the entire software stack. With the purchase of Sun, they will now own the bottom half of the datacenter stack. The operating system down to the disk drives.

Click to read more ...


Oracle to buy Sun for $7.4B - How will it affect the industry?

A few weeks ago, it looked like IBM was going to make a deal to purchase Sun. That fell through when the Sun board could not come to agreement. On April 20th, with very little rumor in the marketplace, Oracle announced they were buying Sun for $7.4B in cash. What does this mean for the new company?

  • To quote a recent Oracle publication, "Oracle plans to engineer and deliver an integrated system – applications to disk – where all the pieces fit and work together, so customers do not have to do it themselves." Sun is already shipping Infiniband switches and blades with InfiniBand on the motherboard. They have also mentioned IB is on the roadmap for the Sun 7000. Andy Bechtolsheim mentioned it at the Sun product announcement on April 14th. What about an integrated Oracle appliance running on Nehalem blades, Solaris x64,  Sun 7000 storage, and using Infiniband switches. It should not be a major technology leap to put it all together. What would this mean for the Oracle/HP appliance?
  • Sun SPARC processors are at an Oracle pricing disadvantage to IBM Power processors in the current Oracle pricing model. Oracle has never been afraid to use pricing to move the market in their direction. Watch for them to use their pricing model to encourage customer to buy Sun servers.
  • Solaris x64 has been intentionally neglected by Oracle. Oracle delivers patches on Solaris SPARC and Linux immediately. Then, they have historically waited up to 6 months to release that same patch for Solaris x64. In the past, this has helped Oracle push their Linux agenda in the marketplace. Given the ease of porting the Oracle patches to Solaris x64, there is no logical technical reason for this lag. Watch for Solaris x64 to become a first class citizen in the Oracle OS support matrix now that growth of Solaris x64 means growth for Oracle.
  • Storage - Sun is not generally thought of as a storage company. However, an Oracle executive recently announced in a Sun all hands meeting that Oracle was buying Sun for Solaris, Java, and storage. Look for the Sun 7000 to pick up some new features to integrate tighter with Oracle. Perhaps Analytics that give visibility across both Oracle, Solaris, and the storage? Oracle performance could benefit from the large cache footprint and even larger flash cache. The Sun 7000 is trying to reduce hardware costs and in the past Oracle has loved reducing hardware costs in order to leave more budget available for software.
  • Servers - I would expect business as usual at Sun on the server front. Oracle has no play in the hardware space today and Sun makes solid products. Sun’s server business generates positive cash flow every quarter. This is part of why they bought Sun, right? According to a recent release from Oracle, "Oracle plans to grow the Sun hardware business after the closing, protecting Sun customers’ investments and ensuring the long-term viability of Sun products."
  • MySQL – Many users adopt MySQL to escape Oracle pricing. The largest market is in the Web 2.0 space, but they play in other parts of the market as well. There is no way to kill MySQL because it is open source. If Oracle/Sun stopped supporting it today, there would be a well funded startup tomorrow stepping in to take their place. Many of my MySQL contacts have already moved on from Sun. In fact, some major MySQL users are turning to the community and other sources for their MySQL support and patches. Oracle already owns TimesTen, Berkeley DB, and InnoDB, so MySQL could just become another DB on the list. If Oracle really wants to put the pressure on Microsoft SQL, then they will encourage the use of MySQL and provide an easy migration path to Oracle DB.
  • The Wall Street Journal reported that the IBM deal fell through when the Sun board split into two factions. The one in favor of the IBM deal was led by Jonathan Schwartz and Scott McNealy headed the group opposing the acquisition. When the CEO takes that kind of stand and loses, one must wonder what this foretells of his future. I think there is a high likelihood we will see Jonathan move on to a new opportunity in the next few months.
  • Identity Management - Oracle and Sun have overlapping products here. When both products are 'good enough' I tend to lean towards the buyer. That said, Sun has their product somewhat integrated into Solaris, so I do not think it will go away. This one may be too close to call. There may also be some regulatory scrutiny on this as the combined market share will be significant.
  • Sun IBIS - Sun has been trying to centralize to a single ERP system over that last few years. This is a project that should never have impacted Sun's customers, partners, or suppliers. Unfortunately, there have been a few hiccups and many delays along the way and it has effected many organizations outside of Sun. Look for Oracle to bring the resources to the table to clean this up.
I think we would have seen some serious regulatory scrutiny of the IBM acquisition of Sun. On the surface, it looks like there are very few issues with Oracle and Sun. The only two issues I see are Identity Management and mySQL. I don’t think either one should be a major issue, but the government may. I don’t think Oracle will have a problem finding a workable solution for either potential objection.

A couple thoughts that are a little further out of the box:

  • An Oracle appliance would give Oracle a private cloud play in the datacenter. Oracle has a large cloud presence today with Oracle On Demand. They could position themselves as the enterprise application private and public cloud provider. Cloud computing is the natural evolution of outsourcing. Oracle could compete with IBM Global Services on the high end and Amazon on the low end. Who better to outsource your enterprise applications to than the company that wrote them?
  • DTrace could be integrated into Oracle to provide a whole new level of observablity. What if DTrace had a GUI that was able to observe systems across the database, server, and storage?
  • Oracle is based on transactions and ensuring transactional integrity is maintained all the way from the Oracle DB to the hard drive. Sun's zfs is a transaction engine for storage that is most commonly used to present a filesystem. The zfs transactional model is completely exposed through their API. What would hapen to Oracle performance if they plugged into the zfs transactional engine and took advantage of the zfs integration with SSD/Flash?
  • Oracle could take OpenOffice to market and try to compete with Microsoft for the desktop. I am not suggesting that OpenOffice is perfect or as full featured as Microsoft Office, but Oracle has a history of monetizing software very effectively.
  • Oracle and Sun have both been working on Xen. I assume they are both working on tools to manage virtual environments. Do they try to enter the virtualization space? The hypervisor is well on the way to commoditization (read: free), but what about a virtualization appliance. Sun has the hardware and Oracle does a pretty good job on the software side.
  • Sun is a major proponent of Infiniband. One of the potential limiting factors in Oracle RAC environments is the latency of the cluster interconnect. What impact would Oracle's endrsement of Infiniband have in the marketplace? How would this impact Cisco's Data Center Ethernet (DCE) plans?

What are a couple of the potential challenges in this acquisition?

  • A few analysts have suggested that Oracle will sell off the Sun hardware business. I can not see it. Why would they do that? As mentioned above, Oracle states in the Oracle/Sun FAQ, "Oracle plans to grow the Sun hardware business after the closing, protecting Sun customers’ investments and ensuring the long-term viability of Sun products."
  • Oracle has never had hardware in their portfolio. How does this affect their relationships with IBM, HP, Dell, EMC, and NetApp?
  • Pillar is a small, but very well funded, storage company. Tako Ventures is the largest investor in Pillar and has a seat on the board. Lawrence Investments is the owner or Tako Ventures and Lawrence "Larry" Ellison provides the capital behind Lawrence Investments. So, Larry will own two storage companies when the Sun deal closes.
  • Sun and Oracle have very different cultures at the field level. How will that play out? Given Sun’s latest reorganization, I think the sales organization could be plugged into a new company quite easily. It almost appears like it was designed to be portable. Perhaps that was a design criteria?

How does it affect the rest of the industry?

Once the IBM news broke, it was generally expected that Sun would be acquired. Even if the IBM deal fell through, someone else would buy Sun. I expected it to be one of the major systems vendors such as HP or Dell. If I let my mind wander a bit, I could convince myself that Cisco or EMC might take the opportunity to enter a new space. I did not expect Oracle to jump in and make the deal. I don't think the rest of the industry expected it either. Any company who makes a living selling hardware for enterprise applications in the datacenter is unhappy about this acquisition. Oracle is the largest vendor of enterprise application software. When they sold no hardware, their support matrix and pricing model would drive hardware buying decisions. Now that they will have their own hardware portfolio, I expect them to push customers towards Solaris, Sun storage, and Sun SPARC and x64 servers. The current state of the economy has driven down stock prices. This is a buying opportunity for companies with a war chest. Cisco took out a $4B bond a couple months back, which suggests they are shopping. IBM, EMC, Dell, HP, NetApp, and Cisco and have money to invest. Sun is the first major transaction of the year, but I do not think it will be the last. Full disclosure: My company is a partner of both Sun and Oracle. Update: I have added another post on the subject of Oracle selling off the Sun hardware business

Click to read more ...


Deduplication - It's not just about capacity

There is no debating that duplication is one of the hottest topics in IT. The question is if the hype has started to become bigger than the technology. Today, there are two primary use cases driving deduplication in the marketplace. The first is backup to disk and the second is virtual guest operating systems (VMware, Hyper-V, and Xen guests). (I will talk a bit about the disk to disk scenario in this article and the virtual guest topic in the next one.) These are both logical markets to adopt deduplication because they suffer from a common challenge. They both create a tremendous amount of redundant data on the disk array. The goal in both cases is to pack more data onto a disk drive and reduce the cost per GB. This is the first and most obvious use case for deduplication. Disk drive capacity is growing exponentially, but disk performance is increasing at a much slower rate. In many cases, when helping customers size for their workload, performance drives the spindle count and not capacity. It is easy to meet the capacity needs with large drives, but will they meet the performance requirement? That is the problem. Often performance is what dictates the spindle count. It is no longer sufficient to size a storage device based solely on capacity requirements. This is a general challenge that must be taken into account when sizing a storage array. So how is the growing disparity between size and performance effected by deduplication? Deduplication can make the performance issue worse by reducing the number of spindles even further. If the bottleneck in the storage device is the spindles, then using deduplication to pack more data onto those spindles is only going to exacerbate the situation. Let's take a closer look at sizing storage for a backup to disk workload. Delivering on the highly sequential read and write requirements of disk to disk backups is much easier than serving a more random workload. Disk drives do a great job with sequential reads and writes. This makes backup to disk all about sizing for capacity. When deduplication is added into the mix, the disk drives should still meet the performance requirement as long as the deduplication technology being used does not turn sequential IO into random IO. This is why it is important to understand how a specific deduplication implementation works. The reality is that nearly every other IT workload is more random than backup to disk. If deduplication was used to pack more data onto the same number of spindles for a highly random workload the spindles would likely not meet the performance requirements. Does that mean deduplication is a point solution for highly sequential workloads? I do not believe so. I am working on an entry covering the potential performance benefits of deduplication in a more random environment.

Click to read more ...


Do I need more cache in my NetApp?

How many times have you wondered whether you could improve the performance of your storage array by adding additional cache? Will more cache improve the performance of my storage array? This is what the vendors so often tell us, but they have no objective information to explain why it is going to help. Depending on the workload, increasing the cache may have little or no effect on performance. There are two ways to know whether your environment will benefit from additional cache. The first is to understand every nuance of your application. Most storage managers I speak with classify this as impractical at best and impossible at worst. Even if you have an application with a very well understood workload, most storage devices are not hosting a single application. Instead, they are the hosting many different applications. It is even more complex to understand how this combined workload will be effected by adding cache. The second way to measure cache benefit is to put the cache in and see what happens. This is the most common approach I see in the field. When performance becomes unacceptable, the options of adding additional disk and/or cache are weighed and a purchase is made. (I will save the topic of adding spindles to increase performance for a future post.) Both of these options force a purchase to be made with no guarantee it will solve the problem. NetApp has introduced a tool to provide a 3rd option: Predictive Cache Statistics. It provides the objective data needed to rationalize a hardware purchase. Predictive Cache Statistics (PCS) is available in systems running 7.3+ and having at least 2GB of memory. When it is enabled, PCS reports what the cache hit ratio would be if the system had 2x (ec0), 4x (ec1), and 8x (ec2) the current cache footprint. (ec0, ec1, and ec2 are the names of the extended caches when the stats are presented by the NetApp system.) Now, let's drill down into exactly how predictive cache statistics work... In most conditions there is no significant impact to system performance. I monitored the change in latency on my test system with PCS enabled and disabled and there was not a measurable difference. The storage controller was running at about 25% CPU utilization at the time with a 40% cache hit rate. NetApp warns in their docs that performance can be effected when the storage controller is at 80% CPU utilization or higher. It is understandable given the amount of information the array has to track in order to provide the cache statistics. This simply means some thought needs to be put into when it is enabled and how long it is run for in production. Here are the steps required to gather the information: 1) Enable Predictive Cache Statistics (PCS)

options flexscale.enable pcs
2) It is important to allow the workload to run until the virtual caches have time to warm up. In a system with a large amount of cache, this can be hours or even days.  Monitor array performance while the storage workload runs. If latency increases to unacceptable levels, you can disable PCS.
options flexscale.enable off
3) The NetApp perfstat tool can be used to capture and analyze the data that is gathered. I prefer instant gratification, so for this example, I will use real time stats command.
stats show –p flexscale-pcs
The way the results are reported can be a little confusing the first time you look at it. The ec0, ec1, and ec2 'virtual caches' are relative to the base cache in the system being tested (2x, 4x, and 8x). If the test system has 16GB of primary cache, ec0 will represent 32GB of 'virtual cache' (2x 16GB). ec1 brings the 'virtual cache' to a total of 4x base cache or an additional 32GB beyond ec0. ec2 brings the total to 8x base cache or an additional 64GB beyond ec0 + ec1. The statistics on each line represent the values for that specific cache segment. Hopefully that explanation clears up more confusion than it introduces. Here are a couple examples. This testing was completed on a NetApp FAS3170. The 3170 platform has 16GB of cache standard. So, in these examples, ec0 is 32GB, ec1 is 32GB, and ec2 is 64GB.

Example 1: 8GB working set, 4KB IO, and 100% random reads

fas3170-a> sysstat -x 5
 CPU   NFS  CIFS  HTTP   Total    Net  kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in    out   read  write  read write   age   hit time  ty util                 in   out    in   out
 39% 39137     0     0   39137  7102 165539    206    370     0     0   >60  100%   3%  T    2%      0     0     0     0     0     0
 39% 39882     0     0   39882  7236 168677    136      6     0     0   >60  100%   0%  -    1%      0     0     0     0     0     0
 39% 39098     0     0   39098  7094 165338    186    285     0     0   >60  100%   3%  T    2%      0     0     0     0     0     0

fas3170-a> stats show -p flexscale-pcs
Instance    Blocks Usage   Hit  Miss Hit Evict Invalidate Insert
                       %    /s    /s   %    /s         /s     /s
     ec0   8388608     0     0     0   0     0          0      0
     ec1   8388608     0     0     0   0     0          0      0
     ec2  16777216     0     0     0   0     0          0      0
     ec0   8388608     0     0     0   0     0          0      0
     ec1   8388608     0     0     0   0     0          0      0
     ec2  16777216     0     0     0   0     0          0      0
     ec0   8388608     0     0     0   0     0          0      0
     ec1   8388608     0     0     0   0     0          0      0
     ec2  16777216     0     0     0   0     0          0      0
The sysstat shows a cache hit rate of 100%. This is exactly what we would expect for an 8GB dataset on a system with 16GB of cache. The stats command shows that PCS is currently reporting no activity. Again, this is exactly what we should expect with a working set that fits completely in main cache.

Example 2: 30GB working set, 4KB IO, and 100% random reads

fas3170-a> sysstat -x 5
 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 27% 11607     0     0   11607  2173 49352  27850      6     0     0     3   41%   0%  -   99%      0     0     0     0     0     0
 27% 11642     0     0   11642  2180 49518  28097    279     0     0     3   41%  21%  T   99%      0     0     0     0     0     0
 26% 11413     0     0   11413  2138 48511  27773     11     0     0     3   41%   0%  -   99%      0     0     0     0     0     0

fas3170-a> stats show -p flexscale-pcs
Instance    Blocks Usage   Hit  Miss Hit Evict Invalidate Insert
                       %    /s    /s   %    /s         /s     /s
     ec0   8388608     1    38  8560   0     0          0  14811
     ec1   8388608     0     0  8560   0     0          0      0
     ec2  16777216     0     0  8560   0     0          0      0
     ec0   8388608     1    65  6985   0     0          0      0
     ec1   8388608     0     0  6985   0     0          0      0
     ec2  16777216     0     0  6985   0     0          0      0
     ec0   8388608     1   100  6922   1     0          0  11899
     ec1   8388608     0     0  6922   0     0          0      0
     ec2  16777216     0     0  6922   0     0          0      0
This data was gathered after the 30GB workload had been running for a few minutes, but just after I enabled predictive cache statistics. The PCS data shows that there are very few hits, but there are a significant number of inserts. This is what we should expect when PCS is first enabled. The sysstat output shows a cache hit rate of 41%.
fas3170-a> sysstat -x 5
 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 27% 11238     0     0   11238  2105 47784  27862    286     0     0     4   40%  18%  T   99%      0     0     0     0     0     0
 26% 11371     0     0   11371  2130 48349  27934     11     0     0     4   40%   0%  -   99%      0     0     0     0     0     0
 27% 11184     0     0   11184  2096 47554  27938    275     0     0     4   40%  33%  T   99%      0     0     0     0     0     0

fas3170-a> stats show -p flexscale-pcs
Instance    Blocks Usage   Hit  Miss Hit Evict Invalidate Insert
                       %    /s    /s   %    /s         /s     /s
     ec0   8388608    87  6536   456  93   933          0    934
     ec1   8388608     6   453     3  99     0        934    933
     ec2  16777216     0     0     3   0     0          0      0
     ec0   8388608    87  6512   435  93     0          0      0
     ec1   8388608     6   435     0 100     0          0      0
     ec2  16777216     0     0     0   0     0          0      0
     ec0   8388608    87  6472   450  93   963          0    964
     ec1   8388608     6   445     5  98     0        964    963
     ec2  16777216     0     0     5   0     0          0      0
Now that the ec0 virtual cache has warmed up, the potential value of additional cache becomes more apparent. The hit rate has gone up to 93% and it is servicing over 6500 operations per second. With 32GB of additional cache, 6500+ disk reads would be alleviated and the latency would be dramatically reduced. These cache hits are virtual, so currently those 'hits' are still causing disk reads. Clearly, the additional cache will provide a major performance boost, but unfortunately, it is impossible to determine exactly how it will effect overall system performance. The current bottleneck, reads from disk, would be alleviated, but that simply means we will find the next one. Additional cache can be added to most NetApp systems in the form of a Performance Accelerator Module (PAM). The PAM is a  PCI Express card with 16GB of DRAM on it. It plugs directly into one of the PCI Express slots in the filer. I suspect there a slight increase in latency when accessing data in the PAM over the main system cache. Although, this increase is likely so small that it will not be noticed on the client side as it is a very small portion of the total transaction time from the client perspective. Unfortunately, I do not have first hand performance data that I can share as I have not been able to get access to a PAM for complete lab testing. It is important to note that a system with 16GB of primary cache and 32GB of PAM cache is not the same as a system with 48GB of primary cache. The PAM cache is populated as items are evicted from primary cache. If there is a hit in the PAM, that block is copied back into primary cache. This type of cache is commonly referred to as a victim cache or an L2 cache. If the goal is to serve a working set without ever going to disk, then that working set needs to fit into extended cache, not the the primary cache plus extended cache. Predictive cache statistics are a great feature. It gives us the power to answer a question we could only guess at in the past. However, like most end users, I always want more. There are a couple things that I would love to see in the future. First, the PAM cards are 16GB in size. It would be great if the extended cache segments reported by PCS could be in 16GB increments. That would make it even easier to determine the value of each card I add. It would also remove all the confusion around how big ec0, ec1, and ec2 are. The ability to reset the PCS counters back to zero would also be helpful. When testing different workloads, this would allow the stats to be associated with each individual workload. It is worth noting that this was not a performance test and the data above should be treated as such. Nothing was done to either the client or the filer to optimize NFS performance. In an attempt to prevent these numbers from being used to judge system performance, I am intentionally omiting the details of how the disk was configured.

Click to read more ...


WAN optimization for array replication

As the need for disaster recovery continues to move downmarket from the enterprise to medium and small businesses, the number of IT shops replicating their data to an offsite location is increasing. Array based replication was once a feature reserved for the big budgets of the Fortune 1000. Today, array based replication is a feature that is available on most midrange storage devices (and even some of the entry level products). This increase in replication deployments has created a new challenge for IT. The most common replication solutions move the data over the IP network. That data puts a significant load on the IP network infrastructure. The LAN infrastructure is almost always up to the task, but the WAN is often not able to handle this new burden. While the prices of network infrastructure have come down over the years, big pipes are still an expensive monthly outlay. So, how do we get that data offsite without driving up those WAN costs? WAN optimization technology provides a potential solution. Not every workload or protocol can benefit from today's WAN optimization technology, but replication is one that usually gets a big boost. I gathered some data from a client who is using NetApp SnapMirror to replicate to a remote datacenter and deployed  WAN optimization to prevent a major WAN upgrade. The NetApp filer is serving iSCSI, Fibre Channel, and CIFS. The clients are primarily Windows and they run Exchange and MS SQL along with some home grown applications. All of their data is stored on the NetApp storage. The chart below shows the impact the WAN optimization device had. For the purposes of this discussion, think of the device as having one unoptimized LAN port and an optimized WAN port. The LAN traffic is represented by the red and the WAN by the blue. With no optimization, the traffic would be the same on both sides. The chart shows a dramatic reduction on the amount of data being pushed over the WAN. Network Throughput Network Throughput This data was gathered over a 2 week period. The total data reduction over the WAN was 83% over the data in the chart and there was a peak of 93% for one window. Again, this is not what every environment will see, so test before you deploy. In this case, the system paid for itself in less than 12 months with the savings in WAN costs. That is the kind of ROI that works for almost anyone. I am intentionally not addressing what WAN optimization technology was used in this solution. Last time we tested these devices in our lab, we brought in a half dozen and they all had their pros and cons. That is another topic for another post.

Click to read more ...


Benchmarking and 'real FC'

Sometimes I think the only people who read technology blogs are people who write other technology blogs. I have no way to figure out if this is true or not, but it is an interesting topic to ponder. Do IT end users actually read technology blogs? If they are reading, they do not seem to comment very frequently. Much more often comments come from other bloggers or competing vendors. That said, I am going to talk about an issue that some of the storage bloggers seem to be caught up in at the moment. The issue of 'emulated FC' vs 'real FC.' Let me start off by sharing a few recent posts from other blogs: Chuck Hollis at EMC writes about the EMC/Dell relationship and takes the opportunity to compare EMC to NetApp. In this case, he is comparing the EMC NX4 to the NetApp FAS2020. The comment in the post that certainly aggravated NetApp is that EMC does "real deal FC that isn't emulated." The obvious implications being that EMC FC is not emulated, NetApp FC is emulated, and FC emulation is bad. (This is not a new debate between EMC and NetApp. Look back through the blogs at both companies and you will find plenty of back and forth on the topic. Kostadis Russos at NetApp has a post explaining why he, not surprisingly, completely disagrees with Chuck. Stephen Foskett, a storage consultant, posts what I think is an excellent overview of the issues. He cuts through the marketing spin and asks the right questions. His coverage of the topic is so complete, I almost decided not to write about the topic. I will try not to retrace all the issues he covered. I will hit a couple of his high level points in case you have not had a chance to read his post (I highly recommend it though, it is very good.) In summary:

  • All enterprise storage arrays “emulate” Fibre Channel drives to one extent or another
  • NetApp is emulating Fibre Channel drives
  • All modern storage arrays emulate SCSI drives
  • Using the wrong tool for the job will always lead to trouble
  • Which is more important to you, integration, performance, or features?
So, why am I writing about it? I am writing about it because Chuck posted a very good blog entry about benchmarking a few days later that, to me, contradicts the importance he gave to 'real FC' on 12/9. I have never meet Chuck or Stephen, but they both seem to be very technically adept from their postings. Without trying to put words in his mouth (text on his blog?), the overall theme of Chuck's post is to make sure you use meaningful tests if you want meaningful results from a storage product benchmark. He is absolutely correct. I could not agree more. How many times have we seen benchmarks performed that were completely irrelevant to the workload the array would see in production? My question is, if the end result of performance testing with real world applications produces acceptable results, then who cares what is 'real' and what is 'emulated'? The average driver does not worry about how the computer in her car is controlling the variable valve timing. She worries about whether it reliably gets her to work on time. VMware is selling plenty of virtualization technology that presents devices that are not 'real.' I know it is not storage, but why is that any different? Less and less is 'real' in storage these days. It is impossible to continue to drive innovation in storage array technology if we are bound by the old ideas of how we configure and manage our storage. With the introduction of technologies that leverage thin provisioning, dependent pointer based copies, compression, and deduplication we need to rethink concepts as fundamental as RAID groups, block placement, and LUN configuration. Or in my opinion, we need to stop thinking about those things. Controlling the location of the bits is not what matters. Features and performance are what matter. Results in the real world matter. Look at the systems available and decide what blend of features fits your organization and workload best. Full disclosure: My company provides storage consulting on all of the platforms discussed above. We sell NetApp, Sun, and HDS products. We do not sell EMC products.

Click to read more ...


HSM without the headaches

Hierarchical Storage Managementement (HSM), Information Lifecycle Management (ILM), and Data Lifecycle Management (DLM). Everyone wants to manage their data intelligently to reduce their spending on storage infrastructure. The storage vendors and the trade rags would like to convince us that there are magic tools to solve this challenge. The truth is there is no magic tool to manage unstructured data. (I am not talking about the archiving tools that integrate with application here, I am only talking about unstructured data.) I have tried many tools over the years and they are simply not cost effective. Don't panic though, in most cases, the solution is far simpler and far less expensive than HSM. File services is a huge consumer of storage capacity. For the purposes of this conversation, let's consider file services as NFS or CIFS storage whether they be integrated appliances or a servers leveraging back end storage devices. In most environments I visit, the file serving infrastructure is using tier 1 disk drives (fibre channel, SCSI, or SAS). These disk drives are populated with data that is mostly idle and the storage managers want to get that idle data onto a less expensive disk tier. The most common request is to transparently move the idle data to a SATA based devices. Let's walk through this the scenarios for an environment with 20TB of unstructured data. To make the example a little simpler, I am going to ignore both RAID capacity overhead and drive right-sizing. I am going to use 300GB FC drives for tier 1 and 1TB SATA for tier 2. I am going to assume that 10% of the data gets 90% of the IO. (While every environment is different, this is is line with what I see in file sharing environments.) Check out this interesting paper about a recent file server analysis. "Measurement And Analysis Of Large-Scale Network File System Workloads" by Andrew W. Leung and Ethan L. Miller from UC Santa Cruz and Shankar Pasupathy and Garth Goodson of NetApp. Storing 20TB of data on tier 1 drives takes 69 drives (20TB * 1024GB/TB  /  300GB/drive = 68.26 drives). If 10% of that data is considered active, then a tiered environment would require 7 300GB drives for the tier 1 data and 18 1TB drives for the tier 2 data. There are two apparent solutions to this problem. The first is 100% tier 1 disk (option A) and the second is 10% tier 1 and 90% tier 2 disk (option B). Using all tier 1 disk will deliver the required performance and capacity. The downside is that the disk is expensive and takes a tremendous amount of power and cooling. This is the expensive solution storage managers are trying to escape from. The second option is to use a mix of tier 1 and tier 2 disk. This has the potential to make the disks significantly less expensive. The challenge here is the requirement for a magic HSM tool. These tools are so expensive that they often cost more than is saved by using tier 2 disks. Additionally, they are very complex to deploy and manage. There is a third option that is often not considered. Use 100% tier 2 disk. Is it practical to use 100% tier 2 disk? Yes, in most environments the unstructured data will perform just fine on tier 2 disks. Let's go back to the 10% tier 1 example for a minute. In this example the small number of tier 1 disks are being asked to shoulder 90% of the IO workload. The tier 2 drives in that example are nearly idle. When we use 100% tier 2 disk, we are able to put all of the spindles to work. Why pay the high price for tier 1 disk to centralize the workload and leave 70%+ of the drives underutilized? Put those tier 2 disks to work. Disk IOPS are the most common performance limiter I see, so I am always looking for ways to spread out the workload. Modern disk drives not only run fine with a mix of active and idle data, they actually need to host some idle data. If a drive were filled to capacity with active data, it would most likely be unable to handle the workload. Disclaimer: This is not true for every environment. Some environments drive too much IO to leverage tier 2 disk effectively. For those environment, I suggest using 100% tier 1 disk. Yes, I will admit that is some extreme cases, the HSM solutions make sense. Far more often the cost effective approach is to stick with either 100% tier 1 or 100% tier 2 disk.

Click to read more ...