Compression, de-dupe, erasure coding – STOP!

Before even considering using these features on any storage solution you should really weigh up the benefits versus the penalties. In essence, they’re great for marketing and ideal nerd knobs for the geeky admin but can introduce additional load on a system for a minimal return.

Vendors claiming to achieve high rates of storage reclamation should be reviewed and questioned, after all, how do they know what data is stored in your environment? If a vendor claims to give away free storage in the event they don’t meet expectations this implies it’s not always achievable. De-duplication rates in excess of 1000:1 have been broadcast over WebEx and in trade show booths but what’s driving that number? In one example it was simply a single desktop virtual machine (powered on) cloned hundreds of times – none of which were actively being used. Is that a true representation of your environment, a static VM?

Using hardware to process data optimisation seems like a good plan, offload to another layer and let it manage the number crunching but that won’t necessarily speed processing time up. The software has to hand over to hardware, data traverses system board buses, is added to a queue, processed then returns back along the same path. Fine for workloads that aren’t latency dependant and for lesser utilised hypervisor hosts but is that always going to be the case? The counter-argument would be that software disk optimisation introduces more load on the CPU which of course is correct, however, if the software has the intelligence to review data patterns and apply disk optimisation only when gains can be achieved surely that’s a better process? Hardware optimisation doesn’t have that level of granularity adopting the brute-force mentality of trying everything and potentially returning nothing – all the time moving data across system buses. Then what happens when that hardware offload/accelerator fails…

Anyway, back to these ‘great marketing’ features and before clicking ‘enable‘ consider a few basic questions:

1. Why enable one, some or all the features?

Does the application data suit itself to being optimised? Multiple re-calculations using complex algorithms have an overhead

2. Can you enable one, some or all non-disruptively?

Not all vendors provide the granularity to select just the feature decided on

3. Can the software dynamically revert its ‘on’ state if no benefits are being returned?

The software intelligence of data I/O should be able to intervene and prevent unwarranted processing time

4. If data is de-duplicated what performance overheads could be imposed?

Reducing multiple distributed copies of data to a single copy is fine but when many requests for the data are instigated the single block location could get saturated or incur latency over the network. Caching locally then starts to defeat the object of de-duplication if multiple copies are needed

There is no silver-bullet answer to this topic but as you can see it’s not just simply about ticking a feature box.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s