Intel Clarifies 600p SSD Endurance Limitations, But TBW Ratings Can Be Misleading

  • Thread starter Thread starter Guest
  • Start date Start date
Status
Not open for further replies.
Very informative article.

Intel didn't really address anything about the low TBW, especially for the 1TB drive. I think the real answer is that there was little overprovisioning done on the 600P drives and this MWI information makes the 72TB endurance limit even worse than at first glance. If Intel provided that info for the article, they did their 72TBW drives a disservice.

I can't see as someone interested in a 1TB NVME M.2 drive how I wouldn't spend the extra $120 for the 960 EVO over 600P. Even with the 3-year vs 5-year warranty, the 72TBW vs 400TBW is huge and is the difference between "I'm concerned about doing extra writes" and "within reason, I'm going to use this pretty much however I please". Since the TBW or warranty is "whatever comes first".

Both the 600P and 960EVO drives are great everyman's products though, and I'm not stepping up further unless I was doing video/3D work all day everyday, and then I'd step up to something with a hefty heatsink on it to prevent throttling. For a professional usecase, I'd look at the Intel 750 series before I moved to the 960Pro class.
 
My HD stopped working and crashed Windows. Time to reboot. All my data is gone and the drive is read-only. Looks like a a duck, quacks like a duck. If I'm going to lose all of my data, it shouldn't be because of some arbitrary artificial limitation. I understand this feature can be good for RAID, but not for non-RAID, and especially not for OS drives.
 
Why are Intel talking about warranty period in the endurance context? Do they use the warranty period as reference to calm people that it will last atleast that long or do they replace it if you reach then endurance max within that time?

It would be nice if Windows would run without any errors on a locked media. The drive would effectively be a time capsule. I think I tried Linux on a CD many years ago, no problems.
Also that you'd have the ability yourself to write lock/unlock the drive at your own discretion. Wouldn't that provide a nice security feature, perhaps in a windows UAE fashion.

So many thoughts...
 
This feature should be optionnal. Most consumers don't do checksums or anything, so maybe it's better to have the drive turn read-only rather than start having silent data corruption.
 

The warranty is based solely upon the MWI counter. Once it has expired, so has your warranty. .
 


It is a 100 to 0 measurement, with 100 being a brand new SSD. You can use a SMART value reader (such as Crystal Disk Info) to read the MWI. However, it isn't always the same attribute for each SSD. The Intel MWI counter is attribute E9.

 
No more INTEL SSDs for me. You know Intel, that 160GB drive firmware bug that bit me a couple of times was bad and now this BS. Good by Intel, I am sad to see you go...
 
It isn't reasonable to base warranty on hard to determine internal functions of a
device. The lifetime of the warranty should be based on host writes.

If Intel wants to use the MWI values to help them with quality control that is
fine. If Intel wants the user to use MWI as an indicator of when the user
should start thinking about replacing the device, that is fine, but the warranty
should still cover any failing devices within the stated lifetime.

Should the user be denied a warranty replacement for the device if
the firmware doesn't handle some usage case and gets a WMF of 300?
What about if the firmware goes batty and just writes for no reason?

Finally, of course you should be able to read the data by setting a mode
someplace that says: do the best you can, and, for a particular read,
say this is could or this is the best guess and do operation X for the raw
data including error correction code.
 
Hm lot of Intel SSD haters. I hope people realize that Samsung has a worse reliability record than Intel here. :) Intel>Crucial/Micron>Samsung is the pecking order on quality SSDs. They've all had problems, but in order of quality/reliability superiority that's how it comes out for those who have been paying attention. It is true that they don't always the best value, but that's another discussion.
 

Speaking for myself, and I suspect many others who've commented above, the issue is not so much about reliability as it is about Intel effectively placing a hard limit on the life of this drive.

Let's say I buy a new car that has a 5 year, 150,000km warranty. Once I cross the 150,000km mark, I understand there's an ever increasing risk of a major component of that car failing. That's fine, I understand the manufacturer can only accept wear & tear liability to a certain point. I also understand that if I decide to quit my job and become a full time Uber driver, I'm going to rack up those kilometres and exceed my warranty really quickly. Again, that's fine, I'm an adult, I understand those risks and can make an informed decision about whether I choose to keep running the car and deal with the costs of failure myself, or replace it.

What Intel has done is effectively set a hard limit on the life of their drive such that the moment the endurance point of the warranty period is reached, it will no longer operate at all and must be replaced... the consumer who paid for and owns the drive has absolutely no say in the matter. As this article spells out it's actually slightly worse than setting a hard kilometre (or miles) driven limit, because SSDs themselves effectively clock up writes internally with their own internal maintenance tasks. At least with a car and km (/m) limit, the owner is in complete control of how far they choose to drive.

Now Intel would likely argue that my analogy is flawed because where a car might break down, the risk for the SSD is that exceeding the endurance rating could result in in users losing critical data. So, the argument goes, their "feature" is implemented to protect users. That's a feeble argument on a number of levels:
    1) If they genuinely believe this is of benefit to consumers, why don't they advertise it anywhere? Why isn't the "feature" listed on product pages or **at least** tech support pages for the product
    2) Given that the OS will simply fail to boot and report a drive failure, many users are likely to think that the drive has completely failed and replace it or their entire computer, thus losing their data anyway. We can't know for certain, but I'd suggest that the risk of users losing data through this assumption of complete drive failure is significantly higher than the actual risk which Intel are claiming to be protecting us from anyway
    3) If this feature is genuinely in place to protect users from data loss, why is the same TBW rating in place for the entire capacity range? Why is the NAND in the 512GB model able to cope with just one quarter of the write cycles of the 128GB drives?

Even if you somehow cling to the argument that this "feature" really is in place to protect non-technical users, there is absolutely no excuse for not providing more technical users with the ability to turn it off and use the drive they bought until it actually fails... rather than just reaches an arbitrary cap that Intel has put in place.

So for me, this is nothing to do with reliability. It's about Intel hard-coding the death of my drive at a point when they've determined that I've used it enough and should get a new one, and then trying to justify it with weak arguments which do not hold up under scrutiny.
 
Intel believes that this way it's going to increase its future SSD sales.That's why it has such a low price on the 600p. It's like Intel is saying: your SSD is dead get a new one, they are dirty cheap. But how many of those locked drive customers are going to buy an Intel SSD again? I think Intel is losing future customers with that bad strategy. What's next for Intel? Maybe they should put a counter on their CPUs, telling us we can use our CPUs for an x amount of time (hours/days) and then after that amount expires, boom they are locked. Get a new one, it's time for an upgrade. Are they serious?
 
The author makes a good point about educating consumers so that they know to (1) buy a new drive and (2) copy the data off the old drive.

The intel SSD toolkit is where you read the wearout indication. Intel could also start giving messages every 10 mins when they are within a few days of locking.

The MWI is not a linear measure on many drives. We use 8 SSD in a raid configurations where each of the 8 drives gets exactly the same IO load. We often see SSDs fail. Once of the common failure modes is the MWI on one of the drives drops from 99 left to 40 left or 28 left or some other really low number while the other 7 drives are still at 99. At that point you need to replace the drive ASAP -- it's going soon. I've always assumed the MWI was looking at the number of bad flash cells and that the large jumps in MWI were the result of chunks of SSD failing, but that's pure guess.

Aside: the comment about "no compression now that sandforce is gone" is not true. Some of the enterprise all flash arrays use compression. IBM has "IBM® Real-time Compression™" Hitachi has "VSP F series flash-only storage arrays feature custom flash modules of up to 6.4 TB raw capacity with on-board data compression."
 


That is correct, I was referring more to SSD controller-level compression, as opposed to software/system level dedupliction and compression, which are, of course, wonderful tech :)

 
This is more about Intel protecting its enterprise margins than it is about protecting user data. Intel has had problems with datacenters using consumer gear, and not paying the enterprise tax. This same SSD will come to market as an enterprise product with a 3X endurance rating and 3X the cost.
 


This can be from a number of things. Some software-based RAID implementations allows users to select a parity drive, which is then hammered more than the other drives in the array.

It sounds like you are using more of a traditional hard RAID, though, but LSI and Adaptec controllers are also known to selectively hammer one drive, or set of drives, more than others, even within the same contiguous array. Some of this is due to the RAID coding, but it can also be due to application hotspots and other phenomenon. I occasionally review HBAs and RAID controllers (since the first 6Gb/s adaptors came to market), and have measured wear on a 24-drive RAID 10 set with Micron P400m (outstanding SSDs, btw.) I confirmed that some drives received more wear than others, and LSI reps have confirmed that this is an issue in some circumstances. They have tried to address it through firmware, etc, as the SSD RAID age came to fruition, but most users are still using older gear due to maintenance contracts. I've always found it best to leave an unaddressed portion of the array to serve as extra OP for the SSDs, and while it doesn't address the issue directly it can help to improve general performance and endurance. Of course, the economics aren't great, but since i'm not running a production environment that isn't as much of a concern. There are some rather scattered articles around the net, mostly on Linux user forums, that explore the uneven wear issue in some detail, but i'm not aware of a one-size-fits-all solution. The wear issue is also an issue with HDDs, as some will go through way more load/unload cycles due to the concentrated wear on certain LBA spans.
 


Double check where Hitachi and IBM are doing the compression.

 


I was under the impression that Hitachi does compression with secondary FPGA, making it a system level approach. Here is an article I wrote on it.

http://www.tomsitpro.com/articles/hds-hitachi-vsp-cloud-amazon,1-3258.html

the part covering FPGA offload is buried a bit in the VSP G Series yada yada, but here it is for reference;

As with many of the HDS core technologies, the company provides non-disruptive and transparent tiering services by employing powerful FPGAs to offload the associated processing overhead. Many of the storage vendors leverage the x86 platform to perform compute-intensive processing tasks, such as inline deduplication, in a gambit to reduce cost. In contrast, FPGAs can process more instructions per cycle, which boosts efficiency and allows the company to perform compute-intensive tasks in a non-disruptive fashion.

HDS leverages the intrinsic benefits of FPGAs for many of its features and employs sophisticated QoS mechanisms to eliminate front-end I/O and latency overhead. The industry is beginning to migrate back to FPGAs for some compute-intensive tasks, and Intel is even working to bring FPGAs on-die with the CPU as it grapples with the expiration of Moore's law.

HDS has extensive experience with FPGA-based designs and views the recent resurgence as a validation of its long-running commitment to the architecture, which it infused into its product and software stacks. It will be interesting to see how HDS, and others, adapt to the tighter integration as Intel moves to fuse FPGAs onto the CPU.

If memory serves correctly, IBM uses the same offloaded approach. These are in the 'controller' as defined by the node, but I am unsure if it is direct compression on the controller inside the final storage device (i.e., HDD/SSD).
 
If MWI is a measure of spare capacity (which is essentially the same thing as counting bad blocks), then it makes sense that it would drop at a non-linear rate. How abruptly it falls is primarily a function of the variance of block endurance.

If endurance is extremely consistent, across all blocks, and the SSD does a superb job of leveling wear, then it would basically be a cliff. On the other hand, if the SSD does a relatively poor job of wear-leveling and there's a relatively wide distribution of block endurance levels, then MWI should drop somewhat linearly (although you'd get less life out of the drive, assuming mean endurance and overprovisioning were the same, in both cases).
 
I read the additional and fixed text.
So, as a conclusion, 72TB wasn't a threshold for read-only feature, right? I feel that endurance part of the 600p review is completely misleading. There is no more evidence shown to judge 600p as a low endurance SSD, especially for high capacity model.
I hope Toms will ask Intel, "what does TBW shows in 600p's spec?" I predict the answer is "it's for reference about endurance, but does not show actual endurance for each model".
 
here is another problem: it goes by reason that there is some kind of software code that controls all this, and by all nature, code has bugs. So who is to say that the drive will not die because of a code induced bug. So Intel is wasting their time on this. People will less likely buy drives, there is danger that the code will brick the drive, and they had to spend money on the code to begin with and now will have to maintain it. Intel used to be a great company, but slowly, they are eroding their user base. Intel, get you house in order.
 
All drives, in recent history - whether mechanical or solid state - have a large amount of embedded firmware. In the past, Intel has used 3rd party controllers and differentiated only with custom firmware (and their own NAND).

Just look at the firmware updates released for any given SSD, and check the release notes for a sense of the kind of bugs that occur.

It goes without saying that the only way to ensure your data is safe is to back up anything you care about. Or you could just put it on the cloud, in the first place, but that has its own disadvantages & I'm old school.

P.S. I think Paul recently wrote an article on a move to dis-intermediate the storage from the host, and basically let the host OS do the work of wear leveling, garbage collection, etc. I don't know whether that'll come to the consumer space, but it'll be a while.
 
Status
Not open for further replies.