News Backblaze Annual Failure Rates for SSDs in 2022: Less Than One Percent

PlaneInTheSky

Commendable
BANNED
Oct 3, 2022
556
759
1,760
It's a pointless stat for consumers, because most SSD fail during loss of power or during changes in power.

Because servers are under constant power and have many ways to mitigate loss of power, they are not comparable to consumer PC.

Backblaze literally says their data is ONLY applicable to comparable server environments and has no relevance to consumer devices.

You can in theory install an uninterruptible power supply behind a PC, which would give you similar power stability as a server, but it would be much cheaper to just back up on more media. Another thing is that cheap consumer UPS use batteries, and they are just as likely to catch fire than storage failing from a power interruption.
 
Last edited:
It's a pointless stat for consumers, because most SSD fail during loss of power or during changes in power.

Because servers are under constant power and have many ways to mitigate loss of power, they are not comparable to consumer PC.

Backblaze literally says their data is ONLY applicable to comparable server environments and has no relevance to consumer devices.

You can in theory install an uninterruptible power supply behind a PC, which would give you similar power stability as a server, but it would be much cheaper to just back up on more media. Another thing is that cheap consumer UPS use batteries, and they are just as likely to catch fire than storage failing from a power interruption.

Depends on drive and power supply. Some have capacitors which allow all write operations to complete. If you are on a UPS or laptop, you are covered there too.

That said, power loss to a SSD results in FILE CORRUPTION (data) NOT DRIVE FAILURE (physical). Physical failures affect everyone, power failure or not.

File corruption errors are what redundant filesystems like ZFS check for. They add another few layers to the block sector checksum. And NT Drives has a transaction log. Transact begin/end on table entries. If power goes out before the operation is done, the transact end is missing and the file assumed bad and the table rolled back. XFS isn't bad either. I consider it reliable enough for smaller backup arrays.
 
Mar 10, 2023
5
0
10
Interestingly we use SSDs on our servers in exactly the same way: as a boot and program drive and for logs so it's encouraging to see that Backblaze came to the same conclusion.

We tried using an SSD for the data drive (Postgresql Database tablespace) but noticed that it would hang every now and then which we attribute to the drive's GC. And in the end it wasn't significantly faster than the HDD's which we over-provisioned to make sure the data could be kept as much as possible in the outer most tracks.

In 23 years we've never had an HDD failure but we replace servers and thus the drives every 4 years.
 

bit_user

Polypheme
Ambassador
We tried using an SSD for the data drive (Postgresql Database tablespace) but noticed that it would hang every now and then which we attribute to the drive's GC.
Which drive? Datacenter SSDs put a lot of emphasis on minimizing tail latencies. Consumer SSDs: most, not at all.

StorageReview routinely does reviews of enterprise SSDs. Here's a recent one:


And in the end it wasn't significantly faster than the HDD's which we over-provisioned to make sure the data could be kept as much as possible in the outer most tracks.
Now, I'm dying to know which SSD you tried. Because the most optimal data layout, on the fastest 15k RPM HDDs, are still limited to mere hundreds of IOPS. Whereas a good NVMe drive can deliver several tens of thousands at a mere QD=1.

In 23 years we've never had an HDD failure but we replace servers and thus the drives every 4 years.
Your server fleet consists of just 1, I take it?
 
Mar 10, 2023
5
0
10
Which drive? Datacenter SSDs put a lot of emphasis on minimizing tail latencies. Consumer SSDs: most, not at all.

StorageReview routinely does reviews of enterprise SSDs. Here's a recent one:



Now, I'm dying to know which SSD you tried. Because the most optimal data layout, on the fastest 15k RPM HDDs, are still limited to mere hundreds of IOPS. Whereas a good NVMe drive can deliver several tens of thousands at a mere QD=1.

It is a Samsung SSD bought in 2019 or 2020 IIRC. The model is a Samsung SSD 970 PRO 1TB. It has a current TBW of 48TB.

If there are a lot of updates HDDs can update a sector in-place whereas SSD's must write a new block and that block size tends to be large. Still we get better performance initially with the SSD but it slows down in heavy loads to due to the occasional seconds-long hanging. The issue was mentioned in forums at the time. It might even be that the linux driver doesn't handle trim correctly but that's just a theory.


A database restore is significantly faster on the SSD but we don't notice much difference in the running application is general speed except for the occasional hanging. Who knows, maybe it's a defect.


Your server fleet consists of just 1, I take it?

We have a main server and a backup server for the application. We recycle older servers to fulfill other server roles such as printing or archiving. My development machines are all SSDs (one SATA and two NVMe and my backup drives are all HDDs.
 

bit_user

Polypheme
Ambassador
It is a Samsung SSD bought in 2019 or 2020 IIRC. The model is a Samsung SSD 970 PRO 1TB. It has a current TBW of 48TB.
Still, in spite of the name, it's a consumer-grade drive. Maybe pro-sumer, but definitely not server-grade.

If there are a lot of updates HDDs can update a sector in-place whereas SSD's must write a new block and that block size tends to be large.
Both SSDs and HDDs have to write entire blocks. An update of a partial block or RAID stripe necessarily involves a read-modify-write operation. SSDs are much faster at that than HDDs.

Still we get better performance initially with the SSD but it slows down in heavy loads to due to the occasional seconds-long hanging.
It'd be interesting to know why it hangs for entire seconds, but the mere fact that consumer SSDs bog down under sustained load is a well-known fact. It's principally due to the way they use low-density storage to buffer writes. Once you fill that buffer, then your write speed drops to the native speed of high-density writes.

fAPbKRzqSURe4nH5jZ2ya7.png

While it seems the 970 can handle sequential writes without a dropoff, there definitely appears to be some buffering in effect for small writes:

Y2rPNGhv3qYdP6Zi629ymD.png


Another issue with sustained workloads on M.2 drives is thermal throttling! And that's something that you'll definitely encounter with a heavy database workload.

viqsQT8FqPhDnkdYy7XLpM-970-80.png


Still, it's not a big enough drop to explain the hang.

It might even be that the linux driver doesn't handle trim correctly but that's just a theory.
It's recommended not to mount it with the TRIM option. The preferred way to handle TRIM is to schedule fstrim to run during off-peak hours.

Whatever the specific cause of the hangs, I think the main issue is probably trying to use a consumer SSD outside of its intended usage envelope. The issue might've been compounded by misuse of the TRIM mount option.
 
  • Like
Reactions: AnendTech
Mar 10, 2023
5
0
10
Still, in spite of the name, it's a consumer-grade drive. Maybe pro-sumer, but definitely not server-grade.

At the time it was the one of the most expensive SSDs on the market and definitely marked as an enterprise SSD. It uses MLC which is not typical for consumer SSDs which are usually TLC or QLC.


Both SSDs and HDDs have to write entire blocks. An update of a partial block or RAID stripe necessarily involves a read-modify-write operation. SSDs are much faster at that than HDDs.

The block size is much smaller on HDDs and SSD blocks have to be "GC"d before they can be rewritten. An HDD sector can be updated in-place. In any case this is our real-world experience and we have to go by that. I suspect that modern Samsung 990 Pros or WD SN850 X with heat sinks will be significantly better.

It'd be interesting to know why it hangs for entire seconds, but the mere fact that consumer SSDs bog down under sustained load is a well-known fact. It's principally due to the way they use low-density storage to buffer writes. Once you fill that buffer, then your write speed drops to the native speed of high-density writes.

fAPbKRzqSURe4nH5jZ2ya7.png

While it seems the 970 can handle sequential writes without a dropoff, there definitely appears to be some buffering in effect for small writes:

Y2rPNGhv3qYdP6Zi629ymD.png


Another issue with sustained workloads on M.2 drives is thermal throttling! And that's something that you'll definitely encounter with a heavy database workload.

viqsQT8FqPhDnkdYy7XLpM-970-80.png
We do a lot of relatively small updates so that might not be the sweet-spot for that drive. We will be replacing it when I visit the customer but they are satisfied with the performance at the moment so there's no rush.


Still, it's not a big enough drop to explain the hang.


It's recommended not to mount it with the TRIM option. The preferred way to handle TRIM is to schedule fstrim to run during off-peak hours.

Whatever the specific cause of the hangs, I think the main issue is probably trying to use a consumer SSD outside of its intended usage envelope. The issue might've been compounded by misuse of the TRIM mount option.
 

bit_user

Polypheme
Ambassador
At the time it was the one of the most expensive SSDs on the market and definitely marked as an enterprise SSD. It uses MLC which is not typical for consumer SSDs which are usually TLC or QLC.
I understand why you thought it would be a good choice, but it actually isn't an enterprise drive. Enterprise/datacenter SSDs aren't typically sold via retail channels. If you look at some of the SSDs reviewed here, it will quickly change your idea of what constitutes an expensive SSD:



Most of those drives/brands, you probably haven't even heard of.

FWIW, this is a Samsung enterprise drive:



Note: they don't call it "Pro" anything. Real pros already know what it is.

The block size is much smaller on HDDs and SSD blocks have to be "GC"d before they can be rewritten.
A RAID stripe is more comparable in size. If you're running the HDDs in RAID-5 or RAID-6, then you'll have a similar write granularity.

If you refer back to that middle graph, the IOPS measured on the 970 Pro bottomed out at about 20k. That was 4k random write, which is about as bad as it gets. That's still about 2 orders of magnitude greater than the IOPS you'll get out of a HDD.

In any case this is our real-world experience and we have to go by that.
Yes, but you just guessed at the underlying cause and, on that basis, reached a conclusion that SSDs aren't good for databases. This flies in the face of more than a decade of industry experience. SSDs are, in fact, very good at hosting databases if you select the appropriate drive type and configure it properly via the OS. This site is not the best place to find help or advice on how to do that.

BTW, it helps the readability of your replies if you either type your reply below the [quote] block, or chop the previous message into multiple [quote] blocks.
 
  • Like
Reactions: AnendTech
Mar 10, 2023
5
0
10
I understand why you thought it would be a good choice, but it actually isn't an enterprise drive. Enterprise/datacenter SSDs aren't typically sold via retail channels. If you look at some of the SSDs reviewed here, it will quickly change your idea of what constitutes an expensive SSD:

You are confusing what is available now with what was available in ca 2019.



Most of those drives/brands, you probably haven't even heard of.

FWIW, this is a Samsung enterprise drive:

Correct, "is". This is not what was available in 2019.


Note: they don't call it "Pro" anything. Real pros already know what it is.

Perhaps you can recommend a better drive which was available in 2019.


A RAID stripe is more comparable in size. If you're running the HDDs in RAID-5 or RAID-6, then you'll have a similar write granularity.

If you refer back to that middle graph, the IOPS measured on the 970 Pro bottomed out at about 20k. That was 4k random write, which is about as bad as it gets. That's still about 2 orders of magnitude greater than the IOPS you'll get out of a HDD.




Yes, but you just guessed at the underlying cause and, on that basis, reached a conclusion that SSDs aren't good for databases.

I didn't guess at anything. We just didn't experience the hangs with the HDD which we did with the SSD. It's that simple. End of story. Perhaps the new drives (mentioned already) will be better.

This flies in the face of more than a decade of industry experience. SSDs are, in fact, very good at hosting databases if you select the appropriate drive type and configure it properly via the OS. This site is not the best place to find help or advice on how to do that.

BTW, it helps the readability of your replies if you either type your reply below the [quote] block, or chop the previous message into multiple [quote] blocks.
 

bit_user

Polypheme
Ambassador
I'll put this at the top, this time: the way you're writing replies is hard to follow. It helps the readability of your replies if you either type your entire reply below the [quote] block, or chop the previous message into multiple [quote] blocks.

You are confusing what is available now with what was available in ca 2019.
Nope. Here's a slightly different link, where I've gone back 6 pages to their 2019 enterprise SSD reviews.



As for regional availability, I'm sure all the enterprise storage vendors have distributors in all the major markets. You just have to know where to look for them. Enterprise SSDs are not typically sold through retail channels, so you'll almost always have to order it online.

Correct, "is". This is not what was available in 2019.
I didn't mean that it was. I was just giving a for-instance, to show that they have different model lines than the ones you've heard of. So, here's one from 2019:



...also, not branded as "Pro".

Perhaps you can recommend a better drive which was available in 2019.
I'd need to know a bit about the workload, but then I'd basically just have gone through the above reviews. Given that you're able to get by with HDDs, I'd guess most enterprise drives oriented towards mixed read/write workloads would do fine.

I didn't guess at anything.
You said you hadn't identified the root cause, so you concluded the performance problems were simply due to using a SSD.

That's like a patient who has a heart attack and eats a lot of peanut butter. The doctor doesn't know what caused the heart attack, but since most people don't have heart attacks and don't eat so much peanut butter, concludes it must've caused the heart attack. If medicine were that simple, we'd have cured cancer by now.

End of story.
You do what you want. As long as you're not spreading misinformation on here, I don't care one bit.

FWIW, the obvious thing you never did was to ask people with experience running a database on SSDs for guidance about the appropriate hardware and software configuration. Fortunately, you managed to find a configuration that's working for you. It's a good thing you didn't need to handle a higher transaction volume.

There's a point where consumer hardware no longer gets the job done and you have to switch over to enterprise gear. Clearly, you're past that point.

Perhaps the new drives (mentioned already) will be better.
SSDs from 10 years ago would've probably done the job, if you'd picked the right model and configured it properly.
 
Last edited:
  • Like
Reactions: AnendTech
Mar 10, 2023
5
0
10
I'll put this at the top, this time: the way you're writing replies is hard to follow. It helps the readability of your replies if you either type your entire reply below the [quote] block, or chop the previous message into multiple [quote] blocks.


Nope. Here's a slightly different link, where I've gone back 6 pages to their 2019 enterprise SSD reviews.



As for regional availability, I'm sure all the enterprise storage vendors have distributors in all the major markets. You just have to know where to look for them. Enterprise SSDs are not typically sold through retail channels, so you'll almost always have to order it online.


I didn't mean that it was. I was just giving a for-instance, to show that they have different model lines than the ones you've heard of. So, here's one from 2019:



...also, not branded as "Pro".


I'd need to know a bit about the workload, but then I'd basically just have gone through the above reviews. Given that you're able to get by with HDDs, I'd guess most enterprise drives oriented towards mixed read/write workloads would do fine.


You said you hadn't identified the root cause, so you concluded the performance problems were simply due to using a SSD.

That's like a patient who has a heart attack and eats a lot of peanut butter. The doctor doesn't know what caused the heart attack, but since most people don't have heart attacks and don't eat so much peanut butter, concludes it must've caused the heart attack. If medicine were that simple, we'd have cured cancer by now.


You do what you want. As long as you're not spreading misinformation on here, I don't care one bit.

Yes, that's what we did. Interestingly Backblaze uses their SSDs in exactly the same way. You might want to let them know they are doing it wrong.

FWIW, the obvious thing you never did was to ask people with experience running a database on SSDs for guidance about the appropriate hardware and software configuration. Fortunately, you managed to find a configuration that's working for you. It's a good thing you didn't need to handle a higher transaction volume.

In fact, this is what was recommended to us by the experts. It was the top of the line available to buy. The 983 was announced in 2019 but not available at the time we bought it. And, at the time, Postgres itself was not recommending SSDs. You need to understand better the way Postgres works.

There's a point where consumer hardware no longer gets the job done and you have to switch over to enterprise gear. Clearly, you're past that point.

You might want to reread what I wrote.


SSDs from 10 years ago would've probably done the job, if you'd picked the right model and configured it properly.

That's why Postgres was recommending against them But you obviously know better.
 

bit_user

Polypheme
Ambassador
We're still not there on the quoting thing, but I appreciate the effort. Note that the [quote] and [/quote] tags have to be matched. In case it helps, here's an example:


Interestingly Backblaze uses their SSDs in exactly the same way. You might want to let them know they are doing it wrong.
They said they just use them as boot drives. I never said consumer drives aren't fine for that (perhaps, except in extreme circumstances).

What I'm talking about is your statement:

"We tried using an SSD for the data drive (Postgresql Database tablespace) but noticed that it would hang every now and then which we attribute to the drive's GC. And in the end it wasn't significantly faster than the HDD's ..."​


The article doesn't address this one way or another. We don't know to what extent they even use relational databases. If they did, and under any kind of non-trivial load, then perhaps they wouldn't use consumer SSDs. However, that's very speculative.

The point I come back to is that enterprise drives tend to be specifically optimized to minimize tail latencies, which seems highly relevant to the behavior you reported. If you even look at how StorageReview tests enterprise drives, it's primarily latency-oriented.

In fact, this is what was recommended to us by the experts.
Like database admins, or people more experienced in building gaming computers?

The 983 was announced in 2019 but not available at the time we bought it.
The reason I'm providing these links is so you can get a sense of the range of enterprise drives that existed at the time. There are many other Samsung SSDs listed before it, but also drives from at least half a dozen other brands. The elephant in the room would be Intel's DC P4800X, which launched way back in 2017.


There were few metrics it didn't completely dominate.

at the time, Postgres itself was not recommending SSDs. You need to understand better the way Postgres works.
Sometimes, advice like this sticks around on the internet for long after it's relevant. If you check the enterprise SSD reviews on StorageReview I've been linking, they specifically test SQL workloads.

The solid state storage industry has been servicing database workloads for a long time, even going so far as to build exotic all-flash arrays, for the highest level of transaction throughput. If you want to see what the big boys are doing, NextPlatform has been talking about products in that field since about 2015. Maybe they go back even further - I'm not sure how much earlier the publication even existed.


That's why Postgres was recommending against them But you obviously know better.
I know enough about storage to know about write amplification. And there's no way it's bad enough to put a hard drive ahead of a decent enterprise SSD. If you look up the media transfer rate of the hard drive and erase block size of the SSD, you can do the math for yourself. Even before factoring in seek & rotational latency, SSDs already win.

BTW, older enterprise SSDs frequently supported a 512-byte sector mode. These days, it shouldn't be necessary.