Re: Intel SSDs that may not suck

From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Intel SSDs that may not suck
Date: 2011-03-29 10:34:08
Message-ID: 4D91B5A0.6040707@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello Greg, list,

On 2011-03-28 22:21, Greg Smith wrote:
> Today is the launch of Intel's 3rd generation SSD line, the 320
> series. And they've finally produced a cheap consumer product that
> may be useful for databases, too! They've put 6 small capacitors onto
> the board and added logic to flush the write cache if the power
> drops. The cache on these was never very big, so they were able to
> avoid needing one of the big super-capacitors instead. Having 6
> little ones is probably a net reliability win over the single point of
> failure, too.
>
> Performance is only a little better than earlier generation designs,
> which means they're still behind the OCZ Vertex controllers that have
> been recommended on this list. I haven't really been hearing good
> things about long-term reliability of OCZ's designs anyway, so glad to
> have an alternative. *Important*: don't buy SSD for important data
> without also having a good redundancy/backup plan. As relatively new
> technology they do still have a pretty high failure rate. Make sure
> you budget for two drives and make multiple copies of your data.
>
> Anyway, the new Intel drivers fast enough for most things, though, and
> are going to be very inexpensive. See
> http://www.storagereview.com/intel_ssd_320_review_300gb for some
> simulated database tests. There's more about the internals at
> http://www.anandtech.com/show/4244/intel-ssd-320-review and the white
> paper about the capacitors is at
> http://newsroom.intel.com/servlet/JiveServlet/download/38-4324/Intel_SSD_320_Series_Enhance_Power_Loss_Technology_Brief.pdf
>
> Some may still find these two cheap for enterprise use, given the use
> of MLC limits how much activity these drives can handle. But it's
> great to have a new option for lower budget system that can tolerate
> some risk there.
>
While I appreciate the heads up about these new drives, your posting
suggests (though you formulated in a way that you do not actually say
it) that OCZ products do not have a long term reliability. No factual
data. If you have knowledge of sandforce based OCZ drives fail, that'd
be interesting because that's the product line what the new Intel SSD
ought to be compared with. From my POV I've verified that the sandforce
based OCZ drives operate as they should (w.r.t. barriers/write through)
and I've reported what and how that testing was done (where I really
appreciated your help with) -
http://archives.postgresql.org/pgsql-performance/2010-07/msg00449.php.

The three drives we're using in a development environment right now
report (with recent SSD firmwares and smartmontools) their health status
including the supercap status as well as reserved blocks and a lot more
info, that can be used to monitor when it's about to be dead. Since none
of the drives have failed yet, or are in the vicinity of their end of
life predictions, it is currently unknown if this health status is
reliable. It may be, but may as well not be. Therefore I'm very
interested in hearing hard facts about failures and the smart readings
right before that.

Below are smart readings from two Vertex 2 Pro's, the first is the same
I did the testing with earlier. You can see it's lifetime reads/writes
as well as unexpected power loss count is larger than the other, newer
one. The FAILING_NOW of available reserved space is an artefact of
smartmontools db that has its threshold wrong: it should be read as Gb's
reserved space, and I suspect for a new drive it might be in the order
of 18 or 20.

It's hard to compare with spindles: I've seen them fail in all sorts of
ways, but as of yet I've seen no SSD failure yet. I'm inclined to start
a perpetual pgbench on one ssd with monitoring of smart stats to see if
what they report is really a good indicator of their lifetime. If that
is so I'm beginning to believe then this technology is better in failure
predictability than spindles, which pretty much seems at random when you
have large arrays.

Model I tested with earlier:

=== START OF INFORMATION SECTION ===
Model Family: SandForce Driven SSDs
Device Model: OCZ VERTEX2-PRO
Serial Number: OCZ-BVW101PBN8Q8H8M5
LU WWN Device Id: 5 e83a97 f88e46007
Firmware Version: 1.32
User Capacity: 50,020,540,416 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Tue Mar 29 11:25:04 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection:
Disabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test
has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7f) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 5) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control
supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 120 120 050 Pre-fail
Always - 0/0
5 Retired_Block_Count 0x0033 100 100 003 Pre-fail
Always - 0
9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age
Always - 965h+05m+20.870s
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 234
13 Soft_Read_Error_Rate 0x000a 120 120 000 Old_age
Always - 752/0
100 Gigabytes_Erased 0x0032 000 000 000 Old_age
Always - 1152
170 Reserve_Block_Count 0x0032 000 000 000 Old_age
Always - 17024
171 Program_Fail_Count 0x0032 000 000 000 Old_age
Always - 0
172 Erase_Fail_Count 0x0032 000 000 000 Old_age
Always - 0
174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age
Offline - 50
177 Wear_Range_Delta 0x0000 000 000 --- Old_age
Offline - 0
181 Program_Fail_Count 0x0032 000 000 000 Old_age
Always - 0
182 Erase_Fail_Count 0x0032 000 000 000 Old_age
Always - 0
184 IO_Error_Detect_Code_Ct 0x0032 100 100 090 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0
194 Temperature_Celsius 0x0022 032 031 000 Old_age
Always - 32 (0 0 0 31)
195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age
Offline - 0/0
196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail
Always - 0
198 Uncorrectable_Sector_Ct 0x0010 120 120 000 Old_age
Offline - 0x000000000000
199 SATA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0
201 Unc_Soft_Read_Err_Rate 0x001c 120 120 000 Old_age
Offline - 0/0
204 Soft_ECC_Correct_Rate 0x001c 120 120 000 Old_age
Offline - 0/0
230 Life_Curve_Status 0x0013 100 100 000 Pre-fail
Always - 100
231 SSD_Life_Left 0x0013 100 100 010 Pre-fail
Always - 0
232 Available_Reservd_Space 0x0000 000 000 010 Old_age
Offline FAILING_NOW 16
233 SandForce_Internal 0x0000 000 000 000 Old_age
Offline - 1088
234 SandForce_Internal 0x0032 000 000 000 Old_age
Always - 6592
235 SuperCap_Health 0x0033 100 100 001 Pre-fail
Always - 0
241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age
Always - 6592
242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age
Always - 3200

SMART Error Log not supported
SMART Self-test Log not supported
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Relatively new model:

=== START OF INFORMATION SECTION ===
Model Family: SandForce Driven SSDs
Device Model: OCZ-VERTEX2 PRO
Serial Number: OCZ-7AVL07UM37FP45U1
LU WWN Device Id: 5 e83a97 f83e6388d
Firmware Version: 1.32
User Capacity: 50,020,540,416 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Tue Mar 29 11:34:28 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection:
Disabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test
has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7f) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 5) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control
supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 120 120 050 Pre-fail
Always - 0/0
5 Retired_Block_Count 0x0033 100 100 003 Pre-fail
Always - 0
9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age
Always - 452h+19m+31.020s
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 64
13 Soft_Read_Error_Rate 0x000a 120 120 000 Old_age
Always - 3067/0
100 Gigabytes_Erased 0x0032 000 000 000 Old_age
Always - 128
170 Reserve_Block_Count 0x0032 000 000 000 Old_age
Always - 17440
171 Program_Fail_Count 0x0032 000 000 000 Old_age
Always - 0
172 Erase_Fail_Count 0x0032 000 000 000 Old_age
Always - 0
174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age
Offline - 16
177 Wear_Range_Delta 0x0000 000 000 --- Old_age
Offline - 0
181 Program_Fail_Count 0x0032 000 000 000 Old_age
Always - 0
182 Erase_Fail_Count 0x0032 000 000 000 Old_age
Always - 0
184 IO_Error_Detect_Code_Ct 0x0032 100 100 090 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0
194 Temperature_Celsius 0x0022 032 032 000 Old_age
Always - 32 (Min/Max 0/32)
195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age
Offline - 0/0
196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail
Always - 0
198 Uncorrectable_Sector_Ct 0x0010 120 120 000 Old_age
Offline - 0x000000000000
199 SATA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0
201 Unc_Soft_Read_Err_Rate 0x001c 120 120 000 Old_age
Offline - 0/0
204 Soft_ECC_Correct_Rate 0x001c 120 120 000 Old_age
Offline - 0/0
230 Life_Curve_Status 0x0013 100 100 000 Pre-fail
Always - 100
231 SSD_Life_Left 0x0013 100 100 010 Pre-fail
Always - 0
232 Available_Reservd_Space 0x0000 000 000 010 Old_age
Offline FAILING_NOW 17
233 SandForce_Internal 0x0000 000 000 000 Old_age
Offline - 128
234 SandForce_Internal 0x0032 000 000 000 Old_age
Always - 448
235 SuperCap_Health 0x0033 100 100 010 Pre-fail
Always - 0
241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age
Always - 448
242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age
Always - 192

SMART Error Log not supported
SMART Self-test Log not supported
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff 2011-03-29 14:16:51 Re: Intel SSDs that may not suck
Previous Message Scott Marlowe 2011-03-29 05:02:01 Re: Intel SSDs that may not suck