BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] How smart is SMART?

Subject: [Discuss] How smart is SMART?
From: tmetro+blu at gmail.com (Tom Metro)
Date: Sun, 25 Nov 2012 14:53:25 -0500
In-reply-to: <CANiupv6LLi1=1hHXXzWxcvCvMRoqDzpqybLyqV2_w=cZwnChXg@mail.gmail.com>
References: <CANiupv6LLi1=1hHXXzWxcvCvMRoqDzpqybLyqV2_w=cZwnChXg@mail.gmail.com>

Doug wrote:
> Thinking about the future, does anyone regularly monitor their hard
> disk SMART (Self-Monitoring, Analysis, and Reporting Technology)
> information?

Yes, I use smartmontools on all systems. (I don't own anything that runs
OS X, so I can't comment on its use on that platform.)

smartmontools includes the smartd daemon, which runs continuously and
monitors your drive's SMART data.

I have it set to alert me via email for critical problems, and I also
have GUI desktop notifiers installed that work with it ('smart-notifier'
on Linux).

Additionally, logwatch (log file monitoring tool) also reports on
anything interesting logged by smartd.

> Are there scripts anyone uses that go through the same data and/or run
> the short and long tests?

The smartd daemon, which when configured to do so, can run tests on a
regular schedule.

I use the following config on my laptop, for example:

/dev/sda -a -I 194 -W 4,45,55 -R 5 -s (L/../../6/03|S/../.././05) -m
root -M exec /usr/share/smartmontools/smartd-runner

That breaks down as:

/dev/sda - the drive
-a - turns on a bunch of common options, like reporting of errors and
self-test results
-I - ignore a specific attribute
-W - set temperature limits
-R - monitor the raw value of a specific attribute
-s (L/../../6/03|S/../.././05) - schedule a weekly long test and daily
short test
-m root - send emails to root
-M exec /usr/... - run this script on errors

(smartd-runner is a script bundled with the package that just iterates
through all script in /etc/smartmontools/run.d/.)

I've forgotten the specifics of why I have the -I and -R switches set. I
probably have notes on other systems where I've used those.

You do sometimes need to tune the parameters for specific drives.
Ignoring an attribute here, or explicitly monitoring an attribute there.

smartd does support a DEVICESCAN option where it will find all the
drives on your system and monitor them using defaults. That has some
low-maintenance appeal (if you add/remove drives, it'll automatically
adjust), but I tend to have better luck when I explicitly list each
drive, and that also lets me stager the self-test times.

> The long tests can last 3 hours.

It depends on the drive. The drive will suspend the self-test if there
is too much I/O activity. (At least that's what my current drives do.)

> ...the replacement...disk is running...48C versus 41C.   What temp is
> BAD?

I googled this recently myself, because a recently replaced drive on a
laptop has been frequently exceeding the 45C max, and occasionally
exceeding the 55C critical limit.

The postings I turned up had no definitive answers. Some mentioned that
the drives are typically specified to have a 60C max temperature by the
manufacturers. Others referenced a Google whitepaper that said
temperatures were optimally kept between 30 and 40C, while below 20C
actually increased failure rates. (This is all second hand info. Check
primary sources before relying on these numbers.)

So clearly greater than 60C is bad. But will greater than 50C reduce the
lifespan of your drive? Perhaps.

With the space constraints in a laptop, I'm not sure what you can do
about it, as long as the airways are free of dust, the area around the
machine is clear, and the fans are functioning.

I suppose you could hack the fan controls to boost the fan speed.

> Now that the test is done, the hot disk is down to 38C.

4 of the 5 times the drive I mentioned above hit the critical limit it
was at 3:32 AM, which suggests some scheduled job is triggering it. The
long self-test does run at 3 AM, but not on the days when the over temp
happened, so it must be something else.

What I'd really like to see is a GUI tool that would read in the SMART
logs and show temperature graphs over time and average temperature. I'm
not concerned at all if the drive is only hitting 55C for a matter of
minutes.

Jason Normand wrote:
> You should be able to setup smartmontools to run as a crown job...

Typically you'd use smartd, which has a built-in scheduler.

Scott Ehrlich wrote:
> Long story short, smart is simply not reliable.

Sure, in the sense that SMART is not guaranteed to indicate a failure
before it happens, but that's hardly a reason not to use it.

The Google whitepaper on drive reliability has some interesting stuff to
say about SMART monitoring. (The paper has been mentioned on Discuss
before. Check the archives.)

 -Tom

-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/

References:
- [Discuss] How smart is SMART?
  - From: sweetser at alum.mit.edu (Doug)

Prev by Date: [Discuss] How smart is SMART?
Next by Date: [Discuss] How smart is SMART?
Previous by thread: [Discuss] How smart is SMART?
Next by thread: [Discuss] Last Chance! Call For Participation! LOPSA-EAST (formerly PICC) Sysadmin Conference, May 3-4, 2013, New Brunswick, NJ
Index(es):
- Date
- Thread


BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Boston Linux & Unix / webmaster@blu.org