BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

forcing a raid recovery

Subject: forcing a raid recovery
From: bogstad-e+AXbWqSrlAAvxtiuMwx3w at public.gmane.org (Bill Bogstad)
Date: Tue, 3 Nov 2009 15:30:24 -0500
In-reply-to: <4AF0809D.6070401-wRvlPVLobi1/31tCrMuHxg@public.gmane.org>
References: <4AF0809D.6070401@stephenadler.com>

On Tue, Nov 3, 2009 at 2:12 PM, Stephen Adler <adler-wRvlPVLobi1/31tCrMuHxg at public.gmane.org> wrote:
> Hi all,
>
> I'm putting together a backup system at my job and in doing so setup the
> good ol' raid 5 array. While I was putting the disk array together, I
> read that one could encounter a problem in which you replace a failed
> drive, the rebuilding processes will trip over another bad sector in on
> of the drives which was good before starting the rebuilding process and
> thus you end up with a screwed up raid array. So I was thinking of a way
> to avoid this problem. One solution is to kick off a job once a week or
> month in which you force the whole raid array to be read. I was thinking
> of possibly forcing a check sum of all the files I had stored on the
> disk.

Reading all the files (whether you checksum them or not) won't read
all of the allocated blocks on the disk:

1. With Raid 5, the parity blocks are pm;u read if a drive error
occurs when reading the data blocks.  The result
is that the parity blocks won't ever get read during your testing
(unless a failure occurs).

2.  If the filesystem you are using supports snapshots, you will only
be reading the data blocks for the current version of the file.
(You could read all the snapshots as well, but that is going to result
in the same physical block on the disk being 'read' multiple times
(once for each snapshot in which it is included).)

If you have direct read access to the drives (partitions), you might
try just reading from them directly.  Any drive on which
you get read errors can then be taken offline and a rebuild can be
forced.  I think this is slightly better then what you suggest below
because you are at least taking a drive with a known problem (bad
blocks) offline rather then ignoring all of the good data on the
driver you are randomly picking to force an error.

What I think you really want is RAID scrubbing.  Here is a link to
some GENTOO Linux RAID docs on the subject:

http://en.gentoo-wiki.com/wiki/Software_RAID_Install#Data_Scrubbing

If you are using hardware RAID, you should investigate similar
commands for your hardware controller.

>The other idea I had was to force one of the drives into a failed
> state and then add it back in and thus force the raid to rebuild. The
> rebuilding processes takes about 3 hours on my system which I could
> easily execute at 2am every Sunday morning.

And what if one of the drives you didn't take offline has a failure
during that window?

Bill Bogstad

References:
- forcing a raid recovery
  - From: adler-wRvlPVLobi1/31tCrMuHxg at public.gmane.org (Stephen Adler)

Prev by Date: forcing a raid recovery
Next by Date: forcing a raid recovery
Previous by thread: forcing a raid recovery
Next by thread: forcing a raid recovery
Index(es):
- Date
- Thread


BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Boston Linux & Unix / webmaster@blu.org