Re: Data corruption issues possibly involving cgd(4)
  Home FAQ Contact Sign in
muc.lists.netbsd.current-users only
 
Advanced search
POPULAR GROUPS

more...

muc.lists.netbsd.current-users Profile…
 Up
Re: Data corruption issues possibly involving cgd(4)         


Author: Nino Dehne
Date: Jan 16, 2007 11:49

On Tue, Jan 16, 2007 at 09:28:21AM +0000, David Laight wrote:
> On Tue, Jan 16, 2007 at 08:00:14AM +0100, Nino Dehne wrote:
>>
>> After 50 runs of dd if=/dev/rcgd0d bs=65536 count=4096 | md5 and no error
>> I aborted the test. Replacing rcgd0d with cgd0a made no difference.
>> While not necessary IMO, I tried the same with rraid1d, no errors either
>> after 50 runs. For comparison, a loop on the filesystem on the cgd aborted
>> after the 14th run now.
>>
>> So the issue doesn't seem to be related to the power supply either and
>> frankly, it's starting to freak me out.
>
> The 'dd' will be doing sequential reads, whereas the fs version will be doing
> considerable numbers of seeks. It is the seeks that cause the disks to
> draw current bursts from the psu - so don't discount that...
Show full article (2.26Kb)
5 Comments
Re: Data corruption issues possibly involving cgd(4)         


Author: Daniel Carosone
Date: Jan 16, 2007 13:32

On Tue, Jan 16, 2007 at 08:49:14PM +0100, Nino Dehne wrote:
>> considerable numbers of seeks. It is the seeks that cause the disks to
>> draw current bursts from the psu - so don't discount that.
>
> Good point. To accommodate to that, I repeatedly cat'ed the test file on the
> cgd partition to /dev/null. At the same time, I hashed the first 64M of rcgd0d
> in a loop. I used 64M instead of 256M because the disk thrashing was really
> bad. I also set the CPU frequency to its maximum to maximize the power the
> system draws.

a cpu-hog process would help here too..
> I attribute the checksum change to changes on the filesystem, since that was
> obviously mounted while doing the test.

Probably, yeah; I gave some suggestions for ways to avoid this a
moment ago, too.
> Getting over 70 equal checksums and then 3 equal other checksums in
> a row with flaky hardware seems highly improbable to me.

Or the 64m is fitting in cache most of the time, and the bad read was
cached and thus repeated?
Show full article (1.52Kb)
4 Comments
Re: Data corruption issues possibly involving cgd(4)         


Author: Nino Dehne
Date: Jan 16, 2007 13:44

On Wed, Jan 17, 2007 at 08:32:50AM +1100, Daniel Carosone wrote:
> On Tue, Jan 16, 2007 at 08:49:14PM +0100, Nino Dehne wrote:
>>> considerable numbers of seeks. It is the seeks that cause the disks to
>>> draw current bursts from the psu - so don't discount that.
>>
>> Good point. To accommodate to that, I repeatedly cat'ed the test file on the
>> cgd partition to /dev/null. At the same time, I hashed the first 64M of rcgd0d
>> in a loop. I used 64M instead of 256M because the disk thrashing was really
>> bad. I also set the CPU frequency to its maximum to maximize the power the
>> system draws.
>
> a cpu-hog process would help here too..

While doing the above, the CPU is about 0%%-8%% idle. I'm still running a
UP kernel.
>> I attribute the checksum change to changes on the filesystem, since that was
>> obviously mounted while doing the test.
>
> Probably, yeah; I gave some suggestions for ways to avoid this a
> moment ago, too.
Show full article (1.92Kb)
3 Comments
Re: Data corruption issues possibly involving cgd(4)         


Author: Thilo Jeremias
Date: Jan 17, 2007 05:30

is the changed checksum always deterministicly the same?
Meaning is this a systematic error, or
(Where I would guess for drive/cable/power etc problems) is it always a
different checksum (I mean are there more than two checksums)

If it is deterministic, it probably just happens at a certain block, so
it might help then to isolate the location where the fault is
to find the cause

--
my 5 cts'

good luck

thilo

Nino Dehne wrote:
> On Wed, Jan 17, 2007 at 08:32:50AM +1100, Daniel Carosone wrote:
>
>> On Tue, Jan 16, 2007 at 08:49:14PM +0100, Nino Dehne wrote:
>>
>>>> considerable numbers of seeks. It...
Show full article (2.52Kb)
2 Comments
Re: Data corruption issues possibly involving cgd(4)         


Author: Steven M. Bellovin
Date: Jan 17, 2007 07:02

On Wed, 17 Jan 2007 23:30:55 +1000
Thilo Jeremias optushome.com.au> wrote:
> is the changed checksum always deterministicly the same?
> Meaning is this a systematic error, or
> (Where I would guess for drive/cable/power etc problems) is it always
> a different checksum (I mean are there more than two checksums)
>
> If it is deterministic, it probably just happens at a certain block,
> so it might help then to isolate the location where the fault is to
> find the cause
>
Is there any chance the two different mirrors -- you did say RAID,
right, though I confess I don't remember which variant -- have
different versions of the block? That shouldn't happen, of course, but
if it did it would explain the problem.

--Steve Bellovin, http://www.cs.columbia.edu/~smb

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
1 Comment
Re: Data corruption issues possibly involving cgd(4)         


Author: Nino Dehne
Date: Jan 18, 2007 00:21

On Wed, Jan 17, 2007 at 11:58:56PM +0100, Nino Dehne wrote:
> On Thu, Jan 18, 2007 at 07:31:47AM +1100, Daniel Carosone wrote:
>> Nino, are you running a kernel with DIAGNOSTIC and/or DEBUG? Looking
>> at the cgd panic you found, I'm guessing not, because the path we see
>> to that problem would have involved one or more likely DIAGNOSTIC
>> messages.
>
> Not yet, but that just went on my list of things to try.

I'm now running the system with those options. I didn't try to provoke
the cgd panic yet, though. Parity recalculation is a lengthy process.
> 1) Boot DIAGNOSTIC+DEBUG kernel
> 2) Run fsck -f[1]

I ran fsck -fn 10 times in a row, with 4 gzips running concurrently.
Nothing. Output looked like this each time:
Show full article (3.54Kb)
no comments

RELATED THREADS
SubjectArticles qty Group
#Oops, Another Corrupt Republican Convicted (GOP: The Stench of Corruption)alt.impeach.clinton ·