|
|
Up |
|
|
  |
Author: Nino DehneNino Dehne Date: Jan 16, 2007 11:49
On Tue, Jan 16, 2007 at 09:28:21AM +0000, David Laight wrote:
> On Tue, Jan 16, 2007 at 08:00:14AM +0100, Nino Dehne wrote:
>>
>> After 50 runs of dd if=/dev/rcgd0d bs=65536 count=4096 | md5 and no error
>> I aborted the test. Replacing rcgd0d with cgd0a made no difference.
>> While not necessary IMO, I tried the same with rraid1d, no errors either
>> after 50 runs. For comparison, a loop on the filesystem on the cgd aborted
>> after the 14th run now.
>>
>> So the issue doesn't seem to be related to the power supply either and
>> frankly, it's starting to freak me out.
>
> The 'dd' will be doing sequential reads, whereas the fs version will be doing
> considerable numbers of seeks. It is the seeks that cause the disks to
> draw current bursts from the psu - so don't discount that...
|
| Show full article (2.26Kb) |
|
| | 5 Comments |
|
  |
Author: Daniel CarosoneDaniel Carosone Date: Jan 16, 2007 13:32
On Tue, Jan 16, 2007 at 08:49:14PM +0100, Nino Dehne wrote:
>> considerable numbers of seeks. It is the seeks that cause the disks to
>> draw current bursts from the psu - so don't discount that.
>
> Good point. To accommodate to that, I repeatedly cat'ed the test file on the
> cgd partition to /dev/null. At the same time, I hashed the first 64M of rcgd0d
> in a loop. I used 64M instead of 256M because the disk thrashing was really
> bad. I also set the CPU frequency to its maximum to maximize the power the
> system draws.
a cpu-hog process would help here too..
> I attribute the checksum change to changes on the filesystem, since that was
> obviously mounted while doing the test.
Probably, yeah; I gave some suggestions for ways to avoid this a
moment ago, too.
> Getting over 70 equal checksums and then 3 equal other checksums in
> a row with flaky hardware seems highly improbable to me.
Or the 64m is fitting in cache most of the time, and the bad read was
cached and thus repeated?
|
| Show full article (1.52Kb) |
|
| | 4 Comments |
|
  |
Author: Nino DehneNino Dehne Date: Jan 16, 2007 13:44
On Wed, Jan 17, 2007 at 08:32:50AM +1100, Daniel Carosone wrote:
> On Tue, Jan 16, 2007 at 08:49:14PM +0100, Nino Dehne wrote:
>>> considerable numbers of seeks. It is the seeks that cause the disks to
>>> draw current bursts from the psu - so don't discount that.
>>
>> Good point. To accommodate to that, I repeatedly cat'ed the test file on the
>> cgd partition to /dev/null. At the same time, I hashed the first 64M of rcgd0d
>> in a loop. I used 64M instead of 256M because the disk thrashing was really
>> bad. I also set the CPU frequency to its maximum to maximize the power the
>> system draws.
>
> a cpu-hog process would help here too..
While doing the above, the CPU is about 0%%-8%% idle. I'm still running a
UP kernel.
>> I attribute the checksum change to changes on the filesystem, since that was
>> obviously mounted while doing the test.
>
> Probably, yeah; I gave some suggestions for ways to avoid this a
> moment ago, too.
|
| Show full article (1.92Kb) |
| 3 Comments |
|
  |
Author: Thilo JeremiasThilo Jeremias Date: Jan 17, 2007 05:30
is the changed checksum always deterministicly the same?
Meaning is this a systematic error, or
(Where I would guess for drive/cable/power etc problems) is it always a
different checksum (I mean are there more than two checksums)
If it is deterministic, it probably just happens at a certain block, so
it might help then to isolate the location where the fault is
to find the cause
--
my 5 cts'
good luck
thilo
Nino Dehne wrote:
> On Wed, Jan 17, 2007 at 08:32:50AM +1100, Daniel Carosone wrote:
>
>> On Tue, Jan 16, 2007 at 08:49:14PM +0100, Nino Dehne wrote:
>>
>>>> considerable numbers of seeks. It...
|
| Show full article (2.52Kb) |
| 2 Comments |
|
  |
Author: Steven M. BellovinSteven M. Bellovin Date: Jan 17, 2007 07:02
On Wed, 17 Jan 2007 23:30:55 +1000
Thilo Jeremias optushome.com.au> wrote:
> is the changed checksum always deterministicly the same?
> Meaning is this a systematic error, or
> (Where I would guess for drive/cable/power etc problems) is it always
> a different checksum (I mean are there more than two checksums)
>
> If it is deterministic, it probably just happens at a certain block,
> so it might help then to isolate the location where the fault is to
> find the cause
>
Is there any chance the two different mirrors -- you did say RAID,
right, though I confess I don't remember which variant -- have
different versions of the block? That shouldn't happen, of course, but
if it did it would explain the problem.
--Steve Bellovin, http://www.cs.columbia.edu/~smb
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
|
| |
| 1 Comment |
|
  |
|
|
  |
Author: Nino DehneNino Dehne Date: Jan 18, 2007 00:21
On Wed, Jan 17, 2007 at 11:58:56PM +0100, Nino Dehne wrote:
> On Thu, Jan 18, 2007 at 07:31:47AM +1100, Daniel Carosone wrote:
>> Nino, are you running a kernel with DIAGNOSTIC and/or DEBUG? Looking
>> at the cgd panic you found, I'm guessing not, because the path we see
>> to that problem would have involved one or more likely DIAGNOSTIC
>> messages.
>
> Not yet, but that just went on my list of things to try.
I'm now running the system with those options. I didn't try to provoke
the cgd panic yet, though. Parity recalculation is a lengthy process.
> 1) Boot DIAGNOSTIC+DEBUG kernel
> 2) Run fsck -f[1]
I ran fsck -fn 10 times in a row, with 4 gzips running concurrently.
Nothing. Output looked like this each time:
|
| Show full article (3.54Kb) |
| no comments |
|
RELATED THREADS |
  |
|
|
|