|
|
Up |
|
|
  |
Author: Phil LewisPhil Lewis Date: May 24, 2008 09:45
Apologies as this turned out to be rather lengthy....
I run a development engineering lab for a financial services company and we
are running into a rather peculiar but very troubling problem on some of our
performance servers when copying, moving or backing...
|
| Show full article (9.42Kb) |
|
| | 13 Comments |
|
  |
Date: May 24, 2008 10:16
The data section of the event ID 50 contains a lot of useful information for
diagnosing the exact cause of the issue. Could you check the "words" radio
button and post the dump data? Also, what is the source of the ID 50? is
it mrxsmb or disk?
"Phil Lewis" comcast.net> wrote in message
news:uVXDdzbvIHA.4876@TK2MSFTNGP02.phx.gbl...
> Apologies as this turned out to be rather lengthy....
>
> I run a development engineering lab for...
|
| Show full article (10.07Kb) |
|
| | no comments |
|
  |
Author: Ace Fekay [MVP]Ace Fekay [MVP] Date: May 24, 2008 10:22
In news:uVXDdzbvIHA.4876@TK2MSFTNGP02.phx.gbl,
Phil Lewis comcast.net> typed:
> Apologies as this turned out to be rather lengthy....
>
> I run a development engineering lab for a financial services company
> and we are running into a rather peculiar but very troubling problem
> on some of our performance servers when copying, moving or backing up
> very large files, i.e. files greater than about 52 Gigabytes in size
> (yes GB not MB). Primarily these are SQL Server database files but we
> also see the same problems copying or relocating large Virtual Server
> Hard Drive files of that size or larger.
> The problem is actually an old one that I think has not been dealt
> with, the end result for us on Windows Server 2000 and Windows Server
> 2003 R2 is that after copying about 52 Gbytes of the file, Windows
> starts reporting "Windows delayed write" errors and at that point the
> file copy collapses and stops. Although the system reports the copy
> is still running, no further data is being successfully copied. All
> the file IO and other windows processes slowdown considerably (more
> on this in a little bit). In each case where a delayed write error
> is generated, Event Viewer shows the first error as being event ID 50 ...
|
| Show full article (11.17Kb) |
| no comments |
|
  |
Date: May 24, 2008 11:03
Being that they're at Windows 2003 SP2 or Windows 2008, my guess would be
CcDirtyPageThreshold is set too high. Let's see the details of the event ID
50. The event ID 26 is really immaterial, it's just the application popup
message (you got the dialog that says lost delayed write)., The evend ID 50
will contain the error code passed up the driver stack, as well as any SCSI
sense data. Between that and possibly a short perfmon run during a copy, we
should be able to sort this out. My guess is that we'll see half of RAM
cached before writing starts, then you'll either see c000009a (out of
resources) or a timeout due to slow disk once dirty pages are flushed. In
perfom, you'll see heavy paging once dirty pages start flushing, and this is
really the cause of the lack of server...
|
| Show full article (12.95Kb) |
| no comments |
|
  |
Author: Phil LewisPhil Lewis Date: May 24, 2008 12:47
Hi John
Event 50 is being generated by Disk. I think that is being seen on all
servers, though I've seen one or two cases where mrxsmb was the source when
doing network copies. I can also confirm that at least 50%% of RAM was indeed
cached, in most cases it was much higher. While we have seen c000009a (out
of resources), in almost all cases, disk timeouts occurred as a failure
point and yes indeed we experience very heavy paging when or just before
the system tips over. On Server 2008 we were seeing the dedicated pagefile
disk active solidly for over an hour before we finally gave up waiting for
the copy to quit and the system to shutdown.
I'll get the event viewer dumps later tonight and I can also get a Perfmon
Dump as well. Let me know what counters you would like to see included...
|
| Show full article (14.64Kb) |
| no comments |
|
  |
Date: May 25, 2008 23:39
The fix is there with SP2, but you still need to set the registry key.
cache
copy reads/se
data flush pages/sec
memory
pages/sec
free system page table entries
physical disk
sec/read
sec/write
reads/sec
writes/sec
John Fullbright [Exchange MVP]
"Phil Lewis" comcast.net> wrote in message
news:uxIrTZdvIHA.576@TK2MSFTNGP05.phx.gbl...
> Hi John
> Event 50 is being generated by Disk. I think that is being seen on all
> servers, though...
|
| Show full article (15.33Kb) |
| no comments |
|
  |
Author: Phil LewisPhil Lewis Date: May 28, 2008 06:50
Hi John I'm about to grab the counters to a CSV File or do you prefer
Binary?
What do you recommend in terms of sample interval? I typically use 5 Second
intervals for this kind of thing but will adjust to whatever you prefer.
I'm grabbing Physical disk counters for both _Total and the Target disk
[D:\] for the Copy.
The first set of counters will be without the reg key being added to see the
system in the raw state. I'll then add the key, what do you recommend in
terms of a value? the Server is a HP ML570 Quad Xeon Dual Core 3.33GHz with
16GB of Ram. Reading suggests starting at half of installed Ram which would
be 8192.
Phil
|
| Show full article (1.09Kb) |
| no comments |
|
  |
Author: Mat YoungMat Young Date: May 28, 2008 14:53
Avg Disk Sec Read and write would be handy. if you are getting Lost write
because your array is being overun beyond the spindle count, then this may
show up in a latency response spike beforehand.
It would also be good to match IO versus spindle count. To do this properly
you will need cache hit ratio from the arrays.
rgds
mat
"Phil Lewis" wrote:
> Hi John I'm about to grab the counters to a CSV File or do you prefer
> Binary?
>
> What do you recommend in terms of sample interval? I typically use 5 Second
> intervals for this kind of thing but...
|
| Show full article (1.53Kb) |
| no comments |
|
  |
Author: Phil LewisPhil Lewis Date: May 28, 2008 15:03
Well I'm getting some very interesting findiong that I don't yet know what
to make of.
First off It looks like the whole train of events ultimately leading up to
the delayed write failure begins about 2-4 mins into the copy process. The
File copy in Windows Explorer begins normally and...
|
| Show full article (2.34Kb) |
| no comments |
|
  |
|
|
  |
Author: Kenny SpeerKenny Speer Date: May 28, 2008 15:50
Please take these comments with a grain of salt. It's typically pretty
difficult to troubleshoot these issues via newsgroup.
Phil Lewis wrote:
> Well I'm getting some very interesting findiong that I don't yet know what
> to make of.
>
> First off It looks like the whole train of events ultimately leading up to
> the delayed write failure begins about 2-4 mins into the copy process. The
> File copy in Windows Explorer begins normally and the 88.6GByte copy shows
> it is expecting to take around 24-26 mins. Watching a whole slew of counters
> in Perfmon I can see the Cache Bytes Growing steadily and Disk Total Reads
> per second running avg around 190.
[Kenny] This is most likely being caused by two things. First the NTFS
Cache writes is where your low time comes from. The AVG 190 IOPs to
your array are most likely from a write-back cache in the array.
Somewhere around 2-4 mins into the copy
> the time to complete estimate in windows Explorer suddenly climbs
> dramatically topping out around 100Mins.
[Kenny] You just filled your cache for the filesystem.
|
| Show full article (3.37Kb) |
| no comments |
|
|
|
|