>>> Apologies as this turned out to be rather lengthy....
>>>
>>> I run a development engineering lab for a financial services company
>>> and we are running into a rather peculiar but very troubling problem
>>> on some of our performance servers when copying, moving or backing up
>>> very large files, i.e. files greater than about 52 Gigabytes in size
>>> (yes GB not MB). Primarily these are SQL Server database files but we
>>> also see the same problems copying or relocating large Virtual Server
>>> Hard Drive files of that size or larger.
>>> The problem is actually an old one that I think has not been dealt
>>> with, the end result for us on Windows Server 2000 and Windows Server
>>> 2003 R2 is that after copying about 52 Gbytes of the file, Windows
>>> starts reporting "Windows delayed write" errors and at that point the
>>> file copy collapses and stops. Although the system reports the copy
>>> is still running, no further data is being successfully copied. All
>>> the file IO and other windows processes slowdown considerably (more
>>> on this in a little bit). In each case where a delayed write error
>>> is generated, Event Viewer shows the first error as being event ID 50
>>> and or Event 26. The problem is seen when using Drive letters and
>>> UNC paths and we asked about hotfixes for Server 2003 SP2 but were
>>> told by support there were none as the fix described in KB Article
>>> 890352 [
http://support.microsoft.com/kb/890352/ ] was rolled into
>>> SP2 and did not apply to our issue.
>>> We have only recently begun to see these problems because until
>>> recently most of our performance testing model used fairly small
>>> working sets for data (typically under 100GB total) and thus each
>>> file-group in our databases was less than 40GB so we never really saw
>>> a problem. I've seen this problem on ALL versions of Windows
>>> including Server 2008. In the server 2008 case, the O/S collapses
>>> completely and cannot be shutdown.In most cases we have to power-off
>>> the server to get the problem system to recover, no delayed write
>>> error is reported on server 2008.
>>> I've tested this on a variety of servers (listed next) and in as many
>>> cases as possible I tested on multiple servers with the same config
>>> and with different O/S editions. I also tested one of the servers
>>> that was experiencing the problems first, using small files (files
>>> from 1byte up to 40 GigaBytes in size). The test set I use is approx
>>> 870GB in total size and has been continuously copying for about 40
>>> days now continuously on this server, I think at last check it had
>>> copied around 2,973 Terabytes of data on this server, all without
>>> error.
>>> Primary Test Servers and configurations
>>> HP ML570 Quad Xeon w/ 16GB Ram and 1.2TBytes local storage + 3.4TB SAN
>>> storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
>>> HP DL580 Dual Xeon w/8Gb or 16Gb Ram and 1.2TB Local Storage + 3.4 TB
>>> SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
>>> HP DL380 G4p Dual Xeon w/4Gb Ram and 300GB Local Storage + 3.4 TB SAN
>>> Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86
>>> DELL 2950 dual quad Core 2.83Ghz w/32 GB Ram 320GB Local Storage
>>> 4.2TB on an EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy
>>> Box. WinSvr 2003R2 SP2 x86 and x64, Win Svr 2008 x64.
>>> DELL 2950 dual quad Core 2.5Ghz w/4 GB Ram 140GB Local Storage 4.2TB
>>> on an EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box.
>>> WinSvr 2003R2 SP2 x86 and WinSvr2008 x86
>>> Dell Precision 460T wkstation Dual Xeon 2.4Ghz with 4GB Ram and 1TB
>>> of local storage (SATA) 3.4 TB SAN Storage on a Nexsan Sataboy Box
>>> via iSCSI WinSvr 2003R2 SP2 x64
>>> With the exception of the ML570 I've tested on multiple servers of
>>> the same type. All servers have succeeded in copying large volumes of
>>> files 40Gb and smaller.
>>>
>>> On our HPs all drives are SCSI hot swappable. Either 10 or 15K. We
>>> don't mix spindle speeds in raid groups. On our Dells, all local
>>> drives are SAS, on the EMC AX150 all drives are are SATA and on the
>>> SATABoy all drives are SATA. On all systems file copies using small
>>> files up to 0 - 40GB are all successful.
>>> All systems are running the latest BIOS and we have seen the same
>>> behavior on prior BIOS versions. All disk controller firware is
>>> updated to the latest version and like the BIOSes the same behavior
>>> existed on earlier versions. We have checked and updated hard disk
>>> firmware, where new versions are available. Same issues as for
>>> controller firmware. On local Hard drives we run Raid 1 or Raid 5 to
>>> get best performance or max capacity. Both Raid modes exhibit the
>>> same behavior. I have tested on the HP's with no Raid at all and the
>>> same results occurred.
>>> I try not to do specialty O/S builds for our lab environment. I build
>>> a straight default Windows O/S configurations, fully patch it with
>>> microsoft Patches and burn in test the system, then go test for this
>>> problem. I do not tweak system settings or apply registry hacks until
>>> I get baseline test data. In all cases here for the file copy tests I
>>> have not tweaked the system settings or registry at all. Our servers
>>> are set for background performance for the system cache, although we
>>> tested with 'foreground' set without success too. We've tried large
>>> and small pagefiles and have moved pagefiles to seperate disk
>>> spindles to see if it made a difference. I've tried all manner of file
>>> copy and file sync tools, but what it
>>> comes down to that if the file being copied is written on the
>>> systems' storage system (local or SAN), the system will collapse and
>>> file copying will fail somewhere around 52 - 58GB being copied.
>>> Windows Server 2008 has given me my best look into the problem and
>>> what appears to be happening is that the system cache keeps expanding
>>> until all physical memory is used and the paged pool keeps growing
>>> until it hits around 380MB and non paged hits about 82MB (I think the
>>> latter is right). What I then see is the CPU goes flat line as does
>>> the Total Disk byes written in Perfmon but the Physical memory usage
>>> history in task manager suddenly starts ramping until it gets out
>>> around 70GB and then everything is done and either the system hangs
>>> (server2008) or delayed write errors occur.
>>> One place where I do not seem to see the problem is SAN Drive to SAN
>>> Drive Copies on the EMC SAN. I always see this problem on the SATABOY
>>> SAN with large files when copying large files to the SAN volumes
>>> regardless of Cache Settings. I can backup the files to tape but due
>>> to their size its an expensive option both in media cost and time to
>>> backup and restore the data, my preferred option is to backup to
>>> removeable Hard Disk (External USB - SATA), sure its slower but it
>>> offers operating efficiencies right now I cannot get with Tape (if
>>> it worked). Some testing has centered on using external drives, but
>>> most of the testing on my systems has been to copy or move the files
>>> from one volume to another on the server. It doesn't matter whether I
>>> turn caching off or on for external USB drives or local drives. I
>>> have had very occasional success on the EMC SAN ensuring that windows
>>> O/S disk caching is off. Success using this method has been spotty
>>> and limited to servers with 32GB or more of RAM. One very telling
>>> test setup was to populate the Server with 128GB RAM and run windows
>>> Server 2008 x64. In that case almost all file copies were successful,
>>> although they became painfully slow after about 60GB was copied and
>>> it took over 5 hours to copy the last ~35GB of a 95.8 GB file.
>>> I tried using Backup Exec and MsBackup to backup the files to a hard
>>> disk but it failed everytime, when I run the same backup to tape it is
>>> successful.
>>> This is leading me to think the problem is generated in the lower
>>> level file system filter drivers. I've tested with and without
>>> AntiVirus software in the mix and have likewise tested systems that
>>> are built raw with no patches at all and see the same problems. I
>>> also tried splitting the file into chunks and copying the pieces and
>>> while I can split the file, I cannot join it again as the processes
>>> all seem to rely on creating a temp file and the process of copying
>>> the large temp file always results in the delayed write errors being
>>> generated. I've also tried zipping up the file to reduce its size
>>> (database files compress really well) but that process likewise
>>> requires a file copy of a large file, and at some point that fails.
>>> It should go without saying I've tried using SQL Db Backup writing to
>>> disk storage and it fails everytime, tape is successful. In fact it
>>> was this very act that caused me to begin investigating the problem
>>> in the first place.
>>> In days gone by there were loads of users seeing this problem on Win
>>> XP copying much smaller files and there are some other people seeing
>>> this problem today on Windows Server, Microsoft are very quiet on the
>>> subject for windows server, I think in no small part because very few
>>> people are seeing the problem and there is no demand to identify the
>>> problem or to fix it. I cannot believe though I'm one of the first
>>> people to see the problem. I'll quite happily accept its a configuration
>>> issue if someone can
>>> tell me how to fix the problem! All attempt to tweak a system has not
>>> yielded any success. What also sucks is that you typically have to
>>> wait ~30 mins to find out that the problem will manifest.
>>>
>>> I'm about ready to escalate this issue to Microsoft, I think I now
>>> have enough test data to do so, but thought I would bounce this off
>>> others to see if anyone else has a solution or guidance first.
>>>
>>> Phil
>>> Checkfree: