HPSS and Ashur Tuning...

James W DeRoest (deroest@cac.washington.edu)
Mon, 28 Jul 1997 07:43:45 -0700


Message-Id: <199707281443.HAA20246@mailhost1.cac.washington.edu>
From: "James W DeRoest" <deroest@cac.washington.edu>
To: "AST Webpage" <astweb@u.washington.edu>
Subject: HPSS and Ashur Tuning...
Date: Mon, 28 Jul 1997 07:43:45 -0700

-----Original Message----- From: Douglas Luft <dbluft@u.washington.edu> To: Jim DeRoest <deroest@u.washington.edu>; Ken Lowe <ken@u.washington.edu>; Jim Fox <fox@u.washington.edu> Date: Sunday, July 27, 1997 12:38 PM Subject: HPSS and Ashur Tuning...

Backups on ashur seem to have caught up, but I feel ashur is still marginal - around 2 am this morning there were still several 'out of space' situations. The main bottleneck seems to be writing to the tape drives. Testing the drives, I found they can handle up to about 10 MB/sec, but throughput can drop when data can't be written to the drive fast enough.

Using the monitor command to see how things were running, this is what I learned:

1). The 4 GB barracuda drives can put out about 5 MB/sec max when read sequentially (about 4 MB/sec when read from the inside sectors). When seeks are added (because HPSS is also writing to the disk), throughput can drop significantly. This may be why some of the other HPSS sites have a intelligent disk subsystem on their HPSS mover systems. Another option is to stripe the HPSS logical volumes (or have HPSS stripe the data onto the disks). This would spread some of the load onto more disks which I feel would help somewhat, but disk bottlenecks will still occur. We could test one disk subsystem (I have the spare CMD raid unit installed in a test rack, and we could move it to 3737). We would lose some disk space (it holds only 7 4 GB disks instead of 9, and depending on the raid type, there might be only 6 disks of usable data). The raid strip size can be tested (stripe sizes can be from 512 bytes to 512 KB).

2). Monitoring the disk reads, I would see large amounts of data read for a period of time (a few seconds), and then it would stop. I assume this data is being written to tape. I found HPSS is configured to write 32 MB files onto the tapes. It is possible there is a delay between tape files (I'm not sure whether this is caused by HPSS or the drive, though the drive should also cache tape marks). Increasing the amount of data from 32 MB to maybe 128 MB (or higher) might help. The 'query and set position' ioctls are available (both our Artape and the OMI atdd tape drivers) which will help in positioning within a file.

3). Ashur's CPU can also be a bottleneck. Much of the time the CPU may not be very busy, but when the number of dumps (or throughput) increases to around 3 MB/sec (or more), ashur's CPU can be saturated. This reduces the overall throughput but especially the throughput to the tape drives, and causes them to drop out of streaming mode more often (along with 1 and 2 above).

Thoughts?

dbl