Webserver Performance

Overview

What's going on?

Web server load and response time and web development load is being measured and graphed.

How's it done?

System load is measured by the output of the uptime command.

Response time is measured by the [Apache Bench] program (ab) bundled with Apache. The program runs on each of the servers and requests http://localhost/webstats/index.html five times. Statistics on time spent waiting for response are gathered.

Graph are generated with gnuplot and output to png format.

When's it done?

Both meaurements (load and response) are taken at five minute intervals. This makes for 288 measurements per day. Graphs are generated at fifteen minute intervals.

I wanna play!

You can run the performance measure manually right now: [Live]. Results will be in text format. It may take a little while to run.

Details

More details than you wanted know:

A crontab on ovid runs Lynx and requests depts.washington.edu/webstats/webserver-stats/stats.cgi (with a special parameter).
stats.cgi request performance information for each cluster (students, depts courses and vieyra) by requesting {{students|depts|courses}.washington.edu|home.myuw.net}/webstats/webserver-stats/ask.cgi?host={students|depts|courses|vieyra}&path=/index.html (with Perl's LWP this time).
ask.cgi then requests (again, with LWP) {{students{1{1|2}|depts{1{1|2|3}}|courses{0{1|2}}.washington.edu|vieyra0{1|2}.myuw.net}/webstats/webserver-stats/perf.cgi?path=/index.html.
On each host, perf.cgi runs uptime and the ab program on localhost with the path /index.html. Output from ab and uptime is parsed, and reformatted to look like this:



1065552827 2003-10-07 11:53:47 load: 3.24 min: 195 mean: 201 max: 204 from depts01 to localhost/webstats/index.html

(Column 1 is the UNIX timestamp. Columns 2 and 3 are the human-readable dates and times. The rest is also human readable. I hope. This is what you see in the "Raw load and response data" below.)



When perf.cgi works on the first webserver machine in the cluster, it uses rsh to run uptime on that cluster's web development machine as well.  This data is listed on its own line along with the data from the rest of the webserver cluster.

 This output is passed back up the call-chain to ask.cgi, which returns it to stats.cgi which actually distributes the data to the proper data files.



The graphing is done by calling gnuplot with some specialized scripts for each of the clusters.

Improvements

mikeg (original author)(with some changes by agraf):

The current implementation is too fractured: The ask.cgi script can be completely eliminated, and stats.cgi can request and parse perf.cgi's output just as well. But the main problem remains, namely, that to measure response time and load average of the web servers, we need to interact with the web servers in a way that actually affects the results.

Ideally, stats.cgi and perf.cgi could be married into one script that simply runs on all six of the cluster members in a cron. But running cronjobs on a production system doesn't sound like a good idea. So I'm back to square one and requesting Apache to get to the tools that measure Apache.

There is also the issue of maintanance of two codebases (one accessible as on localhost from vergil, and same for ovid). But I'm not sure there is an easy fix to this.

Going down the road of ssh to avoid Apache requires maintanance of two crontabs (one on vergil so password-less logins to stewart work, and same for ovid/depts) which adds to the maintanance of two codebases.

agraf: 

When this was moved to a depts account, the ask.cgi script couldn't be requested from individual deptss on the depts server, since they'd redirect to depts.washington.edu/webstats/webserver-stats/ask.cgi and obliterate any GET arguments.  This meant that ask.cgi had to be moved to my account until a solution could be found. I resolved this by having the extended category staff added to the account so that its webtype included staff.  This allowed it to load things on the deptss without being redirected.


I added support for the vieyra cluster, which serves webpages of myuw.net customers.  This feature required adding a MyUW.net subscription to the account.  I also put in a number of log-archiving and efficiency improvements that aren't visible from the site. 


I added the load of the web development machines associated with each web server cluster to the load graph, as the performance of these machines are heavily dependent on each other due to the number of sites that rely on MySQL databases  running on the web development machines.  This feature required small changes in each of the scripts.

nikky: 
Added the long-term MySQL observation graphs. Changed other variables to better judge response times on high-load machines.

Data Observations



 Data gap between March 25th & 26th because I broke the script with .htaccess.
 Data gap on April 26th due to quota problems.
 gap april 27-28th due to faulty archiving.


System Observations



 Hosts in the depts cluster appear 10 times slower in response times than those in the boca cluster.
 Response time of boca02 remains consistently erratic Change! boca02 stopped behaving erratically around May 18th.  Not sure what they did, but I like it.
 Since Wednesday October 22nd 2003, the load average on boca02 has not been following the daily up-down sinusoid path, and has doubled in magnitude since Monday October 27th 2003.
 The third boca and depts were out of DNS on March 25th for a while, so they had low load, since they weren't serving any pages.
 boca02 ran about 1.5 load points above the other bocas from Sat Apr 17th 2004 until 8:00 Wed Apr 21st.  Main cause: 2 sshd processes taking up ~25% CPU each. Is this what happened on Oct 22, 2003?
 This sshd behavior has continued to occur every week or so on various hosts, so I talked to Collaborative Platforms abount it, and they are supposedly investigating.
 A couple of times I've seen slow response times with no apparent change in load, it has been due to syn attacks.  See 15:30, May 15th 2005.

Students Vergil (Students) MySQLDs: (Large version) \| (Huge version)	Depts Ovid (All) (Depts) MySQLDs: (Large version) \| (Huge version)
ovid01 Ovid01 (Depts) MySQLDs: (Large version) \| (Huge version)	ovid02 Ovid02 (Depts) MySQLDs: (Large version) \| (Huge version)	ovid03 Ovid03 (Depts) MySQLDs: (Large version) \| (Huge version)