|
Students
(students.washington.edu)
Raw load and response data: Load: Response: |
Depts
({depts, staff, faculty, courses}.washington.edu)
Load: Response: |
Ovid
({ovid01, ovid02, ovid03}.u.washington.edu)
Raw load data: Load: |
|
Students
(students.washington.edu)
Load: Response: |
Depts
Load: Response: |
|
Students
(students.washington.edu)
Load: Response: |
Depts
Load: Response: |
|
Students
Vergil (Students)
MySQLDs: |
Depts
Ovid (All) (Depts)
MySQLDs: |
|
|
ovid01
Ovid01 (Depts)
MySQLDs: |
ovid02
Ovid02 (Depts)
MySQLDs: |
ovid03
Ovid03 (Depts)
MySQLDs: |
uptime command.
Response time is measured by the [Apache Bench] program (ab) bundled with Apache. The program runs on each of the servers and requests http://localhost/webstats/index.html five times. Statistics on time spent waiting for response are gathered.
Graph are generated with gnuplot and output to png format.
More details than you wanted know:
ovid runs Lynx and requests depts.washington.edu/webstats/webserver-stats/stats.cgi (with a special parameter).
stats.cgi request performance information for each cluster (students, depts courses and vieyra) by requesting {{students|depts|courses}.washington.edu|home.myuw.net}/webstats/webserver-stats/ask.cgi?host={students|depts|courses|vieyra}&path=/index.html (with Perl's LWP this time).
-
ask.cgi then requests (again, with LWP) {{students{1{1|2}|depts{1{1|2|3}}|courses{0{1|2}}.washington.edu|vieyra0{1|2}.myuw.net}/webstats/webserver-stats/perf.cgi?path=/index.html.
- On each host,
perf.cgi runs uptime and the ab program on localhost with the path /index.html. Output from ab and uptime is parsed, and reformatted to look like this:
1065552827 2003-10-07 11:53:47 load: 3.24 min: 195 mean: 201 max: 204 from depts01 to localhost/webstats/index.html
(Column 1 is the UNIX timestamp. Columns 2 and 3 are the human-readable dates and times. The rest is also human readable. I hope. This is what you see in the "Raw load and response data" below.)
perf.cgi works on the first webserver machine in the cluster, it uses rsh to run uptime on that cluster's web development machine as well. This data is listed on its own line along with the data from the rest of the webserver cluster.
ask.cgi, which returns it to stats.cgi which actually distributes the data to the proper data files.
The graphing is done by calling gnuplot with some specialized scripts for each of the clusters.
The current implementation is too fractured: The ask.cgi script can be completely eliminated, and stats.cgi can request and parse perf.cgi's output just as well. But the main problem remains, namely, that to measure response time and load average of the web servers, we need to interact with the web servers in a way that actually affects the results.
Ideally, stats.cgi and perf.cgi could be married into one script that simply runs on all six of the cluster members in a cron. But running cronjobs on a production system doesn't sound like a good idea. So I'm back to square one and requesting Apache to get to the tools that measure Apache.
There is also the issue of maintanance of two codebases (one accessible as on localhost from vergil, and same for ovid). But I'm not sure there is an easy fix to this.
Going down the road of ssh to avoid Apache requires maintanance of two crontabs (one on vergil so password-less logins to stewart work, and same for ovid/depts) which adds to the maintanance of two codebases.
When this was moved to a depts account, the ask.cgi script couldn't be requested from individual deptss on the depts server, since they'd redirect to depts.washington.edu/webstats/webserver-stats/ask.cgi and obliterate any GET arguments. This meant that ask.cgi had to be moved to my account until a solution could be found. I resolved this by having the extended category staff added to the account so that its webtype included staff. This allowed it to load things on the deptss without being redirected.
I added support for the vieyra cluster, which serves webpages of myuw.net customers. This feature required adding a MyUW.net subscription to the account. I also put in a number of log-archiving and efficiency improvements that aren't visible from the site.
I added the load of the web development machines associated with each web server cluster to the load graph, as the performance of these machines are heavily dependent on each other due to the number of sites that rely on MySQL databases running on the web development machines. This feature required small changes in each of the scripts.
nikky:Added the long-term MySQL observation graphs. Changed other variables to better judge response times on high-load machines.