Dissertation Chapter 5

Chapter 5 --- Empirical Investigations

In the sections that follow two collaborative tools will be examined: DocReview (see (§4.3)http://depts.washington.edu/bkn/public/pubs/DocReview/C4/basedoc.html#dr) and Research Webs (see (Chapter 3)http://depts.washington.edu/bkn/public/pubs/DocReview/C3a/basedoc.html). DocReview is a web-based tool that allows readers of documents to become reviewers. This critical capability allows the collaborators to correct, expand, and refine the documents. DocReview is an integral part of Research Web Essays, the principal textual tool of the Research Web.

The Research Web is a customizable collaborative environment that permits the research team in a long-term, large-scale enterprise to examine an issue domain thoroughly. The Research Web (RW) has a WWW site that serves as the repository of the team’s corporate memory and research results. Tools available include a basic set that includes scholarly services of an annotatable bibliography and glossary, and an augmented web page format used for research essays. It incorporates any tool that the team finds necessary to its mission, provided that tool can be made web-compatible. Research Webs are unique, and for that reason may best be examined as case studies.

Table I Definition of Terms Used in Case Studies of DocReviews

Author, authoring team: the owner(s) of a document.
Base document: the document under review in DocReview.
Comment: the contents of an annotation returned by a reader.
Effectivity: the ratio of the size of comments received to the size of the review segment or the size of the base document.
Nature of participation: the number of people participating, the number of comments received, the volume of commentary (size), and the value [subjective] of the comments received.
Notification: e-mail sent to the reviewing team whenever a comment is received.
Quality of document, type of document: a score of value to the enterprise from 0 to 5 with zero being irrelevant and five being an essential product of the enterprise.
Review segment: a fragment, or chunk, of the base document that is the focus of annotation in DocReview. Usually a paragraph, element of a list, or a graphic insert.
Reviewing team, reviewers: people asked to participate in the DocReview.
Size (of document, comment, or review segment): the number of words longer than three letters.
Social character of comments: the content of the comments coded using the Bales IPA codes ()see §5.1.4.2 below)
Substantive character of comments: the content of the comments coded using the Meyers structurational argumentation codes (see (§5.1.4.2) below)

5.1 Case Studies of DocReview Installations
Since 1995, over 400 DocReviews have been installed. There are (or have been) at least nineteen DocReview sites, fifteen at the University of Washington, and one each at: University of California at San Diego, Haverford College, and the University of Wisconsin. Since the software has always been offered free of charge over the Internet, it may have been installed on several other servers, but these installations are likely to be inaccessible passworded intranets. The author has hosted several DocReview installations for other researchers to see if the system would help them in their work. Most of these installations were made at the request of people who wanted to use it as a tool to review independent documents, rather than as a part of an enterprise approximating the nature of a Research Web.

Five sets of DocReviews have been selected for detailed quantitative analysis in order to examine several propositions ()see §5.1.6) that arise from the basic research questions. The basis of selection was their similarity to DocReviews that might be found in active Research Webs. These DocReviews were all under the control of a knowledgeable facilitator (the author). These 101 DocReviews contained 1929 review segments that attracted 294 comments. These comments were coded into 767 Bales codes ()see Table V) and 425 Meyers argumentation codes ()see Table VI). The data was mounted in a relational database to support data conditioning and analysis. Analysis was performed with a spreadsheet program.

Two of the selected sets of DocReviews were the minutes of 59 meetings. The meetings were task-oriented meetings with an attendance averaging six members, with occasional participation of others by telephone. The minutes were quite comprehensive and averaged two pages of text. DocReview was integrated into the meeting routines by directing the attendees to review the minutes on the WWW before the next meeting. At the next meeting the scribe would distribute copies of the minutes with commentary inserted inline. The scribe would then explain how the minutes were revised in light of received annotations, and the team would then approve the minutes or suggest other changes. Usually this discussion was over in two or three minutes, thus saving considerable meeting time.

One set of seven DocReviews was sections of a draft of a professional paper. The paper was divided into seven sections in order to reduce the time required for each reviewing session. In reducing the review time, the busy schedules of the reviewers could accommodate the small time slices. The reviewers were professional colleagues of the author, some of who were involved in the design of DocReview. The author found the annotations very useful and most were incorporated into the final draft of the paper.

Another set of DocReviews was 19 workshop position papers for the 1999 conference on Computer-Supported Cooperative Learning (CSCL). Reviewing position papers was seen as an excellent application of DocReview from the beginning of design. In practice it lived up to its presumed promise. Perhaps the greatest impact was not intellectual, but in opening networking channels.

The final set of DocReviews was a set of 17 documents, Research Web Essays, written for a Research Web for the issue domain of chromium (CrVI) contamination on the Hanford Nuclear Reservation. The set was quite successful in accomplishing the objective of refining the initial versions of the documents, each of which centered on one aspect of the contamination.

5.1.1 Research Questions
The major research questions and the propositions derived from those questions are:

A. How does the behavior of dialog using DocReview compare to dialog that is face-to-face?

Does the social character of comments in DocReview differ from comments in face-to-face dialog?

Does the substantive character of comments in DocReview differ from comments in face-to-face dialog?

B. How should DocReview be segmented in order to maximize the effectiveness of the participants?

For complete DocReviews

Long base documents are ineffective relative to shorter documents.

For DocReview segments receiving comments:

The amount of commentary received on a review segment will be directly proportional to the segment's length.

The ratio of size of comments received to size of review segment will decline proportional to review segment size. (Long segments are less effective than short segments.)

C. How does the design of DocReview serve the research team?

Products similar to DocReview will emerge and will, by similarity, validate the design.

D. How does the quality of the document being reviewed affect the participation in the review?

Higher quality documents will attract more participation.

The nature of social commentary will vary with the type of document.

The nature of substantive commentary will vary with the type of document.

5.1.2 Design of Data Collection System
DocReviews have a very complete set of data. There are three types of data: text data, installation data, and transaction data. Text data includes the base document for the review and the annotations received from reviewers. The text data is contained in files on in single directory on the server (either active or archived). The text files include: the base document (basedoc.html), the HTML code for the document to be reviewed; comment files, either in a cumulative file (cummulate.html), or in files that contain the commentary on each commented review segment (point[nnn].html). Installation data is a set of parameters established by the editor at creation. Installation data includes the name and e-mail address of the sponsor, a list of e-mail addresses to be informed when annotations are received, and a descriptive title of the DocReview. The log file (log.txt) collects transactions (creation, annotation, reading and archiving) on the DocReview as they occur.

A program written by the author extracts and formats data from the files mentioned above. The program (makecsv.pl) creates several comma separated variable (.csv) files suitable for import into a database and thence to a spreadsheet program for the analysis. This program also does a word count on the base document and each of the document's review segments.

The analyst supplements two of the .csv files in order to add information that cannot be automatically extracted. A file (docrev.csv), that captures attributes of each DocReview, is augmented by including a description of the DocReview and a document type attribute designed to indicate the degree of quality, or the degree of completeness, of the document. This attribute is entered as a number from 1 to 5 and is defined as:

1 -- a sketchy document designed to collect ideas or impressions before investing any more effort in the document. A brainstorming dialog.
2 -- a first draft of a document, designed to catch major omissions, correct big mistakes and perhaps attract other authors.
3 -- a working document not intended to be advanced to publication. Minutes and position papers would fit here.
4 -- an advanced draft meant to polish the document before release as an essay for local use, or before submission to a journal for publication.
5 -- a released essay, or a published article. Under review for picking up new material as time passes, or to attract rebuttals or support.

The coder modifies the comments.csv file to add both the Bales codes (Interpersonal Process Analysis) and Meyers codes (Structurational Argumentation).

5.1.3 Quantitative Descriptive Statistics
There are three levels of abstraction used in the evaluation of DocReview: the base document, the document being reviewed; the review segments, sections of the base documents that are the annotatable units; and the annotations or comments that are made by the reviewers. This section reviews the descriptive statistics for the quantitative data collected on these levels of abstractions. The unit of analysis is a count of words of four or more characters.

Base Document
The document that is prepared for DocReview is called the base document. It varies in size, and quality (the degree of development). Very large base documents are usually broken into sections, each a DocReview, in order to allow the usually busy reviewers to complete a section at one sitting.

Data collected on the base document for a DocReview includes a word count, the document type (quality), the sponsor (author), and date of creation of the DocReview. The text of the base document is also available.

Table II Words in Base Document

Base Document Size (word count)
*Characteristic*	*All*	*Type 2*	*Type3*	*Type 4*
Mean	459.26	135.61	465.47	798.82
Median	422	130	469	598
Standard Deviation	325.27	91.259	196.71	695.66
Sample Variance	105801	8328	38693	483946
Kurtosis	20.77	-0.59	2.49	5.44
Skewness	3.44	0.48	0.88	2.22
Range	2647	309	1279	2657
Minimum	10	10	140	206
Maximum	2657	309	1279	2657
Count	100	13	76	11

The sample variances are too heteroscedastistic to employ an F test, so a logarithmic transformation was attempted in hopes of reducing the sample variances. The transform also failed to reduce the heteroscedasticity to an acceptable level. Differences between document types cannot be said to exist based on the base document word count.

Review Segments
Commentary other than general comments are directed toward a fragment of the base document called a review segment. Review segments are most frequently paragraphs or list elements (bullets), but occasionally include images or entire tables. The facilitator determines the review segments. The DocReviews in this case study were all prepared for review by the author and reflect a personal bias toward using relatively short review segments: paragraphs, at the largest; where lists are present, list elements; where large tables are presented, table cells; and individual graphic images. Section headings, bibliographic entries, and titles are usually excluded from review segments. Data collected on review segments consist of the text of the review segment and a word count.

Table III Words in Review Segments

Review Segment Size (word count)
*Characteristic*	*All Types*	*Type 2*	*Type 3*	*Type 4*
Mean	24.89	35.60	21.09	73.86
Median	14	8	14	65
Standard Deviation	28.19	43.33	20.41	55.23
Sample Variance	794	1877	417	3051
Kurtosis	17	1.5	5.6	4.7
Skewness	3.2	1.4	2.1	1.7
Range	306	165	158	306
Minimum	2	3	2	2
Maximum	308	168	160	308
Count (Number of segments)	1822	48	1656	118

The sample variances of both the raw data and a logarithmic transformation are too heteroscedastistic to perform a reliable analysis of variance. A null hypothesis of no differences between the three document types cannot be rejected. Examination of the means and standard deviations points out and obvious difference between types 3 and 4. This conclusion is supported by the nature of the genres represented: type 4 documents are drafts of conventional papers dominated with paragraph-long segments; and type 3 documents are dominated by meeting minutes composed of short segments such as action items and list bullets.

Comments
Each review segment attracts a set of comments, usually an empty set. The set may include not only comments on the review segment, but also comments on the other comments on the review segment. The comments are entirely free form, either text or HTML, and may include emphasis, paragraphing and even images.

Data collected on comments includes: the text of the comment, a word count, the name of the commentator, the commentator's e-mail address, the time and date, and the qualitative coding of the comment, both Bales codes ()see §5.1.4.1) and Meyers codes ()see §5.1.4.2). Due to the unrestricted length of comments, the unit of analysis for coding purposes must usually be a fragment of the message. The Bales codes were assigned to comments by dividing multi-sentence comments into written equivalents of speech acts. These fragments, as noted by Henri cannot be rigidly determined, but must be parsed out based on the analytic objectives dissertation(Henri 1991, 126). This same conclusion is seen in Meyers, et.al., where the units were "complete thoughts," rather than words or turns dissertation(Meyers, et.al. 1991, 53). Occasionally, there are additional, usually social, meanings that can be read into the commentary. For example, the wording of a comment may contain aggressive or supportive intent.

Table IV Words in Comments

Comment Size (word count)
*Characteristic*	*All Types*	*Type 2*	*Type 3*	*Type 4*	*General*
Mean	31.83	34.80	21.49	54.46	30.73
Median	19	22.5	12.5	43	12.5
Standard Deviation	37.61	36.98	26.79	48.01	36.77
Sample Variance	1414	1367	717	2305	1352
Kurtosis	14.6	1.2	40.3	7.9	0.8
Skewness	3.1	1.5	5.1	2.2	1.5
Range	289	122	256	288	124
Minimum	1	3	1	2	1
Maximum	290	125	257	290	125
Number of Comments	233	20	148	65	40

The sample variances are too heteroscedastistic to employ an F test, so a logarithmic transformation was attempted. The sample variances found in the transform were 1.18, 1.09, and 1.02. Using the transformed data a value of 22.1 was found for F. The value of F_2,232(.001) is 6.9, so a null hypothesis of no differences between the three document types can be rejected at the .001 level. Comments made on type 4 documents are much longer on average than comments made on type 3 documents such as meeting minutes.

In analysis of the DocReview commentary, it was discovered that the DocReviews of meeting minutes constituted a subset of commentary that demonstrated very random annotative behavior ()see Figure VII). When the comments for meeting minutes are removed from the Type 3 comments, the sample variances are too heteroscedastistic to employ an F test, so a logarithmic transformation was attempted. The sample variances found in the transform were 1.18, 1.26, and 1.01. Using the transformed data a value of 2.65 was found for F. The value of F_2,104(.10) is 2.36, so a null hypothesis of no differences between those three document types can be rejected at the .10 level.

5.1.4 Qualitative Coding Systems
Analysis of the content of the annotations must start with the selection or invention of a qualitative classification system. Many investigators have seen the wisdom of creating coding systems that are fitted closely to their problem. I chose to use existing systems, thus providing a possibility of drawing comparisons. I chose two systems, one for gauging the social functioning of research team, and the other to show how commentary became argumentation in the review of the document.

Any classification scheme must serve to differentiate between members of a group of cases. In our study the cases are DocReviews, an object that consists of a document that is partitioned into "review segments", and a set of comments made on each segment. The number of comments may be zero or more, and is usually zero. In uncommented segments, the question of implied agreement must be raised. One may be tempted to assume, since there is no limitation on reflection, that the reviewers agree with the review segment. Implied assent is very dangerous because it enables power mechanisms. No comment just means that the reviewers chose not to add to the dialog dissertation(Sheard 2000).

So how can we differentiate between the DocReviews? Certainly there are descriptive statistics such as size of the base document, the number of review segments, the number of comments, when the comments were made with respect to opening the review process, the size of the comments, and who made comments. These data were maintained in the log files, which are features of DocReview.

Beyond these physical statistics are the study of the character of the social interactions of the review team, the interpersonal process analysis (IPA); and the study of the efficacy of the review process, how the review contributed to the refinement of the knowledge represented by the document. Both the IPA and studies of efficacy can be conducted only by analysis of the content of the annotations. Measurement of the value of the comments to the collaboration is quite impossible in most cases, but a qualitative categorization of comments can be done by at least two classification schemes: an observational scheme and a scheme based on how the comments would fit into a formal argument. We must then code the DocReview multilogues twice, once for the social dimension of process-orientation and again for the knowledge content dimension of task-orientation.

To analyze the interpersonal process analysis of behavior in DocReviews, I classified the annotations using the Bales' codes dissertation(Bales 1950, 9), a well developed and respected tool. Analysis of how comments within a DocReview contributed to the knowledge-building content of the document will be conducted using a coding system based on the function of the comment from a task-oriented viewpoint, rather than from a social viewpoint as in IPA. The task-oriented functions are defined as the character of the comment (or comment fragment) in a formal argumentation framework. Meyers, Seibold and Brashers developed this coding system that was based on, and extended from, their previous work dissertation(Meyers, Seibold and Brashers 1991).

Classification schemes need to satisfy three conditions dissertation(Bowker and Star 1999, 10):

There are consistent, unique classificatory principles in operation.
The categories are mutually exclusive.
The system is complete [exhaustive dissertation(Cohen 1960, 38)].

The coding schemes I use vary in compliance with these desiderata. Bales codes are not complete; there is no place for nonsense or muttering. The Bales codes are not mutually exclusive, they instead are derived from four fairly distinct major categories that are each divided into three quasi-ordinal codes that have very fuzzy boundaries, e.g. what is the difference between giving information and giving an opinion? Bales attempts to close the ambiguities in the codes by a very thorough explanation of each dissertation(Bales 1950, 177-195), but overlaps and gaps exist.

Meyers' scheme provides a less complete guide to coding dissertation(Meyers et.al. 1991, 54) (appropriate for a research article as opposed to Bales' book). Both coding schemes are well described and coders can become facile with them in a reasonable time period. In reference to mutual exclusivity, a continuous system like Bales' IPA, must have fuzzy boundaries. Meyers' system is not a continuous system so is immune from this argument.

Meyers' scheme neatly solves the completeness problem with the introduction of the category "non-arguable." Fortunately, this category can contain no contextual knowledge, so it can safely be excluded from our analyses. Bales asserts that his categories are made complete and continuous by being concerned with the interaction content rather than the topical content and by eliminating any requirement for the observer "to make judgments of logical relevance, validity, rigor, etc." dissertation(Bales 1950, Chapt. 2).

Correct assignment of codes could perhaps be tested by comparing actual results from dialog in the source research and the coding of the same material by the author. In short, such testing would require studying intercoder reliability between the teams of Bales and Meyers and the team (myself) that would code the annotations. Bales offers six pages of coded dialog dissertation(Bales 1950, 93-99). Meyers et. al. offers some short examples. Both papers do offer good definitions of the categories. The categories are based on dialog quite familiar to any literate individual. A larger issue is the absence of gestural side-channel communication (head nodding, eye-rolling) in DocReview. As face-to-face dialog would present frequent "speech acts" that are gestures, facial expressions, or voice tones, there will be a loss of that dialog in the coding of DocReview annotations. This loss may account for some of the significantly lower "social-emotive" codes in the DocReview annotations.

I can only compare DocReviews to DocReviews since there was no attempt to set up a control review method by other means. In the DocReview study, all DocReviews use the WWW and are thus device independent. Usually, the participants within a given set of DocReviews are homogeneous, though between sets, they may vary in number. The same task is always performed: review of a document, though the nature of the documents may change (meeting minutes, position papers). Almost all users are invited, since most DocReviews are on intranet sites. Other than the exceptions noted, most dependent variables are identical. Most studies that apply IPA compare computer-mediated communication with face-to-face communication. In a meta-analysis of studies of computer-mediated collaboration, McGrath and Berdahl dissertation(McGrath and Berdahl 1998) make several cautionary points based on differences between face-to-face interaction and computer-mediated interaction: studies often use different computer systems; different kinds of participants are used; different types of tasks are performed; and there are different patterns of dependent variables.

5.1.4.1 Interaction Process Codes
These codes are intended to assign speech acts, including backchannel communication, to categories that are based on social processes rather than substantive content. Since we are social animals, the nature of our dialog will to a great extent determine both how we respond emotionally to our collaborative environment and how effective that environment is in attracting productive participation.

The Bales Code
Commentary of hyperdocuments through DocReview can be evaluated by use of categorization, volume and quality. DocReview comments can be categorized by using Bales codes dissertation(Bales 1955). Depending on the issue domain, these codes can be used to order value between categories. For instance, detection of errors in spelling or grammar is a low value contribution in studies of social behavior, but a high value contribution in the development of a manifesto or epic.

Table V Bales Code for Acts in Social Interaction

Main Categories	Frequency	Types of Acts	Frequency
Positive reactions	25.9%	Shows solidarity	3.4%
		Shows tension release	6.0%
		Shows Agreement	16.5%
Problem-solving attempts	56.0%	Gives suggestion	8.0%
		Gives opinion	30.1%
		Gives information	17.9%
Questions	7.0%	Asks for information	3.5%
		Asks for opinion	2.4%
		Asks for suggestion	1.1%
Negative reactions	11.2%	Shows disagreement	7.8%
		Shows tension	2.7%
		Shows antagonism	0.7%

--- after Bales, 1955

Commentary that expresses support or disagreement is not valueless, for such commentary does influence the behavior of the author and other contributors. So most commentary is of some value, even if it is merely reinforcing the recognition of a team effort. Sadly there are comments of negative worth that occasionally emerge, such as personal attacks or senseless graffiti.

Gay et.al. and classroom discussion forums
Geri Gay and others studied the character of student contributions by computer-mediated communication in university classes dissertation(Geri Gay et.al.1999). The discussion forums were conducted in CoNote, a WWW-based annotation program functionally similar to DocReview. Gay's study included questionnaires and observer data as well as a repository of documents and comments thereon. Gay's codes, like Bales' codes, are not based on the relationship of the annotation to the collaboration task, but on the character of interpersonal activity. Content of the annotations was organized into three categories: technical comments, affiliative comments and advice. Presumably, a single comment could contain all categories, but not multiple occurrences of a category. 197 comments produced percentages of 50.3 technical, 45.2 affiliative, and 68.5 advice. These percentages were obtained in an environment dominated by students who came into frequent contact, thus by age and group structure more inclined to engage in affiliative commentary than professional groups might be.

These codes are equivalent to portions of the twelve category Bales Codes for Interpersonal Activity. The affiliative comments, which presumably could be positive or negative, would fall into one of six categories: Shows Solidarity, Shows Tension Release, Agrees, Disagrees, Shows Tension or Shows Antagonism. The technical comments would fall into the neutral task-oriented area: Gives Opinion, Gives Orientation, Asks for Orientation, Asks for Opinion. The advice category corresponds to the extreme range of the task-oriented area: Gives Suggestion and Asks for Suggestion.

5.1.4.2 Argumentation Based Codes
If research is analogous to argumentation, as Eisenhart and Borko suggest dissertation(Eisenhart and Borko 1993), then a coding system that is based on the argumentation process would seem to be a more effective alternative for characterizing task-oriented activity than the more process-oriented Bales IPA coding. The value of a comment fragment (coding unit) to the collaboration is more closely related to task than process. Perhaps we can assign a value to a specific type, or if the coder is familiar with the document, we can actually assign an interval measure for value. Three coding systems have been considered: informal argumentation codes, structurational argument codes, and an observational categorization.

Informal Argumentation
In An Introduction to Reasoning, Toulmin, Rieke and Janik develop a dialog classification based on argumentation dissertation(Toulmin, Rieke and Janik 1979). Their system is proposed to be the basis for development of a tool (The Landscape of Reason) to organize dialog for the Research Web. Argumentation is broadly defined in this work, having a place in any "rational enterprise." As the authors put it, "... scientific arguments are sound only to the extent that they can serve the deeper goal of improving our scientific understanding." Every coding unit of a comment can be assigned a type based on this classification. The value of the comment in terms of value to the collaboration can be established through a surrogate, the value of the comment in the argument. There are six elements in argumentation: claims, grounds, warrants, backing, modal qualifiers, and rebuttals.

Claims are assertions put forward publicly for general acceptance. In DocReview terms, every review segment is a claim. The claim is that the review segment, whatever its nature within the document, presents and argues its proposition well, and conforms to accepted standards. For example, the role of a review segment within the document may be that of, say, a rebuttal. Comments directed toward the review segment can, in a recursive way, present grounds, warrants, backing, modal qualifiers and rebuttals to the review segment's basic claim as a rebuttal.
Grounds are facts that support a claim. Comments may be directed toward the grounds given, in our example the grounds supporting a rebuttal.
Warrants are ways of describing how one can validly draw a conclusion from the grounds offered. This is the argument in argumentation.
Backing makes explicit the experience relied on to establish the trustworthiness of the warrants. In scholarly documents, citations of literature are the principal means of supplying backing.
Modal qualifiers are statements that show what kind and degree of reliance is to be placed on the conclusions.
Rebuttals are statements showing exceptional circumstances where the conclusions might be undermined.

Structurational Argument Codes
In research on decision-making discussions in a face-to-face environment, a set of seventeen categories describing statements in terms of their place in argumentation was developed and used by a team that studied 45 discussions. This research had its roots in research by Toulmin (in 1958) and two other research teams in 1969 and 1980 dissertation(Meyers, Seibold and Brashers 1991, 50). I can find no subsequent application of this coding scheme in the literature. Coding is extremely difficult, as meanings can shift with context. The coder must be thoroughly immersed in the argument, not just the words, but also the intent of the words.

In Meyers et.al. discussions were analyzed with 8,408 codes produced, having the distribution given in the following table dissertation(Meyers et.al., 45). This dissertation found 425 codes in the DocReview annotations.

Table VI Structurational Argumentation Codes

ARGUABLES (67.4%)	Potential Arguables	*Assertions*	Statements of fact or opinion.
	Potential Arguables	*Propositions*	Statements that call for support, action or conference on an argument-related statement.
	Reason-using Arguables	*Elaborations*	Statements that support other statements by providing evidence, reasons, or other support.
	Reason-using Arguables	*Responses*	Statements that defend arguables met with disagreement.
	Reason-giving Arguables	*Amplifications*	Statements that explain or expound upon other statements in order to establish the relevance of the argument through inference
	Reason-giving Arguables	*Justifications*	Statements that offer validity of previous or upcoming statements by citing a rule of logic (Provide a standard whereby arguments are weighed).
REINFORCERS (13.6%)		*Agreement*	Statements that express agreement with another statement.
REINFORCERS (13.6%)		*Agreement +*	Statements that express agreement with another statement and then go on to state an arguable, promptor, delimitor, or non-arguable.
PROMPTORS (2.3%)		*Objection*	Statements that deny the truth or accuracy of any arguable.
		*Objection +*	Statements that deny the truth or accuracy of any arguable and then go on to state another arguable, promptor, delimitor or nonarguable.
		*Challenge*	Statements that offer problems or questions that must be solved if agreement is to be secured on an arguable.
DELIMITORS (2.1%)		*Frames*	Statements that provide a context for and/or qualify arguables.
		*Forestall/Secure*	Statements that attempt to forestall refutation by securing common ground.
		*Forestall/Remove*	Statements that attempt to forestall refutation by removing possible objections.
NONARGUABLES (14.5%)		*Process*	Non-argument related statements that orient the group to its task or specify the process the group should follow.
		*Unrelated*	Statements unrelated to the group's argument or process (tangents, side issues, self-talk, etc.)
		*Incomplete*	Statements that do not provide a cogent or interpretable idea (due to interruption, stopping to think in midstream, but are completed as a cogent idea elsewhere in the transcript.

after Table 1 Meyers, Seibold and Brashers dissertation(Meyers, Seibold and Brashers 1991, 54-55)

While Meyers et.al. conclude that the structurational argumentation codes reflect both process-orientation and task-orientation (or system and structure, as they put it); the coding scheme clearly supports task-orientation much better than the Bales IPA. In terms of support to a collaborative task, some categories have more value than others.

These argument codes provide places for every element in the Toulmin informal argumentation scheme. The nonarguables Process and Unrelated are very convenient "bins" for trivial or procedural content. One of the seventeen codes is extremely unlikely to be used: the nonarguable Incomplete. The argument codes were developed to analyze transcripts of face-to-face interactions, an environment where interruptions are frequent. It is difficult to imagine how an asynchronous contribution could be interrupted; if the writer is interrupted at the terminal, then the task can be resumed when the interruption terminates.

The Meyers, et.al. study used transcripts of actual face-to-face multilogue, with recourse to videotape only when the expression needed clarification dissertation(Meyers et.al. 1991, 56). Interruption and incomplete expressions were frequent, as in normal conversation. The computer-mediated environment of DocReview will make interruption unlikely and incomplete thought rare. I expect the distribution of message fragments in DocReviews to be quite different from conversational multilogues. As McGrath and Berdahl cautioned, these differences may be due to many different factors dissertation(McGrath and Berdahl 1998); nevertheless, if the differences are great, the argument in favor of computer-mediated communication as a more reflective medium gains support.

An Observational Categorization
The author's five years of experience in the use of DocReview has led to a potential coding system based on observation and sorting. Interpretation and characterization of the codes are based not only the original context of the commentary, but on assumptions of what character the comments would take in a fully implemented Research Web.

This scheme categorizes several nominal classes of comments seen in DocReviews. It has the advantage of being completely specific to DocReviews; that is it is not time restricted, and is asynchronous, document-centric. Most DocReview review segments, especially paragraphs, will contain an assertion, a conclusion and give evidence showing how the conclusion follows from the assertion. In addition to this logical imperative (substantial) there is also the requirement to conform to appropriate standards of scholarship and presentation (formal). In the Research Web environment, the documents are also subject to both the criticism process and an editing process.

Table VII
Observational Categorization of DocReview Annotations

	*Substantial*	*Formal*
Editing Process	Supplies references and citations. Supplies new information or examples. Suggests deleting information.	Corrects spelling or grammar. Questions document layout. Questions sentence or paragraph structure.
Criticism Process	Questions validity of statements. Gives opinions or suggestions. Supports or rejects substance with grounding.	Supports a comment w/o supplying new information. Disagrees with a comment w/o supplying new information. Questions or discusses the philosophical bases of the document.

5.1.5 Qualitative Coding Reliability
In the analysis of the data, the distribution of codes in the DocReview commentary is compared to the distribution of codes in the studies that defined the codes. In comparing the distributions, there is the necessary assumption that all coding would be consistent and correct. Bales points out three sources of variation between coders: unitizing, the correct parsing of dialog into units of analysis; categorizing, correct assignment of codes; and attributing, the source and target of the dialog dissertation(Bales 1950, 101). There is, in the DocReview analysis, no question of the source and target. Because this dissertation was not well funded, the author did all coding, so the skill and consistency of coding was not established by comparing the coding of dialog by independent coders.

Unitizing is a significant source of variability. The variability in unitization is induced by uncertainty in interpretation. Some methods of unitizing are less susceptible to variability than others. Time-based unitization, segments of elapsed real time, are not subject to interpretation dissertation(Nyerges et.al. 1998, 141). Turn taking in speech dialog is more variable due to complications that arise in parsing of monologues; annotations in DocReview are essentially monologues. Parsing face-to-face dialog into speech acts (Bales) is yet more variable because there is a need for insertion of implied speech acts and gestural acts. Even more variable is the event-based coding that was used in the argumentation coding (Meyers). Nyerges et.al. chose time-based coding over event-based coding because event-based coding required at least two coding passes dissertation(Nyerges et.al. 1998).

In the Bales coding, DocReview annotations were parsed during coding into approximations of "speech acts" by dividing the annotation into phrases, sentences or a set of contiguous sentences that dealt with a single topic. Not infrequently when the coder understands both the review segment and an annotation well, implied codes emerge. One comment usual contained a few codes (mean = 2.6) sometimes as many as a dozen. This parsing is assumed to be equivalent to the turn taking of face-to-face dialog.

In the argumentation coding, the unitizing protocol used in Meyers et.al. could not be employed since their unitizing was done by two judges concurrently. As Meyers used transcripts of dialog, so I used written dialog. The unitizing rule that Meyers et.al. used was: "any statement that functioned as a complete thought or change of thought." The Meyers team coded dialog that was parsed into turns, while DocReview comments are relatively long monologues. Rather than parsing the monologue into speech acts I parsed it into argument units that might include several sentences. Such units fit well into the Meyers categories. One comment usually contained one to a few codes (mean = 1.4) sometimes as many as eight.

Coding and unitization of DocReview annotation requires the coder to place the annotation into the context of the review segment being annotated. This contextualization is done by mentally converting the annotation unit and review segment into a narrative equivalent. Unfortunately, returning to the exact same mind set is difficult for either independent judges or for the same coder repeating the coding at a later time.

5.1.5.1 Coding Reliability Tests
In order to test the reliability of the coding, it was decided to take a 12.5% random sample of all review segments that received comments. The author, who was the original coder of the entire set of comments, then recoded this sample. There was no recall of the original coding.

Four sets of codes were tested for reliability: the Bales codes (twelve categories), the Bales categories (four sets of three codes each), the structurational argumentation codes (seventeen categories), and the five structurational argumentation categories derived from the seventeen codes.

5.1.5.2 Data Conditioning
The parsing of DocReview annotations into coding units (unitizing) proved to be uncomfortably variable. It seemed that the degree of engagement by the coder was the principal source of variability. When coding the annotations, the context had to be set by reading and understanding the review segment and then interpreting the annotation in context. When the coder is well engaged, the dialog shows more nuances (codable) than a perfunctory reading would provide. Of course coder drift and fatigue contributed to the variability too. If the coder is heavily engaged in reading between the lines, and sees and records an implied code that another does not, then there will be a difference in the number of codes. The two code strings may not align: for instance coder 1 codes "acbbbca" and coder 2, missing the implied code, codes "cbbbca".

Aligning codes at the beginning gives:

acbbbca
cbbbca

This results in 2 agreements, 4 disagreements, and one not matched.

If on the other hand we align like this:

acbbbca

cbbbca

Then we have a probably more accurate analysis of six agreements and one not matched.

If such realignment is allowed it is subject to much abuse, so I allow only a shift of the entire shorter code string within the limits of the longer code string. If the code strings are of equal length, then no shifting is allowed. Any unmatched codes resulting from unequal code string lengths are removed. Both Bales and the structurational argumentation codes were conditioned this way, and the resulting conditioned data was converted to the aggregated categorical data (the four Bales categories and the five structurational argumentation categories).

5.1.5.3 Analysis
Intercoder or recoder reliability can be measured by several methods. Cohen dissertation(Cohen 1960) and Landis & Koch dissertation(Landis & Koch 1977), in their examples, use nominal categories that are clear, complete and mutually exclusive. On the other hand Perreault and Leigh use more qualitative (though unstated) codes dissertation(Perreault and Leigh 1989). On this basis, plus favorable arguments from the Meyers et.al. paper, I am inclined to use the Perreault and Leigh measure. Since Cohen's kappa is so widely used, I include it for comparison purposes.

The conditioned data were placed in contingency tables comparing the two coding sessions. From the contingency tables, Cohen's kappa and Perreault and Leigh's Index of Reliability were calculated for the four sets of data.

Bales codes
From the initial set of 99 Bales codes, there were 82 codes remaining in the conditioned data. Each code could assume one of twelve values. Comparing the two sets showed 54 pairs in agreement, 28 pairs in disagreement and 17 unmatched codes. Cohen's kappa dissertation(Cohen 1960) for the Bales codes is 0.538, showing only moderate agreement between the two coding sessions dissertation(Landis and Koch 1977, 165). The Index of Reliability dissertation(Perreault and Leigh 1989) is 0.792 with a 95% confidence level of +/- 0.088. This mediocre result, in conjunction with some very low counts of several codes, provided the argument to use only the four Bales categories in the analysis of DocReview annotations.

Bales categories
In analyzing the four Bales categories, each code could assume one of four values. Comparing the two sets showed 80 pairs in agreement, 2 pairs in disagreement and 17 unmatched codes. For the Bales categories, Cohen's kappa is 0.878, showing almost perfect agreement between the two coding sessions. The Index of Reliability is 0.984 with a 95% confidence level of +/- 0.027.

Structurational argumentation codes
From the initial set of 70 structurational argumentation codes, there were 48 codes remaining in the conditioned data. Each code could assume one of seventeen values. Comparing the two sets showed 21 pairs in agreement, 27 pairs in disagreement and 22 unmatched codes. Cohen's kappa for these codes is 0.402, showing only fair agreement between the two coding sessions. The Index of Reliability is 0.668 with a 95% confidence level of +/- 0.133. As with the Bales codes, there were a large number of codes with low to zero counts.

Structurational argumentation categories
In analyzing the five structurational argumentation categories, each code could assume one of five values. Comparing the two sets showed 28 pairs in agreement, 20 pairs in disagreement and 22 unmatched codes. Cohen's kappa is 0.383, showing only fair agreement between the two coding sessions. The Index of Reliability is 0.673 with a 95% confidence level of +/- 0.133.

5.1.5.4 Conclusions
It was presumed that the Bales codes would measure the social aspect of DocReview "dialog." The reliability of the coding was acceptable, especially for the four Bales categories. It must be noted however that the strong concentration of the codes into the positive task-oriented category results in reliability that is perhaps misleading.

The structurational argumentation codes were too numerous and difficult to code to produce acceptable reliability. Applying argumentation codes to analysis of DocReview annotations will require the use at least pairs of coders working together (as Meyers et.al. did). The unitization problem was extremely serious, producing almost a one third rate of no matching codes. The combination of arbitrarily long review segments and arbitrarily long annotations will demand a very clever unitization scheme to produce any hope of consistent coding.

5.1.6 Analytical Results
The proposition designations below (e.g. A2) refer to the research questions discussed in §5.1.1. Three techniques were used to test the propositions: Chi-squared, regression analysis, and case studies.

Four of the propositions use the Chi squared test comparing the counts of DocReview codes versus the coding distributions in the original Bales and Meyers studies. In order to normalize the sample sizes a pseudo-sample of the Bales or Meyers codes was drawn with the same distribution as in the original studies but with a size equal to the DocReview sample.

Four of the propositions were tested using single variable regression analysis. In all these cases the independent variable (X) was the word count of the base document or a review segment of the base document. In some cases the dependant variable (Y) was confounded with the independent variable. This confounding was due to the definition of effectivity as the ratio of commentary to the size of the document (effectivity = Y/X). The shape of the best fitting regression line was found to be logarithmic.

One of the propositions was a case study comparing DocReview to three other web-based annotation programs. The comparison was made on the basis of a universe of features found in all the programs.

5.1.6.1 Proposition A1. The social character of comments in DocReview differs from comments in face-to-face dialog.

One of the most important questions arising from the use of DocReview is how the nature of dialog in DocReview is different from face-to-face dialog. Fortunately we have from Bales' work a distribution of codes assembled from thousands of face-to-face speech acts. If one makes the assumption that DocReview annotation is equivalent to one side of a face-to-face dialog, and further assume that in face-to-face dialog the two participants each produce an identical distribution of coded speech acts, then we can make a valid comparison. The assumption of equivalence is strained by the odd nature of this communication: essentially the document is the source of a series of propositions. The annotation is a set of responses to the proposition presented in the review segment by the readers. This set of responses is also complicated by the not infrequent presence of commentary on other annotations.

Operationalization:
Assigning Bales codes categories to all annotations operationalizes the social character of the comments. The Bales Interaction Process Analysis categorizes all speech acts, including gestures, into twelve codes. The differences between some of the Bales codes are very slight. These fine nuances result in a high variability between coders or between coding sessions by the same person. In order to reduce the intercoder variability it was decided to use Bales' broader classification: categories. Bales grouped the twelve codes into four categories that are generic and form a good basis of comparison. These categories are: positive reactions, problem-solving attempts, questions, and negative reactions. Problem-solving Attempts and Questions are further generalized into a supercategory of the task area, while Positive and Negative Reactions are generalized into the social-emotive area.

Table VIII Bales Interaction Process Analysis Codes

from dissertation(Bales 1950)

Data conditioning:
None.

Data Analysis:
The counts of codes of the entire set of DocReview annotations by Bales category demonstrates that DocReview annotations show a much higher degree of task-related dialog and a much lower degree of social-emotive dialog than is seen in face-to-face dialog. The comparisons (DocReview/face-to-face) are: for Negative Reactions -- 0.1%/11.2%; for Questions -- 7.3%/7.0%; for Problem-Solving Attempts -- 85.5%/56.0%; and for Positive Reactions -- 7.0%/25.9%.

Figure I Distribution of Bales Codes

We find that the null hypothesis that there will be no difference between face-to-face and DocReview dialog when Bales coded can be rejected. With three degrees of freedom, Chi squared = 213.2. This result is significant at <0.000001.

Discussion of Findings: The very low percentages of DocReview annotation in the social-emotive area (positive and negative reactions) may show the effect of moderation in dialog induced by the reflection afforded by DocReview as opposed to the more spontaneous nature of face-to-face dialog. The similarly low, though less extreme, percentages in the positive reactions category may show that there is less need felt for social reinforcement than in face-to-face dialog. Though DocReview annotations show less positive reinforcement, the reinforcement is there, it is simply less effusive. Questions (task area: negative) show an almost identical distribution. Problem-solving attempts (task area: positive) are much higher in DocReview annotation than in face-to-face dialog. This disparity may be the result of the ability of the reader to reflect much longer than is possible in face-to-face dialog. I suggest that this is the most important finding, demonstrating the value of DocReview in problem solving.

Interpretation of Findings:
The conclusions must be tempered with the realization that there are no gestural acts in the DocReviews and their annotations. While Bales does not record the percentages of gestural acts captured in his research, in his description of the codes gestures such as winks, nods, frowns, and even blushing appear. From Bales' description of the codes one can clearly see that most gestural acts are in the social-emotive categories. If an arbitrary portion of the Bales social-emotive codes (comprising 37.1 % of the total face-to-face acts) was assumed to be gestural, then in the annotation coding the missing percentage would need to be reassigned from the task oriented categories. This reassignment would cause the comparison between positive task-orientation to be somewhat less marked, and the comparison between negative task-orientation would shift from being almost equal to somewhat less negative than in face-to-face dialog.

5.1.6.2 Proposition A2: The substantive character of comments in DocReview differs from comments in face-to-face dialog.

The substantive nature of comments in DocReview is measured by determining the intent of the comment, or a portion of the comment. Intent is defined in this analysis as what place the comment would take in argumentation.

As in the analysis of social character of the comments above in Proposition A1, we have to assume that the dialog is quite one-sided, with the document providing propositions and the readers arguing with that proposition. Clearly there can be no negotiation of meaning and the document can make no rebuttals. In terms of argumentation, then we can have but one round of argumentation, but with several people participating.

Operationalization:
Assigning Meyers structurational argumentation code categories to each comment operationalizes the substantive character of the comments.

Data conditioning:
The raw data percentage comparisons (DocReview/face-to-face) are: for non-arguables -- 22.6%/14.5%; for delimitors -- 8%/2.1%; for promptors -- 23.1%/2.3%; for reinforcers -- 10.3%/13.6%; and for arguable -- 36%/67.4%.

Argumentation codes in the non-arguable category in the dialog were excised. In the raw data, DocReview annotations were 22.6% non-arguable, compared to 14.5% in the Meyers study. The difference in non-arguables is attributed to the assignment of annotations frequently complaining about grammar and spelling to that category. Arguably such commentary does not contribute to productive argumentation, and furthermore such corrections are seldom made in face-to-face dialog.

Codes in the arguable class were also excised. Difficulties in adjusting for the asymmetrical nature of DocReview argumentation are simply insurmountable. In the one turn dialog, responses to propositions (the base document's review segment) are much more prevalent than responses to annotations. Responses to annotation usually requires re-reading the comments; busy participants are not likely to return to review comments, even if they are reminded by e-mail notification. This would not be the case in face-to-face argumentation.

The data conditioning leaves us with three categories of codes: Reinforcers, Promptors and Delimitors. Unfortunately the excision of troublesome categories reduces our number of data points by 58% to 176. Since the central action of argumentation is carried out in these categories, I feel that they are an adequate basis for comparison.

Data Analysis:
The conditioned data comparisons (DocReview/face-to-face) are: for reinforcers -- 25%/75.6%; for promptors -- 55.7%/13.3%; and for delimitors -- 19.3%/11.1%.

Figure II Substantive Commentary (DocReview vs. F2F)

Comparing face-to-face distributions to the distributions found in the DocReviews shows a very strong difference in both promptors and reinforcers. There are four promptors in DocReviews for each face-to-face promptor and three face-to-face reinforcers for every DocReview reinforcer.

We find that the null hypothesis that there will be no difference between face-to-face and DocReview dialog when Meyers coded can be rejected. We find that with two degrees of freedom, Chi squared = 93.3. This result is significant at <0.000001.

Discussion of Findings:
The differences of face-to-face argumentation and DocReview annotation are clear: people are much more inclined to suggest changes to the document in DocReview than in face-to-face dialog; people are much less inclined to agree with the document in DocReviews than they are in face-to-face dialog. I see this finding as suggesting that there may be some satisficing occurring as people are less inclined to annotate texts that they see as not far enough wrong to complain about. The vast difference in promptors may be explained by the nature of DocReview: documents are mounted with the intent of drawing out errors and omissions. A portion of the differences may also be explained by social mechanisms: it is much easier to praise than object; and power effects may also be seen as people are more inclined to agree with a proposition offered in a meeting (usually by a leader).

5.1.6.3 Proposition B1: Long base documents are ineffective relative to short documents.

The lives of researchers are fragmented into scores of tasks of varying importance. This produces the need to engage in multitasking, a mosaic of activity that fills the available time with periods of variable lengths. There will be short periods to review documents, provided they are of a size that will fit into the time slot. Very long documents may encourage a shallow reading, thus shallow and short commentary.

Operationalization:
Effectivity is operationalized as the ratio of the sum of comment size to the size of the base document. Size of comments and base documents are both established by software that counts the words of more than three characters. For each DocReview that attracted annotation (n = 78), the word counts for annotations to segments that attracted annotation for each DocReview were accumulated in one column and the word count for the DocReview was placed in another. The DocReview word count was plotted on the X-axis and the effectivity on the Y-Axis.

Data conditioning:
Records for DocReviews that attracted no commentary were excluded. A DocReview with segments containing graphics was excluded due to the low word count in the segments, and the heavy annotation of the segments. The same DocReview contained an anomalously long general comment.

Data Analysis:
A correlation of 0.665 on the logarithmic regression line confirms the hypothesis. With 77 degrees of freedom a value of F = 60.1 was found. As expected the slope was negative with P = 3.27 x 10^-11. The P value of the intercept was 1.64 x 10^-12. A study of DocReviews by document type ()§5.1.6.7) suggests that the logarithmic relationship is even stronger among base documents that are not meeting minutes.

Figure III Effectivity by Document Size

Discussion of Findings:
The hypothesis is accepted. Smaller base documents produce more effective DocReviews. This leads to the conjecture that fragmenting a very long document will increase the effectivity of the review process. This conjecture could be tested, but not with the data from this study.

5.1.6.4 Proposition B2: The amount of commentary received on a review segment will be directly proportional to the segment's length.

An extremely long review segment may tax the reader’s concentration, leading to a decline of effectivity. Short review segments such as list "bullets" are sharply focused and easy to grasp and critique. Due to a small denominator, the effectivity of such short segments may be inflated. The deleterious effect of long review segments is one of the basic assumptions of the design of DocReview.

Operationalization:
Size of comments and base documents are both established by software that counts the words of more than three characters.

Data conditioning:
Segments not attracting annotation are removed. Segments that were graphic images were discarded. General comment segments were discarded.

Data Analysis:
A correlation of 0.235 on the linear regression line weakly confirms the hypothesis showing a direct relationship between segment size and received annotation. With 49 degrees of freedom a value of F = 2.80 was found. As expected the slope was positive with P = 0.101. The P value of the intercept was 0.391.

Figure IV Commentary Accumulated by Segment Size

Discussion of Findings:
The hypothesis is accepted. Commentary size is directly proportional to segment length; but while larger segments attract more commentary due to the positive slope, but they are not necessarily more effective (see §5.1.6.5) as seen by the low value (<1.0) of the slope of the regression line.

5.1.6.5 Proposition B3: The ratio of size of comments received to size of review segment (effectivity) will decline in proportion to review segment size.

Short entries in lists and cells in tables are very sharply focused and when they attract annotation, the annotations are likely to contain more information than the entry (effectivity > 1.0). The context of lists and tables are usually quite clear and contributes to their focus. When long segments such as paragraphs receive annotation, they are likely to contain less information than the segment.

Operationalization:
Size of comments and review segments are both established by software that counts the words of more than three characters.

Data conditioning:
For this analysis general comment segments were excluded, as they are not focused review segments. Segments that applied to graphic images were removed because the number of words in the graphic segment is simply the number of words in the title, and a picture is indeed often worth a thousand words. At this point outliers were examined and one more point was removed. This outlier was a document section heading that drew much commentary from the review segments within the section. Making a section heading a review segment is an error on the part of the facilitator; section headings are for ease of reading and are devoid of real content.

Data analysis:
The remaining segments that received comments were selected and two columns were produced by database query: size of the segment and the summation of the size of the commentary on the segment. This table was imported into the spreadsheet. For each segment, the size of the commentary was divided by the size of the segment to yield effectivity. A column was created for the effectivity. An XY scattergram was produced with segment size on the X-axis and effectivity on the Y-axis. A correlation of 0.451 on the logarithmic regression line confirms the hypothesis. With 184 degrees of freedom a value of F = 46.8 was found. As expected the slope was negative with P = 1.1 x 10^-10. The P value of the intercept was 1.1 x 10^-18.

Figure V Effectivity by Segment Word Count

Discussion of Findings:
The hypothesis is accepted, with strong indications that effectivity decays logarithmically rather than linearly. This hypothesis is also supported by style guides for printed text dissertation(Zinsser 1980, 111), dissertation(Strunk and White 1979, 15) and for the WWW dissertation(Nielsen 2000, 110 et.seq.). Long paragraphs are problem-laden when reading from a screen: scrolling may be required, especially when small displays are used and when the user has the font size increased to compensate for poor eyesight. When the user has set the window to single column width, even moderate length paragraphs may need to be scrolled.

5.1.6.6 Proposition C1: Products similar to DocReview will emerge and will, by similarity, validate the design.

At least four other web-based annotation products have been put into service. One of these (Third Voice) was forced to withdraw after it was subjected to numerous lawsuits centered on copyright issues, specifically allowing anyone to copy any publicly available web page on someone else's web site for annotation.

Since DocReview's debut in 1995, three similar products have emerged: Living Documents in 1998, PageSeeder in 2000, and QuickTopic in 2001. The four products may be compared on a set of core features. The core features are: notification service, in-line commentary option, security, segmentation flexibility, comments on comments, general comments, and review all comments.

Operationalization:
The three products are compared on a set of core features.

A DocReview demo may be used at http://faculty.washington.edu/~bkn/DocReview/review.cgi?name=DrDemo.

Several Interactive Papers may be examined at http://lrsdb.ed.uiuc.edu:591/ipp/.

A Document Review may be examined at http://www.quicktopic.com/6/D/QXx3sZA2kptQpnq9Rqwv.html.

A PageSeeder demo may be used at http://ps.pageseeder.com/ps/ps/demos/tryit/choco/choco.pshtml.

Table IX Annotation Program Features

	*DocReview*	*Interactive Papers*	*QuickTopic Document Review*	*PageSeeder*
Notification Service	Yes	No	Yes	Yes
In-line Commentary	Yes, click for alternative format.	Yes, by request.	No	Yes, no other alternative format.
Security	Yes, your server.	Yes, your server.	By obscure URL.	Yes, commercial service.
Segmentation Flexibility	Yes	No	No, paragraphs and list elements only.	No, chunks only.
Comments on Comments	No, by design.	Yes, three deep.	No, by design.	Yes, unlimited.
General Comments	Yes	Yes	Yes	No
Review all comments	Yes	No	Yes	No

Discussion of Findings:
DocReview's design has been validated by the similarity of several commercial and academic products that were developed in the five years following DocReview's original release.

5.1.6.7 Proposition D1: Higher quality documents will attract more participation.

Document quality may be categorized on an ordinal scale. Degree of completion on a scale from conceptual sketches to completed canonical documents. We have categorized the documents on a five-valued quality scale ()see §5.1.2).

Operationalization:
Participation is considered equivalent to effectivity and is operationalized as the ratio of the sum of comment size to the size of the base document. There were three document types represented: types 2, 3, and 4.

Data Conditioning:
DocReviews without comments were discarded. A DocReview with segments containing graphics was excluded due to the low word count in the segments, and the heavy annotation of those segments.

Data analysis:

Table X Effectivity of DocReviews by Document Type

*Category*	n	*total words in documents*	*total words in commentary*	*effectivity*
Type 2	10	1302	696	0.535
Type 3	58	27636	3181	0.115
Type 3 w/o minutes	8	4433	909	0.205
Type 4	10	8581	2914	0.340
All Types	78	37519	6791	0.181

The DocReviews that received comments were analyzed and two columns were produced by database query: size of the base document and the summation of the size of the commentary on the DocReview. This table was imported into the spreadsheet. For each DocReview, the size of the commentary was divided by the size of the base document to yield effectivity. A column was created for the effectivity. An XY scattergram was produced with segment size on the X-axis and effectivity on the Y-axis. Five effectivity distributions were studied: all DocReviews by document type, meeting minutes (most of the type 3 documents), and all DocReviews less the meeting minutes.

Studying the distributions of the three types shows three very distinct populations, type 2 with very strong logarithmic decay of effectivity with increasing base document size, type 3 documents with a very low effectivity and an almost random distribution ()see Figure VI), and type 4 with logarithmic decay of effectivity. Considering the strong (R2 = 0.4416) logarithmic decay of effectivity with increasing base document size seen in Proposition B1 ()§5.1.6.3), the nature of type 3 documents needs to be examined more closely.

Figure VI Effectivity to Document Length by Type

Table XI Regression Analysis Summary

Type	df	F	P_slope	P_intercept	R	Std Err
2	9	22.2	0.0015	3.3x10^-8	0.858	0.593
3	57	0.001	0.966	0.644	0.0057	0.117
4	9	8.72	0.018	0.013	0.722	0.658

Type 3 documents are working drafts, in the data examined here either position papers submitted for a workshop or minutes of weekly group meetings. Meeting minutes are a highly stable and consistent genre that does not attract much discussion, unless discussion topics were not reported or were reported incorrectly. All the meeting minutes were consistently formatted and prepared by only three people. They were separated from the position papers and examined and the effectivity was found to be essentially randomly distributed (R = 0.05) with respect to document length (see (Figure VII)). With 49 degrees of freedom a value of F = 1.21 was found. The slope was positive with P = 0.730. The P value of the intercept was 0.033.

Figure VII Effectivity of DocReviews of Minutes

Based on the finding that meeting minutes formed an essentially random cluster of data points that was well distributed at the knee (document size 200-800) of the logarithmic regression line, it was decided to plot all DocReviews except the meeting minutes. This distribution contains documents (n = 28) that are more likely to stimulate substantive dialog.

A correlation of 0.714 on the logarithmic regression line confirms a strong negative logarithmic relationship. With 27 degrees of freedom a value of F = 27.1 was found. As expected the slope was negative with P = 1.98 x 10^-5. The P value of the intercept was 1.7 x 10^-6.

Figure VIII Effectivity of DocReviews not Minutes

Discussion of Findings:
The hypothesis is soundly rejected. It is clear that less finished documents attract more participation than do more polished documents. This is likely due to the presence of more opportunities for change through collaborative critique.

5.1.6.8 Proposition D2: The nature of social commentary will vary with the type of document.

It is expected that the more formal nature of higher quality documents will evoke a more formal commentary as opposed to the informal and preliminary nature of the less mature documents.

Operationalization:
The social character of the comments is operationalized as the distribution of the Bales codes categories for each of the document types. The Bales Interaction Process Analysis categorizes all speech acts, including gestures, into twelve codes. Many of the Bales codes are specific to face-to-face dialog, so we must eliminate those codes in order to make a comparison. Bales grouped the twelve codes into four categories that are generic and form a good basis of comparison. These categories are: Social-emotive area: positive (positive reactions), Task area: positive (problem-solving attempts), Task area: negative (questions), and Social-emotive area: negative (negative reactions). The central two categories are further generalized into a supercategory of the task area, while the extremes are generalized into the social-emotive area.

For each of the four Bales categories, the percentages of commentary codes by document type (n = 3) are graphed.

Data conditioning:
None.

Data Analysis:
The Bales category distributions of DocReview annotations by document type demonstrate that the annotations are almost never negative reactions. The annotations that show positive reactions are more often directed to the more finished documents (type 4) than to the working and rough drafts (types 3 and 2). Questions are asked over twice as often in type 2 (rough) documents as in type 4 (finished documents).

Figure IX Distribution of Bales Categories by Type

We find that the null hypothesis that there will be no difference in the Bales category distribution between document types can be rejected. With six degrees of freedom, Chi squared = 46.5. This result is significant at <0.000001.

Discussion of Findings:
Finished documents are viewed more positively than rough documents in DocReview. Most commentary is directed toward problem solving.

5.1.6.9 Proposition D3: The nature of substantive commentary will vary with the type of document.

High quality documents such as Research Web Essays (type 4) will attract relatively few negative comments, just because the documents are likely to contain few errors and omissions. On the other hand speculative documents (type 2) are likely to attract negative commentary due to their incomplete and unfinished nature. Working documents are likely to occupy an intermediate position.

Operationalization:
The substantive character of the comments is operationalized as the distribution of the Meyers structurational argumentation codes categories for each of the document types.

Data conditioning:
None.

Data Analysis:
Of interest is the distribution of reinforcer percentages among the types of DocReviews. The more polished (Types 3 and 4) documents draw over twice the percentage of reinforcers than do the rough (Type 2) documents. This is distribution is inversely mirrored, weakly, by a corresponding presence of a lower percentage of promptors in the polished documents as compared to the rough documents.

Figure X Substantive Commentary by Document Type

We find that the null hypothesis that there will be no difference in the Meyers Argumentation Code category distribution among the document types can be rejected, but only very weakly. With four degrees of freedom, Chi squared = 3.92. This result is significant only at <0.5.

Discussion of Findings:
The distribution of argumentation categories is only weakly contingent on document type. There are indications that polished documents will attract more agreement and somewhat fewer objections than rough documents.

5.1.6.10 Other Findings
Exponential decay of multiple comments is seen. The regression line shows a correlation of 0.941 for classes of comment counts, 0 to 6.

Figure XI Exponential Decay of Comment Counts

5.1.7 Conclusions
The social character of the dialog elicited by DocReview shows substantial departures from face-to-face dialog [prop A1]. The social character of the dialog is much less emotive than face-to-face dialog and the task oriented dialog is very clearly oriented toward problem solving, with questioning being only slightly less than in face-to-face.

The substantive nature of dialog in DocReviews [prop A2] is very concentrated in constructive disagreements with the statements in the DocReview. Conversely, agreements are much less frequent than in face-to-face dialog. Most of these agreements include amplifications. This finding reinforces the similar findings in the study of the social nature of the dialog [prop A1].

Findings related to the size of the base document and the segment size found that the effectivity of the DocReview decays logarithmically with increasing base document size [prop B1]. Commentary size is directly, but not strongly, proportional to segment size [prop B2]. The effectivity of a review segment shows logarithmic decline with increasing segment length [prop B3]. This finding indicates that the document segmentation strategy should avoid long segments.

Analysis of the descriptive statistics on the document size shows that the length of annotations is significantly longer in more finished documents (type 4), perhaps reflecting the willingness to spend more time on "serious" documents, and shortest in working documents (type 3). Annotations on rough documents (type 2) fall into an intermediate length class, perhaps because they need more work to bring them to acceptable quality.
Comparing DocReview to roughly comparable products shows that no important features were overlooked in DocReview, though no product has implemented the features just as DocReview has [prop C1]. This convergence of design demonstrates that DocReview's design is in the mainstream. The differences in design implementation are largely due to differences in audience and commercial aspirations.

An attempt to measure the effect of base document quality on the effectivity (the ratio of words of commentary to words in the base document) of the DocReview found [prop D1] that (with exceptions) effectivity of documents declined with increasing quality, corroborating the findings of prop B1. Measuring the effect of base document quality on the social nature of the dialog showed comparable distributions among the Bales categories [prop D2] in all document types. The minor differences speak perhaps more to the consistent categorization of documents than to the significance of the differences. In the case of substantive dialog (Meyers codes), similar comparable distributions were seen [prop D3]; however there was an apparent, but insignificant, increase in agreements (reinforcers) with increasing quality. A corresponding decrease in objections was also seen.