In the sections that follow two collaborative tools will be examined:
DocReview (see
The Research Web is a customizable collaborative environment that permits the research team in a long-term, large-scale enterprise to examine an issue domain thoroughly. The Research Web (RW) has a WWW site that serves as the repository of the team’s corporate memory and research results. Tools available include a basic set that includes scholarly services of an annotatable bibliography and glossary, and an augmented web page format used for research essays. It incorporates any tool that the team finds necessary to its mission, provided that tool can be made web-compatible. Research Webs are unique, and for that reason may best be examined as case studies.
Author, authoring team: the owner(s) of a document. |
5.1 Case Studies of DocReview Installations
Five sets of DocReviews have been selected for detailed quantitative
analysis in order to examine several propositions
Two of the selected sets of DocReviews were the minutes of 59 meetings. The meetings were task-oriented meetings with an attendance averaging six members, with occasional participation of others by telephone. The minutes were quite comprehensive and averaged two pages of text. DocReview was integrated into the meeting routines by directing the attendees to review the minutes on the WWW before the next meeting. At the next meeting the scribe would distribute copies of the minutes with commentary inserted inline. The scribe would then explain how the minutes were revised in light of received annotations, and the team would then approve the minutes or suggest other changes. Usually this discussion was over in two or three minutes, thus saving considerable meeting time.
One set of seven DocReviews was sections of a draft of a professional paper. The paper was divided into seven sections in order to reduce the time required for each reviewing session. In reducing the review time, the busy schedules of the reviewers could accommodate the small time slices. The reviewers were professional colleagues of the author, some of who were involved in the design of DocReview. The author found the annotations very useful and most were incorporated into the final draft of the paper.
Another set of DocReviews was 19 workshop position papers for the 1999 conference on Computer-Supported Cooperative Learning (CSCL). Reviewing position papers was seen as an excellent application of DocReview from the beginning of design. In practice it lived up to its presumed promise. Perhaps the greatest impact was not intellectual, but in opening networking channels.
The final set of DocReviews was a set of 17 documents, Research Web Essays, written for a Research Web for the issue domain of chromium (CrVI) contamination on the Hanford Nuclear Reservation. The set was quite successful in accomplishing the objective of refining the initial versions of the documents, each of which centered on one aspect of the contamination.
5.1.1 Research Questions 5.1.2 Design of Data Collection System A program written by the author extracts and formats data from the
files mentioned above. The program (makecsv.pl) creates several
comma separated variable (.csv) files suitable for import into a database
and thence to a spreadsheet program for the analysis. This program
also does a word count on the base document and each of the document's
review segments. The analyst supplements two of the .csv files in order to add
information that cannot be automatically extracted. A file
(docrev.csv), that captures attributes of each DocReview, is augmented by
including a description of the DocReview and a document type attribute
designed to indicate the degree of quality, or the degree of completeness,
of the document. This attribute is entered as a number from 1 to 5
and is defined as: The coder modifies the comments.csv file to add both the Bales codes
(Interpersonal Process Analysis) and Meyers codes (Structurational
Argumentation). 5.1.3 Quantitative Descriptive Statistics Base Document Data collected on the base document for a DocReview includes a word
count, the document type (quality), the sponsor (author), and date of
creation of the DocReview. The text of the base document is also
available.
Table III Words in Review Segments
The sample variances of both the raw data and a logarithmic
transformation are too heteroscedastistic to perform a reliable analysis
of variance. A null hypothesis of no differences between the three
document types cannot be rejected. Examination of the means and
standard deviations points out and obvious difference between types 3 and
4. This conclusion is supported by the nature of the genres
represented: type 4 documents are drafts of conventional papers dominated
with paragraph-long segments; and type 3 documents are dominated by
meeting minutes composed of short segments such as action items and list
bullets. Comments Data collected on comments includes: the text of the comment, a word
count, the name of the commentator, the commentator's e-mail address, the
time and date, and the qualitative coding of the comment, both Bales codes
In analysis of the DocReview commentary, it was discovered that the
DocReviews of meeting minutes constituted a subset of commentary that
demonstrated very random annotative behavior 5.1.4 Qualitative Coding Systems Any classification scheme must serve to differentiate between members
of a group of cases. In our study the cases are DocReviews, an
object that consists of a document that is partitioned into "review
segments", and a set of comments made on each segment. The number of
comments may be zero or more, and is usually zero. In uncommented
segments, the question of implied agreement must be raised. One may
be tempted to assume, since there is no limitation on reflection, that the
reviewers agree with the review segment. Implied assent is very
dangerous because it enables power mechanisms. No comment just means
that the reviewers chose not to add to the dialog
dissertation(Sheard 2000). So how can we differentiate between the DocReviews? Certainly
there are descriptive statistics such as size of the base document, the
number of review segments, the number of comments, when the comments were
made with respect to opening the review process, the size of the
comments, and who made comments. These data were maintained in the
log files, which are features of DocReview. Beyond these physical statistics are the study of the character of the
social interactions of the review team, the interpersonal process analysis
(IPA); and the study of the efficacy of the review process, how the review
contributed to the refinement of the knowledge represented by the
document. Both the IPA and studies of efficacy can be conducted
only by analysis of the content of the annotations. Measurement of
the value of the comments to the collaboration is quite impossible in most
cases, but a qualitative categorization of comments can be done by at
least two classification schemes: an observational scheme and a scheme
based on how the comments would fit into a formal argument. We must
then code the DocReview multilogues twice, once for the social dimension
of process-orientation and again for the knowledge content dimension of
task-orientation. To analyze the interpersonal process analysis of behavior in
DocReviews, I classified the annotations using the Bales' codes
dissertation(Bales 1950, 9), a well developed and respected
tool. Analysis of how comments within a DocReview contributed to the
knowledge-building content of the document will be conducted using a
coding system based on the function of the comment from a task-oriented
viewpoint, rather than from a social viewpoint as in IPA. The
task-oriented functions are defined as the character of the comment (or
comment fragment) in a formal argumentation framework. Meyers,
Seibold and Brashers developed this coding system that was based on, and
extended from, their previous work dissertation(Meyers, Seibold
and Brashers 1991). Classification schemes need to satisfy three conditions
dissertation(Bowker and Star 1999, 10):
The coding schemes I use vary in compliance with these
desiderata. Bales codes are not complete; there is no place for
nonsense or muttering. The Bales codes are not mutually exclusive, they
instead are derived from four fairly distinct major categories that are
each divided into three quasi-ordinal codes that have very fuzzy
boundaries, e.g. what is the difference between giving information and
giving an opinion? Bales attempts to close the ambiguities in the
codes by a very thorough explanation of each dissertation(Bales
1950, 177-195), but overlaps and gaps exist. Meyers' scheme provides a less complete guide to coding
dissertation(Meyers et.al. 1991, 54) (appropriate for
a research article as opposed to Bales' book). Both coding schemes
are well described and coders can become facile with them in a reasonable
time period. In reference to mutual exclusivity, a continuous system
like Bales' IPA, must have fuzzy boundaries. Meyers' system is not a
continuous system so is immune from this argument. Meyers' scheme neatly solves the completeness problem with the
introduction of the category "non-arguable." Fortunately, this
category can contain no contextual knowledge, so it can safely be excluded
from our analyses. Bales asserts that his categories are made
complete and continuous by being concerned with the interaction content
rather than the topical content and by eliminating any requirement for the
observer "to make judgments of logical relevance, validity, rigor,
etc." dissertation(Bales 1950, Chapt. 2). Correct assignment of codes could perhaps be tested by comparing actual
results from dialog in the source research and the coding of the same
material by the author. In short, such testing would require
studying intercoder reliability between the teams of Bales and Meyers and
the team (myself) that would code the annotations. Bales offers six
pages of coded dialog dissertation(Bales 1950, 93-99).
Meyers et. al. offers some short examples. Both papers do offer good
definitions of the categories. The categories are based on dialog
quite familiar to any literate individual. A larger issue is the
absence of gestural side-channel communication (head nodding, eye-rolling)
in DocReview. As face-to-face dialog would present frequent "speech
acts" that are gestures, facial expressions, or voice tones, there will
be a loss of that dialog in the coding of DocReview annotations.
This loss may account for some of the significantly lower "social-emotive"
codes in the DocReview annotations. I can only compare DocReviews to DocReviews since there was no attempt
to set up a control review method by other means. In the DocReview
study, all DocReviews use the WWW and are thus device independent.
Usually, the participants within a given set of DocReviews are
homogeneous, though between sets, they may vary in number. The same
task is always performed: review of a document, though the nature of the
documents may change (meeting minutes, position papers). Almost all
users are invited, since most DocReviews are on intranet sites. Other than
the exceptions noted, most dependent variables are identical. Most
studies that apply IPA compare computer-mediated communication with
face-to-face communication. In a meta-analysis of studies of
computer-mediated collaboration, McGrath and Berdahl
dissertation(McGrath and Berdahl 1998) make several
cautionary points based on differences between face-to-face interaction
and computer-mediated interaction: studies often use different computer
systems; different kinds of participants are used; different types of
tasks are performed; and there are different patterns of dependent
variables. 5.1.4.1 Interaction Process Codes The Bales Code Commentary that expresses support or disagreement is not valueless, for
such commentary does influence the behavior of the author and other
contributors. So most commentary is of some value, even if it is
merely reinforcing the recognition of a team effort. Sadly there
are comments of negative worth that occasionally emerge, such as personal
attacks or senseless graffiti. Gay et.al. and classroom discussion forums These codes are equivalent to portions of the twelve category Bales
Codes for Interpersonal Activity. The affiliative comments,
which presumably could be positive or negative, would fall into one of six
categories: Shows Solidarity, Shows Tension Release, Agrees, Disagrees,
Shows Tension or Shows Antagonism. The technical comments
would fall into the neutral task-oriented area: Gives Opinion, Gives
Orientation, Asks for Orientation, Asks for Opinion. The
advice category corresponds to the extreme range of the
task-oriented area: Gives Suggestion and Asks for Suggestion. 5.1.4.2 Argumentation Based Codes Informal Argumentation Structurational Argument Codes In Meyers et.al. discussions were analyzed with 8,408 codes
produced, having the distribution given in the following table
dissertation(Meyers et.al., 45). This
dissertation found 425 codes in the DocReview annotations.
While Meyers et.al. conclude that the structurational argumentation
codes reflect both process-orientation and task-orientation (or
system and structure, as they put it); the coding scheme clearly supports
task-orientation much better than the Bales IPA. In terms of support
to a collaborative task, some categories have more value than others. These argument codes provide places for every element in the Toulmin
informal argumentation scheme. The nonarguables Process and
Unrelated are very convenient "bins" for trivial or procedural
content. One of the seventeen codes is extremely unlikely to be
used: the nonarguable Incomplete. The argument codes were
developed to analyze transcripts of face-to-face interactions, an
environment where interruptions are frequent. It is difficult to
imagine how an asynchronous contribution could be interrupted; if the
writer is interrupted at the terminal, then the task can be resumed when
the interruption terminates. The Meyers, et.al. study used transcripts of actual face-to-face
multilogue, with recourse to videotape only when the expression needed
clarification dissertation(Meyers et.al. 1991,
56). Interruption and incomplete expressions were frequent,
as in normal conversation. The computer-mediated environment of
DocReview will make interruption unlikely and incomplete thought
rare. I expect the distribution of message fragments in DocReviews
to be quite different from conversational multilogues. As McGrath
and Berdahl cautioned, these differences may be due to many different
factors dissertation(McGrath and Berdahl 1998);
nevertheless, if the differences are great, the argument in favor of
computer-mediated communication as a more reflective medium gains support.
An Observational Categorization This scheme categorizes several nominal classes of comments seen in
DocReviews. It has the advantage of being completely specific to
DocReviews; that is it is not time restricted, and is asynchronous,
document-centric. Most DocReview review segments, especially
paragraphs, will contain an assertion, a conclusion and give evidence
showing how the conclusion follows from the assertion. In addition
to this logical imperative (substantial) there is also the requirement to
conform to appropriate standards of scholarship and presentation
(formal). In the Research Web environment, the documents are also
subject to both the criticism process and an editing process. 5.1.5 Qualitative Coding Reliability Unitizing is a significant source of variability. The variability
in unitization is induced by uncertainty in interpretation. Some
methods of unitizing are less susceptible to variability than
others. Time-based unitization, segments of elapsed real time, are
not subject to interpretation dissertation(Nyerges et.al.
1998, 141). Turn taking in speech dialog is more variable due
to complications that arise in parsing of monologues; annotations in
DocReview are essentially monologues. Parsing face-to-face dialog
into speech acts (Bales) is yet more variable because there is a need for
insertion of implied speech acts and gestural acts. Even more
variable is the event-based coding that was used in the argumentation
coding (Meyers). Nyerges et.al. chose time-based coding over
event-based coding because event-based coding required at least two coding
passes dissertation(Nyerges et.al. 1998). In the Bales coding, DocReview annotations were parsed during coding
into approximations of "speech acts" by dividing the annotation into
phrases, sentences or a set of contiguous sentences that dealt with a
single topic. Not infrequently when the coder understands both the
review segment and an annotation well, implied codes emerge. One
comment usual contained a few codes (mean = 2.6) sometimes as many as a
dozen. This parsing is assumed to be equivalent to the turn taking
of face-to-face dialog. In the argumentation coding, the unitizing protocol used in Meyers
et.al. could not be employed since their unitizing was done by two judges
concurrently. As Meyers used transcripts of dialog, so I used
written dialog. The unitizing rule that Meyers et.al. used was:
"any statement that functioned as a complete thought or change of
thought." The Meyers team coded dialog that was parsed into turns,
while DocReview comments are relatively long monologues. Rather than
parsing the monologue into speech acts I parsed it into argument units
that might include several sentences. Such units fit well into the
Meyers categories. One comment usually contained one to a few codes
(mean = 1.4) sometimes as many as eight. Coding and unitization of DocReview annotation requires the coder to
place the annotation into the context of the review segment being
annotated. This contextualization is done by mentally converting the
annotation unit and review segment into a narrative equivalent.
Unfortunately, returning to the exact same mind set is difficult for
either independent judges or for the same coder repeating the coding at a
later time. 5.1.5.1 Coding Reliability Tests Four sets of codes were tested for reliability: the Bales codes (twelve
categories), the Bales categories (four sets of three codes each),
the structurational argumentation codes (seventeen categories), and the
five structurational argumentation categories derived from the seventeen
codes. 5.1.5.2 Data Conditioning If such realignment is allowed it is subject to much abuse, so I allow
only a shift of the entire shorter code string within the limits of the
longer code string. If the code strings are of equal length, then no
shifting is allowed. Any unmatched codes resulting from unequal
code string lengths are removed. Both Bales and the structurational
argumentation codes were conditioned this way, and the resulting
conditioned data was converted to the aggregated categorical data (the
four Bales categories and the five structurational argumentation
categories). 5.1.5.3 Analysis The conditioned data were placed in contingency tables comparing the
two coding sessions. From the contingency tables, Cohen's kappa and
Perreault and Leigh's Index of Reliability were calculated for the four
sets of data. Bales codes Bales categories Structurational argumentation codes Structurational argumentation categories 5.1.5.4 Conclusions The structurational argumentation codes were too numerous and difficult
to code to produce acceptable reliability. Applying argumentation
codes to analysis of DocReview annotations will require the use at least
pairs of coders working together (as Meyers et.al. did). The
unitization problem was extremely serious, producing almost a one third
rate of no matching codes. The combination of arbitrarily long
review segments and arbitrarily long annotations will demand a very clever
unitization scheme to produce any hope of consistent coding. 5.1.6 Analytical Results Four of the propositions use the Chi squared test comparing the counts
of DocReview codes versus the coding distributions in the original Bales
and Meyers studies. In order to normalize the sample sizes a pseudo-sample
of the Bales or Meyers codes was drawn with the same distribution as in
the original studies but with a size equal to the DocReview sample. Four of the propositions were tested using single variable regression
analysis. In all these cases the independent variable (X) was the word
count of the base document or a review segment of the base document. In
some cases the dependant variable (Y) was confounded with the independent
variable. This confounding was due to the definition of effectivity as
the ratio of commentary to the size of the document (effectivity = Y/X).
The shape of the best fitting regression line was found to be logarithmic.
One of the propositions was a case study comparing DocReview to three
other web-based annotation programs. The comparison was made on the basis
of a universe of features found in all the programs. 5.1.6.1 Proposition A1. The social character of comments in
DocReview differs from comments in face-to-face dialog. One of the most important questions arising from the use of DocReview
is how the nature of dialog in DocReview is different from face-to-face
dialog. Fortunately we have from Bales' work a distribution of codes
assembled from thousands of face-to-face speech acts. If one makes the
assumption that DocReview annotation is equivalent to one side of a
face-to-face dialog, and further assume that in face-to-face dialog the
two participants each produce an identical distribution of coded speech
acts, then we can make a valid comparison. The assumption of equivalence
is strained by the odd nature of this communication: essentially the
document is the source of a series of propositions. The annotation is a
set of responses to the proposition presented in the review segment by the
readers. This set of responses is also complicated by the not infrequent
presence of commentary on other annotations. Operationalization: Data conditioning: Data Analysis: We find that the null hypothesis that there will be no difference
between face-to-face and DocReview dialog when Bales coded can be
rejected. With three degrees of freedom, Chi squared = 213.2. This result
is significant at <0.000001. 5.1.6.2 Proposition A2: The substantive character of comments in
DocReview differs from comments in face-to-face dialog. The substantive nature of comments in DocReview is measured by
determining the intent of the comment, or a portion of the comment. Intent
is defined in this analysis as what place the comment would take in
argumentation. As in the analysis of social character of the comments above in
Proposition A1, we have to assume that the dialog is quite one-sided, with
the document providing propositions and the readers arguing with that
proposition. Clearly there can be no negotiation of meaning and the
document can make no rebuttals. In terms of argumentation, then we can
have but one round of argumentation, but with several people
participating. Operationalization: Data conditioning: Argumentation codes in the non-arguable category in the dialog were
excised. In the raw data, DocReview annotations were 22.6% non-arguable,
compared to 14.5% in the Meyers study. The difference in non-arguables is
attributed to the assignment of annotations frequently complaining about
grammar and spelling to that category. Arguably such commentary does not
contribute to productive argumentation, and furthermore such corrections
are seldom made in face-to-face dialog. Codes in the arguable class were also excised. Difficulties in
adjusting for the asymmetrical nature of DocReview argumentation are
simply insurmountable. In the one turn dialog, responses to propositions
(the base document's review segment) are much more prevalent than
responses to annotations. Responses to annotation usually requires
re-reading the comments; busy participants are not likely to return to
review comments, even if they are reminded by e-mail notification. This
would not be the case in face-to-face argumentation. The data conditioning leaves us with three categories of codes:
Reinforcers, Promptors and Delimitors. Unfortunately the excision of
troublesome categories reduces our number of data points by 58% to 176.
Since the central action of argumentation is carried out in these
categories, I feel that they are an adequate basis for comparison. Data Analysis:
The document that is prepared for DocReview
is called the base document. It varies in size, and quality (the
degree of development). Very large base documents are usually broken
into sections, each a DocReview, in order to allow the usually busy
reviewers to complete a section at one sitting.
Base Document Size (word count)
Characteristic
All
Type 2
Type3
Type 4
Mean
459.26
135.61
465.47
798.82
Median
422
130
469
598
Standard Deviation
325.27
91.259
196.71
695.66
Sample Variance
105801
8328
38693
483946
Kurtosis
20.77
-0.59
2.49
5.44
Skewness
3.44
0.48
0.88
2.22
Range
2647
309
1279
2657
Minimum
10
10
140
206
Maximum
2657
309
1279
2657
Count
100
13
76
11
Commentary other than general comments are
directed toward a fragment of the base document called a review
segment. Review segments are most frequently paragraphs or list
elements (bullets), but occasionally include images or entire
tables. The facilitator determines the review segments. The
DocReviews in this case study were all prepared for review by the author
and reflect a personal bias toward using relatively short review segments:
paragraphs, at the largest; where lists are present, list elements; where
large tables are presented, table cells; and individual graphic
images. Section headings, bibliographic entries, and titles are
usually excluded from review segments. Data collected on review segments
consist of the text of the review segment and a word count.
Mean
24.89
35.60
21.09
73.86
Median
14
8
14
65
Standard Deviation
28.19
43.33
20.41
55.23
Sample Variance
794
1877
417
3051
Kurtosis
17
1.5
5.6
4.7
Skewness
3.2
1.4
2.1
1.7
Range
306
165
158
306
Minimum
2
3
2
2
Maximum
308
168
160
308
Count (Number
of segments)
1822
48
1656
118
Each review segment attracts a set of comments,
usually an empty set. The set may include not only comments on the
review segment, but also comments on the other comments on the review
segment. The comments are entirely free form, either text or HTML,
and may include emphasis, paragraphing and even images.Table IV Words in Comments
Mean
31.83
34.80
21.49
54.46
30.73
Median
19
22.5
12.5
43
12.5
Standard
Deviation
37.61
36.98
26.79
48.01
36.77
Sample Variance
1414
1367
717
2305
1352
Kurtosis
14.6
1.2
40.3
7.9
0.8
Skewness
3.1
1.5
5.1
2.2
1.5
Range
289
122
256
288
124
Minimum
1
3
1
2
1
Maximum
290
125
257
290
125
Number of
Comments
233
20
148
65
40
Commentary of hyperdocuments through DocReview can be evaluated by use of
categorization, volume and quality. DocReview comments can be
categorized by using Bales codes dissertation(Bales
1955). Depending on the issue domain, these codes can be used
to order value between categories. For instance, detection of errors
in spelling or grammar is a low value contribution in studies of social
behavior, but a high value contribution in the development of a manifesto
or epic.
Geri Gay and others studied the character of student contributions by
computer-mediated communication in university classes
dissertation(Geri Gay et.al.1999). The
discussion forums were conducted in CoNote, a WWW-based annotation program
functionally similar to DocReview. Gay's study included
questionnaires and observer data as well as a repository of documents and
comments thereon. Gay's codes, like Bales' codes, are not based on
the relationship of the annotation to the collaboration task, but on the
character of interpersonal activity. Content of the annotations was
organized into three categories: technical comments, affiliative comments
and advice. Presumably, a single comment could contain all
categories, but not multiple occurrences of a category. 197 comments
produced percentages of 50.3 technical, 45.2 affiliative, and 68.5 advice.
These percentages were obtained in an environment dominated by students
who came into frequent contact, thus by age and group structure more
inclined to engage in affiliative commentary than professional groups
might be.
In An Introduction to Reasoning, Toulmin, Rieke and Janik develop a
dialog classification based on argumentation
dissertation(Toulmin, Rieke and Janik 1979). Their
system is proposed to be the basis for development of a tool (The
Landscape of Reason) to organize dialog for the Research Web.
Argumentation is broadly defined in this work, having a place in any
"rational enterprise." As the authors put it, "... scientific
arguments are sound only to the extent that they can serve the deeper goal
of improving our scientific understanding." Every coding unit of a comment
can be assigned a type based on this classification. The value of
the comment in terms of value to the collaboration can be established
through a surrogate, the value of the comment in the argument. There are
six elements in argumentation: claims, grounds, warrants, backing, modal
qualifiers, and rebuttals.
In research on decision-making discussions in a face-to-face environment,
a set of seventeen categories describing statements in terms of their
place in argumentation was developed and used by a team that studied 45
discussions. This research had its roots in research by Toulmin (in
1958) and two other research teams in 1969 and 1980
dissertation(Meyers, Seibold and Brashers 1991, 50).
I can find no subsequent application of this coding scheme in the
literature. Coding is extremely difficult, as meanings can shift
with context. The coder must be thoroughly immersed in the argument,
not just the words, but also the intent of the words.
ARGUABLES
(67.4%)
Arguables
Assertions
Statements of fact or opinion.
Propositions
Statements that call for support, action or conference on
an argument-related statement.
Arguables
Elaborations
Statements that support other statements by providing evidence,
reasons, or other support.
Responses
Statements that defend arguables met with disagreement.
Arguables
Amplifications
Statements that explain or expound upon other statements in
order to establish the relevance of the argument through inference
Justifications
Statements that offer validity of previous or upcoming statements
by citing a rule of logic (Provide a standard whereby arguments
are weighed).
REINFORCERS (13.6%)
Agreement
Statements that express agreement with another statement.
Agreement +
Statements that express agreement with another statement and
then go on to state an arguable, promptor, delimitor, or non-arguable.
PROMPTORS (2.3%)
Objection
Statements that deny the truth or accuracy of any arguable.
Objection +
Statements that deny the truth or accuracy of any arguable
and then go on to state another arguable, promptor, delimitor
or nonarguable.
Challenge
Statements that offer problems or questions that must be solved
if agreement is to be secured on an arguable.
DELIMITORS (2.1%)
Frames
Statements that provide a context for and/or qualify arguables.
Forestall/Secure
Statements that attempt to forestall refutation by securing
common ground.
Forestall/Remove
Statements that attempt to forestall refutation by removing
possible objections.
NONARGUABLES (14.5%)
Process
Non-argument related statements that orient the group
to its task or specify the process the group should follow.
Unrelated
Statements unrelated to the group's argument or process
(tangents, side issues, self-talk, etc.)
Incomplete
Statements that do not provide a cogent or interpretable
idea (due to interruption, stopping to think in midstream,
but are completed as a cogent idea elsewhere in the transcript.
The author's five years of experience in the use of DocReview has led to a
potential coding system based on observation and sorting.
Interpretation and characterization of the codes are based not only the
original context of the commentary, but on assumptions of what character
the comments would take in a fully implemented Research Web.
Observational
Categorization of DocReview Annotations
Process
Process
Aligning codes at the beginning gives:
acbbbca
cbbbca
If on the other hand we align like this:
From the initial set of 99 Bales codes, there were 82 codes remaining in
the conditioned data. Each code could assume one of twelve
values. Comparing the two sets showed 54 pairs in agreement, 28
pairs in disagreement and 17 unmatched codes. Cohen's kappa
dissertation(Cohen 1960) for the Bales codes is 0.538,
showing only moderate agreement between the two coding sessions
dissertation(Landis and Koch 1977, 165). The Index of
Reliability dissertation(Perreault and Leigh 1989) is 0.792
with a 95% confidence level of +/- 0.088. This mediocre result, in
conjunction with some very low counts of several codes, provided the
argument to use only the four Bales categories in the analysis of
DocReview annotations.
In analyzing the four Bales categories, each code could assume one of
four values. Comparing the two sets showed 80 pairs in agreement, 2
pairs in disagreement and 17 unmatched codes. For the Bales
categories, Cohen's kappa is 0.878, showing almost perfect agreement
between the two coding sessions. The Index of Reliability is 0.984
with a 95% confidence level of +/- 0.027.
From the initial set of 70 structurational argumentation codes, there were
48 codes remaining in the conditioned data. Each code could assume
one of seventeen values. Comparing the two sets showed 21 pairs in
agreement, 27 pairs in disagreement and 22 unmatched codes. Cohen's
kappa for these codes is 0.402, showing only fair agreement between the
two coding sessions. The Index of Reliability is 0.668 with a 95%
confidence level of +/- 0.133. As with the Bales codes, there were
a large number of codes with low to zero counts.
In analyzing the five structurational argumentation categories, each
code could assume one of five values. Comparing the two sets showed
28 pairs in agreement, 20 pairs in disagreement and 22 unmatched
codes. Cohen's kappa is 0.383, showing only fair agreement between
the two coding sessions. The Index of Reliability is 0.673 with a
95% confidence level of +/- 0.133.
Assigning Bales codes categories to all annotations operationalizes the
social character of the comments. The Bales Interaction Process Analysis
categorizes all speech acts, including gestures, into twelve codes. The
differences between some of the Bales codes are very slight. These fine
nuances result in a high variability between coders or between coding
sessions by the same person. In order to reduce the intercoder variability
it was decided to use Bales' broader classification: categories. Bales
grouped the twelve codes into four categories that are generic and form a
good basis of comparison. These categories are: positive reactions,
problem-solving attempts, questions, and negative reactions.
Problem-solving Attempts and Questions are further generalized into a
supercategory of the task area, while Positive and Negative Reactions are
generalized into the social-emotive area.
None.
The counts of codes of the entire set of DocReview annotations by Bales
category demonstrates that DocReview annotations show a much higher degree
of task-related dialog and a much lower degree of social-emotive dialog
than is seen in face-to-face dialog. The comparisons
(DocReview/face-to-face) are: for Negative Reactions -- 0.1%/11.2%; for
Questions -- 7.3%/7.0%; for Problem-Solving Attempts -- 85.5%/56.0%; and
for Positive Reactions -- 7.0%/25.9%.
Assigning Meyers structurational argumentation code categories to each
comment operationalizes the substantive character of the comments.
The raw data percentage comparisons
(DocReview/face-to-face) are: for non-arguables -- 22.6%/14.5%; for
delimitors -- 8%/2.1%; for promptors -- 23.1%/2.3%; for reinforcers --
10.3%/13.6%; and for arguable -- 36%/67.4%.
The conditioned data comparisons
(DocReview/face-to-face) are: for reinforcers -- 25%/75.6%; for promptors
-- 55.7%/13.3%; and for delimitors -- 19.3%/11.1%.
Figure II Substantive Commentary (DocReview vs.
F2F)
Comparing face-to-face distributions to the distributions found in the DocReviews shows a very strong difference in both promptors and reinforcers. There are four promptors in DocReviews for each face-to-face promptor and three face-to-face reinforcers for every DocReview reinforcer.
We find that the null hypothesis that there will be no difference between face-to-face and DocReview dialog when Meyers coded can be rejected. We find that with two degrees of freedom, Chi squared = 93.3. This result is significant at <0.000001.
Discussion of Findings:
The differences of face-to-face argumentation and DocReview annotation are
clear: people are much more inclined to suggest changes to the document in
DocReview than in face-to-face dialog; people are much less inclined to
agree with the document in DocReviews than they are in face-to-face
dialog. I see this finding as suggesting that there may be some
satisficing occurring as people are less inclined to annotate texts that
they see as not far enough wrong to complain about. The vast difference in
promptors may be explained by the nature of DocReview: documents are
mounted with the intent of drawing out errors and omissions. A portion of
the differences may also be explained by social mechanisms: it is much
easier to praise than object; and power effects may also be seen as people
are more inclined to agree with a proposition offered in a meeting
(usually by a leader).
5.1.6.3 Proposition B1: Long base documents are ineffective relative to short documents.
The lives of researchers are fragmented into scores of tasks of varying importance. This produces the need to engage in multitasking, a mosaic of activity that fills the available time with periods of variable lengths. There will be short periods to review documents, provided they are of a size that will fit into the time slot. Very long documents may encourage a shallow reading, thus shallow and short commentary.
Operationalization:
Effectivity is operationalized as the ratio of the sum of comment size to
the size of the base document. Size of comments and base documents are
both established by software that counts the words of more than three
characters. For each DocReview that attracted annotation (n = 78), the
word counts for annotations to segments that attracted annotation for each
DocReview were accumulated in one column and the word count for the
DocReview was placed in another. The DocReview word count was plotted on
the X-axis and the effectivity on the Y-Axis.
Data conditioning:
Records for DocReviews that attracted no commentary were excluded. A
DocReview with segments containing graphics was excluded due to the low
word count in the segments, and the heavy annotation of the segments. The
same DocReview contained an anomalously long general comment.
Data Analysis:
A correlation of 0.665 on the logarithmic regression line confirms the
hypothesis. With 77 degrees of freedom a value of F = 60.1 was found. As
expected the slope was negative with P = 3.27 x 10-11. The P
value of the intercept was 1.64 x 10-12. A study of DocReviews
by document type
Discussion of Findings:
The hypothesis is accepted. Smaller base documents produce more effective
DocReviews. This leads to the conjecture that fragmenting a very long
document will increase the effectivity of the review process. This
conjecture could be tested, but not with the data from this study.
5.1.6.4 Proposition B2: The amount of commentary received on a review segment will be directly proportional to the segment's length.
An extremely long review segment may tax the reader’s concentration, leading to a decline of effectivity. Short review segments such as list "bullets" are sharply focused and easy to grasp and critique. Due to a small denominator, the effectivity of such short segments may be inflated. The deleterious effect of long review segments is one of the basic assumptions of the design of DocReview.
Operationalization:
Size of comments and base documents are
both established by software that counts the words of more than three
characters.
Data conditioning:
Segments not attracting annotation are removed. Segments that were graphic
images were discarded. General comment segments were discarded.
Data Analysis:
A correlation of 0.235 on the linear regression line weakly confirms the
hypothesis showing a direct relationship between segment size and received
annotation. With 49 degrees of freedom a value of F = 2.80 was found.
As expected the slope was positive with P = 0.101. The P value of the
intercept was 0.391.
Discussion of Findings:
The hypothesis is accepted.
Commentary size is directly proportional to segment length; but while
larger segments attract more commentary due to the positive slope, but
they are not necessarily more effective (see §5.1.6.5) as seen by
the low value (<1.0) of the slope of the regression line.
5.1.6.5 Proposition B3: The ratio of size of comments received to size of review segment (effectivity) will decline in proportion to review segment size.
Short entries in lists and cells in tables are very sharply focused and when they attract annotation, the annotations are likely to contain more information than the entry (effectivity > 1.0). The context of lists and tables are usually quite clear and contributes to their focus. When long segments such as paragraphs receive annotation, they are likely to contain less information than the segment.
Operationalization:
Size of comments and review segments are both established by software that
counts the words of more than three characters.
Data conditioning:
For this analysis general comment segments were excluded, as they are not
focused review segments. Segments that applied to graphic images were
removed because the number of words in the graphic segment is simply the
number of words in the title, and a picture is indeed often worth a
thousand words. At this point outliers were examined and one more point
was removed. This outlier was a document section heading that drew much
commentary from the review segments within the section. Making a section
heading a review segment is an error on the part of the facilitator;
section headings are for ease of reading and are devoid of real content.
Data analysis:
The remaining segments that received comments were selected and two
columns were produced by database query: size of the segment and the
summation of the size of the commentary on the segment. This table was
imported into the spreadsheet. For each segment, the size of the
commentary was divided by the size of the segment to yield effectivity. A
column was created for the effectivity. An XY scattergram was produced
with segment size on the X-axis and effectivity on the Y-axis. A
correlation of 0.451 on the logarithmic regression line confirms the
hypothesis. With 184 degrees of freedom a value of F = 46.8 was found. As
expected the slope was negative with P = 1.1 x 10-10. The P
value of the intercept was 1.1 x 10-18.
Discussion of Findings:
The hypothesis is accepted, with strong indications that effectivity
decays logarithmically rather than linearly. This hypothesis is also
supported by style guides for printed text dissertation(Zinsser
1980, 111), dissertation(Strunk and White 1979, 15)
and for the WWW dissertation(Nielsen 2000, 110
et.seq.). Long paragraphs are problem-laden when reading
from a screen: scrolling may be required, especially when small displays
are used and when the user has the font size increased to compensate for
poor eyesight. When the user has set the window to single column width,
even moderate length paragraphs may need to be scrolled.
5.1.6.6 Proposition C1: Products similar to DocReview will emerge and will, by similarity, validate the design.
At least four other web-based annotation products have been put into service. One of these (Third Voice) was forced to withdraw after it was subjected to numerous lawsuits centered on copyright issues, specifically allowing anyone to copy any publicly available web page on someone else's web site for annotation.
Since DocReview's debut in 1995, three similar products have emerged: Living Documents in 1998, PageSeeder in 2000, and QuickTopic in 2001. The four products may be compared on a set of core features. The core features are: notification service, in-line commentary option, security, segmentation flexibility, comments on comments, general comments, and review all comments.
Operationalization:
The three products are compared on a set of core features.
A DocReview demo may be used at http://faculty.washington.edu/~bkn/DocReview/review.cgi?name=DrDemo.
Several Interactive Papers may be examined at http://lrsdb.ed.uiuc.edu:591/ipp/.
A Document Review may be examined at http://www.quicktopic.com/6/D/QXx3sZA2kptQpnq9Rqwv.html.
A PageSeeder demo may be used at http://ps.pageseeder.com/ps/ps/demos/tryit/choco/choco.pshtml.
Discussion of Findings: 5.1.6.7 Proposition D1: Higher quality documents will attract
more participation. Document quality may be categorized on an ordinal scale. Degree of
completion on a scale from conceptual sketches to completed canonical
documents. We have categorized the documents on a five-valued quality
scale Operationalization: Data Conditioning: Data analysis:
Notification Service
Yes
No
Yes
Yes
In-line Commentary
Yes, click for alternative format.
Yes, by request.
No
Yes, no other alternative format.
Security
Yes, your server.
Yes, your server.
By obscure URL.
Yes, commercial
service.
Segmentation Flexibility
Yes
No
No, paragraphs and list elements only.
No, chunks only.
Comments on
Comments
No, by design.
Yes, three deep.
No, by design.
Yes, unlimited.
General Comments
Yes
Yes
Yes
No
Review all comments
Yes
No
Yes
No
DocReview's design has been validated by the similarity of several
commercial and academic products that were developed in the five years
following DocReview's original release.
Participation is considered equivalent
to effectivity and is operationalized as the ratio of the sum of comment
size to the size of the base document. There were three document types
represented: types 2, 3, and 4.
DocReviews without comments were
discarded. A DocReview with segments containing graphics was excluded due
to the low word count in the segments, and the heavy annotation of those
segments.
The DocReviews that received comments were analyzed and two columns
were produced by database query: size of the base document and the
summation of the size of the commentary on the DocReview. This table was
imported into the spreadsheet. For each DocReview, the size of the
commentary was divided by the size of the base document to yield
effectivity. A column was created for the effectivity. An XY scattergram
was produced with segment size on the X-axis and effectivity on the
Y-axis. Five effectivity distributions were studied: all DocReviews by
document type, meeting minutes (most of the type 3 documents), and all
DocReviews less the meeting minutes. Studying the distributions of the three types shows three very distinct
populations, type 2 with very strong logarithmic decay of effectivity with
increasing base document size, type 3 documents with a very low
effectivity and an almost random distribution
Type 2
10
1302
696
0.535
Type 3
58
27636
3181
0.115
Type 3 w/o minutes
8
4433
909
0.205
Type 4
10
8581
2914
0.340
All Types
78
37519
6791
0.181
Type
|
df
|
F
|
Pslope
|
Pintercept
|
R
|
Std
Err
|
2 | 9 | 22.2 | 0.0015 | 3.3x10-8 | 0.858 | 0.593 |
3 | 57 | 0.001 | 0.966 | 0.644 | 0.0057 | 0.117 |
4 | 9 | 8.72 | 0.018 | 0.013 | 0.722 | 0.658 |
Type 3 documents are working drafts, in the data examined here either
position papers submitted for a workshop or minutes of weekly group
meetings. Meeting minutes are a highly stable and consistent genre that
does not attract much discussion, unless discussion topics were not
reported or were reported incorrectly. All the meeting minutes were
consistently formatted and prepared by only three people. They were
separated from the position papers and examined and the effectivity was
found to be essentially randomly distributed (R = 0.05) with respect to
document length (see
Based on the finding that meeting minutes formed an essentially random cluster of data points that was well distributed at the knee (document size 200-800) of the logarithmic regression line, it was decided to plot all DocReviews except the meeting minutes. This distribution contains documents (n = 28) that are more likely to stimulate substantive dialog.
A correlation of 0.714 on the logarithmic regression line confirms a strong negative logarithmic relationship. With 27 degrees of freedom a value of F = 27.1 was found. As expected the slope was negative with P = 1.98 x 10-5. The P value of the intercept was 1.7 x 10-6.
Discussion of Findings:
The hypothesis is soundly
rejected. It is clear that less finished documents attract more
participation than do more polished documents. This is likely due to the
presence of more opportunities for change through collaborative
critique.
5.1.6.8 Proposition D2: The nature of social commentary will vary with the type of document.
It is expected that the more formal nature of higher quality documents will evoke a more formal commentary as opposed to the informal and preliminary nature of the less mature documents.
Operationalization:
The social character of the comments is
operationalized as the distribution of the Bales codes categories for each
of the document types. The Bales Interaction Process Analysis categorizes
all speech acts, including gestures, into twelve codes. Many of the Bales
codes are specific to face-to-face dialog, so we must eliminate those
codes in order to make a comparison. Bales grouped the twelve codes into
four categories that are generic and form a good basis of comparison.
These categories are: Social-emotive area: positive (positive
reactions), Task area: positive (problem-solving attempts), Task area:
negative (questions), and Social-emotive area: negative (negative
reactions). The central two categories are further generalized into a
supercategory of the task area, while the extremes are generalized into
the social-emotive area.
For each of the four Bales categories, the percentages of commentary codes by document type (n = 3) are graphed.
Data conditioning:
None.
Data Analysis:
The Bales category distributions of
DocReview annotations by document type demonstrate that the annotations
are almost never negative reactions. The annotations that show positive
reactions are more often directed to the more finished documents (type 4)
than to the working and rough drafts (types 3 and 2). Questions are asked
over twice as often in type 2 (rough) documents as in type 4 (finished
documents).
We find that the null hypothesis that there will be no difference in the Bales category distribution between document types can be rejected. With six degrees of freedom, Chi squared = 46.5. This result is significant at <0.000001.
Discussion of Findings:
Finished documents are viewed more
positively than rough documents in DocReview. Most commentary is directed
toward problem solving.
5.1.6.9 Proposition D3: The nature of substantive commentary will vary with the type of document.
High quality documents such as Research Web Essays (type 4) will attract relatively few negative comments, just because the documents are likely to contain few errors and omissions. On the other hand speculative documents (type 2) are likely to attract negative commentary due to their incomplete and unfinished nature. Working documents are likely to occupy an intermediate position.
Operationalization:
The substantive character of the
comments is operationalized as the distribution of the Meyers
structurational argumentation codes categories for each of the document
types.
Data conditioning:
None.
Data Analysis:
Of interest is the distribution of
reinforcer percentages among the types of DocReviews. The more polished
(Types 3 and 4) documents draw over twice the percentage of reinforcers
than do the rough (Type 2) documents. This is distribution is inversely
mirrored, weakly, by a corresponding presence of a lower percentage of
promptors in the polished documents as compared to the rough documents.
We find that the null hypothesis that there will be no difference in the Meyers Argumentation Code category distribution among the document types can be rejected, but only very weakly. With four degrees of freedom, Chi squared = 3.92. This result is significant only at <0.5.
Discussion of Findings:
The distribution of argumentation
categories is only weakly contingent on document type. There are
indications that polished documents will attract more agreement and
somewhat fewer objections than rough documents.
5.1.6.10 Other Findings
Exponential decay of multiple
comments is seen. The regression line shows a correlation of 0.941 for
classes of comment counts, 0 to 6.
5.1.7 Conclusions
The substantive nature of dialog in DocReviews [prop A2] is very concentrated in constructive disagreements with the statements in the DocReview. Conversely, agreements are much less frequent than in face-to-face dialog. Most of these agreements include amplifications. This finding reinforces the similar findings in the study of the social nature of the dialog [prop A1].
Findings related to the size of the base document and the segment size found that the effectivity of the DocReview decays logarithmically with increasing base document size [prop B1]. Commentary size is directly, but not strongly, proportional to segment size [prop B2]. The effectivity of a review segment shows logarithmic decline with increasing segment length [prop B3]. This finding indicates that the document segmentation strategy should avoid long segments.
Analysis of the descriptive statistics on the document size shows that
the length of annotations is significantly longer in more finished
documents (type 4), perhaps reflecting the willingness to spend more time
on "serious" documents, and shortest in working documents (type 3).
Annotations on rough documents (type 2) fall into an intermediate length
class, perhaps because they need more work to bring them to acceptable
quality.
Comparing DocReview to roughly comparable products shows
that no important features were overlooked in DocReview, though no product
has implemented the features just as DocReview has [prop C1]. This
convergence of design demonstrates that DocReview's design is in the
mainstream. The differences in design implementation are largely due to
differences in audience and commercial aspirations.
An attempt to measure the effect of base document quality on the effectivity (the ratio of words of commentary to words in the base document) of the DocReview found [prop D1] that (with exceptions) effectivity of documents declined with increasing quality, corroborating the findings of prop B1. Measuring the effect of base document quality on the social nature of the dialog showed comparable distributions among the Bales categories [prop D2] in all document types. The minor differences speak perhaps more to the consistent categorization of documents than to the significance of the differences. In the case of substantive dialog (Meyers codes), similar comparable distributions were seen [prop D3]; however there was an apparent, but insignificant, increase in agreements (reinforcers) with increasing quality. A corresponding decrease in objections was also seen.