PROJECTH OVERVIEW

APPENDIX

ProjectH:
A Collaborative Quantitative Study
Of Computer-Mediated Communication

Sheizaf Rafaeli, Fay Sudweeks, Joe Konstan, Ed Mabry

Contents

Introduction
Aims
Computer-Mediated Discussion Groups
Ethics
Copyright
Sampling
Coding
Reliability
Access
Acknowledgments

Introduction

A large group of people from several countries andmany universities collaborated for a period of two years (1992-1994) on a quantitative study of electronic discussions. The research group was coordinated by Sheizaf Rafaeli, Hebrew University of Jerusalem, Israel and Fay Sudweeks, University of Sydney, Australia. Members of the group included researchers from several dozen universities, representing numerous academic disciplines, who got together to use the net in order to study use of the net.

This appendix describes the design of the study and the methodology used to create the first, and perhaps only, representative sample of international, public group computer-mediated communication (CMC).

Aims

The alternatives in studying group CMC are numerous. One can use quantitative or qualitative methods. One may study societies, organizations, groups, coalitions in groups, individuals, or single messages. One may study cross-sectionally, or across time. The choice, of course, should be informed by intellectual interest, availability, reliability and validity concerns.

In our case, we perceived the largest opportunity residing in three facts: (a) we were a large group; (b) one-shot, one-list studies have been done numerous times; and (c) a focus on the self-reports of participants (which typifies much of the literature) still needs validation from less obtrusive studies of the content.

The aims of the study were:

to randomly sample a sizable chunk of publicly available, archived computer mediated group discussions
to analyze the content of messages contained in the sample
to focus on the single message, authors, aggregate thread and the lists as units of analysis
to empirically test hypotheses of interest to members
to collect descriptive data to document the state of the medium and the communication over it
to create a shared database to serve future cross-method, cross-media or historical analyses
to conduct research in a manner unprecedented so far working with a group of people diverse in interests, time, age, status and location.

We focused on the single message, the aggregate thread and the lists themselves. We randomly sampled a sizable chunk of publicly available, archived computer mediated group discussions and analyzed the content of the messages within the sample.

We chose a quantitative methodology because we viewed it as dovetailing the large number of experimental (laboratory based) studies of CMC, and the plethora of nongeneralizable surveys of single groups. We chose a content analysis method that is less sensitive to self report. We chose to harness our numbers to produce a cross-list, cross-time account. And we chose to not limit the range of research questions and hypotheses that can be accommodated within the study.

Computer-Mediated Discussion Groups

Decades ago, McLuhan (1964) foresaw a global network creating a global village. It turns out that the "global village" is neither global nor village. The organizing principle is a loosely coupled entity or group, which we will call "list". Each list is a virtual neighborhood, defined by common interest not geography. The networks within which the lists reside come in several flavors. Bitnet, an interuniversity network, has operated since the early eighties and its constituents are mostly academics and students. Internet, an amorphous network connecting thousands of regional networks, is the most rapidly growing and widest spread network. It has a mixed audience of both universities and a growing set of commercial affiliates. CompuServe is a commercial, privately-owned network. Following is a description of lists and their manifestations on different networks.

Discussion groups on Bitnet are called "lists" because most groups are handled by a Listserv program which holds a subscription list of electronic mail addresses. Mail sent by a subscriber to the list address is distributed to all other subscribers by the program. Access to Listserv (or similar) software is usually the only prerequisite for creation of a new list. As lists can be created at the whim of a single network user, a one-layer structure and a "free-for-all" attitude characterizes Bitnet groups.

CompuServe groups have a two-layer structure: SIGs and sections. Discussion groups are called "SIGs" (Special Interest Groups) and each SIG has a collection of subgroups called "sections". There are approximately 10-20 sections in each SIG on a diversity of subtopics. Creation of a new group is an expensive and complicated procedure so there is usually a substantial user base before a new group is formed.

Usenet groups, primarily on Internet, are generally referred to as "newsgroups". Newsgroups are multilayered and hierarchically structured. Whereas individual users can subscribe to any Bitnet or CompuServe group and receive messages in their personal mailbox, access to newsgroups varies at each site. A site has to receive the news "feed" and users read messages with some kind of reader software. The number and types of newsgroups held, therefore, is influenced by administration and censorship policies, and amount of storage space. The creation of new core newsgroups (i.e. those in the comp, misc, news, rec, sci, soc and talk hierarchies) is more structured than Bitnet, requiring 100 signatures and a demonstrated substantial audience. In addition to the core hierarchies, there are alternative and special-purpose ones (e.g. alt, k12, iee, bionet) in which groups are created under different, and sometimes looser, guidelines.

The depth of interactivity varies widely among discussion groups. Some groups are like cocktail parties with many conversations (threads) competing, rather like CB radio; some focus around specific topics ranging from postcard collecting to yacht design; some are like noticeboards in the local grocery store where messages are pinned and left for others to read and comment on; and some groups merely function as newspapers, disseminating electronic journals or computer programs, and advertising conferences or job vacancies. Many people are content to just read and listen, even in the most interactive groups, while a relatively few dominate conversations.

Ethics

A quantitative analysis of the aggregate of publicly available, archived content of large group discussions that occurred voluntarily is subject to fewer ethical concerns than other types of analyses. Nevertheless, ethical issues were raised: is there an ethical obligation to inform list owners and/or subscribers prior to sampling? is public discourse on CMC public? does the principle of "expectation of privacy" apply?

We invested extraordinary effort to compromise on a policy that all could accept as a framework for ethical and scholarly research. An Ethics Committee drafted a policy (Figure 1) and initiated a formal online voting process in which an overwhelming majority of members voted in favor of the policy.

Ethics Policy

Members of the ProjectH Research Group acknowledge and affirm the individual rights of informed consent, privacy, and intellectual property. We are all committed to reducing censorship and prior restraint. We believe the issue of informed consent of authors, moderators and/or archiving institutions does not apply to the ProjectH quantitative content analysis, as we intend to analyze only publicly available text. We believe public posts are public and their use is governed by professional and academic guidelines.

Each member of ProjectH will ensure that his/her participation in this project, data collection and analysis procedures does not violate the standards of his/her own institution's Human Subjects Committee or equivalent.

In this project, we will use only texts:

that are posted to public lists

that are publicly available

In the quantitative content analysis data collection process, the ProjectH group as a whole will observe the following policy regarding 'writers' (authors of messages in our sample), 'messages' (obvious), and 'groups' (the collections of contributors and readers of content in computer-mediated contexts.

Informed consent will not be sought in advance for the quantitative content analysis of publicly available messages.

No individual writer will be identified by name in either data collection or data set, unless that writer has been contacted, and her/his consent was obtained in writing.

Except for short excerpts of 1 or 2 sentences, no messages will be quoted, in any data set, paper or publication, unless the author of the message was contacted and her/his approval was obtained in writing.

Statements and findings about groups of contributors will avoid identifying individuals.

We will take all measures necessary to separate names of authors and groups from any data collected, measured, or assessed. Individual authors will be identified only by a number. The association of person and identifying number will be kept confidential.

Figure 1. ProjectH Ethics Policy

Copyright

Questions were also raised about intellectual ownership and copyright: who owns the messages that are sent to a discussion list? who holds the copyright? As we were using public data, we were committed to conducting the study publicly and making the data, eventually, available to all. The processed data is the intellectual property of members participating in the work and the ProjectH Research Group holds the copyright. Access to and use of the data set was on a hierarchical basis according to contribution rates. After a two-year exclusive access period by ProjectH members, the data set is available to the public at ftp.arch.usyd.edu.au/pub/projectH/dbase or via the web at http://www.arch.usyd.edu.au/~fay/projecth.html.

A Copyright Committee drafted a formal policy (Figure 2) which was accepted unanimously by members.

Copyright Policy

The content analysis data produced by the collaboration of ProjectH members is subject to the following conditions.

The processed data, defined as the data that is pulled together, cleaned, and in any way compiled from the raw data, is the result of considerable effort by members of the ProjectH Research Group, and is the intellectual property of ProjectH members participating in the work.

The data is copyright to "ProjectH Research Group" and included in the copyright notice will be "Coordinators: S. Rafaeli and F. Sudweeks; Members: [full list of current members]".

Any individual or group who uses the processed data, either in part or in full, must acknowledge the source of the data as "ProjectH Research Group, Coordinators: S. Rafaeli and F. Sudweeks; Members: [full list of members]" or simply "ProjectH Research Group, Coordinators: S. Rafaeli and F. Sudweeks".

Initial access to the processed data is dependent upon participation rate. Access is granted as follows:

Senior ProjectH members have immediate access to and use of data, subject to conditions 3 and 5. Senior membership is achieved by substantial contribution to the quantitative research project. Substantial contribution is deemed to be coding a complete list sample (100 messages) in addition to pretest coding, development of codebook and/or membership of a ProjectH committee.

Junior ProjectH members have access to and use of data six months after the data set is finalized, subject to conditions 3 and 5. Junior membership is achieved by minimal contribution to the quantitative research project. Minimal contribution is deemed to be participation in pretest coding, development of codebook and/or membership of a ProjectH committee.

ProjectH members who have not contributed to the quantitative research project have access to and use of data eighteen months after the data set is finalized, subject to conditions 3 and 5.

The data will be made available for public access and use twenty-four months after the data set is finalized subject to conditions 3 and 5.

ProjectH members who have access privileges may release data to their graduate research students or collaborators, subject to condition 3.

Access by person(s) other than specified in conditions 4 and 5 is considered on a case-by-case basis by the Copyright Committee. Appeals against Copyright Committee decisions are brought before the current ProjectH members and decisions overruled by 60% of members.

The processed data is stored on an ftp site with restricted (non-anonymous) access.

Any participant(s) who is about to commence a research project based solely or principally on the data, is required to register the general nature of the research with the ProjectH coordinators. A list of current research projects and principal investigators will be available for FTP with updates sent to ProjectH monthly. If requested by principal investigators, and approved by the coordinators, details of the research project can be kept confidential. Neither coordinators, nor ProjectH, may censor or censure any topic, or in any way interfere or hinder the academic freedom of any investigator.

Any person producing a paper, article, chapter, report, monograph or book from the processed data, either in part or in full, is to notify the ProjectH Research Group. In addition, it is requested that any or all papers based on this data be submitted in ASCII and/or postscript to the ftp repository.

The codebook, which is the product of considerable effort by members of the ProjectH Research Group, is the intellectual property of all ProjectH members. The codebook is copyright to "ProjectH Research Group" and included in the copyright notice is "Coordinators: S. Rafaeli and F. Sudweeks, Members: [list of current members]". Any individual or group who uses the codebook must acknowledge the source as "ProjectH Research Group, Coordinators: S. Rafaeli and F. Sudweeks; Members: [full list of members]" or simply "Project Research Group, Coordinators: S. Rafaeli and F. Sudweeks".

The annotated bibliography, which is the product of considerable effort by members of the ProjectH Research Group, is the intellectual property of all ProjectH members. The annotated bibliography is copyright to "ProjectH Research Group" and included in the copyright notice is "Coordinators: S. Rafaeli and F. Sudweeks; Members: [list of current members]".

Figure 2. ProjectH Copyright Policy

Conceptualization

The initial, conceptual stage of the study comprised deliberating on the unit of analysis, generating hypotheses and writing a codebook. We decided to focus on three units of analysis: the single message, the aggregate thread, and the list themselves. Research questions were many and varied, and included:

What are the characteristics of longer and lasting threads? Does longevity relate to number of participants, pace of discussion, interconnectedness of messages, amount and nature of metacommunication, emotic communication, interactivity/resonance/chiming?
Are "communities" formed on CMC lists, and if so, how? Can social "density" be measured? Can it be predicted, and/or manipulated by structural qualities of the list? Are any of the previously mentioned variables related to community formation? How? Can one discern the emergence of leadership in CMC groups? Is leadership related to talkativity?
How do "free" or "subsidized" lists compare with costly ones.
Are there measurable differences between professional, academic and recreational lists.
How does editorial intervention (moderation, collation, leadership, censorship) affect the nature of CMC?
The gender issue, and all of the above questions. Historically, CMC studies documented almost only male participation. This has clearly (and positively) changed.
The metacommunication concept/problem: How big is it? Is this the real downside of e-groups? Is it really a problem? How does it relate to social vs. task breakdowns? How does metacommunication interact (statistically) with length of thread, intensity of social connection? Do all threads disappear down the metacommunication drain?
What is the relative role (in collaboration, community formation, thread length, etc.) of asking vs. telling, of information provision vs. information demand?
When and where does "flaming" occur? Is it dysfunctional? If so, how is it dysfunctional?
Are there repeating patterns in the "life" of a group, list, thread?
How is the expression of emotion handled?
What is the role, frequency and place of shorthand, innovative forms of expression such as emoticons, smileys?

To accommodate the broad range of questions of interest, many of us chose one or more variables and described a method for measuring the quality(ies). The variables, with accompanying definition, extreme case examples, and measurement scale, were collated and formed the codebook. The codebook was pretested, assessed for reliability of measures and ambiguity of definitions and modified accordingly. The final comprehensive version of the codebook has 46 variables.

Sampling

Selecting a random representative sample of discussion groups was an important phase of the study. Initial discussions revealed divergent opinions on the virtues of random and stratified sampling. A Sampling Committee, representing the spectrum of sampling persuasions within the group, drafted a Sampling Statement (Figure 3) which was adopted by the ProjectH members.

Sampling Statement

Objectives and Constraints

The objectives of the sampling strategy are many and conflicting. Among the more critical objectives are:

Maintaining enough randomness to allow conclusions about as broad a range of CMC as possible.

Obtaining enough data from each group (newsgroup or list) to draw conclusions about the group.

Sampling a wide range of groups with diverse characteristics. Among the characteristics of interest to some of us are:

readership and authorship

list volume (messages per day or week)

average number of concurrent threads

average duration of threads

type of group (i.e., technical, recreational, etc.)

type of distribution (i.e., free vs. paid)

Learning about CMC and human interaction.

At the same time, we operate under certain constraints:

Limited human resources both for coding and for analysis of the types of groups.

Limited availability of data, both list contents and list statistics.

The Sampling Continuum

A sampling strategy, given the objectives stated above, lies on a continuum between random selection and stratification. We believe that the constraints posed above will limit us to 50 or 60 groups. We considered two extreme proposals:

Complete random sampling. Just pick any groups from any of the lists. This has the advantage of randomness, but the disadvantage of likely leading to the selection of inappropriate groups (perhaps groups with only announcements, automated postings, or test messages), and might well result in a sample that is poorly representative of the entirety of the networked experience. This is particularly the case on Usenet, for example, where there are relatively many low-volume groups and relatively few high-volume ones.

Heavy stratification. Select a set of strata and sample from within the strata. For example, given 60 groups, we would be sure to select 30 high-volume and 30 low-volume. Perhaps 20 each from Compuserve, Bitnet, and Usenet. And so forth. This has the clear problem that we would be unable to select much randomly, and even a few strata would lead to unacceptably few measures per category.

Accordingly, we examined the following compromises:

Weighted random sampling with a weighting factor based on the volume, authorship, and readership. We concluded that we did not yet know enough about the domain to derive a meaningful weighting function that would capture the "normality" of a group.

Purely random sampling. This had the problem that we would not be likely to sample enough groups from certain domains (i.e., Compuserve) to draw conclusions about the difference between pay and free services.

Random sampling over a more restricted domain with stratification by the type of list. This strategy limits the groups under consideration to exclude:

foreign language lists

local lists

announcement lists

help/support lists for specific products

test and control groups

lists whose contents are only excerpts of other lists selected by moderators

extremely low volume lists (i.e., lists with fewer than 25 messages and 3 authors during a selected test month)

The stratification will select equal numbers of lists from Compuserve, Bitnet, and Usenet. If the number of lists is not a multiple of three, the extra lists will be selected randomly from all groups.

It is this final strategy which we propose to adopt.

We propose to select randomly from all lists and reject those meeting the exclusion criteria above. Where possible, this rejection will be accomplished in advance by not considering clearly inappropriate groups. Otherwise, groups will be rejected as they are chosen. Lists that are primarily flames or other "degenerate" cases will be accepted and coded as long as they meet these criteria on the grounds that they too hold interesting scientific results and may be reflective of a segment of the CMC experience.

Once lists are selected, we will sample 100 messages or 3 days worth of messages, whichever is greater. This is to allow us to observe and code threads with sufficient time for e-mail lag and response. The selection period shall begin on a randomly selected Monday for which message data is available. While we considered pure random selection, we consider it unwise to try to compare weekend data with weekday data until we have a better understanding of the domain. Weekend data will be included in most low and medium volume groups. In addition, we will pre-process an additional 100 messages or 3 days worth of messages, whichever is greater, BEFORE the sampling region to provide extra thread and author information for coding.

Precoding

To assist coders and provide greater information, we will be pre-coding messages, including both messages in the sample and those before it, to identify authors and subject classifications. With each batch of messages, coders will get a list of authors with author ID numbers and a list of subjects with subject ID numbers. These numbers will be unique across the entire study to allow us to exploit the opportunity should authors participate in multiple lists or should a thread exist in or move across several lists. To the extent possible, this process will be automated and will simplify coding for each coder.

List Statistics

In addition to the message coding statistics, we will attempt to obtain list statistics. Of particular interest are the following, though additional ones are likely to be added:

Average number of postings per day in a one-month period

Number of authors in a one-month period

Number of readers in a one-month period

Average message length

Average thread length (# of messages)

Average length of threads longer than 2 messages

Average thread duration (# of days)

Average duration of threads longer than 2 messages

% of messages in threads

Editorial status (moderated, unmoderated)

Topic (Academic, Technical, Social, etc.)

Age of List (New, Old)

Figure 3. ProjectH Sampling Statement

List traffic is dynamic. Some groups are highly active, generating in excess of 200 messages a day; other groups are almost dormant, generating far fewer than 200 messages a year; some groups maintain a consistent volume of traffic; other groups experience high peaks and low troughs. Sampling an equal number of messages from selected groups has the advantage of capturing threads. Sampling over an equal time period has the advantage of typifying group activity. Rather than risk having to reject a high percentage of groups because we happened to sample during a quiet period, we compromised on the combination of numeric and time measures: 100 messages or 3 days worth of messages, whichever was the greater, beginning on a randomly selected Monday. Unexpectedly, few of the selected groups had 100 messages in less than 3 days so a standard numeric measure of 100 messages per list was used.

Populations of groups were compiled. A list of all known Bitnet lists was obtained from Listserv@gwuvm.Bitnet with a "LISTS GLOBAL" command. Four lists of Usenet newsgroups which are updated periodically were FTP'd from rtfm.mit.edu:

List_of_Active_Newsgroups,_Part_I
List_of_Active_Newsgroups,_Part_II
Alternative_Newsgroup_Hierarchies,_Part_I
Alternative_Newsgroup_Hierarchies,_Part_II

CompuServe groups presented a methodological complication. There is no available list of CompuServe sections so the CompuServe population is a list of SIGs, giving a deceptively low percentage of CompuServe groups.

Groups which were clearly in the categories to be excluded were filtered out prior to random sampling:

	Bitnet	Usenet	CompuServe
Pre-filtered groups	3485	1868	337
Post-filtered groups	1907	986	94

A C program generated a specified number of random numbers within a specified range and to match the generated numbers against post-filtered populations of groups. Twenty groups were selected from each of three network.

The sampling period began on Monday 15 March 1993 and volunteer members shared the task of downloading. Bitnet lists were sampled using a DBase program. Internet newsgroups were downloaded from Usenet news. Articles were collected from news servers at the Royal Institute of Technology, Stockholm, Sweden; University of Minnesota, USA; University of Western Sydney, Nepean; and University of Sydney, Australia. Articles were collected according to the date and time of arrival at each news server.

Many of the selected groups did not fit the restricted domain nor meet the set criteria so we dipped into the population hat again (and again). In all, 77 Bitnet lists, 39 Usenet newsgroups and 23 CompuServe SIGs were selected to get samples of 20 groups for each network. For CompuServe, the unavailability of section lists accounts for the high "hit rate". As each SIG contained a dozen or more subgroups, a secondary random process was applied. A section was selected from each SIG using a random number procedure. CompuServe corpora, then, are randomly selected sections from randomly selected SIGs.

Coding

Each batch of 100 messages downloaded from selected groups was prepared for coders. Programs were written to:

split files of 100 messages into individual files
renumber, if necessary, in numeric alphabetical order
precode the first six variables: CODERID, LISTID, MSGNUM, AUTHORID, MSGTIME and MSGDATE
compile a cumulative database of authors across all lists
reassemble messages in one file

Numerous universal systems for coding were considered and rejected as coders varied in technical expertise, access to technology and Internet resources, and working style. An enterprising member, using the catch phrase "if we build it will you come?", headed a technically skilled committee to develop standard coding formats for different platforms Hypercard stack for Macintosh, FileExpress database for DOS, and templates for text editors and wordprocessors.

After coding, data was exported as ASCII and emailed to an account dedicated to data processing. A C program and a suite of awk scripts verified and manipulated the data. The automatic processor involved five stages:

Check if incoming mail is data. Key strings were used to identify incoming mail as a data file. If one of the key strings were found, then the file was processed as data. If a string were not found, the processor assumed the mail to be regular, and ignored it.
Check for errors. Each mail message determined by the processor to be data was checked for errors, e.g. values out of coding range, missing values, wrong message numbers, non-numeric codes. Data with errors were returned to the coder.
Check for completeness. As each new list was processed, a unique subdirectory was created and error-free coded messages were transferred to the subdirectory as separate files. When the list was complete (i.e. 100 error-free coded messages as 100 files), the codes were transferred to databases.
Manipulate the database. Data was added to databases of two format types - with and without comma-delimiters for fields. In each case, each line is one message.
Report to coder and coordinator. Mail with processable data generated automatic error and completion status reports; unprocessable data was returned to the coder. A copy of all reports was sent to the coordinator and the system maintained a log file of all incoming and outgoing mail.

For each list coded, a questionnaire was completed to gather descriptive information about the coders, the technology used, impressions of the list, and problems experienced.

Reliability

Reliability assesses the degree to which variations in data represent real phenomena rather than variations in the measurement process. Once again, we followed the same procedure for attaining consensus on a methodological process. A Reliability Committee drafted a Reliability Statement (Figure 4) which was adopted by ProjectH members.

Reliability Statement

The following statement on reliability represents a month of intense discussions and a compromise among the many and varied opinions of the "reliability group". We consider, however, it is sufficiently flexible to satisfy both casual inquirers and restrictive publishing standards.

There are a number of ways to collect reliability data. We considered two that are proposed by Klaus Krippendorff (Content Analysis, Sage, 1980):

Test-standard: "The degree to which a process functionally conforms to a known standard." This involves training all coders to a standard set by "expert" coders and accepting as coders only those who code to the preset level of accuracy.

Test-test: "The degree to which a process can be recreated under varying circumstances." This involves using at least two coders for the same data to establish the reproducibility of results.

Given the unprecedented nature of our project, the unavailability of an established standard, and the number of coders involved, we propose to adopt a test-test design as follows.

Each coder must code the nine pretest messages using the pretest codebook. Completion of the pretest is a prerequisite for real coding. The purpose of this is to provide all an opportunity to complete a practice run, ask questions, realize problems, etc.

Everything will be coded twice. In other words, each 'list' (or batch) of 100 messages will be coded by two coders. We now have sufficient coding power (participants) to do this. It is crucial that each coder codes independently. Communication among coders introduces errors and makes data appear more reliable than they are. Independence of coding will be maintained as follows:

_ each `list' (batch of 100 messages) will be randomly assigned to two coders

_ the list assignment will be kept confidential

_ each coder will receive assigned lists privately

_ guidelines will be posted to ProjectH for avoiding coding discussions that threaten reliability

_ everyone is requested to ensure specific comments or quotes from messages are avoided in discussions with other group members, except "oracles" (see 4 below), either privately or publicly.

These strategies will provide us with full reliability figures.

We will set a threshold for an acceptable level of bi-coder agreement. In cases where this threshold is not reached, we will have a third coder deal with corpora/data. In other words, while all messages get double coded, we'll set a tolerable level of ambiguity. Any list (or pair of coders) that does not achieve that level of agreement, will be given to a third "blind" coder who will code the divergent variable(s). If the third coder codes the problematic variable(s) in a way that coincides with one of the two previous coders, then we accept the two consistent data. If the third coder's coding is different from both of the two previous coding attempts, we will use the original two coders' data and mark as 'unagreed'.

We will recruit a small number of "oracles" for sets of variables. Questions on the codebook will be directed privately to the oracle for that question. The question to the oracle may be specific and include quotes but the oracle will respond with a summary of general comments to ProjectH. We will also appoint a "Commissioner of Oracles" to coordinate this effort.

Figure 4. ProjectH Reliability Statement

For various reasons, 40% of potential coders were unable to code. Of the 37 lists (batches of 100 messages) distributed, 20 were single coded, 12 were double coded, and 5 were not coded. Of the 32 coded, 4 were unfinished, giving a final tally of 20 single coded and 10 double coded lists (Table 1). The database(s), therefore, has a total of 4000 messages from fully-completed lists, of which 3000 are unique. In addition, there are 322 messages from 4 unfinished lists.

Table 1. Single, double and partially coded lists

	Bitnet	Usenet	Compuserve
Single coded lists	BLIND-L BONSAI BUDDHA-L CJ-L EMAILMAN HOCKEY-L LAWSCH-L LITERARY	alt.cobol alt.sexual.abuse.recovery comp.ai.genetic k12.ed.math k12.ed.comp.literacy rec.arts.startrek.current soc.college	COMIC PHOTOFORUM TELECOM UKFORUM WINEFORUM
Double coded lists	CELTIC-L	rec.folk-dancing rec.humor.funny rec.nud rec.radio.swap	CARS DISABILITIES EFFSIG FISHNET JFORUM
Partially coded lists	HOCKEY-L	comp.bbs.waffle rec.arts.startrek.current soc.veterans

It was important to maintain independence of coding, particularly those lists that were double coded. Independent coders, working in a defined (and confined) physical work context, typically are less accessible to one another on a day-to-day basis. Email access, however, bridges distances and schedule clashes, and puts coders communicatively closer to each other.

To eliminate a possible source of invalid (inflated) reliability, coders were discouraged from discussing coding problems amongst themselves or within the group. Coder queries were directed, instead, to an advisory committee of twelve members. Each advisor, or oracle, fielded questions on a section of the codebook, responding in a nondirective manner. The more complicated questions were discussed amongst the oracles and the leader (the Commissioner of Oracles) summarized the discussions and responded to the inquirer. The typical practice was for an inquiry to be posted to the group, the specialist oracle (or the Commissioner if the appropriate oracle was not available) would post a recommended response, all oracles would comment on the response, and the Commissioner would summarize oracle recommendations and post the final recommendation to the inquirer and/or the group.

Requests for oracle assistance were relatively low. Inquiries could be divided into four types: technical, confirmatory, enigmatic and interpretive. Technical questions related to the group's procedures for precoding, sampling and distribution; confirmatory questions related to apprehension about and applicability of coding categories; enigmatic questions involved some form of an apparent paradox; and interpretive questions dealt with matters of coding protocol intent. Answers were couched in analytical yet open-ended terms. Turnaround time on inquiries posted to oracles was 48-72 hours.

Access

On completion of coding, the following information was compiled and archived:

databases
data index (explanation of column/row numbers)
list of listids, coderids, listnames and network
corpora
list of authorids and author names
coder questionnaires
technical report

and available, in the first instance, to participants who coded at least 100 messages, agreed to comply with ethics and copyright policies, and outlined the precautions that would be taken to protect author identification and the database.

Acknowledgments

The ProjectH research was supported by the following:

Comserve (vm.its.rpi.edu) sponsored the project and unwittingly endowed it with a name. Sponsorship is granted to research groups whose activities fall within the ambit of Comserve's aims to promote CMC-related research. The sponsorship includes a private "hotline", Listserv services and disk space for archiving logs.
A grant from Compuserve provided access to archives and downloading time.
The Recanati Fund provided funds for some computing resources and coordinating time.
The network resources of the Department of Architectural and Design Science, University of Sydney, Australia was used extensively throughout the project: anonymous ftp site for archiving of ProjectH material (key documents, coding formats, database, papers related to the project), system aliases for distribution lists, an account for processing data, and disk space.

Most importantly, we acknowledge the enthusiasm, perseverance and expertise of ProjectH members who contributed their valuable time and skills to the collaborative project. Listed below are the participants of various phases of the project, and a full list of members.

Project Coordinators
Sheizaf Rafaeli, Hebrew University of Jerusalem, Israel; Fay Sudweeks, University of Sydney, Australia

Software Development
Joe Konstan, University of Minnesota, USA

Ethics Committee
Sheizaf Rafaeli, Hebrew University of Jerusalem, Israel; Fay Sudweeks, University of Sydney, Australia

Copyright Committee
Sheizaf Rafaeli, Hebrew University of Jerusalem, Israel; Fay Sudweeks, University of Sydney, Australia; Jim Thomas, Northern Illinois University, USA

Sampling Committee
Joe Konstan, University of Minnesota, USA (Coordinator); Bob Colman, Pennsylvania State University, USA; Sheizaf Rafaeli, Hebrew University of Jerusalem, Israel; Fay Sudweeks, University of Sydney, Australia; Bob Zenhausern, St Johns University, USA

Reliability Committee
Bob Colman, Pennsylvania State University, USA (Coordinator); Joe Konstan, University of Minnesota, USA; Bob McLean, Ontario Institute for Studies in Education, Canada; Sheizaf Rafaeli, Hebrew University of Jerusalem, Israel; Bill Remington, Middle Tennessee State University, USA; Fay Sudweeks, University of Sydney, Australia; Phil Thompsen, University of Utah, USA; Bob Zenhausern, St Johns University, USA

Mechanics Committee (Coding Formats)
Cheryl Dickie, York University, Canada (Coordinator); Pat Edgerton, University of Texas, USA (FileExpress (format for DOS) Developer); Bob McLean, Ontario Institute for Studies in Education, Canada (Hypercard Stack (format for Macintosh) Developer); Joe Konstan, University of Minnesota, USA; Ed Mabry, University of Wisconsin-Milwaukee, USA; Michael Shiloh, TRW Financial Systems, USA

Dowloading of Corpora
Ray Archee, University of Western Sydney, Nepean, Australia; Joe Konstan, University of Minnesota, USA; Clare McDonald, Royal Institute of Technology, Sweden; Michael Shiloh, TRW Financial Systems, USA; Lucia Ruedenberg, New York University, USA; Sheizaf Rafaeli, Hebrew University of Jerusalem, Israel; Bob Zenhausern, St Johns University, USA

Oracles

Ed Mabry, University of Wisconsin-Milwaukee, USA (Coordinator ìThe Commishî); Sharon Boehlefeld, University of Wisconsin, USA; Pat Edgerton, University of Texas, USA; Nancy Evans, University of Pittsburg, USA; Sandra Katzman, Stanford University, USA; Joe Konstan, University of Minnesota, USA; Clare McDonald, Royal Institute of Technology, Sweden; Judy Norris, Ontario Institute for Studies in Education, Canada; Carole Nowicke, University of Indiana, USA; Bill Remington, Middle Tennessee State University, USA; Michael Shiloh, TRW Financial Systems, USA; Macey Taylor, Marie Curie University,, Poland; Michelle Violanti, University of Kansas, USA

Distribution Committee
Bob Colman, Pennsylvania State University, USA; Joe Konstan, University of Minnesota, USA; Ed Mabry, University of Wisconsin-Milwaukee, USA; Margaret McLaughlin, University of Southern California, USA
Diane Witmer, University of Southern California, USA; Sheizaf Rafaeli, Hebrew University of Jerusalem, Israel; Fay Sudweeks, University of Sydney, Australia

ICA Panel/Conference/Workshop Committees
Ray Archee, University of Western Sydney, Nepean, Australia; Deanie French, Southwest Texas University, USA; Joe Konstan, University of Minnesota, USA; Ed Mabry, University of Wisconsin-Milwaukee, USA; Margaret McLaughlin, University of Southern California, USA; Ted Mills, University of Connecticut, USA; Diane Witmer, University of Southern California, USA; Sheizaf Rafaeli, Hebrew University of Jerusalem, Israel; Myles Slatin, State University of New York-Buffalo, USA; Fay Sudweeks, University of Sydney, Australia; Bob Zenhausern, St Johns University, USA

Coders
Shamir Ahituv, Israel; Ray Archee, Australia; Ross Bender, USA; Bob Boldt, USA; Amos Cividalli, Israel; Bob Colman, USA; Cheryl Dickie, Canada; Patrick Edgerton, USA; Kerstin Eklundh, Sweden; Scott Erdley, USA: Nancy Evans, USA: Sueli Ferreira, Brazil; Deanie French, USA; Stephanie Fysh, Canada; Peter Gingiss, USA; Dean Ginther, USA; Jay Glicksman, USA, Allen Gray, USA; Steve Harries, UK; Richard Henry, USA; Merebeth Howlett, USA; Sandra Katzman, USA; Marcia Kaylakie, USA; Mavis Kelly, Hong Kong; Ed Mabry, USA; Clare Macdonald, Sweden; Leland McCleary, Brazil; Margaret McLaughlin, USA; Robert McLean, Canada; Ted Mills, USA; Carole Nowicke, USA; Andriana Pateris, USA; Diane Witmer, USA; Sheizaf Rafaeli, Israel; Vic Savicki, USA; Ermel Stepp, USA; Michelle Violanti, USA; Gerry White, USA; Nancy Wyatt, USA

ProjectH Members (January 1994)
Ray Archee, Australia; Lecia Archer, USA; Ross Bender, USA; Alex Black, Canada; Sharon Boehlefeld, USA; Luiz Henrique Boff, USA; Bob Boldt, USA; Ingo Braun, Germany; Doug Brent, Canada; Jeutonne Brewer, USA; Mark Bryson, UK; Bill Byers, USA; Paul Chandler, Australia; Robert Christina, USA; Bob Colman, USA; Alicia Conklin, USA; Brenda Danet, Israel; Boyd Davis, USA; Cheryl Dickie, Canada; Patrick Edgerton, USA; Kerstin Eklundh, Sweden; Jill Ellsworth, USA; Scott Erdley, USA; Nancy Evans, USA; Nicky Ferguson, UK; Sueli Ferreira, Brazil; Peter Flynn, Ireland; Davis Foulger, USA; Deanie French, USA; Al Futrell, USA; Stephanie Fysh, Canada; John Garrett, USA; Peter Gingiss, USA; Dean Ginther, USA; Jay Glicksman, USA; Allen Gray, USA; John Gubert, USA; Kate Harrie, USA; Steve Harries, UK; Anne Harwell, USA; Richard Henry, USA; Ping Huang, USA; Noam Kaminer, USA; Sandra Katzman, USA; Marcia Kaylakie, USA; Mavis Kelly, Hong Kong; Yitzchak Kerem, Israel; Mary Elaine Kiener, USA; Elliot King, USA; Lee Komito, Ireland; Joe Konstan, USA; Joan Korenman, USA; Herbert Kubicek, Germany; Stan Kulikowski, USA; David Levine, USA; Mazyar Lotfalian, USA; Ed Mabry, USA; Clare Macdonald, Sweden; Richard MacKinnon, USA; Carole Marmell, USA; Yael Maschler, Israel; Leland McCleary, Brazil; Margaret McLaughlin, USA; Robert McLean, Canada; Rosa Montes, Mexico; Ted Mills, USA; Michael Muller, USA; Rosemary Nowak, Brazil; Carole Nowicke, USA; Andriana Pateris, USA; Diane Witmer, USA; Janet Perkins, USA; Tom Postmes, Netherlands; Sheizaf Rafaeli, Israel; Volker Redder, Germany; Bill Remington, USA; Bernard Robin, USA; Alejandra Rojo, Canada; Roy Roper, USA; Yehudit Rosenbaum, Israel; Laurie Ruberg, USA; Lucia Ruedenberg, USA, Vic Savicki, USA; Steve Schneider, USA; Rob Scott, USA; Myles Slatin, USA; Gilbert Smith, USA; Ermel Stepp, USA; Fay Sudweeks, Australia; Pat Sullivan, USA; Philip Swann, Switzerland; Macey Taylor, Poland; Jim Thomas, USA; Lin Thompson, Australia; Philip Tsang, Australia; Alexander Voiskounsky, Russia; Dadong Wan, USA; Wendy Warren, USA; Gerry White, USA; Jesse White, USA; Sabina Wolfson, USA; Marsha Woodbury, USA; Nancy Wyatt, USA; Kathleen Yancey, USA; Bob Zenhausern, USA; Olga Zweekhorst, Netherlands