Development of a Tool Measuring User Satisfaction of the Human-Computer Interface

John P. Chin Virginia A. Diehl Kent L. Norman
Department of Psychology University of Maryland College Park, MD 20742

John P. Chin, Graduate Research Assistant, chin@tove.umd.edu, (301) 454-7985

Virginia A. Diehl, Graduate Research Assistant, diehl@tove.umd.edu, (301) 454-5049

Kent L. Norman, Associate Professor, kent@umd2.umd.edu, (301) 454-6388

Number of words: Main Body of Text=2908
Keywords: User Satisfaction, User Interface Questionnaire, Design Tool
Topic: Interface Design Tools and Techniques
For paper presentation at SigChi'88

Running Head: User Satisfaction of the Human-Computer Interface

Abstract

This study is a part of a research effort to develop the Generic Use Interface Questionnaire (QUIS). Participants, 150 PC user group members, rated familiar software products. Two pairs of software categories were compared: 1) software that was liked and disliked, and 2) a standard command line system (CLS) and a menu driven application (MDA). The reliability of the questionnaire was high, Cronbach's alpha=.94. The overall reaction ratings yielded significantly higher ratings for liked software and MDA over disliked software and a CLS, respectively. PC users rated MDA more satisfying, powerful and flexible than CLS. Future applications of the QUIS on computers and telephones are discussed.

Introduction

There are many possible ways to evaluate the human-computer interface. Shneiderman (1987) lists five different types of dependent measures. For many tasks speed and accuracy are two related performance measures which affect a person's attitude toward the system. In addition to performance measures, the time it takes to learn a system and the retention of acquired knowledge over time are associated with how effectively a system can be used. User acceptance of a system (i.e., subjective satisfaction) is also a critical measure of a system's success. Although a system is evaluated favorably on every performance measure, the system may not be used very much because of the user's dissatisfaction with the system and its interface. A large number of questionnaires concerning the user's subjective satisfaction of the system and many related issues have been developed. However, few have focused exclusively on user evaluations of the interface. Shneiderman (1987) presented a questionnaire that directs its attention to the user's subjective rating of only the human-computer interface. This paper concerns the subsequent development of this measurement tool, called the Generic User Interface Questionnaire (QUIS). A brief review of the relevant literature will be presented, followed by a description of the development of the QUIS (versions 3.0 and 4.0) in a previous study. The present study, involving the administration of the current QUIS (version 5.0) to a large user group, will then be discussed.

Review of the Literature
In the past, several questionnaires have been developed to assess users perceptions of systems. Recently, extensive literature reviews found several weaknesses in many of the subjective evaluation measurement tools (Chin, Norman & Shneiderman, 1987; Ives, Olson & Baroudi, 1983). Problems ranged from a lack of validation (Gallagher, 1974) to low reliabilities (Lacker & Lessig, 1980). Ives et al. (1983) reported the possibility of inflated reliability values due to respondents marking the same response for many of the questions. One study suffered from a small sample size (N=29) and a nonrepresentative population (Bailey & Pearson, 1983). On the other hand, another study had a very large number of respondents (N=4,597), in which 179 different systems ranging from micros to large mainframes were evaluated (Rushinek & Rushinek, 1986). Thus, the range of problems in questionnaire construction is diverse.

Past studies have examined the types of questions that would be appropriate for questionnaires. Root and Draper (1983) found that checklist questionnaires were not sufficient in evaluating systems since they did not indicate what new features were needed. Open-ended questions were suggested as a possible supplement for checklists. Coleman, Williges and Wixon (1985) found that users preferred concrete adjectives for evaluations. In addition, they found that specific evaluation questions appeared to be more accurate than global satisfaction questions.

In general, the research regarding questionnaires for evaluating computer systems has been steadily improving in quality and increasing in number. More studies are using larger samples across a larger number of different systems. Many have demonstrated more concern for reliability and validity issues. However, very few studies have sustained a long-term effort in the development of a questionnaire, and although many of the surveys consider several issues associated with general subjective satisfaction of the system, few if any directly focus on the interface. The research effort described in this paper is intended to address these issues.

Review of Developmental Process
The original questionnaire (version 2.0, Shneiderman, 1987) consisted of a total of 90 questions. Five questions were overall reaction ratings of the system. The remaining 85 items were organized into 20 different groups. Each of the 20 groups of questions had a main component question followed by related subcomponent questions. The short version of QUIS (2.0) had only the 20 main questions listed along with the five overall questions. Each of the questions had rating scales ascending from 1 on the left to 10 on the right and anchored at both endpoints with adjectives (e.g., inconsistent/consistent). These adjectives were always positioned so that the scale went from negative on the left to positive on the right. In addition, each item had "not applicable" as a choice. Instructions also encouraged raters to include any written comments. Although the 2.0 version was published by Shneiderman (1987), no empirical work had been done to assess its reliability or validity.

The questionnaire (version 2.0) was modified and expanded to three major sections in version 3.0. In section I, three questions were concerned with the type of system being rated and the amount of time spent on the system. In section II, four questions dealt with the user's past computer experience. Section III included the modified version of 2.0 containing 103 questions. Modifications included changing the rating scale from 1 through 10 to 1 through 9. This allowed for the use of zero as the code for "not applicable" responses. This change in the rating scale also established the format for future administrations of a computerized questionnaire; since each rating would require only a single keystroke instead of two keystrokes for the number 10, the response bias would be reduced.

The generalizability of the questionnaire could be established by having different populations of users use the questionnaire to evaluate different types of systems. The program of development of the questionnaire included sampling respondents who were: 1) students, 2) computer professionals, 3) computer hobbyists, and 4) novice users. Moreover, it was important to administer the questionaire under different experimental conditions: 1) strictly controlled experiments with a small number of subjects exposed to a system for a very short period of time, 2) less rigidly controlled manipulations with a medium number of participants who use a system for a limited time, and 3) a field study involving no control with volunteers who have used a system extensively. The examination of the characteristics of versions 3.0 and 4.0 were based on the data from students who used the evaluated system in a moderately controlled situation. The present study's domain of sampling included computer professionals and hobbyists who had extensive and uncontrolled use of the evaluated systems.

The developmental process of the questionnaire began with a large number of items in the questionnaire. The reliability of a questionnaire is directly related to the number of items; the larger the number of items, the higher the reliability of the questionnaire (Nunnally, 1978). However, a questionnaire with a large number of items takes a long time to complete. The number of items on the QUIS had to be reduced so that more people would be willing to complete the questionnaire. Across a series of administrations of successive versions of the questionnaire, the number of items were reduced while maintaining a high degree of reliability.

Chin, Norman and Shneiderman (1987) administered the questionnaire (version 3.0) and a subsequent revised version (4.0) to an introductory computer science class learning to program in CF PASCAL. Participants, 155 males and 58 females, were assigned to either the interactive batch run IBM mainframe or an interactive syntax-directed editor programming environment on an IBM PC. During their class time, they evaluated the environment they had used during the first 6 weeks of the course (version 3.0). A multiple regression of the sub-component questions of version 3.0 with each main component question was performed. Sub-component questions with low beta weights were eliminated to shorten the length from 103 ratings in version 3.0 to 70 ratings in version 4.0. This modification retained the same basic organization and scaling anchors. The participants switched programming environments for the next six weeks and then evaluated the other programming environment with QUIS (version 4.0). In addition to the ratings, the participants' exam and project grades were used as objective measures of their performance.

These grades were used as a possible reference point for establishing validity. Chin et al. (1987) reasoned that an effective interface would translate into better performance. They had expected higher ratings and performance for the interactive syntax-directed editor programming environment. However, they found that subjective ratings did not correspond with the students' performance in the class. Moreover, problems in the syntax-directed editor's interface had led to higher satisfaction ratings for the mainframe. Although the performance measures failed to help establish validity of the questionnaire, it was diagnostic in pointing to the existing interface problems of the syntax-editing programming environment.

The reliability of the questionnaires was high. Cronbach's alpha, which is an estimation of reliability based on the average intercorrelation among question items, was used as the measure of reliability. The 3.0 version had an overall reliability of .94. The interitem alpha values did not vary very much, ranging from .940 to .942. Version 4.0 had an overall reliability of .89, with the values of alpha ranging between .89 and .90. Although there was a drop of .05 in reliability in the later version, an alpha of .89 is respectable when the elimination of 33 items in version 4.0 is taken into account. The small variability of the alpha of each item indicates stability of the questionnaire in terms of internal consistency.

The Present Study
Although version 4.0 appeared to be reliable, the sample of the users doing the evaluation and the interfaces which had been evaluated were limited to those in the academic community. There was a clear need to determine if the reliability of the questionnaire for user interface satisfaction (QUIS, or version 5.0, see Appendix A) would generalize to other populations of users and products, like a local PC User's Group.

In order to look at ratings across products, the questionnaires were divided into 4 groups rating the following: 1) a product the rater liked; 2) a product the rater disliked; 3) a command line system (CLS), 4), Menu Driven Application (MDA). This investigation examines the reliability and discriminability of the questionnaire. In terms of discriminability, we compared the ratings for software that is liked vs. disliked. Lastly, a comparsion between a mandatory CLS with that of a voluntarily chosen MDA.

Method

Subjects
The participants, 127 males and 14 females (nine did not report their gender), were members/affiliates of a local PC User's Group, ranging from ages 14 to 78. They differed widely in their level of computer experience; 11% had used only PC-DOS systems and 32% had worked with over six other types. More than 75% of the respondents had experience using a word processor, file manager, spreadsheet, modem, and a hard disk drive. Among these respondents, 27 rated the command line system, MS-DOSª, 25 evaluated the menu driven system, WordPerfectª. In addition, 35 respondents rated a software product they liked and 18 evaluated one they disliked. A total of 46 different software products were each evaluated by between 1 and 31 persons.

Materials
Participants were given the short version of the QUIS ( 5.0) consistenting of 27 semantic differentials, along with written instructions describing how to complete the questionnaire, and number two pencils. In order to computerize the scoring, optical scanning sheets with 10 alternatives were used, requiring a 10 point scale from 0 to 9. The background information section (4.0) was changed so that the characteristics of the software and the hardware configuration could be determined in version 5.0. A factor analysis of the data from versions 3.0 and 4.0 lead to a reorganization of the main component questions. Each group of items was given a heading based on the aspect of the user interface which the items in that group seemed to be describing. When an item did not clearly fall within a factor, intuition was used to determine the placement of an item under a particular heading. An item concerning the noisiness of the system was added, making a total of 27 main component items.

Procedure
Questionnaire distribution took place during the group's monthly meeting. As attendees entered the auditorium, they were asked if they would complete a survey evaluating a software product. Approximately 500 attendees accepted a questionnaire and pencil. Four different instructions accompanied questionnaire which asked raters to evaluate: 1) a product they liked, 2) a product they disliked, 3) MS-DOSª, 4) WordStarª, WordPerfectª, Lotusª, DBaseª or any comparable software product. Next, a PC User's Group representative read prepared statement while participants read and followed the instructions on the questionnaire's front page. Approximately 30% of the questionnaires were returned at the end of the meeting in the lobby. Some complained that the instructions were complicated and hard to read in dim lighting.

Results

Reliability
The overall reliability of version 5.0 using Cronbach's alpha was .939. Interitem alpha values did not vary very much, and ranged from .933 to .939. The mean ratings varied between 4.72 and 7.02, while standard deviations ranged from 1.67 to 2.25.

Factor Analysis
A factor analysis was performed on the 21 main component questions to determine if the factor analysis of versions 3.0 and 4.0 corresponded with the data from version 5.0 (See Table 1). The items under the Learning and System Capabilities headings factored perfectly, with the exception of "experienced and inexperienced users' needs are taken in to consideration" which factored with the Learning items. The items under Terminology and System Information factored together with the exceptions of "computer keeps you informed of what it is doing" and "error messages." The items under the Screen heading did not match the original organization. The four latent factors may be named: 1) Learning, 2) Terminology and Information flow, 3) System Output, and 4) System Characteristics, respectively. Both "error messages" and "highlighting" do not fit any of the four factors very well.

Table 1
Sorted Rotated Factor Loadings of Questions from QUIS 5.0
HeadingQuestionFactor1Factor2Factor3Factor4
Learning Learning to operate the system (difficult/easy) 0.840 0.000 0.000 0.000
Learning Remembering names and use of commands (difficult/easy) 0.777 0.283 0.000 0.000
Learning Exploring new features by trial and error (difficult/easy) 0.751 0.000 0.283 0.000
System Experienced and inexperienced users' needs are taken into consideration (never/always) 0.658 0.310 0.000 0.275
Learning Tasks can be performed in a straight-forward manner (never/always) 0.639 0.376 0.313 0.000
Learning Supplemental reference materials (confusing/clear) 0.613 2.65 0.000 0.286
Learning Help messages on the screen (unhelpful/helpful) 0.579 0.459 0.000 0.000
Terminology Use of terms throughout system (inconsistent /consistent) 0.000 0.793 0.000 0.271
Terminology Position of messages on screen (inconsistent/consistent) 0.313 0.785 0.252 0.000
Screen Organization of information on screen (confusing/very clear) 0.255 0.774 0.293 0.000
Screen Sequence of screens (confusing/very clear) 0.404 0.724 0.000 0.000
Terminology Computer terminology is related to the task you are doing (never/always) 0.000 0.683 0.000 0.270
Terminology Messages on screen which prompt user for input (confusing/clear) 0.437 0.605 0.349 0.000
Terminology Computer keeps you informed about what it is doing (never/always) 0.000 0.353 0.736 0.000
Screen Characters on the computer screen (hard to read/easy to read) 0.000 0.286 0.694 0.000
System System speed (too slow/fast enough) 0.000 0.000 0.654 0.499
System System tends to be (noisy/quiet) 0.000 0.000 0.000 0.787
System System reliability (unreliable/reliable) 0.000 0.000 0.439 0.689
System Correcting your mistakes (difficult/easy) 0.491 0.391 0.000 0.575
Terminology Error messages (unhelpful/helpful) 0.431 0.400 0.478 0.000
Screen Highlighting on the screen simplifies task (not at all/very much) 0.475 0.416 0.000 0.000
Note: All loading less than 0.250 were set to zero. Based on N=96.

Liked vs. Disliked
The liked and disliked ratings were compared on the six overall reaction and 21 main component questions (see Table 2). The 2 groups differed from each other on all of the overall reaction items with the exception of the "easy/difficult" item. All of the means from the liked system evaluations were higher than those from the disliked systems. Although the main component questions revealed differences (p<.05) in the same direction on "exploring new features by trial and error," "system speed," "system reliability," "error correction," and "experienced and inexperienced users,' none of the differences in the 21 main component questions were significant at the level of p<.001 (controlling for an overall error rate of p<.05). Liked ratings were significantly (p<.001) higher than disliked in the overall reactions: 1) "terrible/wonderful," 2) "frustrating/satisfying," 3) "dull/stimulating," 4) "rigid/flexible."

Table 2
Means of Ratings for Like vs. Dislike Groups
LikeDislike
MeanSt. Dev.MeanSt. Dev.
A. Overall Reactions to the System
1. (terrible/wonderful) 7.21 1.21 4.44 2.41 ****
2. (frustrating/satisfying) 7.12 1.43 3.29 2.14 ****
3. (dull/stimulating) 6.68 1.35 3.75 2.14 ****
4. (difficult/easy) 5.59 2.05 4.33 2.61
5. (inadequate power/adequate power) 7.00 1.52 5.06 2.73 **
6. (rigid/flexible) 6.28 1.76 3.52 2.53 ***
B. Screen
1. Characters on the computer screen (hard to read/easy to read) 6.94 1.80 7.19 1.80
2. Highlighting on the screen simplifies task (not at all/very much) 6.20 1.81 6.12 2.50
3. Organization of information on screen (confusing/very clear) 6.29 1.62 5.76 1.89
4. Sequence of screens (confusing/very clear) 6.45 1.46 5.69 1.48
C. Terminology and System Information
1. Use of terms throughout system (inconsistent /consistent) 7.09 1.42 6.68 1.62
2. Computer terminology is related to the task you are doing (never/always) 6.39 1.90 5.79 1.93
3. Position of messages on screen (inconsistent/consistent) 7.24 1.44 6.25 2.08
4. Messages on screen which prompt user for input (confusing/clear) 6.03 1.98 5.00 2.52
5. Computer keeps you informed about what it is doing (never/always) 6.24 1.89 5.24 2.46
6. Error messages (unhelpful/helpful) 5.97 2.15 4.71 3.24
D. Learning
1. Learning to operate the system (difficult/easy) 5.67 2.10 3.53 2.67 **
2. Exploring new features by trial and error (difficult/easy) 5.62 2.00 3.76 2.46 **
3. Remembering names and use of commands (difficult/easy) 5.52 2.40 4.56 2.68
4. Tasks can be performed in a straight-forward manner (never/always) 5.94 1.72 4.65 2.50 *
5. Help messages on the screen (unhelpful/helpful) 5.94 2.11 4.94 3.01
6. Supplemental reference materials (confusing/clear) 5.29 2.00 3.93 2.74
E. System Capabilities
1. System speed (too slow/fast enough) 6.09 2.28 4.29 2.61 *
2. System reliability (unreliable/reliable) 7.45 1.50 6.35 1.66 *
3. System tends to be (noisy/quiet) 6.50 2.05 7.14 1.66
4. Correcting your mistakes (difficult/easy) 6.64 1.92 4.71 2.44 **
5. Experienced and inexperienced users' needs are taken into consideration (never/always) 5.63 1.81 3.88 2.09 **
note: * denotes p<.05 ** denotes p <.01 *** <.001 **** <.0001

Command Line System vs. Menu Driven Applications
The ratings of CLS and MDA were compared in an item analysis, t-tests performed on the overall reaction and the main component questions revealed many differences (See Table 3). In general, all the MDA mean ratings were higher than CLS. All of the overall reaction items were significant, with the exception of "easy/difficult" and "inadequate power/adequate power" at the .001 level. Eight of the 21 main component items were significant at the .001 level: 1) "information organization," 2) "screen sequence," 3) "position of messages, 4) "status of computer," 5) "error messages," 6) "help," 7) "error correction," and 8) "experienced and inexperienced users."

Table 3
Means of Ratings for Command Line System (CLS) vs. Menu Driven Application (MDA) ps
CLSMDA
MeanSt. Dev.MeanSt. Dev.
A. Overall Reactions to the System
1. (terrible/wonderful) 5.33 1.47 7.36 1.11 ****
2. (frustrating/satisfying) 5.07 1.96 6.84 1.60 ***
3. (dull/stimulating) 4.65 2.22 5.83 1.53 *
4. (difficult/easy) 4.59 1.58 5.24 1.56
5. (inadequate power/adequate power) 4.96 2.19 7.75 1.42 ****
6. (rigid/flexible) 4.33 2.17 6.88 1.54 ****
B. Screen
1. Characters on the computer screen (hard to read/easy to read) 6.08 2.78 7.62 1.20 *
2. Highlighting on the screen simplifies task (not at all/very much) 5.00 2.90 6.72 1.81 *
3. Organization of information on screen (confusing/very clear) 4.36 2.08 7.40 1.29 ****
4. Sequence of screens (confusing/very clear) 5.18 1.72 7.20 1.10 ***
C. Terminology and System Information
1. Use of terms throughout system (inconsistent /consistent) 6.42 1.89 7.54 1.06 *
2. Computer terminology is related to the task you are doing (never/always) 5.46 2.08 6.63 1.35 *
3. Position of messages on screen (inconsistent/consistent) 6.00 2.64 8.04 0.95 ***
4. Messages on screen which prompt user for input (confusing/clear) 4.77 2.25 6.44 1.58 **
5. Computer keeps you informed about what it is doing (never/always) 4.19 1.79 6.71 1.33 ****
6. Error messages (unhelpful/helpful) 3.54 1.92 5.80 1.61 ****
D. Learning
1. Learning to operate the system (difficult/easy) 3.56 1.78 5.08 2.12 **
2. Exploring new features by trial and error (difficult/easy) 4.35 2.24 5.56 1.98 *
3. Remembering names and use of commands (difficult/easy) 4.48 2.17 5.04 2.30
4. Tasks can be performed in a straight-forward manner (never/always) 4.74 1.75 6.16 1.31 **
5. Help messages on the screen (unhelpful/helpful) 3.74 2.16 6.16 1.80 ***
6. Supplemental reference materials (confusing/clear) 4.30 2.28 5.84 1.60 **
E. System Capabilities
1. System speed (too slow/fast enough) 5.31 2.31 6.84 1.34 **
2. System reliability (unreliable/reliable) 7.19 1.77 7.48 1.36
3. System tends to be (noisy/quiet) 6.33 1.88 7.13 2.01
4. Correcting your mistakes (difficult/easy) 5.24 2.01 7.04 1.31 ***
5. Experienced and inexperienced users' needs are taken into consideration (never/always) 3.80 2.12 6.00 1.53 ***
note: * denotes p<.05 ** denotes p <.01 *** <.001 **** <.0001

Discussion

Summary of the Results
The results show that the questionnaire has maintained a high degree of reliability as the number of items were decreased in successive versions. The low variability of the reliability values of each item indicates a high degree of stability. The factor analysis revealed that both the learning and terminology sections of the questionnaire corresponded well with the latent factors. System capability questions appeared to break down into two different factors: one concerning the system output and the other focusing on system characteristics. However, two questions concerning error messages and highlighting on the screen did not seem to fit any category. The item analyses using t-tests show that the QUIS has good discriminability in the overall reaction ratings between the follows pairs: 1) like vs. dislike and 2) command line system vs. menu driven application. Both like and MDA groups had consistently higher ratings compared to dislike and CLS groups, respectively.

Although there were strong differences found between like vs. dislike and CLS vs. MDA in the overall reactions, more significant differences were found between CLS vs. MDA in the specific main component questions in comparison to the liked vs. dislike group. The lack of significant differences in the like vs. dislike groups in the specific main component questions may be due to evaluating a large number of different software products. Each product differs in its strengths and weaknesses. The aggregation of the evaluations of different products that were may have cancelled the rating differences between like and disliked groups in the 21 main component questions.

There are many reasons why MDA was rated higher than CLS. Shneiderman (1987) lists five advantages of MDA: 1) shortens learning, 2) reduces keystroke, 3) strucutre decision-making, 4) permits the use of dialog management, and 5) supports error handling well. Surprisingly, although CLS are known for their flexibility and appeal to "power" users (Shneiderman, 1987), the overall ratings of MDA suggests that these frequent and sophisticated PC users rated MDA more powerful and flexible than CLS. Although this study did not attempt to establish any construct or predictive validity. Future research on the QUIS should concentrate on this area. There are several reasons for the difficulty in establishing validity. First, there is lack of theoretical constructs about interfaces to test the QUIS. Second, there few if any other established questionnaires to cross validate the findings of the QUIS. Future plans to establish validity of the questionnaire include the use of a standard interface to calibrate the QUIS ratings. Calibration of the QUIS can be accomplished by comparing successive ratings with corresponding degradations of the interface standard.

Future Applications
All previous questionnaires have been paper and pencil tasks. The media with which the questionnaires will be administered will be likely to change in many ways. Future questionnaires could be presented on computers to facilitate user evaluations of a computer system. Computerized questionnaires would allow tailoring of questions that are specific to particular systems as a supplement to the QUIS. Data collection on computers would eliminate data encoding errors and speed statistical analysis. At present a computerized version of the current questionnaire for the IBM PC has been implemented and distributed. Telephone questionnaires administered by computers may also be feasible. Currently there are specialized software/hardware packages for telemarketing. These packages can easily be modified to automatically: 1) dial a list of phone numbers at a designated time of the day, 2) ask a set of prerecorded questions, and 3) collect the respondent's answers (verbal or touchtone). The advantages of telesurveys include: 1) the possibility of collecting data from a large and diverse population, 2) an automated and standardized orally presented questionnaire, and 3) a cost effective and convenient method of data collection. However, there are problems with telesurveys. People may not be very comfortable responding to a machine, thus, people may be less likely to agree to be respondents. Moreover, a telesurvey may be viewed as an unsolicited invasion of privacy. Nevertheless, automated telesurveys may be an effective way to administer a questionnaire for future researchers.

Author Notes:
We would like to thank Yuri Gawdiak and Steven Versteeg for their assistance in data collection. Partial funding of this research was from the National Science Foundation, AT&T and the computer science center at the University of Maryland.

References

Bailey, J. E., & Pearson, S. W. (1983). Development of a tool for measuring and analyzing
computer usersatisfaction. Management Science, 29(5), May, 530-545.

Coleman, W. D., Williges, R. C., & Wixon, D. R. (1985). Collecting detailed user evaluations of
software interfaces. Proceedings of the Human Factors Society - 29th Annual Meeting - 1985, 240-244.

Chin, J. P., Norman, K. L., & Shneiderman, B. (1987). Subjective user evaluation of CF
PASCAL programming tools. Technical Report (CAR-TR-304), Human-Computer Interaction Laboratory, University of Maryland, College Park, MD 20742.

Gallagher, C. A. (1974). Perceptions of the value of a management information system. Academy
of Management Journal, 17(1), 46-55.

Ives, B. Olson, M. H., Baroudi, J. J. (1983). The measurement of user information satisfaction.
Communications of the ACM, 26, 785-793.

Larcker, D. F. & Lessig, V. P. (1980). Perceived usefulness of information: A psychometric
examination. Decision Science, 11(1), 121-134.

Nunnally, J. C. (1978). Psychometric Theory. McGraw-
Hill Book Company, New York.

Root, R. W., & Draper, S. (1983). Questionnaires as a
software evaluation tool. CHI'83 Proceedings, December, 83-87.

Rushinek, A. & Rushinek, S. F. (1986). What makes
users happy? Communications of the ACM, 29(7), 594-598.

Shneiderman, B. (1987). Designing the User
Interface: Strategies for Effective Human-Computer Interaction. Addison-Wesley Publishing Co., Reading, MA.

Appendix A

Part 3: User Evaluation of an Interactive Computer System
(For each of the following questions, fill in 0-9 or leave blank if question is not applicable)
Skip question if not applicable
OVERALL REACTIONS TO THE SOFTWARE
terrible wonderful
0 1 2 3 4 5 6 7 8 9
difficult easy
0 1 2 3 4 5 6 7 8 9
frustrating satisfying
0 1 2 3 4 5 6 7 8 9
inadequate power adequate power
0 1 2 3 4 5 6 7 8 9
dull stimulating
0 1 2 3 4 5 6 7 8 9
rigid flexible
0 1 2 3 4 5 6 7 8 9
SCREEN
· Characters on the computer screen
hard to read easy to read
0 1 2 3 4 5 6 7 8 9
· Highlighting on the screen simplifies task
not at all very much
0 1 2 3 4 5 6 7 8 9
· Organization of information on screen
confusing very clear
0 1 2 3 4 5 6 7 8 9
· Sequence of screens
confusing very clear
TERMINOLOGY AND SYSTEM INFORMATION
· Use of terms throughout system
inconsistent consistent
0 1 2 3 4 5 6 7 8 9
· Computer terminology is related to the task you are doing
never always
0 1 2 3 4 5 6 7 8 9
· Position of messages on screen
inconsistent consistent
0 1 2 3 4 5 6 7 8 9
· Messages on screen which prompt user for input
confusing clear
0 1 2 3 4 5 6 7 8 9
· Computer keeps you informed about what it is doing
never always
0 1 2 3 4 5 6 7 8 9
· Error messages
unhelpful helpful
0 1 2 3 4 5 6 7 8 9
LEARNING
· Learning to operate the system
difficult easy
0 1 2 3 4 5 6 7 8 9
· Exploring new features by trial and error
difficult easy
0 1 2 3 4 5 6 7 8 9
· Remembering names and use of commands
difficult easy
0 1 2 3 4 5 6 7 8 9
· Tasks can be performed in a straight-forward manner
never always
0 1 2 3 4 5 6 7 8 9
· Help messages on the screen
unhelpful helpful
0 1 2 3 4 5 6 7 8 9
· Supplemental reference materials
confusing clear
0 1 2 3 4 5 6 7 8 9
SYSTEM CAPABILITIES
· System speed
too slow fast enough
0 1 2 3 4 5 6 7 8 9
· System reliability
unreliable reliable
0 1 2 3 4 5 6 7 8 9
· System tends to be
noisy quiet
0 1 2 3 4 5 6 7 8 9
· Correcting your mistakes
difficult easy
0 1 2 3 4 5 6 7 8 9
· Experienced and inexperienced users' needs are taken into consideration
never always
0 1 2 3 4 5 6 7 8 9