Watson-Glaser: Short Form Manual

Watson-Glaser
®
Critical Thinking Appraisal
Short Form
Manual
Goodwin Watson & Edward M. Glaser
888-298-6227 • TalentLens.com
Copyright © 2008 Pearson Education, Inc., or its affiliates. All rights reserved.
Copyright © 2008 by Pearson Education, Inc., or its affiliate(s).
All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any
means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval
system, without permission in writing from the copyright owner.
The Pearson and TalentLens logos, and Watson-Glaser Critical Thinking Appraisal are trademarks,
in the U.S. and/or other countries, of Pearson Education, Inc. or its affiliate(s).
Portions of this work were previously published.
Printed in the United States of America.
Table of Contents
Chapter 1
Introduction................................................................................................................................... 1
Chapter 2
Critical Thinking Ability and the Development of the
Original Watson-Glaser Forms................................................................................................... 3
The Watson-Glaser Short Form................................................................................................... 4
Chapter 3
Directions for Paper-and-Pencil Administration and Scoring.......................................... 5
Preparing for Administration..................................................................................................... 5
Testing Conditions.................................................................................................................... 5
Materials Needed to Administer the Test.................................................................................. 5
Answering Questions................................................................................................................ 6
Administering the Test................................................................................................................ 6
Timed Administration............................................................................................................... 7
Untimed Administration.......................................................................................................... 8
Concluding Administration...................................................................................................... 8
Scoring.......................................................................................................................................... 8
Scoring with the Hand-Scoring Key.......................................................................................... 8
Machine Scoring....................................................................................................................... 8
Test Security.................................................................................................................................. 9
Accommodating Examinees with Disabilities.......................................................................... 9
Chapter 4
Directions for Computer-Based Administration................................................................. 11
Preparing for Administration................................................................................................... 11
Testing Conditions.................................................................................................................. 11
Answering Questions.............................................................................................................. 11
Administering the Test.............................................................................................................. 12
Scoring and Reporting............................................................................................................... 12
Test Security................................................................................................................................ 12
Accommodating Examinees with Disabilities........................................................................ 13
iii
Watson-Glaser Short Form Manual
Chapter 5
Norms.............................................................................................................................................. 15
Using Norms Tables to Interpret Scores................................................................................... 15
Converting Raw Scores to Percentile Ranks............................................................................ 16
Chapter 6
Development of the Short Form.............................................................................................. 19
Test Assembly Data Set.............................................................................................................. 19
Criteria for Item Selection......................................................................................................... 20
Maintenance of Reading Level................................................................................................. 21
Updates to the Test..................................................................................................................... 21
Test Administration Time.......................................................................................................... 21
Chapter 7
Equivalence of Forms................................................................................................................. 23
Equivalence of Short Form to Form A...................................................................................... 23
Equivalent Raw Scores............................................................................................................... 24
Equivalence of Computer-Based and Paper-and-Pencil Versions of the Short Form.......... 25
Chapter 8
Evidence of Reliability............................................................................................................... 27
Historical Reliability.................................................................................................................. 28
Previous Studies of Internal Consistency Reliability.............................................................. 28
Previous Studies of Test-Retest Reliability............................................................................... 29
Current Reliability Studies........................................................................................................ 29
Evidence of Internal Consistency Reliability.......................................................................... 29
Evidence of Test-Retest Reliability.......................................................................................... 30
Chapter 9
Evidence of Validity.................................................................................................................... 33
Evidence of Validity Based on Content.................................................................................33
Evidence of Criterion-Related Validity.................................................................................... 34
Previous Studies of Evidence of Criterion-Related Validity.................................................... 35
Current Studies of Evidence of Criterion-Related Validity..................................................... 35
Evidence of Convergent and Discriminant Validity.............................................................. 39
Previous Studies of Evidence of Convergent and Discriminant Validity............................... 39
Studies of the Relationship Between the Watson-Glaser and General Intelligence............... 40
Current Studies of Evidence of Convergent and Discriminant Validity................................ 40
Chapter 10
Using the Watson-Glaser as an Employment Selection Tool ........................................... 43
Employment Selection.............................................................................................................. 43
Fairness in Selection Testing..................................................................................................... 44
Legal Considerations............................................................................................................... 44
iv
Table of Contents
Group Differences/Adverse Impact......................................................................................... 44
Monitoring the Selection System........................................................................................... 44
Research...................................................................................................................................... 45
Appendix A
Description of the Normative Sample and Percentile Ranks.......................................... 47
Appendix B
Final Item Statistics for the Watson-Glaser–Short Form
Three-Parameter IRT Model..................................................................................................... 63
References..................................................................................................................................... 65
Research Bibliography............................................................................................................... 69
Glossary of Measurement Terms. ........................................................................................... 79
Tables
6.1 Distribution of Item Development Sample Form A Scores (N = 1,608)................................. 19
6.2 Grade Levels of Words on the Watson-Glaser–Short Form.................................................... 21
6.3 Frequency Distribution of Testing Time in Test-Retest Sample (n = 42)................................ 22
7.1 Part-Whole Correlations (rpw) of the Short Form and Form A................................................ 24
7.2 Raw Score Equivalencies Between the Short Form and Form A............................................. 25
7.3 Equivalency of Paper and Online Modes of Administration................................................. 26
8.1 Means, Standard Deviations (SD), Standard Errors of Measurement (SEM)
and Internal Consistency Reliability Coefficients (ralpha ) for the Short Form
Based on Previous Studies....................................................................................................... 28
8.2 Means, Standard Deviations (SD), Standard Errors of Measurement (SEM)
and Internal Consistency Reliability Coefficients (ralpha ) for the Current
Short Form Norm Groups....................................................................................................... 30
8.3 Test-Retest Reliability of the Short Form................................................................................ 31
9.1 Studies Showing Evidence of Criterion-Related Validity........................................................ 37
9.2 Watson-Glaser Convergent Evidence of Validity................................................................... 41
A.1 Description of the Normative Sample by Industry................................................................ 47
A.2 Description of the Normative Sample by Occupation........................................................... 51
A.3 Description of the Normative Sample by Position Type/Level.............................................. 54
A.4 Percentile Ranks of Total Raw Scores for Industry Groups..................................................... 57
A.5 Percentile Ranks of Total Raw Scores for Occupations........................................................... 59
A.6 Percentile Ranks of Total Raw Scores for Position Type/Level................................................ 60
A.7 Percentile Ranks of Total Raw Scores for Position Type/
Occupation Within Industry.................................................................................................. 61
B.1 Final Item Statistics for the Watson-Glaser Short Form Three-Parameter
IRT Model (reprinted from Watson & Glaser, 1994).............................................................. 63
Acknowledgements
The development and publication of updated information on a test like
the Watson-Glaser Critical Thinking Appraisal ®–Short Form inevitably involves the
helpful participation of many people in several phases of the project—design,
data collection, statistical data analyses, editing, and publication. The Harcourt
Assessment Talent Assessment team is indebted to the numerous professionals
and organizations that provided assistance.
The Talent Assessment team thanks Julia Kearney, Sampling Special Projects
Coordinator; Terri Garrard, Study Manager; and Victoria N. Locke, Director,
Catalog Sampling Department, for coordinating the data collection phase of this
project. David Quintero, Clinical Handscoring Supervisor, ensured accurate scoring of the paper-administered test data.
We thank Zhiming Yang, PhD, Psychometrician, and Jianjun Zhu, PhD, Manager,
Data Analysis Operations. Zhiming’s technical expertise in analyzing the data
and Jianjun’s psychometric leadership ensured the high level of analytical rigor
and psychometric integrity of the results reported.
Our thanks also go to Troy Beehler and Peter Schill, Project Managers, for skillfully managing the logistics of this project. Troy and Peter worked with several
team members from the Technology Products Group, Harcourt Assessment, Inc.
to ensure the high quality and accuracy of the computer interface. These dedicated individuals included Paula Oles, Manager, Software Quality Assurance;
Matt Morris, Manager, System Development; Christina McCumber and Johnny
Jackson, Software Quality Assurance Analysts; Terrill Freese, Requirements
Analyst; and Maurya Duran, Technical Writer. Dawn Dunleavy, Senior Managing
Editor and Konstantin Tikhonov, Project Editor, provided editorial guidance
and support. Production assistance was provided by Stephanie Adams, Director,
Production; Mark Cooley, Designer; Debbie Glaeser, Production Coordinator;
and Robin Espiritu, Production Manager, Manufacturing.
Finally, we wish to acknowledge the leadership, guidance, support, and commitment
of the following people through all the phases of this project: Gene Bowles,
Vice President, Publishing and Technology, Larry Weiss, PhD, Vice President,
Psychological Assessment Products Group, and Aurelio Prifitera, PhD, Publisher,
Harcourt Assessment, Inc., and President, Harcourt Assessment International.
Kingsley C. Ejiogu, PhD, Research Director
Mark Rose, PhD, Research Director
John Trent, M.S., Senior Research Analyst
vii
1
Introduction
The Watson-Glaser Critical Thinking Appraisal® (subsequently referred to in
this manual as the Watson-Glaser) is designed to measure important abilities
involved in critical thinking. Critical thinking ability plays a vital role in academic instruction and occupations that require careful analytical thinking to
perform essential job functions. The Watson-Glaser has been used to predict
performance in a variety of educational settings and has been a popular selection tool for executive, managerial, supervisory, administrative, and technical
occupations for many years. When used in conjunction with information from
multiple sources about the examinee’s skills, abilities, and potential for success,
the Watson-Glaser can contribute significantly to the quality of an organization’s
selection program.
The Watson-Glaser–Short Form was published in 1994 to enhance the use of the
Watson-Glaser in assessing adult employment applicants, candidates for employment-related training, career and vocational counselees, college students, and
students in technical schools and adult education programs. As an abbreviated
version of the Watson-Glaser–Form A, the Short Form uses a subset of Form A
scenarios and items to measure the same critical thinking abilities.
This manual provides the following information about the Short Form:
•
Updated guidelines for administration. Chapter 3 includes
guidelines for administering and scoring the traditional paper-and-pencil
version. Chapter 4 provides guidelines for administering the new
computer-based version.
•
Updated normative information (norms). Twenty-three new norm
groups, based on 6,713 cases collected in 2004 and 2005, are presented in
chapter 5.
•
Results of an equivalency study on the computer-based and
paper-and-pencil versions. To ensure equivalence between the newly
designed computer-based and traditional paper-and-pencil based formats
of the Watson-Glaser, a study was conducted comparing scores on the two
versions. A full description of the study, which supported equivalence of the
two versions, is presented in chapter 7, Equivalence of Forms.
•
Updated reliability and validity information. New studies describing
internal consistency and test-retest reliability are presented in chapter 8. New
studies describing convergent and criterion-related validity are presented
in chapter 9.
Information on Forms A and B was published in the 1994 Watson-Glaser manual.
Critical Thinking Ability and the
Development of the Original
Watson-Glaser Forms
2
Development of the Watson-Glaser was driven by the conceptualization of
critical thinking as a combination of attitudes, knowledge, and skills. This
conceptualization suggests that critical thinking includes:
•
the ability to recognize the existence of problems and an acceptance of the
general need for evidence in support of what is asserted to be true,
•
knowledge of the nature of valid inferences, abstractions, and generalizations
in which the weight or accuracy of different kinds of evidence are logically
determined, and
•
skills in employing and applying the above attitudes and knowledge.
The precursors of the Watson-Glaser were developed by Goodwin Watson in
1925 and Edward Glaser in 1937. These tests were developed with careful consideration given to the theoretical concept of critical thinking, as well as practical
applications. In 1964, The Psychological Corporation (now Harcourt Assessment,
Inc.) published Watson-Glaser Forms Ym and Zm. Each form contained 100
items and replaced an earlier version of the test, Form Am. In 1980, Form Ym
and Form Zm were modified for clarity, current word usage, and the elimination
of racial and sexual stereotypes. The revised instruments, each containing 80
items, were published as Form A and Form B.
The Watson-Glaser measures the extent to which examinees need training or
have mastered certain critical thinking skills. The availability of comparable
forms (i.e., the Short Form, Form A, and Form B) makes it possible to partially
gauge the efficacy of instructional programs, and to measure developments of
these skills over an extended period of time. The Watson-Glaser also has been a
particularly popular tool for assessing the success of critical thinking instruction
programs and courses, and for placing students in gifted and talented programs
at the high school level, and in honors curriculum at the university level.
The Watson-Glaser is composed of a set of five tests. Each test is designed to tap
a somewhat different aspect of critical thinking. A high level of competency
in critical thinking, as measured by the Watson-Glaser, may be operationally
defined as the ability to correctly perform the domain of tasks represented by
the five tests.
1—Inference. Discriminating among degrees of truth or falsity
of inferences drawn from given data.
2—Recognition of Assumptions. Recognizing unstated
assumptions or presuppositions in given statements or assertions.
3—Deduction. Determining whether certain conclusions
necessarily follow from information in given statements or premises.
Watson-Glaser Short Form Manual
4—Interpretation. Weighing evidence and deciding if
generalizations or conclusions based on the given data are warranted.
5—Evaluation of Arguments. Distinguishing between
arguments that are strong and relevant and those that are weak
or irrelevant to a particular issue.
Each test is composed of reading passages or scenarios that include problems,
statements, arguments, and interpretations of data similar to those encountered
on a daily basis at work, in the classroom, and in newspaper or magazine articles.
Each scenario is accompanied by a number of items to which the examinee
responds. There are two types of item content: neutral and controversial. Neutral
scenarios and items deal with subject matter that does not cause strong feelings
or prejudices, such as the weather and scientific facts or experiments. Scenarios
and items having controversial content refer to political, economic, and social
issues that frequently provoke strong emotional responses. As noted in the
research literature about critical thinking, strong attitudes, opinions, and biases
affect the ability of some people to think critically (Jaeger & Freijo, 1975;
Jones & Cook, 1975; Mitchell & Byrne, 1973; Sherif, Sherif, & Nebergall, 1965).
Though the Watson-Glaser comprises five tests, it is the total score of these
tests that yields a reliable measure of critical thinking ability. Individually, the
tests are composed of relatively few items and lack sufficient reliability to measure specific aspects of critical thinking ability. Therefore, individual test scores
should not be relied upon for most applications of the Watson-Glaser.
The Watson-Glaser Short Form
The Short Form was designed to offer a brief version of the Watson-Glaser
without changing the essential nature of the constructs measured. The length
of time required to administer Form A or Form B of the Watson-Glaser is approximately one hour, making both forms well suited for administration during a
single classroom period in a school setting. However, such lengthy administration
time increases the cost and decreases the practicality of using the Watson-Glaser
in adult assessment, particularly in the employment selection context.
The Short Form is composed of 16 scenarios and 40 items selected from the
80-item Form A. The Short Form takes about 30 minutes to complete in a paperand-pencil or computer-based format. It takes an additional five to ten minutes
to read the directions and sample questions. At one-half the length of Form A,
the Short Form presents a more practical measure of critical thinking ability, yet
retains an equivalent nature (see chapter 7, Equivalence of Forms). Organizations
requiring an alternative to the Short Form for retesting or other purposes may
use the full length Form B.
Like Form A and Form B, the Short Form is appropriate for use with persons
who have at least the equivalent of a ninth-grade education (see chapter 6,
Development of the Short Form, for more information on the reading level
of the Watson-Glaser).
Directions for Paper-and-Pencil
Administration and Scoring
3
Preparing for Administration
The person responsible for administering the Watson-Glaser does not need special training, but must be able to carry out standard examination procedures. To
ensure accurate and reliable results, the administrator must become thoroughly
familiar with the administration instructions and the test materials before
attempting to administer the test. It is recommended for test administrators to
take the Watson-Glaser prior to administration, being sure to comply with the
directions and any time requirement.
Testing Conditions
Generally accepted conditions of good test administration should be observed:
good lighting, comfortable seating, adequate desk or table space, and freedom
from noise and other distractions. Examinees should have sufficient seating
space to minimize cheating.
Each examinee needs an adequate flat surface on which to work. Personal
materials should be removed from the work surface.
Materials Needed to Administer the Test
•
This manual or the Directions for Administration booklet
•
1 Test Booklet for each examinee
•
1 Answer Document for each examinee
•
2 No. 2 pencils with erasers for each examinee
•
A clock or stopwatch if the test is timed
•
1 Hand Scoring Key (if the test will be hand-scored rather
than machine-scored)
Intended as a test of critical thinking power rather than speed, the WatsonGlaser may be given in either timed or untimed administrations. In timed
administrations, the time limit is based on the amount of time required to finish
the test by the majority of examinees in test tryouts. The administrator should
have a regular watch with a second hand, a wall clock with sweep-second hand,
or any other accurate device to time the test administration. To facilitate accurate
timing, the starting time and the finishing time should be written down immediately after the signal to begin has been given. In addition to testing time, allow
5–10 minutes to read the directions and answer questions.
Watson-Glaser Short Form Manual
Answering Questions
Examinees may ask questions about the test before the signal to begin is given.
To maintain standard testing conditions, answer such questions by rereading
the appropriate section of the directions. Do not volunteer new explanations
or examples. It is the responsibility of the test administrator to ensure that
examinees understand the correct way to indicate their answers on the Answer
Document and what is required of them. The question period should never be
rushed or omitted.
If any examinees have routine questions after the testing has started, try to
answer them without disturbing the other examinees. However, questions
about the test directions should be handled by telling the examinee to
do his or her best.
Administering the Test
All directions that the test administrator reads aloud to examinees are in bold
type. Read the directions exactly as they are written, using a natural tone and
manner. Do not shorten the directions or change them in any way. If you make
a mistake in reading a direction, say,
No, that is wrong. Listen again.
Then read the direction again.
When all examinees are seated, give each examinee two pencils and an
Answer Document.
Say Please make sure that you do not fold, tear, or otherwise damage the Answer
Documents in any way. Notice that your Answer Document has an example of how
to properly blacken the circle.
Point to the Correct Mark and Incorrect Marks samples on the
Answer Document.
Say Make sure that the circle is completely filled in as shown.
Note. Y
ou may want to point out how the test items are ordered on the front page
of the Short Form Answer Document so that examinees do not skip anything
or put the correct information in the wrong place.
Say In the upper left corner of the Answer Document, you will find box A labeled
NAME. Neatly print your Last Name, First Name, and Middle Initial here. Fill in the
appropriate circle under each letter of your name.
The Answer Document provides space for a nine-digit identification number. If
you want the examinees to use this space for an employee identification number, provide them with specific instructions for completing the information at
this time. For example, say, In box B labeled IDENTIFICATION NUMBER, enter your
employee number in the last four spaces provided. Fill in the appropriate circle under
each digit of the number. If no information is to be recorded in the space, tell
examinees that they should not write anything in box B.
Say Find box C, labeled DATE. Write down today’s Month, Day, and Year here. (Tell
examinees today’s date.) Blacken the appropriate circle under each digit of the date.
Chapter 3 Directions for Paper-and-Pencil Administration and Scoring
Box D labeled OPTIONAL INFORMATION, provides space for additional information you would like to obtain from the examinees. Let examinees know what
information, if any, they should provide in this box.
Note. If optional information is collected, the test administrator should explain
to the examinees the purpose of collecting this information (i.e., how it
will be used).
Say, Are there any questions?
Answer any questions.
Say After you receive your Test Booklet, please keep it closed. You will do all your
writing on the Answer Document only. Do not make any additional marks on the
Answer Document until I tell you to do so.
Distribute the Test Booklets.
Say In this test, all the questions are in the Test Booklets. There are five separate tests
in the booklet, and each one is preceded by its own directions. For each question,
decide what you think is the best answer. Because your score will be the number of
items you answered correctly, try to answer each question even if you are not sure
that your answer is correct.
Record your choice by making a black mark in the appropriate space on the Answer
Document. Always be sure that the answer space has the same number as the question in the booklet and that your marks stay within the circles. Do not make any other
marks on the Answer Document. If you change your mind about an answer, be sure to
erase the first mark completely.
Do not spend too much time on any one question. When you finish a page, go right on
to the next one. If you finish all the tests before time is up, you may go back and check
your answers.
Timed Administration
Say, You will have 30 minutes to work on this test. Now read the directions on the
cover of your Test Booklet.
After allowing time for the examinees to read the directions, say,
Are there any questions?
Answer any questions, preferably by rereading the appropriate section of the
directions, then say, Ready? Please begin the test.
Start timing immediately. If any of the examinees finish before the end of the
test period, either tell them to sit quietly until everyone has finished, or collect
their materials and dismiss them. At the end of 30 minutes, say,
Stop! Put your pencils down. This is the end of the test.
Intervene if examinees continue to work on the test after the time signal
is given.
Watson-Glaser Short Form Manual
Untimed Administration
Say You will have as much time as you need to work on this test. Now read the directions on the cover of your Test Booklet.
After allowing time for the examinees to read the directions, say,
Are there any questions?
Answer any questions, preferably by rereading the appropriate section of the
directions, then instruct examinees regarding what they are to do upon completing the test (e.g., remain seated until everyone has finished, bring Test Booklet
and Answer Document to the test administrator).
Say Ready? Please begin the test.
Allow the group to work until everyone is finished.
Concluding Administration
At the end of the testing session, collect all Test Booklets, Answer Documents,
and pencils. Place the completed Answer Documents in one pile and the Test
Booklets in another. The Test Booklets may be reused, but they will need to be
inspected for marks. Marked booklets should not be reused, unless the marks
can be completely erased.
Scoring
The Watson-Glaser Answer Document may be hand scored with the Hand
Scoring Key or machine scored.
Scoring With the Hand-Scoring Key
First, cross out multiple responses to the same item with a heavy red mark that
will show through the key. (Note: Red marks are only suitable for hand-scored
documents.) Check for any answer spaces that were only partially erased by the
examinee in changing an answer; partial erasures should be completely erased.
Next, place the scoring key over the Answer Document so that the edges are
neatly aligned and the two stars appear through the two holes that are the
closest to the bottom of the key. Count the number of correctly marked spaces
(other than those through which a red line has been drawn) appearing through
the holes in the stencil. Record the total in the “Score” box on the Answer
Document. The maximum raw score for the Short Form is 40. The percentile
score corresponding to the raw score may be recorded in the “Percentile” space
on the Answer Document, and the norm group used to determine that percentile
may be recorded in the space labeled “Norms Used.”
Machine Scoring
First, completely erase multiple responses to the same item or configure the scanning program to treat multiple responses as incorrect answers. If you find any
answer spaces that were only partially erased by the examinee, finish completely
erasing them.
The machine scorable Answer Documents available for the Short Form
may be processed by any reflective scanning device programmed to
your specifications.
Chapter 3 Directions for Paper-and-Pencil Administration and Scoring
Test Security
Watson-Glaser scores are confidential and should be stored in a secure location
accessible only to authorized individuals. It is unethical and poor test practice to
allow test score access to individuals who do not have a legitimate need for the
information. The security of testing materials and protection of copyright must
also be maintained by authorized individuals. Storing test scores and materials
in a locked cabinet (or password-protected file in the case of scores maintained
electronically) that can only be accessed by designated test administrators is an
effective means to ensure their security.
Accommodating Examinees with Disabilities
You will need to routinely provide reasonable accommodations that make it possible for candidates with particular needs to comfortably take the test, such as
left-handed desks for some candidates and adequate and comfortable seating for
all individuals.
On occasion, a special administration may be required for an examinee with an
impairment that affects his or her ability to take a test in the standard manner.
Harcourt Assessment, Inc. recommends that reasonable accommodations for
these examinees be made in accordance with the Americans with Disabilities
Act (ADA) of 1990. The ADA has established basic legal rights for individuals
with physical or mental disabilities that substantially limit one or more major
life activities. Reasonable accommodations may include, but are not limited to,
modifications to the testing environment (e.g., high desks), medium (e.g., having a reader read questions to the examinee), time limit, and/or content (Society
for Industrial and Organizational Psychology, 2003).
If an examinee’s disability is not likely to impair job performance, but may hinder his or her performance on the Watson-Glaser, you may want to consider
waiving administration of the Watson-Glaser or de-emphasizing the test score in
lieu of other application criteria.
Directions for Computer-Based Administration
4
The computer-based Watson-Glaser is administered through eAssessTalent.com,
the Internet-based testing system designed by Harcourt Assessment, Inc., for the
administration, scoring, and reporting of professional assessments. Instructions
for administrators on how to order and access the test online are provided at eAssessTalent.com. Instructions for accessing the Watson-Glaser interpretive reports
are provided on the website. After a candidate has taken the Watson-Glaser on
eAssessTalent.com, you can review the candidate’s results in an interpretive
report, using the link that Harcourt Assessment provides.
Preparing for Administration
Being thoroughly prepared before the examinee’s arrival will result in a more
efficient online administration session. It is recommended for test administrators to take the computer-based Watson-Glaser prior to administering the test,
being sure to comply with the directions and any time requirement. Examinees
will not need pencils or scratch paper for this computer-based test. In addition,
examinees should not have access to any reference materials (e.g., dictionaries
or calculators).
Testing Conditions
It is important to ensure that the test is administered in a quiet, well-lit room.
The following conditions are necessary for accurate scores and for maintaining
the cooperation of the examinee: good lighting, comfortable seating, adequate
desk or table space, comfortable positioning of the computer screen, keyboard,
and mouse, and freedom from noise and other distractions.
Answering Questions
Examinees may ask questions about the test before the signal to begin is given.
To maintain standard testing conditions, answer such questions by rereading the
appropriate section of these directions. Do not volunteer new explanations or
examples. As the test administrator, it is your responsibility to ensure that examinees understand the correct way to indicate their answers and what is required
of them. The question period should never be rushed or omitted.
If any examinees have routine questions after the testing has started, try to
answer them without disturbing the other examinees. However, questions about
the test directions should be handled by telling the examinee to do his or her best.
11
Watson-Glaser Short Form Manual
Administering the Test
After the initial instruction screen for the Watson-Glaser has been accessed and
the examinee is seated at the computer, say,
The on-screen directions will take you through the entire process, which begins with
some demographic questions. After you have completed these questions, the test
will begin. You will have as much time as you need to complete the test items. The test
ends with a few additional demographic questions. Do you have any questions before
starting the test?
Answer any questions and say, Please begin the test.
Once the examinee clicks the “Start Your Test” button, test administration begins
with the first page of test questions. The examinee may review test items at the
end of the test. Examinees have as much time as they need to complete the exam,
but they typically finish within 30 minutes.
If an examinee’s computer develops technical problems during testing, move the
examinee to another suitable computer location. If the technical problems cannot
be solved by moving to another computer location, contact Harcourt Assessment,
Inc. Technical Support for assistance. The contact information, including phone
and fax numbers, can be found at the eAssessTalent.com website.
Scoring and Reporting
Scoring is automatic, and the report is available a few seconds after the test is
completed. A link to the report will be available on eAssessTalent.com. Adobe®
Acrobat Reader® is necessary to open the report. You may view, print, or save the
candidate’s report.
Test Security
Watson-Glaser scores are confidential and should be stored in a secure location
accessible only to authorized individuals. It is unethical and poor test practice to
allow test-score access to individuals who do not have a legitimate need for the
information. Storing test scores in a locked cabinet or password protected file
that can only be accessed by designated test administrators will help ensure their
security. The security of testing materials (e.g., access to online tests) and protection of copyright must also be maintained by authorized individuals. Avoid
disclosure of test access information such as usernames or passwords and only
administer the Watson-Glaser in proctored environments. All the computer
stations used in administering the Watson-Glaser must be in locations that
can be easily supervised with the same level of security as with the paper-andpencil administration.
12
Chapter 4 Directions for Computer-Based Administration
Accommodating Examinees with Disabilities
As noted in chapter 3 above under the section dealing with examinees with disability, the test administrator should provide reasonable accommodations to
enable candidates with special needs to comfortably take the test. Reasonable
accommodations may include, but are not limited to, modifications to the test
environment (e.g., high desks) and medium (e.g., having a reader read questions
to the examinee, or increasing the font size of questions) (Society for Industrial
and Organizational Psychology, 2003). In situations where an examinee’s disability is not likely to impair his or her job performance, but may hinder the
examinee’s performance on the Watson-Glaser, the organization may want
to consider waiving the test or de-emphasizing the score in lieu of other application criteria. Interpretive data as to whether scores on the Watson-Glaser are
comparable for examinees who are provided reasonable accommodations are not
available at this time due to the small number of examinees who have requested
such accommodations.
If, due to some particular impairment, a candidate cannot take the computeradministered test but can take the test on paper, the administrator could provide
reasonable accommodation for the candidate to take the test on paper, and then
have the candidate’s certified responses and results entered into the computer
system. The Americans with Disabilities Act (ADA) of 1990 requires an employer
to reasonably accommodate the known disability of a qualified applicant provided such accommodation would not cause an “undue hardship” to the
operation of the employer’s business.
13
5
Norms
The raw score on the Watson-Glaser–Short Form is calculated by adding the total
number of correct responses. The maximum raw score is 40. Raw scores may be
used to rank examinees in order of performance, but little can be inferred from
raw scores alone. It is important to relate the scores to specifically defined normative groups to make the test results meaningful.
Norms provide a basis for evaluating an individual’s score relative to the scores
of other individuals who took the same test. Norms allow for the conversion of
raw scores to more useful comparative scores, such as percentile ranks. Typically,
norms are constructed from the scores of a large sample of individuals who
took a test. This group of individuals is referred to as the normative group or
standardization sample.
The characteristics of the sample used for preparing norms are critical in determining the usefulness of those norms. For some purposes, such as intelligence
testing, norms that are representative of the general population are essential. For
other purposes, such as selecting from among applicants to fill a particular job,
normative information derived from a specific, relevant, well-defined group may
be most useful. However, the composition of a sample of job applicants is influenced by a variety of situational factors, including job demands and local labor
market conditions. Because such factors can vary across jobs, locations, and over
time, the limitations on the usefulness of any set of published norms should
be acknowledged.
When a test is used to help make human resource decisions, the most appropriate norm group is one that is representative of those who will be taking the test
in the local situation. It is best, whenever possible, to prepare local norms by
accumulating the test scores of applicants, trainees, or employees. One of the
factors that must be considered in preparing norms is sample size. With large
samples, all possible scores can be converted to percentile ranks. Data from
smaller samples tend to be unstable and the presentation of percentile ranks for
all possible scores presents an unwarranted impression of precision. Until a sufficient and representative number of cases has been collected (preferably 100 or
more), the norms presented in Appendix A should be used to guide the interpretation of test scores.
Using Norms Tables to Interpret Scores
The Short Form norms in Appendix A were derived from new data, collected
June 2004 through March 2005, from 6,713 adults in a variety of employment
settings. Please note that the distributions of occupational levels across industry
15
Watson-Glaser Short Form Manual
samples vary. Therefore, it is not appropriate to compare industry means presented in Appendix A to each other. The tables in Appendix A show the total
raw scores on the Watson-Glaser with their corresponding percentile ranks for
identified norm groups.
When using the norms tables in Appendix A, look for a group that is similar to
the individual or group tested. For example, you would compare the test score
of a person who applied for an engineer’s position with norms derived from the
scores of other engineers. If a person applied for a management position, you
would compare the candidate’s test score with norms for managers, or norms
for managers in manufacturing. When using the norms tables in Appendix A to
interpret candidates’ scores, keep in mind that norms are affected by the composition of the groups that participated in the normative study. Therefore, it
is important to examine specific industry and occupational characteristics of a
norm group.
By comparing an individual’s raw score to the data in a norms table, it is possible
to determine the percentile rank corresponding to that score. The percentile rank
indicates an individuals’ relative position in the norm group. Percentiles should
not be confused with percentage scores which represent the percentage of correct
items. Percentiles are derived scores which are expressed in terms of the percent
of people in the norm group scoring equal to or below a given raw score.
Although percentiles are useful for explaining an examinee’s performance
relative to others, they have limitations. Percentile ranks do not have equal
intervals. In a normal distribution of scores, percentile ranks tend to cluster
around the 50th percentile. This clustering affects scores in the average range
the most because a difference of one or two raw score points may change the
percentile rank. Extreme scores are less affected; a change in one or two raw
score points typically does not produce a large change in percentile ranks. These
factors should be taken into consideration when interpreting percentiles.
Converting Raw Scores to Percentile Ranks
To find the percentile rank of a candidate’s raw score, locate the raw score from
either the extreme right- or left-hand column in Tables A.4–A.7. The corresponding percentile rank is read from the selected norm group column. For example,
if a person applying for a job as an engineer had a score of 35 on the WatsonGlaser–Short Form, it is appropriate to use the Engineer norms in Appendix A
(Table A.5) for comparison. In this case, the percentile rank corresponding to a
raw score of 35 is 63. This percentile rank indicates that about 63% of the people
in the norm group scored lower than or equal to a score of 35 on the WatsonGlaser–Short Form, and about 37% scored higher than a score of 35 on the
Watson-Glaser–Short Form.
Each group’s size (N), mean, and standard deviation (SD) are shown at the bottom of the norms tables. The group mean or average is calculated by summing
the raw scores and dividing the sum by the total number of examinees. The
standard deviation indicates the amount of variation in a group of scores. In a
normal distribution, approximately two-thirds (68.26%) of the scores are within
the range of –1 SD (below the mean) to +1 SD (above the mean). These statistics
are often used in describing a study sample and setting cut scores. For example, a
cut score may be set as one SD below the mean.
16
Chapter 5 Norms
In accordance with the Civil Rights Act of 1991, Title 1, Section 106, the norms
provided in Appendix A combine data for males and females, and for white and
minority examinees. The use of combined group norms can exacerbate adverse
impact if there are expected differences in scores due to differences in group
membership. Previous investigations conducted during the development of
Watson-Glaser Form A and Form B found no consistent differences between the
scores of male examinees and the scores of female examinees. Other studies of
earlier Forms Ym and Zm also found no consistent differences based on the sex
of the examinee in critical thinking ability as measured by the Watson-Glaser
(e.g., Burns, 1974; Gurfein, 1977; Simon & Ward, 1974).
17
6
Development of the Short Form
The Watson-Glaser–Short Form is a shortened version of Form A. Historical and
test development information for Form A may be found in the Watson-Glaser,
Forms A and B Manual, 1980 edition.
Test Assembly Data Set
Two overlapping sets of data were used in the development of the Short Form.
The first data set consisted of item-level responses to Form A from 1,608 applicants and employees. These data were obtained from eight sources between 1989
and 1992. This data set was used to generate item statistics and to make decisions about item selection for inclusion in the Short Form. The average Form
A score for this sample was 61.78 (SD = 9.30) and the internal consistency (i.e.,
KR-20) coefficient was .87. Table 6.1 presents a frequency distribution of Form A
scores for the sample.
Table 6.1 Distribution of Item Development Sample Form A Scores (N = 1,608)
Form A Score
Frequency
Percent
Form A Score
Frequency
Percent1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0.0
0.0
0.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
3
8
8
13
19
15
14
17
31
30
25
37
43
45
38
36
46
46
67
56
0.2
0.5
0.5
0.8
1.2
0.9
0.9
1.1
1.9
1.9
1.6
2.3
2.7
2.8
2.4
2.2
2.9
2.9
4.2
3.5
(continued)
19
Watson-Glaser Short Form Manual
Table 6.1 Distribution of Item Development Sample Form A Scores (N = 1,608) (continued)
1
Form A Score
Frequency
Percent
Form A Score
Frequency
Percent1
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
1
0
1
0
0
1
0
0
2
4
1
2
0
1
3
2
5
3
3
5
0.1
0.0
0.1
0.0
0.0
0.1
0.0
0.0
0.1
0.2
0.1
0.1
0.0
0.1
0.2
0.1
0.3
0.2
0.2
0.3
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
57
66
59
79
63
74
76
76
68
88
60
62
39
37
33
22
8
6
2
0
3.5
4.1
3.7
4.9
3.9
4.6
4.7
4.7
4.2
5.5
3.7
3.9
2.4
2.3
2.1
1.4
0.5
0.4
0.1
0.0
The total percent equals 100.4 due to rounding.
A second set of data was created by combining the first dataset with item-level
data obtained in 1993 from three additional sources of Form A data (N = 2,119).
The combined data set (N = 3,727) was used to evaluate the psychometric properties of the Short Form, including reliability, and to examine the equivalency
between the Short Form and Form A.
Criteria for Item Selection
In assembling the Short Form, the primary goal was to significantly reduce the
time limit required for Form A without changing the essential nature of the constructs measured. For additional information regarding the selection of items for
the Short Form, please refer to the 1994 edition of the Watson-Glaser manual.
The following criteria were used to select Short Form items:
20
Maintenance of the Watson-Glaser five sub-test structure and the
scenario-based format
Items represent psychometrically sound scenarios and items
Maintenance of test reliability
Maintenance of reading level
Update of test currency
Chapter 6 Development of the Short Form
Maintenance of Reading Level
The reading level of the shortened test form was assessed using EDL Core
Vocabulary in Reading, Mathematics, Science, and Social Studies (Taylor, et al., 1989)
and Basic Reading Vocabularies (Harris & Jacobson, 1982). Approximately 98.2%
of words appearing in directions and exercises were at or below ninth-grade reading level. A summary of word distribution by grade level is presented in Table 6.2.
Table 6.2 Grade Levels of Words on the Watson-Glaser–Short Form
Grade Level
Preprimer
1
2
3
4
5
6
7
8
9
10
11
Total
Frequency
106
224
310
379
222
191
229
80
53
5
5
25
1829
Percent
5.8
12.2
16.9
20.7
12.1
10.4
12.5
4.4
2.9
0.3
0.3
1.5
100.0
Updates to the Test
During the test assembly process, attention was given to the currency of Form
A scenarios and items. Some scenarios deal with dated subject matter, such as
Russia prior to the dissolution of the USSR. The item selection process removed
such dated scenarios, thereby making the composition of the Short Form more
contemporary.
Test Administration Time
The optional 30-minute time limit for the Short Form was established during a
study investigating the instrument’s test-retest reliability. A sample of 42 employees (92.9% non-minority; 54.8% female) at a large publishing company completed
the Short Form twice (two-week testing interval). The participants worked in a
variety of positions ranging from Secretary to Project Director. During the first
testing session, participants were given as much time as they required to complete the test. A frequency distribution of time taken (see Table 6.3) revealed that
approximately 90% of the respondents completed the Short Form in 30 minutes
or less. Consistent with the method used to establish testing time limits for previous forms of the Watson-Glaser, these results were used to set the time limit for
21
Watson-Glaser Short Form Manual
completing the Short Form at 30 minutes. The fact that the majority of respondents complete the Short Form within the allotted 30-minute time supports
the point that the Watson-Glaser is a test of critical thinking power, rather than
speed. Furthermore, normative data gathered in both timed and untimed administration may be used to interpret Short Form results, as the variability in scores
is derived from test items rather than testing time limits.
Table 6.3 Frequency Distribution of Testing Time in Test-Retest Sample (n = 42)
Completion Time
20 minutes or less
21 to 25 minutes
26 to 30 minutes
31 minutes or more
22
Frequency
2
14
22
4
Percent
4.8
33.3
52.4
9.5
Cumulative Percent
4.8
38.1
90.5
100.0
7
Equivalence of Forms
Equivalence of Short Form to Form A
To support the equivalence of the Short Form and Form A, test item contents
were not changed and the new form was assembled from Form A test items.
As a result, the Short Form may be considered to measure the same abilities as
Form A.
Following assembly of the Short Form, correlation coefficients were computed
between raw scores on the Short Form and those on Form A. Because the constituent test items of the Short Form are completely contained in the longer Form A,
the coefficients are considered part-whole correlations (rpw). Part-whole correlations are known to overstate the relationship between independently measured
variables and cannot be interpreted as alternate form correlations. However, they
can be used to support the equivalence of a Short Form to a longer one, because
examinees are expected to respond the same way to the same items, regardless of
the form.
The overall correlation coefficient was calculated by using data from a sample of
3,727 adults who were administered Form A. To compute the correlations, each
Watson-Glaser was scored twice. First, the Form A raw score was computed, then
the Short Form raw score was computed by ignoring responses to the Form A
items not used in the Short Form. For the entire sample, the resulting coefficient
was .96. Correlations between Form A and the Short Form scores were also computed separately for each of 21 sources providing data, some of which were not
included in the Short Form developmental analysis. The resulting coefficients
are presented in Table 7.1. A description of the sample group is followed by the
group size (N) and the part-whole correlation (rpw) between the Short Form and
Form A. The coefficients presented in Table 7.1 indicate that raw scores on the
Short Form correlated very highly with Form A raw scores in a variety of groups.
23
Watson-Glaser Short Form Manual
Table 7.1 Part–Whole Correlations (rpw ) of the Short Form and Form A
Group
Lower-level management applicants
Lower to upper-level management applicants
Mid-level management applicants
Upper-level management applicants at Board of County Commissioners
Construction management applicants
Executive management applicants
Supervisory and managerial applicants in the corrugated container industry
Sales applicants
Mid-level marketing applicants
Bank employees
Bank management associates
Candidates for the ministry
Clergy
Railroad dispatchers
Nurse managers and educators
Police officers
Administrative applicants in city government
Security applicants
Candidates for police captain
Police department executives
Various occupations
N
rpw
219
501
211
215
322
453
149
473
909
95
131
126
99
199
111
225
23
42
41
55
133
.93
.94
.94
.94
.94
.93
.94
.94
.94
.95
.94
.95
.91
.92
.95
.95
.97
.89
.89
.94
.97
Equivalent Raw Scores
Table 7.2 presents raw score equivalents for the Short Form and Form A. For
every possible score on the Short Form, this table contains an equivalent raw
score on Form A. To convert a Form A raw score, find that score in the Form A
column. Then, look in the Short Form raw score column on the left. To convert
a Short Form raw score, simply reverse the procedure. Table 7.2 was prepared
with data obtained from 3,727 adults comprising the item selection sample. To
establish equivalent raw scores, raw-score-to-ability estimates for both the Short
Form and Form A were generated using Rasch-model difficulty parameters. Then,
using interpolation when necessary, the ability estimates were calibrated for all
possible scores on each form. Form A and Short Form raw scores corresponding
to the same ability estimate were considered equivalent (i.e., represent the same
ability level).
Organizations requiring an alternative to the Short Form for retesting or other
purposes may use Form B. Form A and Form B are equivalent, alternate forms.
Raw scores on one form (A or B) may be interpreted as having the same meaning
as identical raw scores on the other form (A or B). Therefore, scores from either
Form A or Form B may be equated to Form S using Table 7.2.
24
Chapter 7 Equivalence of Forms
Table 7.2 Raw Score Equivalencies Between the Short Form and Form A
Short Form
Raw Score
Form A
Raw Score
Short Form
Raw Score
Form A
Raw Score
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
78–80
77
75–76
73–74
71–72
69–70
67–68
65–66
63–64
61–62
59–60
57–58
55–56
53–54
51–52
49–50
48
46–47
44–45
42–43
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
40–41
38–39
36–37
34–35
33
31–32
29–30
27–28
25–26
23–24
21–22
19–20
17–18
15–16
13–14
10–12
8–9
6–7
4–5
1–3
Equivalence of Computer-Based and Paper-andPencil Versions of the Short Form
Studies of the effect of the medium of test administration have generally
supported the equivalence of paper and computerized versions of non-speeded
cognitive ability tests (Mead & Drasgow, 1993). To ensure that these findings
held true for the Watson-Glaser, Harcourt Assessment conducted an equivalency
study using paper-and-pencil and computer-administered versions of the
Short Form.
In this study, a counter-balanced design was employed using a sample of 226
adult participants from a variety of occupations. Approximately half of the group
(n = 118) completed the paper form followed by the online version, while the
other participants (n = 108) completed the tests in the reverse order. Table 7.3
presents means, standard deviations, and correlations obtained from an analysis
of the resulting data. As indicated in the table, neither mode of administration
yielded consistently higher raw scores, and mean score differences between
modes were less than one point (0.5 and 0.7). The variability of scores also was
very similar, with standard deviations ranging from 5.5 to 5.7.
25
Watson-Glaser Short Form Manual
The coefficients indicate that paper-and-pencil raw scores correlate very highly
with online administration raw scores (.86 and .88, respectively). The high correlations provide further support that the two modes of administration can be
considered equivalent. Thus, raw scores on one form (paper or online) may
be interpreted as having the same meaning as identical raw scores on the
other form.
Table 7.3 Equivalency of Paper and Online Modes of Administration
Paper
26
Administration Order
N
Paper Followed by Online
Online Followed by Paper
Online
Mean
SD
Mean
SD
r
118
30.1
5.7
30.6
5.5
.86
108
29.5
5.5
28.8
5.7
.88
8
Evidence of Reliability
The reliability of a measurement instrument refers to the accuracy and precision
of test results and is a widely used indicator of the confidence that may be placed
in those results. The reliability of a test is expressed as a correlation coefficient that
represents the consistency of scores that would be obtained if a test could be
given an infinite number of times. In actual practice, however, we do not have
the luxury of administering a test an infinite number of times, so we can expect
some measurement error. Reliability coefficients help us to estimate the amount
of error associated with test scores. Reliability coefficients can range from .00 to
1.00. The closer the reliability coefficient is to 1.00, the more reliable the test. A
perfectly reliable test would have a reliability coefficient of 1.00 and no measurement error. A completely unreliable test would have a reliability coefficient of
.00. The U.S. Department of Labor (1999) provides the following general guidelines for interpreting a reliability coefficient: above .89 is considered “excellent,”
.80–.89 is “good,” .70–.79 is considered “adequate,” and below .70 “may have
limited applicability.”
The methods most commonly used to estimate test reliability are test-retest
(the stability of test scores over time), alternate forms (the consistency of scores
across alternate forms of a test), and internal consistency of the test items (e.g.,
Cronbach’s alpha coefficient, Cronbach 1970).
Since repeated testing always results in some variation, no single test event ever
measures an examinee’s actual ability with complete accuracy. We therefore need
an estimate of the possible amount of error present in a test score, or the amount
that scores would probably vary if an examinee were tested repeatedly with the
same test. This error is known as the standard error of measurement (SEM). The SEM
decreases as the reliability of a test increases; a large SEM denotes less reliable
measurement and less reliable scores.
The SEM is a quantity that is added to and subtracted from an examinee’s test
score to create a confidence interval or band of scores around the obtained score.
The confidence interval is a score range that, in all likelihood, includes the
examinee’s hypothetical “true” score which represents the examinee’s actual
ability. A true score is a theoretical score entirely free of error. Since the true
score is a hypothetical value that can never be obtained because testing always
involves some measurement error, the score obtained by an examinee on any
test will vary somewhat from administration to administration. As a result,
any obtained score is considered only an estimate of the examinee’s “true”
score. Approximately 68% of the time, the observed score will lie within +1.0
and –1.0 SEM of the true score; 95% of the time, the observed score will lie
within +1.96 and –1.96 SEM of the true score; and 99% of the time, the
observed score will lie within +2.58 and –2.58 SEM of the true score.
27
Watson-Glaser Short Form Manual
Historical Reliability
Previous Studies of Internal Consistency Reliability
For the sample used in the initial 1994 development of the Watson-Glaser Short
Form (N = 1,608), Cronbach’s alpha coefficient (r) was .81. Cronbach’s alpha and
the SEM were also calculated for a number of groups separately, including some
groups that were in the development sample and some that were not in the
development sample (see Table 8.1).
Table 8.1 Means, Standard Deviations (SD), Standard Errors of Measurement (SEM) and Internal Consistency
Reliability Coefficients (ralpha), Based on Previous Studies
Group
Lower-level management applicants
Lower to upper-level management applicants
Mid-level management applicants
Upper-level management applicants at a Board of County Commissioners
Executive management applicants
Construction management applicants
Supervisory and managerial applicants in the corrugated container industry
Sales applicants
Mid-level marketing applicants
Bank employees
Bank management associates
Candidates for the ministry
Clergy
Railroad dispatchers
Nurse managers and educators
Police officers
Various occupations
Administrative applicants in city government1
Security applicants
Candidates for police captain2
Police department executives3
N
Mean
SD
SEM
r alpha
219
501
211
215
453
322
149
473
909
95
131
126
99
199
111
225
133
23
42
41
55
33.50
32.29
33.99
31.80
33.42
32.05
31.48
30.88
31.02
32.75
31.61
34.10
34.56
25.15
30.52
28.00
30.68
30.43
25.00
27.95
32.56
4.40
4.63
4.20
5.20
4.21
4.87
5.00
4.98
5.08
4.58
4.69
4.71
3.79
5.00
4.86
6.03
6.65
5.82
4.79
4.60
4.14
2.17
2.31
2.12
2.35
2.18
2.32
2.39
2.43
2.42
2.25
2.39
2.08
2.05
2.78
2.46
2.64
2.40
2.42
2.77
2.69
2.32
.76
.75
.74
.80
.73
.77
.77
.76
.77
.76
.74
.80
.71
.69
.74
.81
.87
.83
.67
.66
.69
1
D.O.T. Code 169. 167-010
2
D.O.T. Code 375. 167-034
3
Includes Commander (D.O.T. Codes 375.167-034 and 375. 267-026) Chief (D.O.T. Code 375.117-010), Deputy Chief
(D.O.T. Code 375.267-026) and Warden (D.O.T. Code 187, 117-018)
Using the SEM means that scores are interpreted as bands or ranges of scores,
rather than as precise points. Thinking in terms of score ranges serves as a check
against overemphasizing small differences between scores. The SEM may be used
to determine whether an individual’s score is significantly different from a cut
score, or whether the scores of two individuals differ significantly. An example of
one general rule of thumb is that the difference between two scores on the same
test should not be interpreted as significant unless the difference is equal to at
least twice the SEM of the test (Aiken, 1979; as reported in Cascio, 1982).
28
Chapter 8 Evidence of Reliability
The internal consistency estimates calculated for the Short Form tests were
moderately low, consistent with research involving previous forms of the
Watson-Glaser; for this reason, individual test scores should not be used.
Previous Studies of Test-Retest Reliability
In 1994, a study investigating the test-retest reliability and required completion time of the Watson-Glaser–Short Form was conducted at a large publishing
company. A sample of 42 employees (92.9% non-minority; 54.8% female) completed the Short Form two weeks apart. The participants worked in a variety of
positions ranging from Secretary to Project Director. The mean score at the first
testing was 30.5 (SD = 5.6) and at the second testing 31.4 (SD = 5.9), while the
test-retest correlation was .81 (p < .001). Scores for females (Mean = 31.0, SD = 6.1)
and male (Mean = 31.8, SD = 5.7) respondents were not significantly different
(t = 0.11, df = 40).
Current Reliability Studies
Evidence of Internal Consistency Reliability
Cronbach’s alpha and the standard error of measurement (SEM) were calculated
for the samples used for the current norm groups (see Table 8.2). Reliability
estimates for these samples were similar to those found in previous studies and
ranged from .76 to .85. Consistent with previous research, these values indicate
that the total score possesses adequate reliability. The test scores obtained lower
estimates of internal consistency reliability, thereby suggesting that the test
scores alone should not be used.
29
Watson-Glaser Short Form Manual
Table 8.2 Means, Standard Deviations (SD), Standard Errors of Measurement (SEM) and Internal Consistency
Reliability Coefficients (ralpha ) for the Current Short Form Norm Groups
Group
Industry
Advertising/Marketing/Public Relations
Education
Financial Services/Banking/Insurance
Government/Public Service/Defense
Health Care
Information Technology/Telecommunications
Manufacturing/Production
Professional Business Services (e.g., Consulting, Legal)
Retail/Wholesale
Occupation
Accountant/Auditor/Bookkeeper
Consultant
Engineer
Human Resource Professional
Information Technology Professional
Sales Representative—Non-Retail
Position Type/Level
Executive
Director
Manager
Supervisor
Professional/Individual Contributor
Hourly/Entry-Level
Norms by Occupation Within Specific Industry
Manager in Manufacturing/Production
Engineer in Manufacturing/Production
N
Mean
SD
SEM
r alpha
101
119
228
130
195
295
561
153
307
28.7
30.2
31.2
30.0
28.3
31.2
32.0
31.9
30.8
6.1
5.4
5.7
6.3
6.5
5.5
5.3
5.6
5.3
2.52
2.47
2.42
2.44
2.60
2.40
2.31
2.31
2.43
.83
.79
.82
.85
.84
.81
.81
.83
.79
118
139
225
140
222
353
30.2
33.3
32.8
30.0
31.4
29.8
5.8
4.8
4.8
5.7
5.9
5.1
2.46
2.20
2.25
2.48
2.36
2.50
.82
.79
.78
.81
.84
.76
409
387
973
202
842
332
33.4
32.9
30.7
28.8
30.6
27.7
4.5
4.7
5.4
6.2
5.6
5.9
2.20
2.20
2.41
2.63
2.44
2.64
.76
.78
.80
.82
.81
.80
170
112
31.9
32.9
5.3
4.7
2.31
2.25
.81
.77
Evidence of Test-Retest Reliability
Test-Retest reliability was evaluated for the total score and for the individual test
scores in a sample of job incumbents representing various organizational levels
and industries. (N = 57). The test-retest intervals ranged from 4 to 26 days, with
a mean interval of 11 days. As the data in Table 8.3 indicate, the Watson-Glaser
Total score demonstrates acceptable test-retest reliability (r12 = .89). The difference
in mean scores between the first testing and the second testing is statistically
small (d′ = 0.17). This difference (d′ ), proposed by Cohen (1988), is useful as an
index to measure the magnitude of the actual difference between two means.
The difference (d′ ) is calculated from dividing the difference of the two test
means by the square root of the pooled variance, using Cohen’s (1996)
Formula 10.4.
30
Chapter 8 Evidence of Reliability
The test-retest reliability of the Watson-Glaser Total score has a small difference
index (d′ = 0.17), indicating that the magnitude of the difference in mean scores
between first testing and the retesting is statistically small. In other words, the
Watson-Glaser Total score is stable over the test-retest period. The test-retest reliability coefficients of the test scores are somewhat lower, suggesting that the
Total score is more reliable than the test scores as a measure of critical thinking.
Table 8.3 Test-Retest Reliability of the Short Form
First Testing
Second Testing
Score
Mean
SD
Mean
SD
r12
Difference (d ′)
Total
29.5
7.0
30.7
7.0
.89
.17
Inference
Recognition of Assumptions
Deduction
Interpretation
Evaluation of Arguments
4.3
6.1
7.1
5.0
7.2
1.9
2.2
1.8
1.6
1.6
4.6
6.5
7.0
5.1
7.5
1.9
2.0
2.0
1.6
1.3
.70
.83
.55
.78
.67
.16
.19
–.05
.06
.21
31
9
Evidence of Validity
The validity of a test refers to the degree to which specific data, research, or
theory support that the test measures what it is intended to measure. Validity is a
unitary concept. It is the extent to which all the accumulated evidence supports
the intended interpretation of test scores for the proposed purpose (AERA, APA,
& NCME, 1999). “Validity is high if a test gives the information the decision
maker needs” (Cronbach, 1970). Data from the Short Form sample was analyzed
for evidence of validity based on content, test-criterion relationships, and evidence of convergent and discriminant validity.
Evidence of Validity Based on Content
Evidence based on the content of a test exists when a test includes a representative sample of tasks, behaviors, knowledge, skills, abilities, or other characteristics
necessary to perform the job. Evidence of content validity is usually gathered
through job analysis and is most appropriate for evaluating knowledge and
skills tests.
Evaluation of content-related evidence is usually a rational, judgmental process
(Cascio & Aguinis, 2005). In employment settings, the principal concern is with
making inferences about how well the test samples a job performance domain—a
segment or aspect of the job performance universe which has been identified and
about which inferences are to be made (Lawshe, 1975). Because most jobs have
several performance domains, a standardized test generally applies only to one
segment of the job performance universe (e.g., a typing test administered to a
secretary applies to typing, one job performance domain in the job performance
universe of a secretary). Thus, the judgment of whether content-related evidence
exists depends upon an evaluation of whether the same capabilities are required
in both the job performance domain and the test (Cascio & Aguinis, 2005).
In an employment setting, evidence based on test content should be established
by demonstrating that the jobs for which the test will be used require the critical thinking abilities measured by the Watson-Glaser. Content-related validity
evidence of the Watson-Glaser in classroom and instructional settings may be
examined by noting the extent to which the Watson-Glaser measures a sample of
the specified objectives of such instructional programs.
33
Watson-Glaser Short Form Manual
Evidence of Criterion-Related Validity
One of the primary reasons tests are used is to provide an educated guess about
an examinee’s potential for future success. For example, selection tests are used
to hire or promote those individuals most likely to be productive employees. The
rationale behind selection tests is this: the better an individual performs on a
test, the better this individual will perform as an employee.
Criterion-related validity evidence addresses the inference that individuals who
score better on tests will be successful on some criterion of interest. Criterionrelated validity evidence indicates the statistical relationship (e.g., for a given
sample of job applicants or incumbents) between scores on the test and one
or more criteria, or between scores on the tests and independently obtained
measures of subsequent job performance. By collecting test scores and criterion
scores (e.g., job performance ratings, grades in a training course, supervisor ratings), one can determine how much confidence may be placed in using test
scores to predict job success. Typically, correlations between criterion measures
and scores on the test serve as indices of criterion-related validity evidence.
Provided that the conditions for a meaningful validity study have been met
(sufficient sample size, adequate criteria, etc.), these correlation coefficients are
important indices of the utility of the test.
Unfortunately, the conditions for evaluating criterion-related validity evidence
are often difficult to fulfill in the ordinary employment setting. Studies of testcriterion relationships should involve a sufficiently large number of persons
hired for the same job and evaluated for success using a uniform criterion measure. The criterion itself should be reliable and job-relevant, and should provide
a wide range of scores. In order to evaluate the quality of studies of test-criterion
relationships, it is essential to know at least the size of the sample and the nature
of the criterion.
Assuming that the conditions for a meaningful evaluation of criterion-related
validity evidence have been met, Cronbach (1970) characterized validity coefficients of .30 or better as having “definite practical value.” The U.S. Department
of Labor (1999) provides the following general guidelines for interpreting validity coefficients: above .35 are considered “very beneficial,” .21–.35 are considered
“likely to be useful,” .11–.20 “depends on the circumstances,” and below .11
“unlikely to be useful”. It is important to point out that even relatively lower
validities (e.g., .20) may justify the use of a test in a selection program (Anastasi
& Urbina, 1997). This suggestion is because the practical value of the test
depends not only on the validity, but also other factors, such as the base rate for
success on the job (i.e., the proportion of people who would be successful in the
absence of any selection procedure). If the base rate for success on the job is low
(i.e., few people would be successful on the job), tests of low validity can have
considerable utility or value. When the base rate is high (i.e., selected at random,
most people would succeed on the job), even highly valid tests may not contribute significantly to the selection process.
In addition to the practical value of validity coefficients, the statistical significance
of coefficients should be noted. Statistical significance refers to the odds that a
non-zero correlation could have occurred by chance. If the odds are 1 in 20 that
a non-zero correlation could have occurred by chance, then the correlation is
considered statistically significant. Some experts prefer even more stringent odds,
such as 1 in 100, although the generally accepted odds are 1 in 20. In statistical
analyses, these odds are designated by the lower case p (probability) to signify
whether a non-zero correlation is statistically significant. When p is less than or
34
Chapter 9 Evidence of Validity
equal to .05, the odds are presumed to be 1 in 20 (or less) that a non-zero correlation of that size could have occurred by chance. When p is less than or equal to
.01, the odds are presumed to be 1 in 100 (or less) that a non-zero correlation of
that size occurred by chance.
Previous Studies of Evidence of Criterion-Related Validity
Previous studies have shown evidence of the relationship between Watson-Glaser
scores and various job and academic success criteria. Gaston (1993), in a study of
law enforcement personnel, found a relationship between Watson-Glaser scores
and organizational level. Among his findings were that executives scored in the
7–9 decile range more often than non-executives, and non-executives scored in
the 1–3 decile range more often than executives. Holmgren and Covin (1984),
in a study of students majoring in education, found Watson-Glaser scores correlated .50 with grade-point average (GPA) and .46 with English Proficiency Test
scores. In studies of nursing students, Watson-Glaser scores correlated .50 with
National Council Licensure Exam (NCLEX) scores (Bauwens and Gerhard, 1987)
and .38 with state licensing exam scores (Gross, Takazawa, & Rose). Additional
studies of the relationship between Watson-Glaser scores and various criteria are
reported in the previous version of the manual (1994).
Current Studies of Evidence of Criterion-Related Validity
Studies continue to provide strong criterion-related validity evidence for the
Watson-Glaser. Kudisch and Hoffman (2002) reported that, in a sample of 71
leadership assessment center participants, Watson-Glaser scores correlated with
ratings on Analysis, .58, and with ratings on Judgment, .43. Ratings on Analysis
and Judgment were based on participants’ performance on assessment center
exercises, including a coaching meeting, in-basket exercise or simulation, and a
leaderless group discussion.
Spector, Schneider, Vance, and Hezlett (2000) evaluated the relationship
between Watson-Glaser scores and assessment center exercise performance for
managerial and executive level assessment center participants. They found that
Watson-Glaser scores significantly correlated with overall scores on six of eight
assessment center exercises, and related more strongly to exercises involving
primarily cognitive problem-solving skills (e.g., r = .26, p < .05, with in-basket
scores) than exercises involving a greater level of interpersonal skills (e.g., r = .16,
p < .05, with in-basket coaching exercise).
In a study we conducted for this revision of the manual in 2005, we examined
the relationship between Watson-Glaser scores and on-the-job performance
of 142 job incumbents in various industries. Job performance was defined as
supervisory ratings on behaviors determined through research to be important
to most professional, managerial, and executive jobs. The study found that
Watson-Glaser scores correlated .33 with supervisory ratings on a dimension
made up of Analysis and Problem Solving behaviors, and .23 with supervisory
ratings on a dimension made up of Judgment and Decision Making behaviors.
Supervisory ratings from the sum of ratings on 19 job performance behaviors
(“Total Performance”), as well as ratings on a single-item measure of “Overall
Potential” were also obtained. The Watson-Glaser scores correlated .28 with
“Total Performance” and .24 with ratings of Overall Potential.
In an analysis of a sub-group of the 2005 study mentioned above, we examined
the relationship between the Watson-Glaser scores and on-the-job performance
of 64 analysts from a government agency. The results showed that Watson-Glaser
35
Watson-Glaser Short Form Manual
scores correlated .40 with supervisory ratings on each of the two dimensions
composed of (a) Analysis and Problem Solving behaviors and, (b) Judgment
and Decision Making behaviors, and correlated .37 with supervisory ratings
on a dimension composed of behaviors dealing with Professional/Technical
Knowledge and Expertise. In the sample of 64 analysts mentioned above, the
Watson-Glaser scores correlated .39 with “Total Performance” and .25 with
Overall Potential.
Another part of the study we conducted in 2005 for this revision of the manual
examined the relationship between Watson-Glaser scores and job success as
indicated by organizational level achieved, for 2,303 job incumbents across 9
industry categories. Results indicated that Watson-Glaser scores correlated .33
with organizational level.
Other studies of job-relevant criteria have found significant correlations between
Watson-Glaser scores and creativity (Gadzella & Penland, 1995), facilitator effectiveness (Offner, 2000), positive attitudes toward women (Loo & Thorpe, 2005),
and openness to experience (Spector, et al., 2000).
In the educational domain, Behrens (1996) found that Watson-Glaser scores
correlated .59, .53, and .51 respectively, with semester GPA for three freshmen classes in a Pennsylvania nursing program. Similarly, Gadzella, Baloglu, &
Stephens (2002) found Watson-Glaser subscale scores explained 17% of the total
variance in GPA (equivalent to a multiple correlation of .41) for 114 Education
students. Williams (2003), in a study of 428 educational psychology students,
found Watson-Glaser total scores correlated .42 and .57 with mid-term and final
exam scores, respectively. Gadzella, Ginther, and Bryant (1996), in a study of 98
college freshmen, found that Watson-Glaser scores were significantly higher for
A students than B and C students, and significantly higher for B students relative
to C students.
Studies have also shown significant relationships between Watson-Glaser scores
and clinical decision making effectiveness (Shin, 1998), educational experience
and level (Duchesne, 1996; Shin, 1998; Yang & Lin, 2004), educational level of
parents (Yang & Lin, 2004), academic performance during pre-clinical years
of medical education (Scott & Markert, 1994), and preference for contingent,
relativistic thinking versus “black-white, right-wrong” thinking (Taube, 1995).
Table 9.1 presents a summary of studies that evaluated criterion-related
validity evidence for the Watson-Glaser since 1994 when the previous manual
was published. Only studies that reported validity coefficients are shown.
Additional studies are reported in this chapter as well as the previous version
of manual (1994).
In Table 9.1, the column entitled N details the number of cases in the sample.
The criterion measures include job performance and grade point average, among
others. Means and standard deviations, for studies in which they were available, are shown for both the test and criterion measures. The validity coefficient
for the sample appears in the last column. Validity coefficients such as those
reported in Table 9.1 apply to the specific samples listed.
36
Chapter 9 Evidence of Validity
Table 9.1 Studies Showing Evidence of Criterion-Related Validity
Watson-Glaser
Group
N
Leadership assessment center
participants from a national retail chain and a utility service (Kudisch & Hoffman, 2002)
71
Middle-management assessment center participants (Spector, Schneider, Vance, & Hezlett, 2000)
189–407
Form
80-item
80-item
Mean
–
66.5
Criterion
SD
–
7.3
Mean
SD
Assessor
Ratings:
Description
r
Analysis
–
–
.58*
Judgment
Assessor
Ratings:
–
–
.43*
In-basket
2.9
0.7
.26*
In-basket
Coaching
3.1
0.7
.16*
Leaderless
Group
3.0
0.6
.19*
Project
Presentation
3.0
0.7
.25*
Project
Discussion
2.9
0.6
.16*
Team
Presentation
3.1
0.6
.28*
41.8
6.4
.36*
CPI Score:
Job incumbents across multiple industries (Harcourt Assessment, Inc., 2005)
Job applicants and incumbents across multiple industries (Harcourt
Assessment, Inc., 2005)
142
2,303
Short
Short
Openness to
Experience
Supervisory
Ratings:
Analysis and
Problem Solving
.33**
Judgment and
Decision Making
.23**
Total
Performance
.28**
Potential
.24**
.33*
Org. Level
(continued)
37
Watson-Glaser Short Form Manual
Table 9.1 Studies Showing Evidence of Criterion-Related Validity (continued)
Watson-Glaser
Group
N
Form
Incumbent analysts from a
government agency (Harcourt
Assessment, Inc., 2005)
64
Short
Mean
Criterion
SD
Description
Mean
SD
Supervisory
Ratings:
Analysis and
Problem Solving
.40**
Judgment and
Decision Making
.40**
Professional
/Technical
Knowledge &
Expertise
.37**
Total
Performance
.39**
Potential
Freshmen classes in a Pennsylvania nursing program (Behrens, 1996)
50.5
–
Semester 1 GPA
2.5
–
.25*
.59**
31
52.1
–
Semester 1 GPA
2.5
–
.53**
37
114
80-item
52.1
51.4
–
9.8
Semester 1 GPA
GPA
2.4
3.1
–
.51
.51**
.41**
158–164
Short
–
–
Exam 1 Score
–
–
.42**
Exam 2 Score
Education Level
–
–
–
–
.57**
.57**
8.1
GPA
2.8
.51
.30*
–
Checklist of
Educational
Views
Course Grades
41
Education majors (Gadzella, Baloglu, & Stephens, 2002)
Educational psychology students
(Williams, 2003)
Job applicants and incumbents across multiple industries (Harcourt Assessment, Inc., 2005)
Education majors (Taube, 1995)
Educational psychology students
(Gadzella, Stephens, & Stacks, 2004)
r
147–194
139
80-item
80-item
80-item
54.9
–
GPA
50.1
7.6
.33*
–
–
.42**
–
–
.28**
* p < .05. ** p < .01
Test users should not automatically assume that these data constitute sole and
sufficient justification for use of the Watson-Glaser. Inferring validity for one
group from data reported for another group is not appropriate unless the organizations and job categories being compared are demonstrably similar.
Careful examination of Table 9.1 can help test users make an informed judgment
about the appropriateness of the Watson-Glaser for their own organization.
However, the data presented here are not intended to serve as a substitute for
locally obtained data. Locally conducted validity studies, together with locally
derived norms, provide a sound basis for determining the most appropriate use
of the Watson-Glaser. Hence, whenever technically feasible, test users should
study the validity of the Watson-Glaser, or any selection test, at their own location.
38
Chapter 9 Evidence of Validity
Sometimes it is not possible for a test user to conduct a local validation study.
There may be too few incumbents in a particular job, an unbiased and reliable
measure of job performance may not be available, or there may not be a sufficient range in the ratings of job performance to justify the computation of
validity coefficients. In such circumstances, evidence of a test’s validity reported
elsewhere may be relevant, provided that the data refer to comparable jobs.
Evidence of Convergent and Discriminant Validity
Convergent evidence is provided when scores on a test relate to scores on other
tests or variables that purport to measure similar traits or constructs. Evidence of
relations with other variables can involve experimental (or quasi-experimental)
as well as correlational evidence (American Educational Research Association,
American Psychological Association, & National Council on Measurement in
Education, 1999). Discriminant evidence is provided when scores on a test do
not relate closely to scores on tests or variables that measure different traits
or constructs.
Previous Studies of Evidence of Convergent and
Discriminant Validity
Convergent validity evidence for the Watson-Glaser has been provided in a
variety of instructional settings. Given that the Watson-Glaser is a measure of
critical thinking ability, experience in programs aimed at developing this ability
should be reflected in changes in performance on the test. Sorenson (1966)
found that participants in laboratory-centered biology courses showed greater
change in Watson-Glaser scores than did members of classes where the teaching
method was predominately a traditional lecture approach. Similarly, Agne and
Blick (1972) found that Watson-Glaser performance was differentially affected
by teaching earth science through a data-centered, experimental approach as
compared with a traditional lecture approach. In both of these studies, critical
thinking was cited as a key curriculum element for the nontraditional
teaching methods.
In addition, studies have reported that Watson-Glaser scores may be influenced
by critical thinking and problem-solving courses (Arand & Harding 1987; Herber,
1959; Pierce, Lemke, & Smith, 1988; Williams, 1971), debate training (Brembeck,
1949; Colbert , 1987; Follert & Colbert, 1983; Jackson, 1961), developmental
advising (Frost, 1989, 1991), group problem-solving (Gadzella, Hartsoe, & Harper,
1989; Garris, 1974; Goldman & Goldman, 1981; Neimark, 1984), reading and
speech courses (Brownell, 1953; Duckworth, 1968; Frank, 1969; Livingtson, 1965;
Ness, 1967), exposure to college curriculum (Berger, 1984; Burns, 1974; Fogg &
Calia, 1967; Frederickson & Mayer, 1977; McMillan, 1987; Pardue, 1987) and
computer-related courses (Jones, 1988; Wood, 1980; Wood & Stewart, 1987).
Convergent validity evidence for the Watson-Glaser has also been provided in
studies that examined its relationship with other tests. For example, studies
reported in Watson and Glaser (1994) showed significant relationships between
Watson-Glaser scores and scores on the following tests: Otis-Lennon Mental Ability
Test (Forms J & K), the California Test of Mental Maturity, the Verbal IQ test of the
Wechsler Adult Intelligence Scale (WAIS), the Miller Analogies Test, Wesman Personnel
Test, Differential Aptitude Test (Abstract Reasoning), College Entrance Examination
Board, Scholastic Aptitude Tests (SAT), Stanford Achievement Tests, and the American
College Testing Program (ACT).
39
Watson-Glaser Short Form Manual
Discriminant validity evidence for the Watson-Glaser has been shown in
studies such as Robertson and Molloy (1982), which found non-significant
correlations between Watson-Glaser scores and measures of dissimilar constructs
such as Social Skills and Neuroticism. Similarly, Watson and Glaser (1994) found
that Watson-Glaser scores correlated stronger with a test measuring reasoning
ability (i.e., Wesman Personnel Test) than a test measuring the conceptually
less closely related construct of visual-spatial ability (i.e., Employee Aptitude
Survey—Space Visualization).
Studies of the Relationship Between the Watson-Glaser and
General Intelligence
An area concerning both convergent and discriminant validity evidence is the
relationship of the Watson-Glaser to general intelligence. Although the WatsonGlaser has been found to correlate with general intelligence, its overlap as a
construct is not complete. Factor analyses of the Watson-Glaser Short Form tests
with other measures of intelligence generally indicate that the Watson-Glaser is
measuring a dimension of ability that is distinct from overall intellectual ability. Landis (1976), for example, performed a factor analysis of the Watson-Glaser
tests with measures drawn from the Guilford Structure of Intellect Model. The
Watson-Glaser tests were found to reflect a dimension of intellectual functioning that was independent of that tapped by the measures of the structure of
the intellect system. Follman, Miller, and Hernandez (1969) also found that the
Watson-Glaser tests were represented by high loadings on a single factor when
analyzed along with various achievement and ability tests. Ross (1977) reported
that the Watson-Glaser tests, Inference and Deduction, loaded on a verbally
based induction and a deduction factor, respectively.
Current Studies of Evidence of Convergent and Discriminant
Validity
Studies of instructional programs aimed at developing critical thinking abilities
continue to support the effectiveness of the Watson-Glaser as a measure of such
abilities. For example, increases in critical thinking have been found as a result of
scenario-based community health education (Sandor, Clark, Campbell, Rains, &
Cascio, 1998), medical school curriculum (Scott, Markert, & Dunn, 1998), critical thinking instruction (Gadzella, Ginther, & Bryant, 1996), teaching methods
designed to stimulate critical thinking (Williams, 2003), nursing program experience (Frye, Alfred, & Campbell, 1999; Pepa, Brown, & Alverson, 1997), and
communication skills education such as public speaking courses (Allen,
Bekowitz, Hunt, & Louden, 1999).
Recent studies have also shown that the Watson-Glaser relates to tests of similar
and dissimilar constructs in an expected manner. In 2005, Harcourt Assessment
conducted a study of 63 individuals employed in various roles and industries
and found that Watson-Glaser total scores correlated .70 with scores on the Miller
Analogies Test for Professional Selection. Rust (2002), in a study of 1,546 individuals
from over 50 different occupations in the United Kingdom, reported a correlation of .63 between the Watson-Glaser and a test of critical thinking with numbers, the Rust Advanced Numerical Reasoning Appraisal. In a study of Education
majors, Taube (1995) found that Watson-Glaser scores correlated .37 with
scores on an essay test designed to measure critical thinking (Ennis-Weir Critical
Thinking Essay Test), .43 with SAT-Verbal scores, and .39 with SAT-Math scores.
40
Chapter 9 Evidence of Validity
Regarding discriminant validity evidence, non-significant correlations have been
found between Watson-Glaser scores and tests measuring dissimilar constructs
such as the Big Five construct Emotional Stability (Spector, et al., 2000) and the
psychological type characteristic Thinking/Feeling (Yang & Lin, 2004).
Table 9.2 presents correlations between the Watson-Glaser and other tests.
Additional studies are reported in the previous version of manual (1994). In
Table 9.2, a description of the study participants appears in the first column.
The second column lists the total number of participants (N), followed by the
Watson-Glaser form for which data were collected, the mean and standard deviation (SD) of Watson-Glaser scores, and the comparison test name. The mean and
standard deviation of scores on the comparison test are reported next, followed
by the correlation coefficient (r) indicating the relationship between scores on
the Watson-Glaser and the comparison test. High correlation coefficients indicate overlap between Watson-Glaser and comparison tests. Low correlations,
on the other hand, suggest that the tests measure different traits.
Table 9.2 Watson-Glaser Convergent Evidence of Validity
Watson-Glaser
Group
N
Form
Job incumbents
across multiple
industries (Harcourt
Assessment, Inc.,
2005)
63
Short
Job incumbents
from multiple
occupations in UK
(Rust, 2002)
1,546
CUK
Education majors
(Taube, 1995)
147–194
80-item
Mean
Other Test
SD
Description
Mean
SD
r
Miller Analogies
Test for
Professional
Selection
.70**
.63**
–
–
Rust Advanced
Numerical
Reasoning
Appraisal
54.9
8.1
SAT-Verbal
431.5
75.3
.43**
SAT-Math
495.5
91.5
.39*
Ennis-Weir
Critical Thinking
Essay Test
14.6
6.1
.37*
Baccalaureate
Nursing Students
(Adams, Stover, &
Whitlow, 1999)
203
80-item
54.0
9.3
ACT Composite
21.0
–
.53**
Dispatchers at a
Southern railroad
company (Watson &
Glaser, 1994)
180
Short
24.9
5.0
Industrial
Reading Test
29.9
4.4
.53**
73.7
11.4
.50**
Lower-level
management
applicants (Watson
& Glaser, 1994)
219
Wesman, Verbal
27.5
6.0
.51**
EAS, Verbal
Comp.
20.7
3.1
.54**
16.7
4.6
.48**
Test of Learning
Ability
217
217
Short
33.5
4.4
EAS, Verbal
Reasoning
(continued)
41
Watson-Glaser Short Form Manual
Table 9.2 Watson-Glaser Convergent Evidence of Validity (continued)
Watson-Glaser
Group
N
Form
Mean
SD
Description
Mean
SD
r
Mid-level
management
applicants (Watson
& Glaser, 1994)
209
Short
34.0
4.2
Wesman, Verbal
27.5
6.0
.66**
EAS, Verbal
Comp.
21.0
3.0
.50**
16.6
4.9
.51**
27.0
5.8
.54**
21.1
3.4
.42**
16.2
4.2
.47**
Executive
management
applicants (Watson
& Glaser, 1994)
* p < .05. ** p < .01.
42
Other Test
209
208
440
437
436
Short
33.4
4.2
EAS, Verbal
Reasoning
Wesman, Verbal
EAS, Verbal
Comp.
EAS, Verbal
Reasoning
Using the Watson-Glaser as an
Employment Selection Tool
10
The Watson-Glaser is used to predict success in certain occupations and instructional programs that require critical thinking ability. The test is also used to
measure gains in critical thinking ability resulting from instructional and training programs, and to determine for research purposes, the relationship between
critical thinking ability and other abilities or traits.
Employment Selection
Many organizations use testing as a component of their employment selection
process. Typical selection test programs make use of cognitive ability tests, aptitude tests, personality tests, basic skills tests, and work values tests, to name a
few. Tests are used to screen out unqualified candidates, to categorize prospective
employees according to their probability of success on the job, or to rank order a
group of candidates according to merit.
The Watson-Glaser is designed to assist in the selection of employees for jobs
that require careful, analytical reasoning. Many executive, administrative, and
technical professions require the type of critical thinking ability measured by
the Watson-Glaser. The test has been used to assess applicants for a wide variety
of jobs, including administrative and sales positions, and lower-to-upper level
management jobs in construction, production, marketing, healthcare, financial,
police, public sector organizations, teaching facilities, and religious institutions.
It should not be assumed that the type of critical thinking required in a particular job is identical to that measured by the Watson-Glaser. Job analysis and
validation of the Watson-Glaser for selection purposes should follow accepted
human resource research procedures, and conform to existing guidelines concerning fair employment practices. In addition, no single test score can possibly
suggest all of the requisite knowledge and skills necessary for success in a job.
It is the responsibility of the hiring authority to determine how it uses the
Watson-Glaser scores. It is recommended that if the hiring authority establishes
a cut score, examinees’ scores should be considered in the context of appropriate
measurement data for the test, such as the standard error of measurement and
data regarding the predictive validity of the test. In addition, it is recommended
that selection decisions be based on multiple job-relevant measures rather
than relying on any single measure (e.g., using only Watson-Glaser scores
to make decisions).
Organizations using the Watson-Glaser are encouraged to examine the relationship between examinees’ scores and their subsequent performance on the
job. This locally obtained information will provide the best assistance in score
interpretation and will most effectively enable a user of the Watson-Glaser to
43
Watson-Glaser Short Form Manual
differentiate examinees that are likely to be successful from those who are not.
Harcourt Assessment, Inc. does not establish or recommend a passing score for
the Watson-Glaser.
Fairness in Selection Testing
Fair employment regulations and their interpretation are continuously subject
to changes in the legal, social, and political environments. It therefore is advised
that a user of the Watson-Glaser consult with qualified legal advisors and human
resources professionals as appropriate.
Legal Considerations
There are governmental and professional regulations that cover the use of all
personnel selection procedures. Relevant source documents that the user may
wish to consult include the Standards for Educational and Psychological Testing
(AERA et al., 1999); the Principles for the Validation and Use of Personnel Selection
Procedures (Society for Industrial and Organizational Psychology, 2003); and the
federal Uniform Guidelines on Employee Selection Procedures (Equal Employment
Opportunity Commission, 1978). For an overview of the statutes and types of
legal proceedings which influence an organization’s equal employment opportunity obligations, the user is referred to Cascio and Aguinis (2005) or the U.S.
Department of Labor’s (2000) Testing and Assessment: An Employer’s Guide to
Good Practices.
Group Differences/Adverse Impact
Local validation is particularly important when a selection test may have adverse
impact. According to the Uniform Guidelines on Employee Selection Procedures
(EEOC, 1978) adverse impact is normally indicated when the selection rate for
one group is less than 80% (or 4 out of 5) that of another. Adverse impact is
likely to occur with cognitive ability tests such as the Watson-Glaser. While it
is not unlawful to use a test with adverse impact (EEOC, 1978), the testing organization must be prepared to demonstrate that the selection test is job-related
and consistent with business necessity. A local validation study, in which scores
on the Watson-Glaser are correlated with indicators of on-the-job performance,
will help provide evidence to support the use of the test in a particular job context. In addition, an evaluation that demonstrates that the Watson-Glaser is
equally predictive for protected subgroups, as outlined by the Equal Employment
Opportunity Commission, will assist in the demonstration of fairness of the test.
Monitoring the Selection System
The abilities to evaluate selection strategies and to implement fair employment
practices depend on an organization’s awareness of the demographic characteristics
of applicants and incumbents. Monitoring these characteristics and accumulating test score data are clearly necessary for establishing legal defensibility of a
selection system, including those systems that incorporate the Watson-Glaser.
The most effective use of the Watson-Glaser will be achieved where a local norms
database is established and continuously monitored for unfair consequences.
44
Chapter 10 Using the Watson-Glaser as an Employment Selection Tool
Research
The Watson-Glaser provides a reliable measure of critical thinking ability
and has been included in a variety of research studies on critical thinking and
related topics. Several studies are summarized in this manual. The citations are
listed in the References section. Other research studies are listed in the Research
Bibliography section.
45
Appendix A
Description of the Normative Sample and
Percentile Ranks
Table A.1 Description of the Normative Sample by Industry
Industry
Norms and Sample Characteristics
Advertising/Marketing/
Public Relations
N = 101
Various occupations within
advertising, marketing, and
public relations industries.
Mean = 28.7
SD = 6.1
Occupational Characteristics
12.9%
Hourly/Entry-Level
4.3%
Supervisor
34.4%
Manager
16.1%
Director
17.2%
Executive
15.1%
Professional/Individual Contributor
Education
N = 119
Various occupations within the
education industry.
Mean = 30.2
SD = 5.4
Occupational Characteristics
11.5%
Hourly/Entry-Level
1.0%
Supervisor
21.2%
Manager
23.1%
Director
26.0%
Executive
17.3%
Professional/Individual Contributor
(continued)
47
Watson-Glaser Short Form Manual
Table A.1 Description of the Normative Sample by Industry (continued)
Industry
Norms and Sample Characteristics
Financial Services/
Banking/Insurance
N = 228
Various occupations within
financial services, banking, and
insurance industries.
Mean = 31.2
SD = 5.7
Occupational Characteristics
10.4% Hourly/Entry-Level
8.1% Supervisor
19.4% Manager
7.6% Director
23.2% Executive
31.3% Professional/Individual Contributor
Government/Public
Service/Defense
Various occupations within
government, public service, and
defense agencies.
N = 130
Mean = 30.0
SD = 6.3
Occupational Characteristics
19.8% Hourly/Entry-Level
10.3% Supervisor
20.7% Manager
7.8% Director
2.6% Executive
38.8% Professional/Individual Contributor
Health Care
N = 195
Various occupations within the
health care industry.
Mean = 28.3
SD = 6.5
Occupational Characteristics
19.1% Hourly/Entry-Level
9.5% Supervisor
18.5% Manager
15.5% Director
11.9% Executive
25.6% Professional/Individual Contributor
(continued)
48
Appendix A
Table A.1 Description of the Normative Sample by Industry (continued)
Industry
Norms and Sample Characteristics
Information Technology/
Telecommunications
N = 295
Various occupations within
information technology and
telecommunications industries.
Mean = 31.2
SD = 5.5
Occupational Characteristics
8.1% Hourly/Entry-Level
4.0% Supervisor
22.8% Manager
8.1% Director
12.5% Executive
44.5% Professional/Individual Contributor
Manufacturing/ Production
N = 561
Various occupations within
manufacturing and production
industries.
Mean = 32.0
SD = 5.3
Occupational Characteristics
7.6% Hourly/Entry-Level
10.2% Supervisor
36.8% Manager
15.9% Director
9.1% Executive
20.6% Professional/Individual Contributor
Professional Business Services
N = 153
Various occupations within the
professional business services
industry (e.g., consulting, legal)
Mean = 31.9
SD = 5.6
Occupational Characteristics
8.9% Hourly/Entry-Level
1.5% Supervisor
23.7% Manager
12.6% Director
16.3% Executive
37.0% Professional/Individual Contributor
(continued)
49
Watson-Glaser Short Form Manual
Table A.1 Description of the Normative Sample by Industry (continued)
Industry
Norms and Sample Characteristics
Retail/Wholesale
N = 307
Various occupations within retail
and wholesale industries.
Mean = 30.8
SD = 5.3
Occupational Characteristics
8.1% Hourly/Entry-Level
3.9% Supervisor
45.3% Manager
10.5% Director
22.1% Executive
10.2% Professional/Individual Contributor
50
Appendix A
Table A.2 Description of the Normative Sample by Occupation
Occupation
Norms and Sample Characteristics
Accountant/Auditor/
Bookkeeper
N = 118
Accountant, auditor, and
bookkeeper positions within
various industries.
Mean = 30.2
SD = 5.8
Industry Characteristics
2.4% Advertising/Marketing/Public Relations
8.2% Education
24.7% Financial Services/Banking/Insurance
5.9% Government/Public Service/Defense
8.2% Health Care
8.2% Information Technology/High-Tech/Telecommunications
29.4% Manufacturing/Production
8.2% Professional Business Services
4.7% Retail/Wholesale
Consultant
N = 139
Consultant positions within
various industries.
Mean = 33.3
SD = 4.8
Industry Characteristics
6.3% Advertising/Marketing/Public Relations
2.7% Education
3.6% Financial Services/Banking/Insurance
1.8% Government/Public Service/Defense
0.9% Health Care
20.5% Information Technology/High-Tech/Telecommunications
6.3% Manufacturing/Production
50.9% Professional Business Services
7.1% Retail/Wholesale
(continued)
51
Watson-Glaser Short Form Manual
Table A.2 Description of the Normative Sample by Occupation (continued)
Occupation
Norms and Sample Characteristics
Engineer
N = 225
Engineer positions within various industries.
Mean = 32.8
SD = 4.8
Industry Characteristics
0.0% Advertising/Marketing/Public Relations
1.3% Education
0.6% Financial Services/Banking/Insurance
9.0% Government/Public Service/Defense
0.6% Health Care
14.7% Information Technology/High-Tech/Telecommunications
71.8% Manufacturing/Production
1.9% Professional Business Services
0.0% Retail/Wholesale
Human Resource Professional
N = 140
Human resource professional
positions within various
industries.
Mean = 30.0
SD = 5.7
Industry Characteristics
0.9% Advertising/Marketing/Public Relations
7.5% Education
10.3% Financial Services/Banking/Insurance
9.4%
Government/Public Service/Defense
13.1% Health Care
4.7% Information Technology/High-Tech/Telecommunications
35.5% Manufacturing/Production
7.5% Professional Business Services
11.2% Retail/Wholesale
(continued)
52
Appendix A
Table A.2 Description of the Normative Sample by Occupation (continued)
Occupation
Norms and Sample Characteristics
Information Technology
Professional
N = 222
Information technology positions
within various industries.
Mean = 31.4
SD = 5.9
Industry Characteristics
1.0% Advertising/Marketing/Public Relations
2.0% Education
9.4% Financial Services/Banking/Insurance
4.0% Government/Public Service/Defense
5.5% Health Care
62.9% Information Technology/High-Tech/Telecommunications
8.4% Manufacturing/Production
3.0% Professional Business Services
4.0% Retail/Wholesale
Sales Representative—
Non-Retail
Sales representative positions (non-retail) within
various industries.
N = 353
Mean = 29.8
SD = 5.1
Industry Characteristics
11.6% Advertising/Marketing/Public Relations
4.2% Education
7.9% Financial Services/Banking/Insurance
1.1% Government/Public Service/Defense
11.1% Health Care
13.2% Information Technology/High-Tech/Telecommunications
26.5% Manufacturing/Production
8.5% Professional Business Services
15.9% Retail/Wholesale
53
Watson-Glaser Short Form Manual
Table A.3 Description of the Normative Sample by Position Type/Level
Position Type/Level
Norms and Characteristics
Executive
N = 409
Executive-level positions (e.g., CEO, CFO, VP) within
various industries.
Mean = 33.4
SD = 4.5
Industry Characteristics
5.7% Advertising/Marketing/Public Relations
9.6% Education
17.4% Financial Services/Banking/Insurance
1.1% Government/Public Service/Defense
7.1% Health Care
12.1% Information Technology/High-Tech/Telecommunications
17.0% Manufacturing/Production
7.8% Professional Business Services
22.3% Retail/Wholesale
Director
N = 387
Director-level positions within
various industries.
Mean = 32.9
SD = 4.7
Industry Characteristics
6.2% Advertising/Marketing/Public Relations
9.9% Education
6.6% Financial Services/Banking/Insurance
3.7% Government/Public Service/Defense
10.7% Health Care
9.1% Information Technology/High-Tech/Telecommunications
34.6% Manufacturing/Production
7.0% Professional Business Services
12.4% Retail/Wholesale
(continued)
54
Appendix A
Table A.3 Description of the Normative Sample by Position Type/Level (continued)
Position Type/Level
Norms and Characteristics
Manager
N = 973
Manager-level positions within
various industries.
Mean = 30.7
SD = 5.4
Industry Characteristics
5.6% Advertising/Marketing/Public Relations
3.9% Education
7.2% Financial Services/Banking/Insurance
4.2% Government/Public Service/Defense
5.5% Health Care
10.9% Information Technology/High-Tech/Telecommunications
34.3% Manufacturing/Production
5.6% Professional Business Services
22.7% Retail/Wholesale
Supervisor
N = 202
Supervisor-level positions within various industries.
Mean = 28.8
SD = 6.2
Industry Characteristics
3.1% Advertising/Marketing/Public Relations
0.8% Education
13.3% Financial Services/Banking/Insurance
9.4% Government/Public Service/Defense
12.5% Health Care
8.6% Information Technology/High-Tech/Telecommunications
42.2% Manufacturing/Production
1.6% Professional Business Services
8.6% Retail/Wholesale
(continued)
55
Watson-Glaser Short Form Manual
Table A.3 Description of the Normative Sample by Position Type/Level (continued)
Position Type/Level
Norms and Characteristics
Professional/Individual
Contributor
N = 842
Professional-level positions and
individual contributor positions
within various industries.
Mean = 30.6
SD = 5.6
Industry Characteristics
2.8% Advertising/Marketing/Public Relations
3.6% Education
13.3% Financial Services/Banking/Insurance
9.1% Government/Public Service/Defense
8.7% Health Care
24.4% Information Technology/High-Tech/Telecommunications
22.0% Manufacturing/Production
10.1% Professional Business Services
5.9% Retail/Wholesale
Hourly/Entry-Level
N = 332
Hourly and entry-level positions
within various industries.
Mean = 27.7
SD = 5.9
Industry Characteristics
6.1% Advertising/Marketing/Public Relations
6.1% Education
11.1% Financial Services/Banking/Insurance
11.6% Government/Public Service/Defense
16.2% Health Care
11.1% Information Technology/High-Tech/Telecommunications
20.2% Manufacturing/Production
6.1% Professional Business Services
11.6% Retail/Wholesale
Position Type/Occupation
Within Specific Industry Norms
Manager in Manufacturing/
Production
Managers in manufacturing and
production industries.
Engineer in Manufacturing/
Production
Engineers in manufacturing and
production industries.
56
N = 170
Mean = 31.9
SD = 5.3
N = 112
Mean = 32.9
SD = 4.7
Appendix A
Table A.4 Percentile Ranks of Total Raw Scores for Industry Groups
Industry
Raw Score
Advertising/
Marketing/
Public
Relations
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
99
99
98
95
90
82
77
73
69
64
58
54
48
39
34
31
26
21
19
14
11
8
7
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Raw Score Mean
Raw Score SD
N
28.7
6.1
101
Education
Financial
Services/
Banking/
Insurance
Government/
Public
Service/
Defense
Health Care
Raw Score
99
99
98
93
87
84
76
70
62
54
46
40
36
30
27
21
16
13
8
8
6
4
3
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
97
95
88
80
73
65
58
53
47
41
36
31
24
20
16
14
11
8
7
6
5
4
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
98
94
88
82
75
68
62
58
56
53
48
42
36
31
28
22
19
13
12
8
4
3
3
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
98
95
91
84
78
74
69
66
58
53
50
45
39
36
31
26
22
15
12
9
7
6
5
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
30.2
5.4
119
31.2
5.7
228
30.0
6.3
130
28.3
6.5
195
Raw Score Mean
Raw Score SD
N
(continued)
57
Watson-Glaser Short Form Manual
Table A.4 Percentile Ranks of Total Raw Scores for Industry Groups (continued)
Industry
Raw Score
Information
Technology/
Telecommunications
Manufacturing/
Production
Professional/
Business
Services
Retail/
Wholesale
Raw Score
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
99
99
94
89
82
74
64
57
50
45
41
36
30
25
22
18
15
12
8
6
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
94
86
77
68
60
51
45
41
36
31
24
20
17
13
11
9
6
4
3
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
95
88
76
67
56
52
46
37
33
29
26
20
16
13
12
9
8
6
5
3
3
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
96
91
85
79
73
65
58
51
42
34
29
25
21
17
14
10
10
7
5
4
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
31.2
5.5
295
32.0
5.3
561
31.9
5.6
153
30.8
5.3
307
Raw Score Mean
Raw Score SD
N
58
Raw Score Mean
Raw Score SD
N
Appendix A
Table A.5 Percentile Ranks of Total Raw Scores for Occupations
Occupation
Raw Score
Accountant/
Auditor/
Bookkeeper
Engineer
Human
Resource
Professional
Information
Technology
Professional
Sales
Representative
Non-Retail
Consultant
Raw Score
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
99
98
95
91
88
81
71
63
57
53
46
42
41
32
26
22
19
12
11
9
8
6
4
3
3
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
91
86
71
63
49
40
33
24
20
18
16
14
13
9
8
6
5
4
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
92
83
72
63
55
49
42
35
30
26
21
15
12
9
7
5
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
99
94
86
79
73
68
61
55
49
44
39
32
27
25
21
17
14
10
6
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
98
93
88
78
71
61
52
47
42
40
35
29
24
20
17
15
11
9
9
7
5
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
97
95
91
86
80
74
66
61
53
47
41
31
25
21
17
13
10
7
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Raw Score Mean
Raw Score SD
N
30.2
5.8
118
33.3
4.8
139
32.8
4.8
225
30.0
5.7
140
31.4
5.9
222
29.8
5.1
353
Raw Score Mean
Raw Score SD
N
59
Watson-Glaser Short Form Manual
Table A.6 Percentile Ranks of Total Raw Scores for Position Type/Level
Position Type/Level
60
Raw Score
Executive
Director
Manager
Supervisor
Professional/
Individual
Contributor
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
99
97
92
83
72
63
53
42
37
28
23
16
13
11
10
7
6
3
3
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
98
94
87
76
63
53
43
38
32
27
23
19
15
11
8
8
4
3
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
97
92
85
78
72
65
57
50
44
37
32
27
22
18
14
11
8
7
5
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
97
92
88
82
76
72
68
63
59
55
48
41
38
33
28
23
19
15
12
6
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
95
90
84
78
70
64
57
51
45
40
35
29
25
20
16
13
10
7
5
3
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
97
96
94
91
88
80
76
72
66
59
52
45
41
36
32
29
23
18
13
8
6
5
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Raw Score Mean
Raw Score SD
N
33.4
4.5
409
32.9
4.7
387
30.7
5.4
973
28.8
6.2
202
30.6
5.6
842
27.7
5.9
332
Raw Score Mean
Raw Score SD
N
Hourly/
Entry-Level
Raw Score
Appendix A
Table A.7 Percentile Ranks of Total Raw Scores for Position Type/Occupation Within Industry
Position Type/Occupation Within Industry
Raw Score
Manager in
Manufacturing/Production
Engineer in
Manufacturing/Production
Raw Score
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
99
99
95
89
81
72
65
54
45
39
35
26
22
18
14
13
11
8
5
5
5
4
4
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
99
99
91
82
70
63
54
48
41
37
31
28
22
15
13
8
4
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
31.9
5.3
170
32.9
4.7
112
Raw Score Mean
Raw Score SD
N
Raw Score Mean
Raw Score SD
N
61
Appendix B
Final Item Statistics for the Watson-Glaser–
Short Form Three-Parameter IRT Model
Table B.1 Final Item Statistics for the Watson-Glaser Short Form Three-Parameter IRT Model
(reprinted from Watson & Glaser, 1994)
Form A Item
Short Form
Item
One-Parameter
Rasch-Model
Difficulty
Index (b)
1
2
3
5
11
12
14
20
21
22
26
27
28
31
32
33
34
36
37
38
39
40
41
42
52
53
54
55
57
62
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1.09
1.04
1.01
1.03
1.05
1.04
1.04
1.05
0.92
0.99
0.95
1.02
1.02
1.03
1.03
0.96
0.98
0.92
1.10
1.00
0.99
1.02
0.94
0.95
1.00
1.01
0.87
0.96
0.92
0.92
Discrimination (a)
0.70
0.76
0.63
0.87
0.61
0.56
0.84
0.57
1.22
0.66
1.02
0.43
0.53
0.54
0.55
1.51
0.76
1.06
0.56
0.71
0.60
0.51
0.92
0.72
1.05
0.60
1.34
0.75
0.92
0.87
Difficulty (b)
Guessing (c)
Single-Factor
ML Solution
Factor
Loading
0.36
0.18
–0.39
0.33
–0.09
–0.36
0.55
–0.61
–1.19
–1.21
–0.04
–3.31
–2.16
–1.97
–0.48
–0.33
–1.09
0.21
0.62
–0.75
–1.56
–1.68
–0.31
–3.10
0.19
–0.89
–0.48
–0.68
–0.61
–1.93
0.43
0.36
0.28
0.31
0.38
0.29
0.27
0.31
0.40
0.31
0.28
0.32
0.37
0.36
0.32
0.50
0.39
0.23
0.50
0.36
0.33
0.31
0.25
0.28
0.35
0.34
0.29
0.21
0.26
0.22
0.21
0.27
0.31
0.31
0.25
0.29
0.29
0.29
0.40
0.32
0.39
0.19
0.24
0.24
0.27
0.35
0.33
0.39
0.20
0.33
0.29
0.27
0.42
0.28
0.34
0.29
0.49
0.41
0.43
0.41
(continued)
63
Watson-Glaser Short Form Manual
Table B.1 Final Item Statistics for the Watson-Glaser Short Form Three-Parameter IRT Model
(reprinted from Watson & Glaser, 1994) (continued)
Form A Item
Short Form
Item
One-Parameter
Rasch-Model
Difficulty
Index (b)
64
68
69
71
72
74
75
76
77
80
31
32
33
34
35
36
37
38
39
40
0.94
0.96
0.90
1.02
1.05
0.92
0.99
0.95
1.07
1.02
Discrimination (a)
0.77
0.58
0.98
0.47
0.46
0.78
1.13
0.96
0.40
0.42
Difficulty (b)
Guessing (c)
Single-Factor
ML Solution
Factor
Loading
–2.97
–2.90
–1.86
–1.43
–1.40
–1.87
0.01
–0.70
–1.37
–2.49
0.25
0.25
0.23
0.23
0.28
0.21
0.38
0.42
0.31
0.26
0.32
0.28
0.43
0.30
0.26
0.40
0.37
0.37
0.23
0.26
Note. Though item statistics were generated using both a one-parameter Rasch model and a three-parameter IRT model, item selection
decisions were based on the Rasch model for the following reasons:
64
1. the c parameter is held constant, as guessing is seldom truly random (i.e., even with true/false items, some partially incorrect
knowledge is used by the examinee to select the response);
2. one-to-one correspondence with the raw score scale (i.e., “number correct” scoring) is possible; and regardless of the
discrimination metric (Classical Test Theory or Item Response Theory), the estimates were similar across items—thus, little
was gained in using a more complicated two- or three-parameter estimate of discrimination.
References
Adams, M.H., Stover, L.M., & Whitlow, J.F. (1999). A
longitudinal evaluation of baccalaureate nursing
students’ critical thinking abilities. Journal of Nursing
Education, 38, 139–141.
Agne, R. & Blick, D. (1972). A comparison of earth
science classes taught by using original data in a
research-approach technique versus classes taught by
conventional approaches not using such data. Journal
of Research in Science Teaching, 9, 83–89.
Aiken, L. R. (1979). Psychological testing and assessment,
third edition. Boston: Allyn & Bacon.
Allen, M., Berkowitz, S., Hunt, S., & Louden, A. (1999).
A meta-analysis of the impact of forensics and
communication education on critical thinking.
Communication Education, 48, 18–30.
American Educational Research Association, American
Psychological Association, & National Council
on Measurement in Education (1999). Standards
for educational and psychological testing. Washington,
DC: Author.
Brembeck, W. L. (1949). The effects of a course in
argumentation on critical thinking ability. Speech
Monographs, 16, 177–189.
Brownell, J. A. (1953). The influence of training in
reading in the social studies on the ability to think
critically. California Journal of Educational Research,
4, 25–31.
Burns, R. L. (1974). The testing of a model of critical thinking ontogeny among Central Connecticut
State College undergraduates. (Doctoral dissertation,
University of Connecticut). Dissertation Abstracts
International, 34, 5467A.
Cascio, W. F. (1982). Applied psychometrics in
personnel management, second edition, Reston, VA:
Reston Publishing.
Cascio, W. F., & Aguinis, H. (2005). Applied psychology
in human resource management (6th ed.). Upper Saddle
River, NJ: Prentice Hall.
Civil Rights Act of 1991 (Pub. L. 102-166). United States
Code, Volume 42, Sections 101-402.
Americans With Disabilities Act of 1990, Titles I &
V (Pub. L. 101-336). United States Code, Volume 42,
Sections 12101-12213.
Cohen, B.H. (1996). Explaining psychological statistics.
Pacific Grove, CA: Brooks & Cole.
Anastasi, A. & Urbina, S. (1997). Psychological testing
(7th ed.). Upper Saddle River, N.J.: Prentice Hall.
Cohen, J. (1988). Statistical power analysis for the
behavioral sciences (2nd ed.), Hillsdale, NJ: Lawrence
Erlbaum Associates.
Arand, J. U. & Harding, C. G. (1987). An investigation
into problem solving in education; A problem-solving curricular framework. Journal of Allied Health,
16, 7–17.
Colbert, K. R. (1987). The effects of CEDA and NDT
debate training on critical thinking ability. Journal of
the American Forensic Association, 23, 194–201.
Bauwens, E. E. & Gerhard, G. G. (1987). The use of the
Watson-Glaser Critical Thinking Appraisal to predict
success in a baccalaureate nursing program. Journal of
Nursing Education, 26, 278–281.
Behrens, P. J. (1996). The Watson-Glaser Critical
Thinking Appraisal and academic performance of
diploma school students. Journal of Nursing Education,
35, 34–36.
Berger, M. C. (1984). Clinical thinking ability and
nursing students. Journal of Nursing Education,
23, 306–308.
Cronbach, L. J. (1970). Essentials of psychological testing,
third edition, New York: Harper & Row
Duchesne, R. E., Jr. (1996). Critical thinking, developmental learning, and adaptive flexibility in
organizational leaders (Doctoral dissertation,
University of Connecticut). Dissertation Abstracts
International, 57, 2121.
Duckworth, J. B. (1968). The effect of instruction in
general semantics on the critical thinking of tenth
and eleventh grade students. (Doctoral dissertations,
Wayne State University). Dissertation Abstracts,
29, 4180A.
65
Watson-Glaser Short Form Manual
Equal Employment Opportunity Commission. (1978).
Uniform guidelines on employee selection procedures. Federal Register, 43(166), 38295–38309.
Fogg, C. & Calia, V. (1967). The comparative influence of two testing techniques on achievement in
science and critical thinking ability. The Journal of
Experimental Education, 35, 1–14.
Follert, V. F. & Colbert, K. R. (1983, November). An
analysis of the research concerning debate Training and
critical thinking. Paper presented at the 69th Annual
Meeting of the Speech Communication Association,
Washington, DC.
Follman, J., Miller, W., & Hernandez, D. (1969). Factor
analysis of achievement, scholastic aptitude, and
critical thinking subtests. The Journal of Experimental
Education, 38, 48–53.
Frank, A. D. (1969). Teaching high school speech to
improve critical thinking ability. Speech Teacher,
18, 296–302.
Frederickson, K. & Mayer, G. G. (1977). Problem-solving skills: What effect does education have? American
Journal of Nursing, 77, 1167–1169.
Frost, S. (1989, October). Academic advising and cognitive development: Is there a link? Paper presented at the
13th Annual Conference of the National Academic
Advising Association, Houston.
Frost, S. H. (1991). Fostering critical thinking of
college women through academic advising and faculty contact. Journal of College Student Development,
32, 359–366.
Frye, B., Alfred, N., & Campbell, M. (1999). Use of
the Watson-Glaser Critical Thinking Appraisal with
BSN students. Nursing and Health Care Perspectives,
20(5), 253–255.
Fulton, R. D. (1989). Critical thinking in adulthood. (Eric
Document Reproduction Service No. ED 320015).
Gadzella, B. M., Baloglu, M., & Stephens, R. (2002).
Prediction of GPA with educational psychology
grades and critical thinking scores. Education,
122(3), 618–623.
Gadzella, B. M., Ginther, D. W., & Bryant, G. W. (1996,
August). Teaching and learning critical thinking skills.
Paper presented at the XXVI International Congress
of Psychology, Montreal, Quebec.
Gadzella, B. M., Hartsoe, K., & Harper, J. (1989). Critical
thinking and mental ability groups. Psychological
Reports, 65, 1019–1026.
Gadzella, B. M., & Penland, E. (1995). Is creativity
related to scores on critical thinking? Psychological
Reports, 77, 817–818.
66
Garris, C. W. (1974). A study comparing the improvement of students’ critical thinking ability achieved
through the teacher’s increased use of classroom
questions resulting from individualized or
group training programs. (Doctoral dissertation,
Pennsylvania State University). Dissertation
Abstracts International, 35, 7123A.
Gaston, A. (1993). Recognizing potential law enforcement
executives. (Reports No. NCJ 131646) Washington,
D.C.: National Institute of Justice/NCJRS.
Gibson, J. W., Kibler, R. J., & Barker, L. L. (1968). Some
relationship between selected creativity and critical
thinking measures. Psychological Reports, 23, 707–714.
Glaser, E. M. (1937). An experiment in the development of
critical thinking. Contributions to Education, No. 843.
New York: Bureau of Publications, Teachers College,
Columbia University.
Goldman, F. W. & Goldman M. (1981). The effects of
dyadic group experience in subsequent individual
performance. Journal of Social Psychology, 115, 83–88.
Gross, Y. T., Takazawa, E. S., & Rose, C. L. (1987).
Critical thinking and nursing education. Journal of
Nursing Education, 26, 317–323.
Gurfein, H. (1977). Critical thinking in parents
and their adolescents children. Dissertation
Abstracts, 174A.
Harris, A. J. & Jacobson, M. D. (1982) Basic reading
vocabularies. New York: MacMillan.
Herber, H. L. (1959). An inquiry into the effect of
instruction in critical thinking upon students in
grades ten, eleven, and twelve. (Doctoral dissertation,
Boston University). Dissertation Abstracts, 20, 2174.
Holmgren, B. & Covin, T. (1984). Selective
characteristics of preservice professionals.
Education, 104, 321–328.
Jackson, T. R. (1961). The effects of intercollegiate
debating on critical thinking ability. (Doctoral dissertation, University of Wisconsin). Dissertation
Abstracts, 21, 3556.
Jaeger, R. M. & Freijo, T. D. (1975). Race and sex as
concomitants of composite halo in teachers’ evaluative rating of pupils. Journal Educational Psychology,
67, 226–237.
Jones, P. K. (1988). The effect of computer programming instruction on the development of generalized
problem solving skills in high school students.
Unpublished doctoral dissertation Nova University.
Jones, S. H. & Cook, S. W. (1975). The influence of
attitude on judgments of the effectiveness of alternative social policies. Journal of Personality and Social
Psychology, 32, 767–773.
References
Kudish, J. D. & Hoffman, B. J. (2002, October).
Examining the relationship between assessment center
final dimension ratings and external measures of cognitive ability and personality. Paper presented at the
30th International Congress on Assessment Center
Methods, Pittsburgh, PA.
Landis, R. E. (1976). The psychological dimensions of
three measures of critical thinking and twenty-four
structure-of-intellect tests for a sample of ninthgrade students. (Doctoral dissertation, University
of Southern California.) Dissertation Abstracts
International, 37, 5705A.
Pardue, S. F. (1987). Decision-making skills and critical
thinking ability among associate degree, diploma,
baccalaureate, and master’s prepared nurses. Journal
of Nursing Education, 26, 354–361.
Pepa, C. A., Brown, J. M., & Alverson, E. M. (1997).
A comparison of critical thinking abilities between
accelerated and traditional baccalaureate nursing students. Journal of Nursing Education, 36, 46–48.
Pierce, W., Lemke, E., & Smith, R. (1988). Critical thinking and moral development in secondary students.
High School Journal, 71, 120–126.
Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology, 28, 563–575.
Rasch, G. (1980). Probabilistic models for some
intelligence and attainment tests. Chicago:
University of Chicago Press.
Livingston, H. (1965). An investigation of the effect
of instruction in general semantics on critical reading
ability. California Journal of Educational Research,
16, 93–96.
Robertson, I. T. & Molloy, K. J. (1982). Cognitive complexity, neuroticism, and research ability. British
Journal of Educational Psychology, 52, 113–118.
Loo, R. & Thorpe, K. (1999). A psychometric investigation of scores on the Watson-Glaser Critical Thinking
Appraisal New Form S. Educational and Psychological
Measurement, 59(6), 995–1003.
Loo, R. & Thorpe, K. (2005). Relationships between
critical thinking and attitudes toward women’s roles
in society. The Journal of Psychology, 139(1), 47–55.
Ross, G. R. (1977, April). A factor-analytic study of inductive reasoning tests. Paper presented as the 61st Annual
Meeting of the American Educational Research
Association, New York.
Rust, J. (2002). Rust advanced numerical
reasoning appraisal manual. London: The
Psychological Corporation.
McMillan, J. H. (1987). Enhancing college students’
critical thinking: A review of studies. Research in
Higher Education, 26, 3–19.
Sandor, M. K., Clark, M., Campbell, D., Rains, A. P., &
Cascio, R. (1998). Evaluating critical thinking skills in
a scenario-based community health course. Journal of
Community Health Nursing, 15(1), 21–29.
Mead, A. D. & Drasgow, F. (1993). Equivalence of
computerized and paper-and-pencil cognitive
ability tests: A meta-analysis. Psychological Bulletin,
114(3), 449–458.
Scott, J. N. & Markert, R. J. (1994). Relationship
between critical thinking skills and success in preclinical courses. Academic Medicine, 69(11), 920–924.
Mitchell, H. E. & Byrne, D. (1973). The defendant’s
dilemma: Effects of juror’s attitudes and authoritarianism on judicial decision. Journal of Personality and
Social Psychology, 25, 123–129.
Morse, H. & McCune, G. (1957). Selected items for
the testing of study skills and critical thinking (3rd
Edition). Washington, DC: National Council for
the Social Studies.
Neimark, E. D. (1984, August). A cognitive style change
approach to the modification of thinking in college
students. Paper presented at the Conference on
Thinking, Cambridge, MA. (ERIC Document
Reproduction Service No. ED 261301).
Ness, J. H. (1967). The effects of a beginning speech
course on critical thinking ability. (Doctoral dissertation, University of Minnesota). Dissertation Abstracts,
28, 5171A.
Offner, A. N. (2000). Tacit knowledge and group facilitator behavior (Doctoral dissertation, Saint Louis
University, 2000). Dissertation Abstracts International,
60, 4283.
Scott, J. N., Markert, R. J., & Dunn, M.M. (1998).
Critical thinking: change during medical school and
relationship to performance in clinical clerkships.
Medical Education, 32(1), 14–18.
Sherif, C., Sherif, M., & Nebergall, R. (1965). Attitude
and attitude change. Philadelphia: W. B. Saunders.
Shin, K. R. (1998). Critical thinking ability and clinical
decision-making skills among senior nursing students
in associate and baccalaureate programmes in Korea.
Journal of Advanced Nursing, 27(2), 414–418.
Simon, A. & Ward, L. (1974). The performance on
the Watson-Glaser Critical Thinking Appraisal of
university students classified according to sex, type
of course pursued, and personality score category.
Educational and Psychological Measurement,
34, 957–960.
Society for Industrial and Organizational Psychology.
(2003). Principles for the validation and use of personnel
selection procedures (4th ed.). Bowling Green,
OH: Author.
67
Watson-Glaser Short Form Manual
Sorenson, L. (1966). Watson-Glaser Critical Thinking
Appraisal: Changes in critical thinking associated
with two methods of teaching high school biology.
Test Data Report No. 51. New York: Harcourt
Brace & World.
Spector, P. A., Schneider, J. R., Vance, C. A., & Hezlett,
S. A. (2000). The relation of cognitive ability and
personality traits to assessment center performance.
Journal of Applied Social Psychology, 30(7), 1474–1491.
Taube, K. T. (1995, April). Critical thinking ability and
disposition as factors of performance on a written critical
thinking test. Paper presented at the Annual Meeting
of the American Educational Research Association,
San Francisco, CA.
Taylor, S. E., Frankenpohl, H., White, C. E., Nieroroda,
B. W., Browning, C. L., & Birsner, E. P. (1989). EDL
core vocabularies in reading, mathematics, science, and
social studies. Columbia, SC: EDL.
Thouless, R. H. (1949). Review of Watson-Glaser Critical
Thinking Appraisal. In O. K. Buros (Ed.), The third
mental measurements yearbook. Lincoln: University of
Nebraska Press.
U.S. Department of Labor. (1999). Testing and
assessment: An employer’s guide to good practices.
Washington, DC: Author.
Watson, G. B. (1925). The measurement of fairmindedness. Contributions to Education, No. 176. New
York: Bureau of Publications, Teachers College,
Columbia University.
Watson, G., & Glaser, E. M. (1994). Watson-Glaser
Critical Thinking Appraisal, Form S manual. San
Antonio, TX: The Psychological Corporation.
Williams, B. R. (1971). Critical thinking ability as
affected by a unit on symbolic logic. (Doctoral dissertation, Arizona State University). Dissertation
Abstracts International, 31, 6434A.
Williams, R. L. (2003). Critical thinking as a predictor
and outcome measure in a large undergraduate educational psychology course. (Report No. TM-035-016).
Knoxville, TN: University of Tennessee. (ERIC
Document Reproduction Service No. ED478075)
Wood, L. E. (1980). An “intelligent” program to teach
logical thinking skills. Behavior Research Methods and
Instrumentation, 12, 256–258.
Wood, L. E. & Stewart, P. W. (1987). Improvement of
practical reasoning skills with a computer game.
Journal of Computer-Based instruction, 14, 49-53.
Yang, S. C. & Lin, W. C. (2004). The relationship among
creative, critical thinking and thinking styles in
Taiwan High School Students. Journal of Instructional
Psychology, 31(1). 33–45.
68
Research Bibliography
Alexakos, C. E. (1966). Predictive efficiency of two multivariate statistical techniques in comparison with
clinical predictions. Journal of Educational Psychology,
57, 297–306.
Alspaugh, C. A. (1970). A study of the relationships
between student characteristics and proficiency in
symbolic and algebraic computer programming.
(Doctoral dissertation, University of Missouri).
Dissertation Abstracts International, 31, 4627B.
Bergman, L. M. E. (1960). A study of the relationship
between selected language variables in extemporaneous speech and critical thinking ability. (Doctoral
dissertation, University of Minnesota). Dissertation
Abstracts, 21, 3552.
Bessent, E. W. (1961). The predictability of selected
elementary school principals’ administrative behavior. (Doctoral dissertation, The University of Texas at
Austin). Dissertation Abstracts, 22, 3479.
Alston, D. N. (1972). An investigation of the critical
reading ability of classroom teachers in relation to
selected background factors. Educational Leadership,
29, 341–343.
Betres, J. J. A. (1971). A study in the development of
the critical thinking skills of preservice elementary
teachers. (Doctoral dissertation, Ohio University).
Dissertation Abstracts International, 32, 2520A.
Annis, L. F. & Annis, D. B. (1979). The impact of
philosophy on students’ critical thinking ability.
Contemporary Educational Psychology, 4, 219–226.
Bishop, T. (1971). Predicting potential: Selection for
science-based industries. Personnel Management,
3, 31–33.
Armstrong, N. A. (1970). The effect of two instructional inquiry strategies on critical thinking and
achievement in eighth grade social studies. (Doctoral
dissertation, Indiana University). Dissertation
Abstracts International, 31, 1611A.
Bitner, B. (1991). Formal operational reasoning modes:
Predictors of critical thinking abilities and grades
assigned by teachers in science and mathematics for
students in grades nine through twelve. Journal of
Research in Science Teaching, 28, 265–274.
Awomolo, A. A. (1973). Teacher discussion leadership
behavior in a public issues curriculum and some
cognitive and personality correlates. (Doctoral dissertation, University of Toronto). Dissertation Abstracts
International, 35, 316A.
Bitner, C. & Bitner, B. (1988, April). Logical and critical
thinking abilities of sixth through twelfth grade students
and formal reasoning modes as predictors of critical
thinking abilities and academic achievement. Paper presented at the 61st Annual Meeting of the National
Association for Research in Science Teaching, Lake of
the Ozarks, MO.
Bass, J. C. (1959). An analysis of critical thinking in a
college general zoology class. (Doctoral dissertation,
University of Oklahoma). Dissertation Abstracts,
20, 963.
Beckman, V. E. (1956). An investigation and analysis
of the contributions to critical thinking made by
courses in argumentation and discussion in
selected colleges. (Doctoral dissertation, University
of Minnesota). Dissertation Abstracts, 16, 2551.
Berger, A. (1985). Review of test: Watson-Glaser Critical
Thinking Appraisal. In J.V. Mitchell (Ed.), The ninth
mental measurements yearbook. Lincoln: University of
Nebraska Press.
Blai, B. (1989). Critical thinking: The flagship of thinking
skills? (ERIC Document Reproduction Service No.
ED 311752).
Bledso, J. C. (1955). A comparative study of values and
critical thinking skills of a group of educational workers. The Journal of Educational Psychology, 46, 408–417.
Bostrom, E. A. (1969). The effect of class size on critical
thinking skills. (Doctoral dissertation, Arizona State
University). Dissertation Abstracts, 29, 2032A.
Brabeck, M. M. (1981, August). The relationship between
critical thinking skills and the development of reflective
judgment among adolescent and adult women. Paper
presented at the 89th Annual Convention of the
American Psychological Association, Los Angeles.
69
Watson-Glaser Short Form Manual
Brabeck, M. M. (1983). Critical thinking skills and
reflective judgment development; Redefining
the aims of higher education. Journal of Applied
Developmental Psychology, 4, 23–24.
Bradberry, R. D. (1968). Relationships among critical
thinking ability, personality attributes, and attitudes
of students in a teacher education program. (Doctoral
dissertation, North Texas State University, Denton).
Dissertation Abstracts, 29, 163A.
Brakken, E. (1965). Intellectual factors in Pssc and conventional high school physics. Journal of Research in
Science Teaching, 3, 19–25.
Braun, J. R. (1969). Search for correlates of self-actualization. Perceptual and Motor Skills, 28, 557–558.
Broadhurst, N. A. (1969). A measure of some learning outcomes in matriculation chemistry in South
Australia. Australian Science Teaching Journal,
15, 67–70.
Broadhurst, N.A. (1969). A study of selected teacher
factors in learning outcomes in chemistry in secondary schools in South Australia. (Doctoral dissertation,
Oregon State University). Dissertation Abstracts
International, 30, 485A.
Broadhurst, N. A. (1970). An item analysis of the
Watson-Glaser Critical Thinking Appraisal (Form
Ym). Science Education, 54, 127–132.
Brouillette, O. J. (1968). An interdisciplinary comparison of critical thinking objective among science and
non-science majors in higher education. (Doctoral
dissertation, University of Southern Mississippi).
Dissertation Abstracts, 29, 2877A.
Brubaker, H. L. (1972). Selection of college major by
the variables of intelligence, creativity, and critical
thinking. (Doctoral dissertation, Temple University).
Dissertation Abstracts International, 33, 1507A.
Bunt, D. D. (1974). Prediction of academic achievement and critical thinking of eighth graders in
suburban, urban, and private schools through
specific personality, ability, and school variables.
(Doctoral dissertation, Northern Illinois University).
Dissertation Abstracts International, 35, 2042A.
Burns, R. L. (1974). The testing of a model of critical thinking ontogeny among Central Connecticut
State College undergraduates. (Doctoral dissertation,
University of Connecticut). Dissertation Abstracts
International, 24, 5467A.
Bybee, J. R. (1972). Prediction in the college of education doctoral program at the Ohio State University.
(Doctoral dissertation, Ohio State University).
Dissertation Abstracts International, 33, 4111A.
70
Campbell, I. C. (1976). The effects of selected facets of
critical reading instruction upon active duty servicemen and civilian evening college adults. (Doctoral
dissertation, University of Georgia). Dissertation
Abstracts International, 37, 2591A.
Cannon, A. G. (1974). The development and testing of a policy-capturing model for the selection of
school administrators in a large urban school district.
(Doctoral dissertation, The University of Texas at
Austin). Dissertation Abstracts International, 35, 2565A.
Canter, R. R. Jr. (1951). A human relations training program. Journal of Applied Psychology, 35, 38–45.
Carleton, F. O. (1970). Relationships between follow-up
evaluations and information developed in a management assessment center. Proceedings of the 78th Annual
Convention of the American Psychological Association,
5, 565–566.
Carlson, D. A. (1975). Training in formal reasoning
abilities provided by the inquiry role approach and
achievement on the Piagetian formal operational
level. (Doctoral dissertation, University of Northern
California). Dissertation Abstracts International,
36, 7368A.
Carnes, D. D. (1969). A study of the critical thinking
ability of secondary summer school mathematics
students. (Doctoral dissertation, University of
Mississippi). Dissertation Abstracts International,
30, 2242A.
Cass, M. & Evans, E. D. (1992). Special education
teachers and critical thinking. Reading Improvement,
29, 228–230.
Chang, E. C. G. (1969). Norms and correlates of
the Watson-Glaser Critical Thinking Appraisal
and selected variables for Negro college students.
(Doctoral dissertation, University of Oklahoma).
Dissertation Abstracts International, 30, 1860A.
Coble, C.R. & Hounshell, P.B. (1972). Teacher
self-actualization and student progress. Science
Education, 26, 311–316.
Combs, C. M. Jr. (1968). An experiment with independent study in science education. (Doctoral
dissertation, University of Mississippi). Dissertation
Abstracts, 29, 3489A.
Cook, J. (1955). Validity Information Exchange, No. 813: D.O.T. Code 0-17-01, Electrical Engineer. Personnel
Psychology, 8, 261–262.
Cooke, M. M. (1976). A study of the interaction of
student and program variables for the purpose
of developing a model for predicting graduation
from graduate programs in educational administration at the State University of New York at Buffalo.
(Doctoral dissertation, State University of New York).
Dissertation Abstracts International, 37, 827A.
Research Bibliography
Corell, J. H. (1968). Comparison of two methods of
counseling with academically deteriorated university upperclassmen. (Doctoral dissertation, Indiana
University). Dissertation Abstracts, 29, 1419A.
Corlett, D. (1974). Library skills, study habits and
attitudes, and sex as related to academic achievement. Educational and Psychological Measurement,
34, 967–969.
Cornett, E. (1977). A study of the aptitude-treatment
interactions among nursing students regarding programmed modules and personological variables.
(Doctoral dissertation, The University of Texas at
Austin). Dissertation Abstracts International, 38, 3119B.
Cousins, J. E. (1962). The development of reflective thinking in an eighth grade social studies
class. (Doctoral dissertation, Indiana University).
Dissertation Abstracts, 24, 195.
Coyle, F. A. Jr. & Bernard, J. L. (1965). Logical thinking
and paranoid schizophrenia. Journal of Psychology,
60, 283–289.
Crane, W. J. (1962). Screening devices for occupational
therapy majors. American Journal of Occupational
Therapy, 16, 131–132.
Crawford, C. D. (1956). Critical thinking and personal
values in the listening situation: An exploratory
investigation into the relationships of three theoretical variables in human communication as indicated
by the relation between measurements on the
Allport-Vernon-Lindzey Study of Values and the
Watson-Glaser Critical Thinking Appraisal, and similar measurements of responses to a recorded radio
news commentary. (Doctoral dissertation, New York
University). Dissertation Abstracts, 19, 1845.
Crites, J. O. (1972). Review of test: Watson-Glaser
Critical Thinking Appraisal. In O.K. Buros (Ed.),
The seventh mental measurements yearbook. Lincoln:
University of Nebraska Press.
Crosson, R. F. (1968). An investigation into certain
personality variables among capital trial jurors.
Proceedings of the 76th Annual Convention of the
American Psychological Association, 3, 371–372.
Cruce-Mast, A. L. (1975). The interrelationship of
critical thinking, empathy, and social interest with
moral judgment. (Doctoral dissertation, Southern
Illinois University). Dissertation Abstracts International,
36, 7945A.
Cyphert, F. R. (1961). The value structures and critical
thinking abilities of secondary-school principals. The
Bulletin of the National Association of Secondary-School
Principals, 45, 43–47.
D’Aoust, T. (1963). Predictive validity of four psychometric tests in a selected school of nursing. Unpublished
master’s thesis, Catholic University of America,
Washington, DC.
Davis, W. N. (1971). Authoritarianism and selected
trait patterns of school administrators: Seventeen
case studies. (Doctoral dissertation, North Texas
State University, Denton). Dissertation Abstracts
International, 32, 1777A.
De Loach, S. S. (1976). Level of ego development,
degree of psychopathology, and continuation or termination of outpatient psychotherapy involvement.
(Doctoral dissertation, Georgia State University).
Dissertation Abstracts International, 37, 5348B.
De Martino, H. A. (1970). The relations between certain
motivational variables and attitudes about mental
illness in student psychiatric nurses. (Doctoral dissertation, St. John’s University). Dissertation Abstracts
International, 31, 3036A.
Denney, L. L. (1968). The relationships between teaching method, critical thinking, and other selected
teacher traits. (Doctoral dissertation, University of
Missouri). Dissertation Abstracts, 29, 2586A.
Dirr, P. M. (1966). Intellectual variables in achievement
in modern algebra. (Doctoral dissertation, Catholic
University of America). Dissertation Abstracts,
27, 2873A.
Dispenzieri, A., Giniger, S., Reichman, W., & Levy, M.
(1971). College performance of disadvantaged students as a function of ability and personality. Journal
of Counseling Psychology, 18, 298–305.
Dowling, R. E. (1990, February). Reflective judgment in
debate: Or, the end of “critical thinking” as the goal of
educational debate. Paper presented at the Annual
Meeting of the Western Speech Communication
Association, Sacramento.
Dressel, P. & Mayhew, L. (1954). General education: Exploration in evaluation. Final Report of
the Cooperative Study of Evaluation in General
Education. Washington D.C.: American Council
of Education.
Eisenstadt, J. W. (1986). Remembering Goodwin
Watson. Journal of Social Issues, 42, 49–52.
Embretson, S. E. & Reise, S.P. (2000). Item response
theory for psychologists. Mahwah, NJ: Lawrence
Erlbaum Associates.
Ennis, R.H. (1958). An appraisal of the Watson-Glaser
Critical Thinking Appraisal. Journal of Educational
Research, 52, 155–158.
Fisher, A. (2001). Critical thinking: An introduction. New
York: Cambridge University Press.
Flora, L. D. (1966). Predicting academic success at
Lynchburg College from multiple correlational
analysis of four selected predictor variables. (Doctoral
dissertation, University of Virginia). Dissertation
Abstracts, 27, 2276A.
71
Watson-Glaser Short Form Manual
Follman, J. (1969). A factor-analysis of three critical thinking tests, one logical reasoning test, and
one English test. Yearbook of the National Reading
Conference, 18, 154–160.
Follman, J., (1970). Correlational and factor analysis
of critical thinking, logical reasoning, and English
total test scores. Florida Journal of Educational Research,
12, 91–94.
Flollman, J., Brown, L., & Burg, E. (1970). Factor analysis of critical thinking, logical reasoning, and English
subtest. Journal of Experimental Education, 38, 11–6.
Follman, J., Hernandez, D., & Miller, W. (1969).
Canonical correlation of scholastic aptitude and
critical thinking. Psychology, 6, 3–6.
Follman, J., Miller, W., & Burg, E. (1971). Statistical
analysis of three critical thinking tests. Psychological
Measurement, 31, 519–520.
Foster, P. J. (1981). Clinical discussion groups: Verbal
participation and outcomes. Journal of Medical
Education, 56, 831–838.
Frank, A. D. (1967). An experimental study in improving the critical thinking ability of high school
students enrolled in a beginning speech course.
(Doctoral dissertation, University of Wisconsin).
Dissertation Abstracts, 28, 5168A.
Friend, C. M. & Zubek, J. P. (1958). The effects of age
on critical thinking ability. Journal of Gerontology,
13, 407–413.
Gable, R. K., Roberts, A. D., & Owens, S. V. (1977).
Affective and cognitive correlates of classroom
achievement. Educational and Psychological
Measurement, 37, 977–986.
Geckler, J. W. (1965). Critical thinking, dogmatism,
social status, and religious affiliation of tenth-grade
students. (Doctoral dissertation, University
of Tennessee). Dissertation Abstracts, 26, 886.
Geisinger, K. F. (1998). Review of Watson-Glaser Critical
Thinking Appraisal. In Impara, J. C. & Plake, B. S.
(Eds.), The thirteenth mental measurements yearbook.
Lincoln, NE: Buros Institute of Mental Measurements.
George, K.I. & Smith, M.C. (1990). An empirical
comparison of self-assessment and organizational
assessment in personnel selection. Public Personnel
Management, 19, 175–190.
George, K. D. (1968). The effect of critical thinking ability upon course grades in biology. Science Education,
52, 421–426.
Glidden, G. W. (1964). Factors that influence achievement in senior high school American History.
(Doctoral dissertation, University of Nebraska).
Dissertation Abstracts, 25, 3429.
72
Grace, J. L., Jr. (1968). Critical thinking ability of students in Catholic and public high schools. National
Catholic Education Association Briefs, 65, 49–57.
Grasz, C. S. (1977). A study to determine the validity
of test scores and other selected factors as predictors
of success in a basic course in educational administration. (Doctoral dissertation, Rutgers - The State
University of New Jersey). Dissertation Abstracts
International, 37, 7436A.
Gunning, C. S. (1981). Relationships among field
independence, critical thinking ability, and clinical problem-solving ability of baccalaureate nursing
students. (Doctoral dissertation, The University of
Texas at Austin). Dissertation Abstracts International,
42, 2780.
Guster, D. & Batt, R. (1989). Cognitive and affective
variables and their relationships to performance
in a Lotus 1-2-3 class. Collegiate Microcomputer,
7, 151–156.
Haas, M. G. (1963). A comparative study of critical
thinking, flexibility of thinking, and reading ability
involving religious and lay college seniors. (Doctoral
dissertation, Fordham University). Dissertation
Abstracts, 24, 622.
Hall, W. C., Jr. & Myers, C. B. (1977). The effect of a
training program in the Taba teaching strategies on
teaching methods and teacher perceptions of their
teaching. Peabody Journal of Education, 54, 162–167.
Hardesty, D. L. & Jones, W. S. (1968). Characteristics
of judged high potential management personnel—
The operations of an industrial assessment center.
Personnel Psychology, 21, 85–98.
Hatano, G. & Kuhara, K. (1980, April). The recognition
of inferences from a story among high and low critical
thinkers. Paper presented at the 64th Annual
Meeting of the American Educational Research
Association, Boston.
Helm, C. R. (1967). Watson-Glaser-DAT graduate
norms. Unpublished master’s thesis, University of
Toledo, OH.
Helmstadter, G. C. (1972). Review of Watson-Glaser
Critical Thinking Appraisal. In O. K. Buros (Ed.),
The seventh mental measurements yearbook. Lincoln:
University of Nebraska Press.
Henkel, E. T. (1968). Undergraduate physics instruction and critical thinking ability. Journal of Research in
Science Teaching, 5, 89–94.
Hicks, R. E. & Southey, G. N. (1990). The Watson-Glaser
Critical Thinking Appraisal and the performance of
business management students. Psychological Test
Bulletin, 3, 74–81.
Research Bibliography
Hildebrant, S & Lucas, J. A. (1980). Follow-up students
who majored and are majoring in legal technology.
Research report, William Raney Harper College,
Pallatine, IL.
Hill, O. W., Pettus, W. C., & Hedin, B. A. (1990). Three
studies of factors affecting the attitudes of Blacks and
females toward the pursuit of science and sciencerelated careers. Journal of Research in Science Teaching,
27, 289–314.
Hill, W. H. (1959). Review of Watson-Glaser Critical
Thinking Appraisal. In O. K. Buros (Ed.), The fifth
mental measurements yearbook. Lincoln: University of
Nebraska Press.
Hillis, S. R. (1975). The relationship of inquiry orientation in secondary physical science classrooms and
student’s critical thinking skills, attitudes, and views
of science. (Doctoral dissertation. The University of
Texas at Austin). Dissertation Abstracts International,
36, 805A.
Himaya, M. I. (1972). Identification of possible variables
for predicting student changes in physical science
courses designed for non-science majors. (Doctoral
dissertation, University of Iowa). Dissertation Abstracts
International, 34, 67A.
Hudson, V. C., Jr. (1972). A study of the relationship
between the social studies student teacher’s divergent thinking ability and his success in promoting
divergent thinking in class discussion. (Doctoral
dissertation, University of Arkansas). Dissertation
Abstracts International, 33, 2219A.
Hughes, T. M., et al. (1987, November). The prediction of
teacher burnout through personality type, critical thinking, and self-concept. Paper presented at the Annual
Meeting of the Mid-South Educational Research
Association, Mobile, AL.
Hunt, D. & Randhawa, B. S. (1973). Relationship
between and among cognitive variables and achievement in computational science. Educational and
Psychological Measurement, 33, 921–928.
Hunt, E. J. (1967). The critical thinking ability of
teachers and its relationship to the teacher’s classroom verbal behavior and perceptions of teaching
purposes. (Doctoral dissertation, University of
Maryland). Dissertation Abstracts, 28, 4511A.
Hunter, N. W. (1968). A study of the factors which
may affect a student’s success in quantitative analysis. (Doctoral dissertation, University of Toledo).
Dissertation Abstracts, 29, 2437A.
Hinojosa, T. R., Jr. (1974). The influence of idiographic
variables on leadership style: A study of special education administrators (Plan A) in Texas. (Doctoral
dissertation, The University of Texas at Austin).
Dissertation Abstracts International, 35, 2082A.
Hurov, J. T. (1987). A study of the relationship
between reading, computational, and critical thinking skills and academic success in fundamentals of
chemistry. (ERIC Document Reproduction Service
No. ED 286569).
Hjelmhaug, N. N. (1971). Context instruction and
the ability of college students to transfer learning. (Doctoral dissertation, Indiana University).
Dissertation Abstracts International, 32, 1356A.
Ivens, S. H. (1998). Review of Watson-Glaser Critical
Thinking Appraisal. In Impara, J. C. & Plake, B. S.
(Eds.), The thirteenth mental measurements yearbook.
Lincoln, NE: Buros Institute of Mental Measurements.
Holdampf, B. A. (1983). Innovative associate degree nursing program—remote area: A comprehensive final report
on exemplary and innovative proposal. Department
of Occupational Education and Technology, Texas
Education Agency, Austin, TX.
Jabs, M. L. (1969). An experimental study of the
comparative effects of initiating structure and consideration leadership on the educational growth of
college students. (Doctoral dissertation, University
of Connecticut). Dissertation Abstracts International,
30, 2762A.
Hollenbach, J. W. & De Graff, C. (1957). Teaching for
thinking. Journal of Higher Education, 28, 126–130.
Hoogstraten, J. & Christiaans, H. H. C. M. (1975). The
relationship of the Watson-Glaser Critical Thinking
Appraisal to sex and four selected personality measures for a sample of Dutch first-year psychology
students. Educational and Psychological Measurement,
35, 969–973.
Houle, C. (1943). Evaluation in the eight-year study.
Curriculum Journal, 14, 18–21.
Hovland, C. I. (1959). Review of Watson-Glaser Critical
Thinking Appraisal. In O. K. Buros (Ed.), The fifth
mental measurements yearbook. Lincoln: University of
Nebraska Press.
James, R. J. (1971). Traits associated with the initial and
persistent interest in the study of college science.
(Doctoral dissertation, State University of New York).
Dissertation Abstracts International, 32, 1296A.
Jenkins, A. C. (1966). The relationship of certain measurable factors to academic success in freshman
biology. (Doctoral dissertation, New York University).
Dissertation Abstracts, 27, 2279A.
Jurgenson, E. M. (1958). The relationship between success in teaching vocational agriculture and ability
to make sound judgments as measured by selected
instruments. (Doctoral dissertation, Pennsylvania
State University). Dissertation Abstracts, 19, 96.
73
Watson-Glaser Short Form Manual
Kenoyer, M. F. (1961). The influence of religious life
on three levels of perceptual processes. (Doctoral dissertation, Fordham University). Dissertation Abstracts,
22, 909.
Ketefian, S. (1981). Critical thinking, educational preparation, and development of moral judgment among
selected groups of practicing nurses. Nursing Research,
30, 98–103.
Kintgen-Andrews, J. (1988). Development of critical
thinking: Career ladder P.N. and A.D. nursing students,
pre-health science freshmen, generic baccalaureate sophomore nursing students. (ERIC Document Reproduction
Service No. ED 297153).
Kirtley, D. & Harkless, R. (1970). Student political activity in relation to personal and social adjustment.
Journal of Psychology, 75, 253–256.
Klassen, Peter T. (1984). Changes in personal orientation and critical thinking among adults returning to
school through weekend college: An alternative evaluation. Innovative Higher Education, 8, 55–67.
Kleg, M. (1987). General social studies knowledge and
critical thinking among pre-service elementary teachers. International Journal of Social Education, 1, 50–63.
Kooker, E. W. (1971). The relationship between performance in a graduate course in statistics and the
Miller Analogies Tests and the Watson-Glaser
Critical Thinking Appraisal. The Journal of Psychology,
77, 165–169.
Krockover, G. H. (1965). The development of critical
thinking through science instruction. Proceedings of
the Iowa Academy of Sciences, 72, 402–404.
La Forest, J. R. (1970). Relation of critical thinking to
program planning. (Doctoral dissertation, North
Carolina State University). Dissertation Abstracts
International, 32, 1253A.
Land, M. (1963). Psychological tests as predictors for
scholastic achievement of dental students. Journal of
Dental Education, 27, 25–30.
Landis, R. F. & Michael, W. B. (1981). The factorial
validity of three measures of critical thinking within
the context of Guilford’s structure-of-intellect model
for a sample of ninth grade students. Educational &
Psychological Measurement, 41, 1147–1166.
Larter, S. J. & Taylor, P. A. (1969). A study of aspects
of critical thinking. Manitoba Journal of Education,
5, 35–53.
Leadbeater, B. J. & Dionne, J. P. (1981). The adolescent’s
use of formal operational thinking solving problems
related to identity resolution. Adolescence, 16, 111–121.
Lewis, D. R. & Dahl, T. (1971). The Test of
Understanding in College Economics and its
construct validity. Journal of Economics Education,
2, 155–166.
74
Little, T. L. (1972). The relationship of critical thinking
ability to intelligence, personality factors, and academic achievement. (Doctoral dissertation, Memphis
State University). Dissertation Abstracts International,
33, 5554A.
Litwin, J. L. & Haas, P. F. (1983). Critical thinking: An
intensive approach. Journal of Learning Skills, 2, 43–47.
Lowe, A. J., Follman, J., Burley, W., & Follman, J. (1971).
Psychometric analysis of critical reading and critical
thinking test scores – twelfth grade. Yearbook of the
National Reading Conference, 20, 142–174.
Lucas, A. M. & Broadhurst, N. A. (1972). Changes in
some content-free skills, knowledge, and attitudes
during two terms of Grade 12 biology instruction
in ten South Australian schools, Australian Science
Teaching Journal, 18, 66–74.
Luck, J. I. & Gruner, C. R. (1970). Note on authoritarianism and critical thinking ability. Psychological
Reports, 27, 380.
Luton, J. N. (1955). A study of the use of standardized
tests in the selection of potential educational administrators. Unpublished doctoral dissertation, University of
Tennessee, Memphis.
Lysaught, J. P. (1963). An analysis of factors related
to success in construction programmed learning
sequences. Journal Programmed Instruction, 2, 35–42.
Lysaught, J. P. (1964). An analysis of factors related
to success in constructing programmed learning
sequences. (Doctoral dissertation, University of
Rochester, NY). Dissertation Abstracts, 25, 1749.
Lysaught, J. P. (1964). Further analysis of success among
auto-instructional programmers, Teaching Aid News,
4, 6–11.
Lysaught, J. P. (1964). Selecting instructional programmers: new research into characteristics of successful
programmers. Training Directors Journal, 18, 8–14.
Lysaught, J. P. & Pierleoni, R. G. (1964). A comparison
of predicted and actual success in auto-instructional
programming. Journal of Programmed Instruction,
3, 14–23.
Lysaught, J. P. &Pierleoni, R. G. (1970). Predicting individual success in programming self-instructional
materials. Audio-Visual Community Research, 18, 5–24.
Marrs, L. W. (1971). The relationship of critical thinking ability and dogmatism to changing regular class
teachers’ attitudes toward exceptional children.
(Doctoral dissertation, The University of Texas at
Austin). Dissertational Abstracts International, 33, 638A.
Mathias, R. O. (1973). Assessment of the development
of critical thinking skills and instruction in grade
eight social studies in Mt. Lebanon school district.
(Doctoral dissertation, University of Pittsburgh).
Dissertation Abstracts International, 34, 1064A.
Research Bibliography
McCammon, S. L., Golden, J., & Wuensch, K. L. (1988).
Predicting course performance in freshman and sophomore physics courses: Women are more predictable
than men. Journal of Research in Science Teaching,
25, 501–510.
McCloudy, C. W. (1974). An experimental study of
critical thinking skills as affected by intensity and
types of sound. (Doctoral dissertation, East Texas
State University, Commerce). Dissertation Abstracts
International, 35, 4086A.
McCutcheon, L. E., Apperson J. M., Hanson, E.,
& Wynn, V. (1992). Relationships among critical
thinking skills, academic achievement, and misconceptions about psychology. Psychological Reports,
71, 635–639.
McMurray, M. A., Beisenherz, P., & Thompson, B.
(1991). Reliability and concurrent validity of a measure of critical thinking skills in biology. Journal of
Research in Science Teaching, 28, 183–191.
Miller, D. A., Sadler, J. Z. & Mohl, P. C. (1993). Critical
thinking in preclinical course examinations, Academic
Medicine, 68, 303–305.
Miller, W., Follman, J., & Hernandez, D. E. (1970).
Discriminate analysis of school children in integrated
and non-integrated schools using tests of critical
thinking. Florida Journal of Educational Research,
12, 63–68.
Milton, O. (1960). Primitive thinking and reasoning
among college students. Journal of Higher Education,
31, 218–220.
Moskovis, L. M. (1967). An identification of certain
similarities and differences between successful and
unsuccessful college level beginning shorthand
students and transcription student. (Doctoral dissertation, Michigan State University). Dissertation
Abstracts, 28, 4826A.
Moskovis, L. M. (1970). Similarities and differences of
college-level successful and unsuccessful shorthand
students. Delta Pi Epsilon Journal, 12, 12–16.
Murphy, A. J. (1973). The relationship of leadership potential to selected admission criteria for the
advanced programs in educational administration.
(Doctoral dissertation, State University of New York).
Dissertation Abstracts International, 34, 1545A.
Nixon, J. T. (1973). The relationship of openness to
academic performance, critical thinking, and school
morale in two school settings, (Doctoral dissertation,
George Peabody College for Teachers). Dissertation
Abstracts International, 34, 3999A.
Norris, C. A., Jackson, L., & Poirot, J. L. (1992). The
effect of computer science instruction on critical thinking skills and mental alertness. Journal of
Research on Computing in Education, 24, 329–337.
Norris, S. P. (1986). Evaluating critical thinking ability.
History and Social Science Teacher, 21, 135–146.
Norris, S. P. (1988). Controlling for background beliefs
when developing multiple-choice critical thinking
tests. Educational Measurement, 7, 5–11.
Modjeski, R. B. & Michael, W. B. (1983). An evaluation by a panel of psychologists of the reliability and
validity of two tests of critical thinking. Educational
and Psychological Measurement, 43, 1187–1197.
Norton, S., et al (1985, November). The effects of an
independent laboratory investigation on the critical thinking ability and scientific attitude of students in a general
microbiology class. Paper presented at the 14th Annual
Meeting of the Mid-South Research Association,
Biloxi, MS.
Moffett, C. R. (1954). Operational characteristics of beginning master’s students in educational administration
and supervision. Unpublished doctoral dissertation,
University of Tennessee, Memphis.
Nunnery, M. Y. (1959). How useful are standardized
psychological tests in the selection of school administrators? Educational Administration and Support,
45, 349–356.
Molidor, J., Elstein, A., & King, L. (1978). Assessment
of problem solving skills as a screen for medical school
admissions. Michigan State University, Report No.
TM 800383. East Lansing: National Fund for Medical
Education. (ERIC Document Reproduction Service
No. Ed. 190595).
Obst, F. (1963). A study of abilities of women students entering the Colleges of Letters and Science
and Applied Arts at the University of California, Los
Angeles, Journal of Educational Research, 57, 54–86.
Moore, M. L. (1976). Effects of value clarification on
dogmatism, critical thinking, and self-actualization.
(Doctoral dissertation, Arizona State University).
Dissertation Abstracts International, 37, 907A.
Moore, M. R. (1973). An investigation of the relationships among teacher behavior, creativity, and critical
thinking ability. (Doctoral dissertation, University
of Missouri), Dissertation Abstracts International,
35, 10270A.
O’Neill, M. R. (1973). A study of critical thinking,
open-mindedness, and emergent values among high
school seniors and their teachers. (Doctoral dissertation, Fordham University). Dissertation Abstracts
International, 34, 2278A.
O’Toole, D. M. (1971). An accountability evaluation
of an in-service economic education experience.
(Doctoral dissertation, Ohio University). Dissertation
Abstracts International, 32, 2315A.
Owens, T. R. & Roaden, A. L. (1966). Predicting
academic success in master’s degree programs in education. Journal of Educational Research, 60, 124–126.
75
Watson-Glaser Short Form Manual
Parsley, J. F., Jr. (1970). A comparison of the ability
of ninth grade students to apply several critical
thinking skills to problematic content presented
through two different media. (Doctoral dissertation,
Ohio University). Dissertation Abstracts International,
31, 4620A.
Parson, C. V. (1991). Barrier to success: Community college students critical thinking skills. (ERIC Document
Reproduction Service No. ED 340415).
Rust, V. I. (1959). Factor analyses of three tests of critical thinking. (Doctoral dissertation, University of
Illinois). Dissertation Abstracts, 20, 225.
Rust, V. I. (1960). Factor analyses of three tests of
critical thinking. Journal of Experimental Education,
29, 177–182.
Pascarella, E. T. (1987, November). The development of
critical thinking: Does college make a difference? Paper
presented at the Annual Meeting of the Association
for the Study of Higher Education, Baltimore, MD.
Rust, V. I. (1961). A study of pathological doubting
as a response set. Journal of Experimental Education,
29, 393–400.
Pierleoni, R. G. & Lysaught, J. P. (1970). A decision ladder for prediction programmer success. NSPI Journal,
9, 6–7.
Rust V. I., Jones, R. S., & Kaiser, H. F. (1962). A factor analytic study of critical thinking. Journal of
Educational Research, 55, 253–259.
Pillai, N. P. & Nayar, P. P. (1968). The role of critical thinking in science achievement. Journal of
Educational Research and Extension, 5, 1–8.
Ryan, A. M. & Sackett, P. R. (1987). A survey of individual assessment practices by I/O Psychologists.
Personnel Psychology, 40, 455–488.
Poel, R. H. (1970). Critical thinking as related to PSSC
and non-PSSC physics programs. (Doctoral dissertation, Western Michigan University). Dissertation
Abstracts International, 31, 3983A.
Schafer, P. J. (1972). An inquiry into the relationship
between the critical thinking ability of teachers and
selected variables. (Doctoral dissertation, University
of Pittsburgh). Dissertation Abstracts International,
33, 1066A.
Quinn, P. V. (1965). Critical thinking and openmindedness in pupils from public and Catholic secondary
schools. Journal Social Psychology, 66, 23–30.
Schmeck, R. R. & Ribich, F. D. (1978). Construct validation of the Inventory of Learning Processes. Applied
Psychological Measurement, 2, 551–562.
Raburn, J. & Van Scuyver, B. (1984). The relationship of
reading and writing to thinking among university students taking English Composition. (ERIC Document
Reproduction Service No. ED 273978).
Scott, D. W. (1983). Anxiety, critical thinking, and
information processing during and after breast
biopsy. Nursing Research, 32, 24–28.
Radebaugh, B. F. & Johnson, J. A. (1971). Excellent
teachers: What makes them outstanding? Phase 2.
Illinois School Research, 7, 12–20.
Scott, D. W., Oberts, M. T., & Bookbinder, M. I. (1984).
Stress-coping response to genitourinary carcinoma in
men. Nursing Research, 33, 24–28.
Rawls, J. R. & Nelson, O. T. (1975). Characteristics
associated with preferences for certain managerial
positions. Psychological Reports, 36, 911–918.
Seymour, L. A. & Sutman, F. X. (1973). Critical thinking ability, open-mindedness, and knowledge of the
processes of science of chemistry and non-chemistry
students. Journal of Research in Science Teaching,
10, 159–163.
Ribordy, S. C., Holmes, D. S., & Buchsbaum, H. K.
(1980). Effects of affective and cognitive distractions
on anxiety reduction. Journal of Social Psychology,
112, 121–127.
Richards, M. A. (1977). One integrated curriculum: An
empirical evaluation. Nursing Research, 26, 90–95.
Richardson, Bellows, Henry & Co. (1963).
Normative information: Manager and executive
testing. New York: Author.
76
Rose, R. G. (1980). An examination of the responses to
a multivalue logic test. Journal of General Psychology,
102, 275–281.
Shatin, L. & Opdyke, D. (1967). A critical thinking appraisal and its correlates. Journal of Medical
Education, 42, 789–792.
Sherman, M. (1978). Concurrent validation of entry level
police officers’ examination. (Technical Report 78-1).
State of Minnesota Department of Personnel.
Shneidman, E. S. (1961). The case of E1: Psychological
test data. Journal of Projective Techniques, 25, 131–154.
Roberts, A. D., Gable, R. K., & Owen, S. V. (1977).
An evaluation of minicourse curricula in secondary
social studies. Journal of Experimental Education,
46, 4–11.
Shockley, J. T. (1962). Behavioral rigidity in relation to
student success in college physical science. Science
Education, 46, 67–70.
Rodd, W. G. (1959). A cross-cultural study of Taiwan’s
schools. Journal of Social Psychology, 50, 3–36.
Shultz, K.S. & Whitney, D.J. (2005). Measurement
Theory in Action: Case Studies and Exercises. London:
Sage Publications.
Research Bibliography
Singer, E. & Roby, T. B. (1967). Dimensions of decisionmaking behavior. Perceptual and Motor Skills,
24, 571–595.
Sternberg, R. J. (1986). Critical thinking: Its nature,
measurement, and improvement. (ERIC Document
Reproduction Service No. ED 272882).
Skelly, C. G. (1961). Some variables which differentiate
the highly intelligent and highly divergent thinking adolescent. (Doctoral dissertation, University of
Connecticut). Dissertation Abstracts, 22, 2699.
Stevens, J. T. (1972). A study of the relationships
between selected teacher affective characteristics
and student learning outcomes in a junior high
school science program. (Doctoral dissertation,
University of Virginia). Dissertation Abstracts
International, 33, 3430A.
Skinner, S. B. (1970). A study of the effect of the St.
Andrews Presbyterian College Natural Science course
upon critical thinking ability. (Doctoral dissertation,
University of North Carolina). Dissertation Abstracts
International, 31, 3984A.
Skinner, S. B.& Hounshell, P. B. (1972). The effect of St.
Andrews College Natural Science Course upon critical
thinking ability. School Science and Math, 72, 555–562.
Smith, D. G. (1977). College classroom interactions
and critical thinking. Journal of Educational
Psychology, 69, 180–190.
Smith, D. G. (1980, April). College instruction: four
empirical views. Paper presented at the 64th Annual
Meeting of the American Educational Research
Association, Boston.
Smith, J. R. (1969). A comparison of two methods of
conduction introductory college physics laboratories. (Doctoral dissertation, Colorado State College).
Dissertation Abstracts International, 30, 4159A.
Smith, P. M., Jr. (1963). Critical thinking and the science intangibles. Science Education, 47, 405–408.
Smith, R. G. (1965). An evaluation of selected aspects of
a teacher education admissions program. (Doctoral
dissertation, North Texas State University, Denton).
Dissertation Abstracts, 26, 3771.
Smith, R. L. (1971). A factor-analytic study of critical
reading/thinking, influence ability, and related factors. (Doctoral dissertation, University of Maine).
Dissertation Abstracts International, 32, 6220A.
Snider, J. G. (1964). Some correlates of all-inclusive
conceptualization in high school pupils. (Doctoral
dissertation, Stanford University). Dissertation
Abstracts, 25, 4005.
Sparks, C. P. (1990). How to read a test manual. In J.
Hogan & R. Hogan (Eds.), Business and industry testing.
Austin, TX: Pro-Ed.
Stalnaker, A. W. (1965). The Watson-Glaser Critical
Thinking Appraisal as a predictor of programming
performance. Proceedings of the Annual Computer
Personnel Research Conference, 3, 75–77.
Stephens, J. A. (1966). A study of the correlation between
critical thinking abilities and achievement in algebra
involving advanced placement. Unpublished master’s
thesis, North Carolina State University, Raleigh.
Steward, R. J. & Al Abdulla, Y. (1989). An examination of the relationship between critical thinking
and academic success on a university campus. (ERIC
Document Reproduction Service No. ED 318936).
Story, L. E., Jr. (1974). The effect of the BSCS inquiry
slides on the critical thinking ability and process
skills of first-year biology students. (Doctoral dissertation, University of Southern Mississippi). Dissertation
Abstracts International, 35, 2796A.
Taylor, L. E. (1972). Predicted role of prospective
activity-centered vs. textbook-centered elementary science teachers correlated with 16 personality
factors and critical thinking abilities. (Doctoral dissertation, University of Idaho). Dissertation Abstracts
International, 34, 2415A.
Thompson, A. P. & Smith, L. M. (1982). Conceptual,
computational, and attitudinal correlates of student
performance in introductory statistics. Australian
Psychologist, 17, 191–197.
Titus, H. W. (1969). Prediction of supervisory success
by use of standard psychological tests. Journal of
Psychology, 72, 35–40.
Titus, H. E. & Goss, R. G. (1969). Psychometric comparison of old and young supervisors. Psychological
Reports, 24, 727–733.
Trela, T. M. (1962). A comparison of ninth grade
achievement on selected measures of general reading comprehension, critical thinking, and general
educational development. (Doctoral dissertation,
University of Missouri). Dissertation Abstracts,
23, 2382.
Trela, T. M. (1967). Comparing achievement on tests
of general and critical reading. Journal of Reading
Specialists, 6, 140–142.
Troxel, V. A. & Snider, C. F. B. (1970). Correlations
among student outcomes on the Test of
Understanding Science, Watson-Glaser Critical
Thinking Appraisal, and the American Chemical
Society Cooperative Examination—General
Chemistry. School Science and Math, 70, 73–76.
Vance, J. S. (1972). The influence of a teacher questioning strategy on attitude and critical thinking.
(Doctoral dissertation, Arizona State University).
Dissertation Abstracts International, 33, 669A.
77
Watson-Glaser Short Form Manual
Vidler, D. & Hansen, R. (1980). Answer changing
on multiple-choice tests. Journal of Experimental
Education, 49, 18–20.
Walton, F. X. (1968). An investigation of differences
between more effective and less effective counselors
with regard to selected variables. (Doctoral dissertation, University of South Carolina). Dissertation
Abstracts, 29, 3844A.
Ward, J. (1972). The saga of Butch and Slim. British
Journal of Educational Psychology, 42, 267–289.
Watson, G., & Glaser, E. M. (1980). Watson-Glaser
Critical Thinking Appraisal, Forms A and B manual.
San Antonio, TX: The Psychological Corporation.
Welsch, L. A. (1967). The supervisor’s employee
appraisal heuristic: The contribution of selected
measures of employee aptitude, intelligence, and
personality. (Doctoral dissertation, University of
Pittsburgh). Dissertation Abstracts, 28, 4321A.
Wenberg, B. G. & Ingersoll, R. W. (1965). Medical
dietetics: Part 2, The development of evaluative
techniques. Journal of the American Dietetic
Association, 47, 298–300.
Wenberg, C. W. & Burness, G. (1969). Evaluation
of dietetic interns. Journal of the American Dietetic
Association, 54, 297–301.
Westbrook, B. W. & Sellers, J. R. (1967). Critical thinking, intelligence, and vocabulary. Educational and
Psychological Measurement, 27, 443–446.
Wevrick, L. (1970). Evaluation of the personnel test
battery. In L. Wevrick, et al. (Eds.), Applied research
in public personnel administration. Chicago: Public
Personnel Association.
White, W. F. & Burke, C. M. (1992). Critical thinking and teaching attitudes of preservice teachers.
Education, 112, 443–450.
Wilson, D. G. & Wagner, E. E. (1981). The WatsonGlaser Critical Thinking Appraisal as predictor of
performance in a critical thinking course. Educational
and Psychological Measurement, 41, 1319–1322.
Woehlke, P. L. (1985). Watson-Glaser Critical Thinking
Appraisal. In D. J. Keyser & R. C. Sweetland (Eds.),
Test critiques (Vol. III, pp. 682-685). Kansas City, MO:
Test Corporation of America.
Yager, R. E. (1968). Critical thinking and reference
materials in the physical science classroom. School
Science and Math, 68, 743–746.
Yoesting, C. & Renner, J. W. (1969). Is critical thinking an outcome of a college general physics science
course? School Science and Math, 69, 199–206.
78
Glossary of Measurement Terms
This glossary of terms in intended to aid in the interpretation of statistical information presented in the Watson-Glaser–Short Form Manual, as well as other
manuals published by Harcourt Assessment, Inc. The terms defined are fairly
common and basic. In the definitions, certain technicalities have been sacrificed
for the sake of succinctness and clarity.
achievement test—A test that measures the extent to which a person has
“achieved” something, acquired certain information, or mastered certain
skills—usually as a result of planned instruction or training.
alternate-form reliability—The closeness of correspondence, or correlation,
between results on alternate forms of a test; thus, a measure of the extent to
which the two forms are consistent or reliable in measuring whatever they do
measure. The time interval between the two testings must be relatively short so
that the examinees are unchanged in the ability being measured. See reliability,
reliability coefficient.
aptitude—A combination of abilities and other characteristics, whether innate
or acquired, that are indicative of an individual’s ability to learn or to develop
proficiency in some particular area if appropriate education or training is provided. Aptitude tests include those of general academic ability (commonly called
mental ability or intelligence tests); those of special abilities, such as verbal,
numerical, mechanical, or musical; tests assessing “readiness” for learning; and
prognostic test, which measure both ability and previous learning and are used
to predict future performance – usually in a field requiring specific skills, such as
speaking a foreign language, taking shorthand, or nursing.
arithmetic mean—A kind of average usually referred to as the “mean.” It is
obtained by dividing the sum of a set of scores by the number of scores. See
central tendency.
average—A general term applied to the various measures of central tendency. The
three most widely used averages are the arithmetic mean (mean), the median,
and the mode. When the “average” is used without designation as to type, the
most likely assumption is that it is the arithmetic mean. See central tendency,
arithmetic mean, median, mode.
battery—A group of several tests standardized on the same population so that
results on the several tests are comparable. Sometimes applied to any group of
tests administered together, even though not standardized on the same subjects.
ceiling—The upper limit of ability that can be measured by a test. When an individual earns a score which is at or near the highest possible score, it is said that
the “ceiling” of the test is too low for the individual. The person should be given
a higher level test.
79
Watson-Glaser Short Form Manual
central tendency—A measure of central tendency provides a single most typical
score as representative of a group of scores. The “trend” of a group of measures is
indicated by some type of average, usually the mean or the median.
Classical Test Theory (also known as True Score Theory)—The earliest theory of psychological measurement which is based on the idea that the observed
score a person gets on a test is composed of the person’s theoretical “true score”
and an “error score” due to unreliability (or imperfection) in the test. In Classical
Test Theory (CTT), item difficulty is indicated by the proportion (p) of examinees that answer a given item correctly. Note that in CTT, the more difficult an
item is, the lower p is for that item.
composite score—A score which combines several scores, usually by addition;
often different weights are applied to the contributing scores to increase or
decrease their importance in the composite. Most commonly, such scores are
used for predictive purposes and the weights are derived through multiple
regression procedures.
correlation—Relationship or “going-togetherness” between two sets of scores
or measures; tendency of one score to vary concomitantly with the other, as
the tendency of students of high IQ to be above the average in reading ability. The existence of a strong relationship (i.e., a high correlation) between two
variables does not necessarily indicate that one has any causal influence on the
other. Correlations are usually denoted by a coefficient; the correlation coefficient most frequently used in test development and educational research is the
Pearson or product-moment r. Unless otherwise specified, “correlation” usually
refers to this coefficient. Correlation coefficients range from –1.00 to +1.00; a
coefficient of 0.0 (zero) denotes a complete absence of relationship. Coefficients
of –1.00 or +1.00 indicate perfect negative or positive relationships, respectively.
criterion—A standard by which a test may be judged or evaluated; a set of other
test scores, job performance rating, etc., with which a test is designed to measure, to predict, or to correlate. See validity.
cutoff score (cut score)—A specified point on a score scale at or above which
applicants pass the test and below which applicants fail the test.
deviation—The amount by which a score differs from some reference value, such
as the mean, the norm, or the score on some other test.
difficulty index (p or b)—The proportion of examinees correctly answering an
item. The greater the proportion of correct responses, the easier the item.
discrimination index (d or a)—The difference between the proportion of
high-scoring examinees who correctly answer an item and the proportion of lowscoring examinees who correctly answer the item. The greater the difference, the
more information the item has regarding the examinee’s level of performance.
distribution (frequency distribution)—A tabulation of the scores (or other
attributes) of a group of individuals to show the number (frequency) of each
score, or of those within the range of each interval.
factor analysis—A term that represents a large number of different mathematical procedures for summarizing the interrelationships among a set of variables
or items in terms of a reduced number of hypothetical variables, called factors.
Factors are used to summarize scores on multiple variables in terms of a single
score, and to select items that are homogeneous.
80
Glossary of Measurement Terms
factor loading—An index, similar to the correlation coefficient in size and
meaning, of the degree to which a variable is associated with a factor; in test
construction, a number that represents the degree to which an item is related
to a set of homogeneous items.
Fit to the model—No model can be expected to represent complex human
behavior or ability perfectly. As a reasonable approximation, however, such a
model can provide many practical benefits. Item-difficulty and person-ability values are initially estimated on the assumption that the model is correct. An examination of the data reveals whether or not the model satisfactorily predicts each
person’s actual pattern of item passes and failures. The model-fit statistic, based
on discrepancies between predicted and observed item responses, identifies items
that “fit the model” better. Such items are then retained in a shorter version of a
long test.
internal consistency—Degree of relationship among the items of a test; consistency in content sampling.
item response theory (IRT)—Refers to a variety of techniques based on the
assumption that performance on an item is related to the estimated amount
of the “latent trait” that the examinee possesses. IRT techniques show the measurement efficiency of an item at different ability levels. In addition to yielding
mathematically refined indices of item difficulty (b) and item discrimination (a),
IRT models may contain additional parameters (i.e., Guessing).
mean (M)—See arithmetic mean, central tendency.
median (Md)—The middle score in a distribution or set of ranked scores; the
point (score) that divides the group into two equal parts; the 50th percentile. Half
of the scored are below the median and half above it, except when the median
itself is one of the obtained scores. See central tendency.
mode—The score or value that occurs most frequently in a distribution.
multitrait-multimethod matrix—An experimental design to examine
both convergent and discriminant validity, involving a matrix showing the correlations between the scores obtained (1) for the same trait by different methods,
(2) for different traits by the same method, and (3) for different traits by different
methods. Construct-valid measures show higher same trait-different methods
correlations than the correlations obtained for different traits-different
methods and different traits-same method correlations.
normal distribution—A distribution of scores or measures that in graphic
form has a distinctive bell-shaped appearance. In a perfect normal distribution,
scores or measures are distributed symmetrically around the mean, with as many
cases up to various distances above the mean as down to equal distances below
it. Cases are concentrated near the mean and decrease in frequency, according
to a precise mathematical equation, the farther one departs from the mean.
Mean, median, and mode are identical. The assumption that mental and
psychological characteristics are distributed normally has been very useful
in test development work.
normative data (norms)—Statistics that supply a frame of reference by which
meaning may be given to obtained test scores. Norms are based upon the actual
performance of individuals in the standardization sample(s) for the test. Since
they represent average or typical performance, they should not be regarded as
81
Watson-Glaser Short Form Manual
standards or as universally desirable levels of attainment. The most common
types of norms are deviation IQ, percentile rank, grade equivalent, and stanine.
Reference groups are usually those of specified occupations, age, grade, gender,
or ethnicity.
part-whole correlation—A correlation between one variable and another
variable representing a subset of the information contained in the first; in test
construction, the correlation between a score based on a set of items and
another score based on a subset of the same items.
percentile (P)—A point (score) in a distribution at or below which fall the percent
of cases indicated by the percentile. Thus a score coinciding with the 35th percentile (P35) is regarded as equaling or surpassing 35% of the persons in the group,
such that 65% of the performances exceed this score. “Percentile” does not mean
the percent of correct answers on a test.
Use of percentiles in interpreting scores offers a number of advantages: percentiles are easy to compute and understand, can be used with any type of examinee,
and are suitable for any type of test. The primary drawback of using a raw scoreto-percentile conversion is the resulting inequality of units, especially at the
extremes of the distribution of scores. For example, in a normal distribution,
scores cluster near the mean and decrease in frequency the farther one departs
from the mean. In the transformation to percentiles, raw score differences near
the center of the distribution are exaggerated—small raw score differences may
lead to large percentile differences. This is especially the case when a large proportion of examinees receive same or similar scores, causing a one- or two-point
raw score difference to result in a 10- or 15-unit percentile difference. Short tests
with a limited number of possible raw scores often result in a clustering of scores.
The resulting effect on tables of selected percentiles is “gaps” in the table corresponding to points in the distribution where scores cluster most closely together.
percentile band—An interpretation of a test score which takes into account the
measurement error that is involved. The range of such bands, most useful in
portraying significant differences in battery profiles, is usually from one standard
error of measurement below the obtained score to one standard error of measurement above the score.
percentile rank (PR)—The expression of an obtained test score in terms of its
position within a group of 100 scores; the percentile rank of a score is the percent of scores equal to or lower than the given score in its own or some external
reference group.
point-biserial correlation (rpbis )—A type of correlation coefficient calculated
when one variable represents a dichotomy (e.g., 0 and 1) and the other represents a continuous or multi-step scale. In test construction, the dichotomous
variable is typically the score (i.e., correct or incorrect) and the other is typically
the number correct for the entire test; good test items will have moderate to high
positive point-biserial correlations (i.e., more high-scoring examines answer the
item correctly than low-scoring examinees).
practice effect—The influence of previous experience with a test on a later
administration of the same or similar test; usually an increased familiarity with
the directions, kinds of questions, etc. Practice effect is greatest when the interval
between testings is short, when the content of the two tests is identical or very
similar, and when the initial test-taking represents a relatively novel experience
for the subjects.
82
Glossary of Measurement Terms
profile—A graphic representation of the results on several tests, for either an
individual or a group, when the results have been expressed in some uniform
or comparable terms (standard scores, percentile ranks, grade equivalents, etc.).
The profile method of presentation permits identification of area of strength
or weakness.
r—See correlation.
range—For some specified group, the difference between the highest and the lowest obtained score on a test; thus a very rough measure of spread or variability,
since it is based upon only two extreme scores. Range is also used in reference to
the possible range of scores on a test, which in most instances is the number of
items in the test.
Rasch model—A technique in Item Response Theory (IRT) using only the item
difficulty parameter. This model assumes that both guessing and item differences
in discrimination are negligible.
raw score—The first quantitative result obtained in scoring a test. Examples
include the number of right answers, number right minus some fraction of
number wrong, time required for performance, number of errors, or similar
direct, unconverted, uninterpreted measures.
reliability—The extent to which a test is consistent in measuring whatever it does
measure; dependability, stability, trustworthiness, relative freedom from errors
of measurement. Reliability is usually expressed by some form of reliability coefficient or by the standard error of measurement derived from it.
reliability coefficient—The coefficient of correlation between two forms of
a test, between scores on two administrations of the same test, or between
halves of a test, properly corrected. The three measure somewhat different
aspects of reliability, but all are properly spoken of as reliability coefficients.
See alternate-form reliabilty, split-half reliability coefficient, test-retest
reliablity coefficient.
representative sample—subset that corresponds to or matches the population
of which it is a sample with respect to characteristics important for the purposes
under investigation. In a clerical aptitude test norm sample, such significant
aspects might be the level of clerical training and work experience of those in the
sample, the type of job they hold, and the geographic location of the sample.
split-half reliability coefficient—A coefficient of reliability obtained by correlating scores on one half of a test with scores on the other half, and applying
the Spearman-Brown formula to adjust for the double length of the total test.
Generally, but not necessarily, the two halves consist of the odd-numbered and
the even-numbered items. Split-half reliability coefficients are sometimes referred
to as measures of the internal consistency of a test; they involve content sampling only, not stability over time.
standard deviation (SD)—A measure of the variability or dispersion of a distribution of scores. The more the scores cluster around the mean, the smaller
the standard deviation. For a normal distribution, approximately two thirds
(68.25%) of the scores are within the range from one SD below the mean to
one SD above the mean. Computation of the SD is based upon the square of
the deviation of each score from the mean.
83
Watson-Glaser Short Form Manual
standard error (SE)—A statistic providing an estimate of the possible magnitude
of “error” present in some obtained measure, whether (1) an individual score or
(2) some group measure, as a mean or a correlation coefficient.
(1) standard error of measurement (SEM)—As applied to a single obtained
score, the amount by which the score may differ from the hypothetical true
score due to errors of measurement. The larger the SEM, the less reliable the
measurement and the less reliable the score. The SEM is an amount such that
in about two-thirds of the cases, the obtained score would not differ by more
than one SEM from the true score. (Theoretically, then, it can be said that the
chances are 2:1 that the actual score is within a band extending from the true
score minus one SEM to the true score plus one SEM; but since the true score can
never be known, actual practice must reverse the true-obtained relation for an
interpretation.) Other probabilities are noted under (2) below. See true score.
(2) standard error—When applied to sample estimates (e.g., group averages,
standard deviation, correlation coefficients), the SE provides an estimate of
the “error” which may be involved. The sample or group size and the SD are the
factors on which standard errors are based. The same probability interpretation
is made for the SEs of group measures as is made for the SEM; that is, 2 out of 3
sample estimates will lie within 1.0 SE of the “true” value, 95 out of 100 within
1.96 SE, and 99 out of 100 within 2.6 SE.
standard score—A general term referring to any of a variety of “transformed”
scores, in terms of which raw scores may be expressed for reasons of convenience, comparability, ease of interpretation, etc. The simplest type of standard
score, known as a z score, is an expression of the deviation of a score from the
mean score of the group in relation to the standard deviation of the scores of
the group. Thus,
Standard Score = (Score - Mean) / Standard Deviation
Adjustments may be made in this ratio so that a system of standard scores having any desired mean and standard deviation may be set up. The use of such
standard scores does not affect the relative standing of the individuals in the
group or change the shape of the original distribution.
Standard scores are useful in expressing the raw score of two forms of a test in
comparable terms in instances where tryouts have shown that the two forms are
not identical in difficulty. Also, successive levels of a test may be linked toForm A
continuous standard-score scale, making across-battery comparisons possible.
standardized test (standard test)—A test designed to provide a systematic
sample of individual performance, administered according to prescribed directions, scored in conformance with definite rules, and interpreted in reference to
certain normative information. Some would further restrict the usage of the term
“standardized” to those tests for which the items have been chosen on the basis
of experimental evaluation, and for which data on reliability and validity
are provided.
statistical equivalence—Occurs when test forms measure the same construct
and every level of the construct is measured with equal accuracy by the forms.
Statistically equivalent test forms may be used interchangeably.
testlet—A single test scenario that has a number of test questions based directly on
the scenario. A testlet score is generated by summing the responses for all items
in the testlet.
84
Glossary of Measurement Terms
test-retest reliability coefficient—A type of reliability coefficient obtained by
administering the same test a second time, after a short interval, and correlating
the two sets of scores. “Same test” was originally understood to mean identical
content, i.e., the same form. Currently, however, the term “test-retest” is also
used to describe the administration of different forms of the same test, in which
case this reliability coefficient becomes the same as the alternate-form coefficient. In either type, the correlation may be affected by fluctuations over time,
differences in testing situations, and practice. When the time interval between
the two testings is considerable (i.e., several months), a test-retest reliability coefficient reflects not only the consistency of measurement provided by the test, but
also the stability of the trait being measured.
true score—A score entirely free of error; hence, a hypothetical value that can
never be obtained by psychological testing, because testing always involves some
measurement error. A “true” score may be thought of as the average score from
an infinite number of measurements from the same or exactly equivalent tests,
assuming no practice effect or change in the examinee during the testings. The
standard deviation of this infinite number of “samplings” is known as the standard error of measurement.
validity—The extent to which a test does the job for which it is used. This definition is more satisfactory than the traditional “extent to which a test measures
what it is supposed to measure,” since the validity of a test is always specific to
the purposes for which the test is used.
(1) content validity. For achievement tests, validity is the extent to which
the content of the test represents a balanced and adequate sampling of the outcomes (knowledge, skills, etc.) of the job, course, or instructional program it is
intended to cover. It is best evidenced by a comparison of the test content with
job descriptions, courses of study, instructional materials, and statements of educational goals; and often by analysis of the process required in making correct
responses to the items. Face validity, referring to an observation of what a test
appears to measure, is a non-technical type of evidence; apparent relevancy is,
however, quite desirable.
(2) criterion-related validity. The extent to which scores on the test are
in agreement with (concurrent validity) or predict (predictive validity) some
given criterion measure. Predictive validity refers to the accuracy with which an
aptitude, prognostic, or readiness test indicates future success in some area, as
evidenced by correlations between scores on the test and future criterion measures of such success (e.g., the relation of the score on a clerical aptitude test
administered at the application phase to job performance ratings obtained after a
year of employment). In concurrent validity, no significant time interval elapses
between administration of the test and collection of the criterion measure. Such
validity might be evidenced by concurrent measures of academic ability and of
achievement, by the relation of a new test to one generally accepted as or known
to be valid, or by the correlation between scores on a test and criteria measures
which are valid but are less objective and more time-consuming to obtain than a
test score.
85
Watson-Glaser Short Form Manual
(3) evidence based on internal structure. The extent to which a test measures some relatively abstract psychological trait or construct; applicable in
evaluating the validity of tests that have been constructed on the basis of
analysis (often factor analysis) of the nature of the trait and its manifestations.
(4) convergent and discriminant validity. Tests of personality, verbal ability, mechanical aptitude, critical thinking, etc., are validated in terms of the
relation of their scores to pertinent external data. Convergent evidence refers to
the relationship between a test score and other measures that have been demonstrated to measure similar constructs. Discriminant evidence refers to the
relationship between a test score and other measures demonstrated to measure
dissimilar constructs.
variability—The spread or dispersion of test scores, best indicated by their
standard deviation.
variance—For a distribution, the variance is the average of the squared deviations
from the mean. Thus, the variance is the square of the standard deviation.
86