Reliability and Validity of Measurement

jobando; ffehr; gregsonk19; stavingai23

Reliability and Validity of Measurement

Again, measurement involves assigning scores to individuals so that they represent some characteristic of the individuals. But how do researchers know that the scores actually represent the characteristic, especially when it is a construct like intelligence, self-esteem, depression, or working memory capacity? The answer is that they conduct research using the measure to confirm that the scores make sense based on their understanding of the construct being measured. This is an extremely important point. Researchers do not simply assume that their measures work. Instead, they collect data to demonstrate that they work. If their research does not demonstrate that a measure works, they stop using it.

As an informal example, imagine that you have been dieting for a month. Your clothes seem to be fitting more loosely, and several friends have asked if you have lost weight. If at this point your bathroom scale indicated that you had lost 10 pounds, this would make sense and you would continue to use the scale. But if it indicated that you had gained 10 pounds, you would rightly conclude that it was broken and either fix it or get rid of it. In evaluating a measurement method, researchers consider two general dimensions: reliability and validity.

Reliability

Reliability refers to the consistency of a measure. Researchers consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (inter-rater reliability).

Types of Reliability

Reliability → Consistency of Measurement

Test-retest → same results over time

Internal consistency → items behave similarly (Cronbach’s α)

Inter-rater reliability → raters agree

Test-Retest Reliability

When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. Test-retest reliability is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.

Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the same group of people at a later time, and then looking at the test-retest correlation between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing the correlation coefficient. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.

Nurse taking a patient's pulse — **Figure 9.1** Reliable measurement is essential in quantitative research. Ime Stavinga/TRU Open Press CC BY-NC-SA 4.0

Internal Consistency

Another kind of reliability is internal consistency, which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioral and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials.

Like test-retest reliability, internal consistency can only be assessed by collecting and analyzing data. One approach is to look at a split-half correlation. This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship between the two sets of scores is examined. A split-half correlation of +.80 or greater is generally considered good internal consistency.

Perhaps the most common measure of internal consistency used by researchers is a statistic called Cronbach’s α (the Greek letter alpha). Conceptually, α is the mean of all possible split-half correlations for a set of items. For example, there are 252 ways to split a set of 10 items into two sets of five. Cronbach’s α would be the mean of the 252 split-half correlations. Note that this is not how α is actually computed, but it is a correct way of interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken to indicate good internal consistency.

Interrater Reliability

Many behavioral measures involve significant judgment on the part of an observer or a rater. Inter-rater reliability is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does, in fact, have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other. Interrater reliability is often assessed using Cronbach’s α when the judgments are quantitative or an analogous statistic called Cohen’s κ (the Greek letter kappa) when they are categorical.

Activity

Watch the following video, Reliability and Validity [7:13] by research assistant Katie Gregson (2025).

Note: If you are using a printed copy of this resource, watch the video by scanning the QR code with your mobile device.

Validity

Validity is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimeter longer than another’s would indicate nothing about which one had higher self-esteem.

Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure.

Internal Validity

Two variables being statistically related does not necessarily mean that one causes the other. You have probably heard the term, “Correlation does not imply causation.” For example, if it were the case that people who exercise regularly are happier than people who do not exercise regularly, this implication would not necessarily mean that exercising increases people’s happiness. It could mean instead that greater happiness causes people to exercise or that something like better physical health causes people to exercise and be happier.

The purpose of an experiment, however, is to show that two variables are statistically related and to do so in a way that supports the conclusion that the independent variable caused any observed differences in the dependent variable. The logic is based on this assumption: If the researcher creates two or more highly similar conditions and then manipulates the independent variable to produce just one difference between them, then any later difference between the conditions must have been caused by the independent variable.

An empirical study is said to be high in internal validity if the way it was conducted supports the conclusion that the independent variable caused any observed differences in the dependent variable. Thus experiments are high in internal validity because the way they are conducted—with the manipulation of the independent variable and the control of extraneous variables (such as through the use of random assignment to minimize confounds)—provides strong support for causal conclusions. In contrast, non-experimental research designs (e.g., correlational designs), in which variables are measured but are not manipulated by an experimenter, are low in internal validity.

External Validity

At the same time, the way that experiments are conducted sometimes leads to a different kind of criticism. Specifically, the need to manipulate the independent variable and control extraneous variables means that experiments are often conducted under conditions that seem artificial.

The issue we are confronting is that of external validity. An empirical study is high in external validity if the way it was conducted supports generalizing the results to people and situations beyond those actually studied. As a general rule, studies are higher in external validity when the participants and the situation studied are similar to those that the researchers want to generalize to and participants encounter every day, often described as mundane realism. As Table 9.4 shows, there are many threats to external validity that researchers must try to avoid or note in their limitations. Imagine, for example, that a group of researchers is interested in whether a new falls-prevention program reduces fall rates among older adults in long-term care homes in Ontario. If the study recruits only healthy volunteers or is conducted in a laboratory setting, it would be harder to generalize the results to real long-term care residents.

We should be careful, however, not to draw the blanket conclusion that experiments are low in external validity. One reason is that experiments need not seem artificial. Consider field experiments that are conducted entirely outside the laboratory. In one such experiment, nurses in a pediatric unit were randomly assigned to use either the usual pain-assessment method or a new evidence-based protocol during routine immunizations. The study was conducted during normal clinic hours with real patients, so the results reflect what could happen in everyday practice.

A second reason not to draw the blanket conclusion that experiments are low in external validity is that they are often conducted to learn about processes that are likely to operate in a variety of people and situations. Researchers studying how stress affects nurses’ clinical decision-making in emergency departments may simulate high-stress scenarios. Even though the scenarios are staged, the way stress influences decision-making is expected to be similar across many real Canadian emergency settings.

**Table 9.4:** **Threats to External Validity**
Threat	Explanation	Nursing/Health-Care Example
Testing	Taking a pre-test or completing a baseline measure changes how participants respond to the intervention.	Nurses on a surgical unit complete a questionnaire on burnout before a wellness intervention. Filling it out makes them more aware of their stress, so they start changing their self-care habits before the intervention even begins.
Sampling Bias	The sample does not represent the broader population the researchers want to generalize to.	A fall-prevention study recruits only highly mobile older adults from an independent-living residence. Results do not generalize to frailer older adults in long-term care who have different mobility needs.
Hawthorne Effect	Participants modify their behaviour because they know they are being observed or studied.	Nurses being studied for hand-hygiene compliance wash their hands more frequently than usual because they know auditors are present on the unit.

Face Validity

Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.

Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behavior, which are frequently wrong. It is also the case that many established measures in research work quite well despite lacking face validity.

Content Validity

Content validity is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that they think positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.

Construct Validity

A construct is an idea or theoretical concept based on empirical observations that are not directly measurable. An example of a construct could be physical functioning or social anxiety. Thus construct validity determines whether an instrument measures the underlying construct of interest and discriminates it from other related constructs. It is important and expresses the confidence that a particular construct is valid. This type of validity can be assessed using factor analysis or other statistical techniques. The commonly used survey instrument, SF-36 is widely used to measure the quality of life or health status in sick and healthy populations (Hooker, 2020). Principal components factor analysis with varimax rotation confirmed the presence of the seven domains in the SF-36: in the SF-36: physical functioning, role limitations due to physical and emotional problems, mental health, general health perception, bodily pain, social functioning, and vitality (Hooker, 2020). It was concluded that the Turkish version of the SF-36 was a suitable instrument that could be employed in cancer research in Turkey.

Criterion Validity

Criterion validity is the extent to which people’s scores on a measure are correlated with other variables (known as criteria) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.

A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).

Criteria can also include other measures of the same construct. For example, one would expect new measures of test anxiety or physical risk taking to be positively correlated with existing established measures of the same constructs. This is known as convergent validity. Assessing convergent validity requires collecting data using the measure.

Discriminant Validity

Discriminant validity, on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.

Summary

To summarize all the types of reliability and validity mentioned on this page, see tables 9.5 and 9.6 below.

Table 9.5: Key Types of Reliability
Type	Focus	How to Assess	Acceptable Threshold
Internal Consistency	Are items within a test measuring the same thing?	Cronbach’s Alpha, Split-Half reliability	\(\alpha \geq 0.70\)
Test-Retest	Are results consistent over time?	Administering the same test to the same group twice	Correlation coefficient \(>0.70\)
Inter-Rater	Are results consistent across different observers?	Having multiple observers score the same test	Cohen’s Kappa (\(>0.60\))
Parallel-Forms	Are different versions of the test equivalent?	Comparing scores of two equivalent versions of a test	High positive correlation

Table 9.6: Key Types of Validity
Type	Focus	How to Assess	Validation Method
Content	Does the test cover the entire domain of the construct?	Expert review	Content Validity Index (CVI)
Construct	Does the test measure the intended theoretical concept?	Convergent and discriminant validity	Factor Analysis
Criterion	Do the results correlate with an established standard?	Predictive or concurrent validity	Correlation with a gold standard test
Face	Does the test appear to measure what it claims to?	General inspection or superficial evaluation	Subjective consensus

Remixed from:

Research methods in psychology (4th ed.) by Dr. Rajiv Jhangiani, Dr. Carrie Cuttler, and Dr. Dana C. Leighton, KPU. (2019). Published under a CC BY-NC-SA 4.0 license.
- Chapter 20: Reliability and Validity of Measurement
- Chapter 25: Experimentation and Validity
An Introduction to Research Methods for Undergraduate Health Profession Students by Faith Alele and Bunmi Malau-Aduli (2023). Published under a CC BY-NC 4.0 license.
- Chapter 3: Navigating Quantitative Research

Media Attributions

Figure 9.1 Reliable measurement is essential in quantitative research is by Resarch Assistant Ime Stavinga and is subject to the CC BY-NC-SA 4.0 license.

References

Alele, F., Malau-Aduli, B. (2023). Chapter 3: “Navigating Quantitative Research.” In An Introduction to Research Methods for Undergraduate Health Profession Students. James Cook University. https://jcu.pressbooks.pub/intro-res-methods-health/part/2-planning-a-research-project/

Gregson, K. (2025). Reliability and Validity [Video].

Hooker, S. A. (2020). SF-36. Encyclopedia of Behavioral Medicine, 2035–2036. https://doi.org/10.1007/978-3-030-39903-0_1597

Jhangiani, R. S., Chiang, I.-A., Cuttler, C., & Leighton, D. C. (2019). Research methods in psychology (4th ed.). Kwantlen Polytechnic University. https://kpu.pressbooks.pub/psychmethods4e/

‌

Reliability and Validity of Measurement

Reliability

Test-Retest Reliability

Internal Consistency

Interrater Reliability

Activity

Validity

Internal Validity

External Validity

Face Validity

Content Validity

Construct Validity

Criterion Validity

Discriminant Validity

Summary

Media Attributions

References

License

Share This Book