Internal and External Validity
Based on the Burgette and Magun-Jackson (2009) article:
1. List and describe one possible or plausible[1] threat to statistical conclusion validity? Why do you think it is a possible or plausible threat?
2. List and describe one possible or plausible threat to internal validity? Why do you think it is a possible or plausible threat?
3. List and describe one possible or plausible threat to construct validity? Why do you think it is a possible or plausible threat?
4. List and describe one possible or plausible threat to external validity? Why do you think it is a possible or plausible threat?
5. Did the authors attempt to control for any of the threats to validity you mentioned in questions 1 – 4? If so, how did they try to control for one of these? If not, how could they control for one of these threats?
6. Describes a threat to statistical conclusion validity in the study
7. view longer description
8. 10.0 pts
9. 2. Describes a threat to internal validity in the study
10. view longer description
11. 10.0 pts
12. 3. Describes a threat to construct validity in the study
13. view longer description
14. 10.0 pts
15. 4. Describes a threat to external validity in the study
16. view longer description
17. 10.0 pts
18. 5. Describes how authors controlled for a threat to validity
19. view longer description
20. 10.0 pts
21. Deductions: Late or Shared assignments
22. 0.0 pts
23. Total Points: 50.0
When working on Article Critique 3:
You will find that several threats to validity may apply to a single situation. Therefore, there will not always be one right answer. For instance, if the authors mention that several students with high pretest scores on academic achievement drop out of a study during the intervention; this can threaten statistical conclusion validity (restriction of range), internal validity (attrition), construct validity (confounding constructs with levels of constructs), and external validity (interaction of causal relationship with units). Rather than trying to figure out every plausible threat to validity that may impact a study’s results, I recommend that you select the clearest threat to validity.
Plausible vs. Possible Threats to Validity
A plausible threat to validity is when the authors mention something in the study that suggests that this is a reasonable threat to validity. A possible threat to validity is when you think of something that could have occurred to threatened the validity of the study, but the authors gave no indication that such a problem actually occurred. For example, if the authors specifically write that 11 participants dropped out of a study, this would be a plausible threat to internal validity (attrition). However, if there is no mention of attrition, but you think it could have occurred, this would be a possible threat to internal validity.
Grading Rubric for Article Critique 3
Questions 1 – 4
Full credit: identifies a specific threat to the type of validity in question (2 pts), describes how that threat could or did occur in the study (4 pts), the explanation supports or is relevant to the particular threat mentioned (4 pts). Partial credit may be given if: a problem is described, but the threat is not correctly identified; a threat is identified and supported, but is not a threat to the type of validity in question; the description of the problem does not support either the threat or type of validity.
Question 5
Full credit: identifies a specific threat discussed in answers to problems 1-4 (2 pts), describes how researchers addressed the problem (4 pts), the explanation is relevant to and could actually correct the particular threat mentioned (4 pts). Partial credit may be given if: a solution to a problem is described, but the threat is not correctly identified; a threat is identified and a solution is described, but the solution does not address the threat; the solution to the problem was not actually done by the authors.
Frequently Asked Questions
Question:
How do we recognize threats to Statistical Conclusion Validity if we haven’t had any coursework in statistics?
Answer:
I realize that many threats to Statistical Conclusion Validity will be tough to recognize without a background in statistics. Here are a few hints:
• Low power may be a problem if the study used only a few participants (less than 50) and did not find a statistical relationship.
• Fishing may be a problem if the researchers conducted multiple statistics.
• Unreliability of measures may be a problem if the authors report low reliability coefficients for the instruments they used.
• Unreliability of treatment implementation may be a problem if the authors report inconsistencies in how they administered the intervention.
• The authors would have to specifically state that they violated statistical assumptions to know that this threat is plausible.
• Similarly, they would have to provide a range of obtained and possible scores to know if restriction of range is a plausible threat.
Question:
What do the statistics in the results section of the article mean?
Answer:
The Chi Square and p values are simply statistical evidence to support the claims that are made before the symbols and values that are reported. For example, when the authors write: “Of the 2001 cohort of 1,193 first-time freshman (773 white; 420 black), 806 (67.6%) students persisted to year two with a significant difference due to race (542-70.1% white; 264-62.9% black), (1) = 6.544, p < 05" (p. 244); the statistics at the end of the sentence is another way of stating there is a statistical difference in what is being tested. In this case, the difference is between the proportion of White students who persisted and the proportion of Black students who persisted. A greater percentage of White students returned to college for a second year than Black students.
Study: Researcher looked at how gender andsexual orientation influence self-image.
What type of validity did this researcher
focus on?
Which specific threats did he control for?How does this strengthen his claims?What type of validity was overlooked?What specific threats should he have
considered?
What threats are plausible versus possible
Research Validity Typologies
Campbell and Stanley (1966) created a validity typology to provide a logical and objective way to evaluate the quality of causal inferences made when using randomized and quasi-experimental (also known as causal-comparative) designs. Originally, they described only two types of validity: internal validity and external validity. The latter addresses how well causal relationships generalize across units (people), treatments, observations, and settings. Cook and Campbell (1979) later added statistical conclusion validity, which addresses the ability to establish that the cause and effect covary, and construct validity, which addresses the ability to infer that the research operations sufficiently represent constructs of interest.
As part of their validity typology, Campbell and colleagues included threats to validity, common reasons within each type of validity that explain why researchers may be incorrect about the causal inferences they draw from experiments. Although Campbell and Stanley (1966) were the first to describe the threats, since then, Shadish, Cook and Campbell (2002) have added to and refined the threats to each of the validity typologies.
Statistical Conclusion Validity
Statistical conclusion validity addresses how well researchers can conclude that cause and effect variables are statistically related. Are the results that we obtain from statistical analyses reasonable conclusions given the data, statistics used and procedure by which we obtained the data?
Threats to Statistical Conclusion Validity
Low Statistical Power
Low statistical power is when sample size, effect size or alpha level is not large enough to detect a significant effect for a treatment. Statistical power is the probability of finding an effect in a sample if that effect exists in the population. Therefore, increasing these three parameters also increases statistical power. Although any of these parameters can be changed, the alpha level (or Type I error) is usually maintained as a constant of .05 in educational research and the effect size (magnitude of the relationship) is dependent on the strength of the treatment or comparison condition. Therefore, the sample size is the easiest of the three to control in an experiment.
Example
Dr. Taylor is looking at the relationship between sex and romantic love style. He has a sample of 10 men and 10 women and measures all 20 participants on their erotic (sensual) and storgic (friendship-based) love styles. He finds that there is no statistically significant difference between men and women on either erotic, t (18) = 2.00, p = .06, or storgic, t (18) = -0.21, p = .84, love styles. However, there was a moderate effect size, d = .54, for the difference between men and women on erotic love. This effect suggests that, while there was a difference, it was not statistically significant due to such a small sample size. On the other hand, the effect size for the difference between men and women on storgic love was small, d = .05, suggesting that the sample size was not responsible for the null effect from the t-test.
Violation of Statistical Tests
Violation of statistical tests occurs when data violate assumptions of statistical tests that are used to assess the relationship between variables. Parametric statistics (t-tests, Analyses of variance, and correlations) assume that the data that are analyzed have certain characteristics (i.e. continuous variables are normally distributed, variances are similar across categories, and observations are independent of each other). If data do not have these particular characteristics, the results from those analyses may not be accurate.
Example
Dr. Hoffman is looking at the relationship between college GPA and institution (University of Central Florida and Florida State University) among basketball players. Although there is a statistically significant difference between university teams, t (18) = 3.15, p = .01, he noticed that there is a substantial difference between the means (M UCF = 3.00, M FSU = 2.30) and the standard deviations (SD UCF = 0.65, SD FSU = 0.25). While the mean difference is apparent, the difference between the distributions of scores violates a statistical assumption of the t-test: the variances between groups are similar. We can see from Figure 6.1 that the distribution of scores (variance) is much larger for UCF than FSU. The variances between the groups are not similar; therefore, the results of the t-test may not be accurate.
Figure 6.1: Distribution of GPA for UCF and FSU.
Fishing
Fishing or error rate problem occurs when multiple statistical tests are conducted on the same data, increasing the likelihood of finding a significant effect by chance alone. When a researcher sets an alpha level (Type I error rate) at .05, he is allowing a 5% chance that the effect that he finds in his sample will not be found in the population. Each time a statistical test is performed, the alpha rate increases. Therefore, repeatedly running tests on a sample will increase the statistical power (likelihood of finding an effect), but decrease the accuracy of the statistical results.
Example
Ms. Castilla, is examining the relationship between the number of romantic partners people have and their love styles. Unlike Dr. Taylor, she is looking at all six of Lee’s love styles (1988): erotic, ludic, storgic, pragmatic, manic, and agapic. She wants to see if there are differences between each of the six loves styles. To examine all paired comparisons, she runs 15 t-tests. Unfortunately, running 15 tests could increase the alpha rate by as much as 15 times the original alpha rate. Therefore, if each t-test is run with an individual alpha of .05, her overall alpha could be inflated to as high as .75 (15*.05). Although she will very likely find a significant difference between some of these love styles, some of these differences may be due to statistical chance, not because there is a true difference in the population.
Reliability of Measurement
Reliability of measurement is a concern when the instrument used to measure the construct (thinking or behavior) of interest does not provide consistent scores or assessments.
Example
Dr. Clark likes taking Facebook personality quizzes. She recently ran across one that was supposed to tell her which Disney princess she was most similar to. When she first took the test, the result indicated that her personality was most like that of Cinderella. She thought this was an absurd result, so she retook the quiz. The second result informed her that she was most like Mulan. While this seemed like a more valid outcome, it was clear that this was not a reliable test, as it gave different answers each time Dr. Clark took the test.
Reliability of Treatment Implementation
Reliability of treatment implementation is a concern when a treatment, program or intervention is implemented differently for different participants in a study.
Example
Dr. Cundiff is hoping to increase college retention among first year college students by implementing a program that focuses on social integration (meeting new people on campus, attending sporting event, joining clubs). Given the number of students who are interested in enrolling in the program, she cannot possibly run all of the program sessions herself. Therefore, she recruits two graduate assistants, Stan and Eric, to help her. While Stan is very thorough and conscientious about the instructions that he gives to his students, Eric frequency gives unclear instructions and fails to take his students to several social events. In this case, because Eric and Stan have such different ways of implementing the treatment, it may be difficult to tell if the treatment is effective when comparing the retention rates of those in the program to those who weren’t in it.
Restriction of Range
Restriction of range is a problem when the scores obtained from a sample on a particular variable do not span the total possible range of values. Very often when this threat is plausible, so is confounding constructs with levels of constructs (see threats to construct validity).
Example
Mr. Martin hypothesizes that college entrance exam scores predict first-year grade point average. He collects data from Harvard and finds that there is no relationship between ACT and GPA. Thinking that this was a strange result, he replicates his study at the University of Central Florida and finds that there is a relationship. In hopes of finding an explanation for why the two universities had different results, he examines his data distributions (Figures 6.2 and 6.3). While both distributions have data clustered in the top right quadrant of the figures, the Harvard data is restricted to that quadrant. For the Harvard sample, GPAs are restricted between 3.5 and 4.0 and the ACT scores are restricted between 30 and 36. This is restricted range is likely why he did not find a relationship between GPA and ACT for the Harvard sample.
Figure 6.2: Relationship between ACT and GPA for Harvard.
Figure 6.3: Relationship between ACT and GPA for UCF.
Heterogeneity of Units
Heterogeneity of units is when the variability of scores on the outcome or dependent variable is so large within groups that it is difficult to detect an effect between groups. The variability of scores is how different scores are from each other. High variability is when some people in a group or condition have very high scores and other people have very low scores.
Example
Dr. Hoffman adds a third university to his study in which he is looking at the relationship between GPA and institution. This time he compares the University of Central Florida to the University of Florida. Unlike the previous analyses, he finds no significant difference between university teams, t (18) = 1.73, p = .10. Again, he noticed that there is a substantial difference between the means (M UCF = 3.00, M UF = 2.50); however, this time the standard deviations are large for both groups (SD UCF = 0.65, SD FSU = 0.65). Finding this odd, he randomly selects members of the football players and compares their GPAs. Although the means are the same as those found for the basketball players (MUCF = 3.00, M UF = 2.50), the standard deviations are much smaller (SD UCF = 0.45, SD FSU = 0.45). Therefore, he does find a significant difference between the football teams, t (18) = 2.55, p = .02. Figures 6.4 and 6.5 show the difference in the standard deviations for each comparison.
Figure 6.4: Distribution of GPA for UCF and UF basketball players.
Figure 6.5: Distribution of GPA for UCF and UF football players.
Inaccurate Effect Size Estimation
Inaccurate effect size estimation is when the statistic used to analyze the data tends to over- or underestimate treatment effects. This most often occurs when the sample size is extremely high or low, as many statistics are sample size dependent – that is they are easily influenced by the sample size. It can also occur when an inappropriate statistic is used, such as using a related samples (within-subjects) statistic for a between-subjects design.
Example
Dr. Rodriguez is looking at the difference between Republicans and Democrats in terms of their income. Since all of the people in his sample identified as either Republican or Democrat, this was a between-subjects design (different people in each condition). However, he mistakenly ran a related samples t-test for within-subjects designs on his data and found a statistically significant difference, t (63) = 2.10, p = .04. Realizing his mistake, he re-analyzed the data using an independent measures t-test for between-subjects designs. Because the second statistic uses a more conservative error term, he did not find a significant difference this time, t (63) = 1.90, p = .06.
Extraneous Variance in the Experimental Situation
Extraneous variance in the experimental situation occurs when distracters or events that are not part of the treatment, program or intervention occur during an experiment.
Example
Monsieur Dassi has been assigned to teach a 7 th grade French course over the summer. However, Bartow Middle School is being renovated; therefore, his class has been moved to a modular classroom (a trailer for those of you who have not had the joy of teaching in one). As it turns out, the classroom is located near the baseball field and does not include an air conditioner. Therefore, the students are constantly distracted by the noise from the baseball games and the summer heat. While the noise and heat are not part of learning French, they are likely to negatively impact students’ learning and performance. Therefore, when Monsieur Dassi compares the summer midterm exams with the exam scores from his previous semester he finds that the summer school students performed significantly worse than the spring semester students, t (42) = -2.35, p = .02. Although it appears that Monsieur Dassi is doing a poorer job teaching, the more plausible explanation is that the students cannot comfortably attend to his lectures.
Internal Validity
Internal validity addresses how well researchers can make causal inferences about the relationship between predictive or independent variables, presumed “causes”, and outcomes or dependent variables, their “effects”.
Threats to Internal Validity
Ambiguous Temporal Precedence
Ambiguous temporal precedence is the inability to determine which variable occurred first, thereby preventing the researcher from knowing which variable is the cause and which is the effect. This commonly occurs when the independent variable is not manipulated and researchers measure the independent and dependent variables on a single occasion.
Example
Assume that you give a class of college freshmen two questionnaires: one that measures academic motivation and the second that measures conscientiousness. Because both variables were measured at the same time and neither variable was manipulated, there is no way to know which characteristic occurred first. Therefore, we cannot reasonably claim that conscientiousness causes greater academic motivation (or vice versa), even if there is a significant correlation between the two variables.
Selection
Selection is when systematic differences between unit characteristics in each treatment condition could affect the effect due to a non-random assignment procedure — either due to self-section or a criterion based assignment. Selection most commonly occurs when it is not feasible or ethical to randomly assign participants to treatment conditions.
Example
Although Head Start educational programs are available to any child whose parents are below the poverty level, enrollment into the program is voluntary. Because parents elect to enroll their children in the educational program, it is possible that children who are enrolled in Head Start may be characteristically different (i.e. have more at-home instruction, better nutrition, better family support) than those who do not enroll. These differences between the treatment (those in Head Start) and control groups (those who are eligible, but do not enroll) may be what causes the differences between groups on a dependent variable, such as preparedness for kindergarten.
History
History is when events occur simultaneously with the treatment and could affect the outcome. This is a concern in any study in which participants cannot be isolated from outside influences that may affect the outcome.
Example
Suppose that a high school physical education instructor, Mr. Rodriguez, wants to develop and test a rigorous exercise program. He assigns half of his gym class to the new program and the other half will receive the regular physical education tasks. He expects that students in the new program will demonstrate increased physical stamina and reduce body mass. However, at the same time he implements his exercise program, the school cafeteria begins a new “healthy options” menu, in which students are served meals with less fat and more protein. Unless all the students in his gym class eat the cafeteria’s lunches, it will be difficult for Mr. Rodriguez to know whether his exercise program or the cafeteria’s menu caused any changes in students’ stamina and body mass.
Maturation
Maturation is when a natural development over time occurs during the same time as the treatment and could affect the outcome. This is a similar problem as history, but is most likely to occur among any group of participants that will likely mature relatively quickly over the time that within-group measures are taken. Typically, maturation can be assessed using the same methods used for history.
Example
Quackery Pharmaceutical Company is marketing a new drug that increases height in boys. To determine the effectiveness of their medication, they give their height pill to 13-year-old boys for two years and find that the boys, on average, have grown six inches. Despite these remarkable results, it is unlikely that the increase in the boys’ height is due to the pills, but rather to their natural growth during puberty.
Regression
Regression (also known as Regression to the Mean) occurs when participants are selected from their extreme scores, but have less extreme scores on other measures, appearing as if an effect occurred. This most often occurs when researchers use unreliable measures or heterogeneous units to match participants’ treatment and control scores. This can occur either when the control score is a pretest measure taken from the same participant or if it is an outcome measure taken from another participant in a control group who is matched to the treatment participant using a matching variable.
Example
Dr. Feelgood assessed the effectiveness of an alcohol treatment intervention for college students. He measured the number of drinks they had over Spring Break and then again after eight weeks of treatment. He found that students’ drinking decreased dramatically after the intervention. However, this was not because the treatment was so effective, but because the students had been drinking more than normal during the time that the pretest measure was administered. Even if the treatment had no true effect, he would have seen a decrease in drinking simply because students had returned to their normal (more moderate) drinking behavior in May.
Attrition
Attrition (also known as Mortality) occurs when units drop out of the experiment after being assigned to treatment conditions. This is less problematic if they drop out randomly (i.e., the reason they drop out is not related to the experiment) than non-randomly. However, if participants who drop out of the experiment are systematically different in their responses than those who remain, this type of attrition, known as differential attrition, is a considerable threat to validity.
Example
Madame Lebos is looking at the difference in learning French through a fully online course versus a face to face course. She measures students’ academic motivation and French speaking skills at the beginning of the semester and again at the end of the semester. She finds that there is no difference between the two groups in their French speaking skills or their motivation at the beginning of the semester. Over the course of the semester, 10 out of 20 students dropped out of the online course and 5 out of 20 students dropped out of the face to face course. At the end of the semester, there was no difference between groups in their French speaking skills, but those who dropped out of the courses had lower academic motivation than those who remained. If academic motivation influenced the students’ performance, it is likely that the result was not accurate.
Testing
Testing occurs when repeatedly exposing units to a test may permit them to learn the test, appearing as if a treatment effect occurred. This is likely to occur when an outcome is measured more than once from the same participants.
Example
Oviedo Education Center wanted to assess the effects of coaching on the SAT I: Reasoning test. The treatment group received ten weekly sessions of test preparation and was compared to a control group who did not receive coaching. Although participants had been randomly assigned to conditions, many of the participants had taken the SAT more than once. Not only did the treatment group show an increase in SAT scores from previous testing, but so did the control group. Therefore, students in both the treatment and control groups had benefited from taking practice tests.
Instrumentation
Instrumentation occurs when changes in the instrument used to measure responses provide different results for the outcome, appearing as if an effect occurred. This often occurs when observations are taken over an extended period of time and a measure is revised.
Example
Dr. Julie wanted to examine the effectiveness of a psychological intervention on behavioral problems in adolescents. Pretest measures of behavior were taken using the Personality Inventory for Children (PIC). However, when the posttest measures were taken a year later, the PIC had been replaced with the PIC-2. If the revised version of the PIC was very different from the original measure, any change in the pretest and posttest scores could have mistakenly been attributed to the intervention rather than due to the change in the instrument.
Additive and Interactive Threats to Internal Validity
Additive and interactive threats to internal validity occur when the impact of a threat can be compounded by, or may depend on the level of, another threat.
Example
Dr. Emami, a school psychologist for Howell L. Watkins Junior High, wanted to implement and test a program intended to increase emotional intelligence among young adolescents. She first measured their emotional intelligence at the beginning of their seventh-grade year and again at the end of their eighth-grade year. The children were exposed to monthly exercises aimed to improve their emotional stability for two years. Although all of the children showed an increase in their emotional intelligence, she found that girls had higher scores than boys at posttest. However, her study was plausibly threatened by the interactive effects of maturation and selection. Not only were children likely to have grown more emotionally intelligent over the course of two years naturally, but girls were more likely to mature at a faster rate than boys. Therefore, any effects that she may have found from her intervention may have been due to girls’ increased emotional maturity.
Construct Validity
Construct validity addresses how well researchers can infer that the research operations sufficiently represent constructs of interest. Aconstruct is a broad concept, idea or behavior that we are usually interested in studying, but it is not directly tangible. An operation is a tangible way to measure a construct. For example, student achievement is a construct and the Florida Comprehensive Assessment Test (FCAT) is the operation – that is the FCAT is how we measure student achievement. There are often several ways to operationalize a construct, therefore, it is important to establish that the particular operation that is used to represent a construct is suitable.
Threats to Construct Validity
Inadequate Explication of Constructs
Inadequate explication of constructs is when a construct is not clearly defined, making it difficult to sufficiently operationalize it. This often occurs when there are several different ways of defining a construct.
Example
R. B. Cattell proposed that there were two different factors of intelligence: fluid and crystallized (Cattell, 1963). Fluid intelligence is innate or hereditary intellectual ability, which is often demonstrated through logical reasoning, problem solving, and pattern recognition. Crystallized intelligence is knowledge that is learned or acquired through experience or one’s environment. While Cattell recognized different types of intelligence, many researchers may only recognize one type. Therefore, if a researcher considers intelligence to be only natural abilities, then he will likely select an operation that measures only fluid intelligence. If this is the case, it is important that he clearly define his idea of intelligence within his research paper or proposal.
Construct Confounding
Construct confounding is when an operation or measure involves more than one construct. This frequently occurs when constructs are not clearly defined or tend to be strongly related to each other.
Example
The Hamilton Depression Rating Scale purports to measure depression (Hamilton, 1960). However, there are several characteristics that operationalize both depression and anxiety. Therefore, when asking about sadness, difficulties in eating and sleeping, and fatigue; researchers may actually be measuring both depression and anxiety, not just depression.
Mono-Operation Bias
Mono-operation bias is when only one operation or measure is used to assess a construct. Although this is a relatively frequent practice in research, mostly because we want to measure several constructs in as little time as possible; this is not recommended when making critical decisions or diagnoses.
Example
Dr. Robertson is a school psychologist who has been asked to assess elementary school students for Attention Deficit Hyperactivity Disorder (ADHD). She uses the ADHD Rating Scale IV (DuPaul, Power, Anastopoulos, & Reid, 1998) to get the teachers’ evaluations of children’s behaviors in the classroom. Although the measure is valid and reliable, she is using only a single measure from a single source (the teacher) to make a critical decision about children’s behaviors. In most cases, diagnoses that strongly impact children will be made using several measures.
Mono-Method Bias
Mono-method bias is when all operations or measures use the same method to assess a construct. While any study with mono-operation bias will also have mono-method bias, we can have mono-method bias without mono-operation bias. This can be avoided by using multiple methods to operationalize a construct. Methods can include written questionnaires, interviews, observations, self-reports, and other-reports (someone who knows the participant well assesses his or her behavior).
Example
Suppose that Dr. Robertson decides to add a second measure to her assessments of children that she screens for ADHD. In addition to having teachers evaluate students using the ADHD Rating Scale IV (DuPaul, et al., 1998), she also asks them to use the SNAP IV (Swanson, Nolan, & Pelham, 2008). In this case, she has ruled out mono-operation bias, but not mono-method bias since she is still asking only the teachers to provide ratings of children’s behaviors. To rule out mono-method bias, she could also ask the parents of the children to assess their children’s behavior at home using the SNAP IV or the ADHD Rating Scale IV. This way she would have two written other-reports, but using different raters. She could also interview the child (verbal self-report) or observe the child herself in the classroom (observational) to rule out mono-method bias.
Confounding Constructs with Levels of Constructs
Confounding constructs with levels of constructs is when an inference is made about a construct based on a restricted range of levels of that construct (i.e. only high or low values of performance). Very often when this threat is plausible, so is restriction of range (see threats to statistical conclusion validity).
Example
Dr. Dollinger wanted to see if there was a relationship between students’ intrinsic academic motivation (attending class for the pleasure of learning) and openness to new experiences (willingness to learn new things). Since his study was not part of the course curriculum, students were not required to attend class on the day that he collected data. He ran a correlation between the two variables and found that there was no relationship between intrinsic motivation and openness to new experiences ( r (18) = .15, p = .52). However, when he looked more closely at the distribution of scores, he noticed that all of the students who participated in the study had high scores on intrinsic motivation. The following semester, he offered extra credit to students in exchange for attending class to participate in his study. This time when he ran a correlation between the two variables and found that those who had high scores on intrinsic motivation were open to new experiences ( r (34) = .42, p = .01). Not surprisingly, he noticed that there was a lot more variability in the range of scores of intrinsic motivation.
Treatment Sensitive Factorial Structure
Treatment sensitive factorial structure occurs when participants respond differently to an operation because they were exposed to a treatment, program or intervention. This most often occurs when they learn new ways to interpret or perceive the operation.
Example
Dr. Dicken offers a workshop in Gay, Lesbian, Bisexual and Transgender (GLBT) sensitivity to residence assistants on campus. Since the program is voluntary, she uses a pretest-posttest design to assess the effectiveness of her program. Before the workshop, she gives the participants a test of GLBT sensitivity; and then has them take the test again after the workshop. She is pleased to find that the scores on this measure increase from pretest to posttest, t (24) = 2.3, p = .03. Naturally, she assumes that the increase in GLBT sensitivity is due to her exceptional training program. However, she has the good sense to talk to several of her participants after the workshop and learns that, while they did not think they grew more compassionate towards those in the GLBT community, they did learn more about GLBT issues. The workshop had not affected sensitivity, but had taught the participants vocabulary terms, laws and campus conflicts; which had changed how the participants interpreted some of the items on the posttest.
Reactive Self-Report Changes
Reactive self-report changes occur when participants are motivated to alter their responses on a measure. This is most likely to occur when the participants believe that they will be rewarded (or avoid punishment) by having especially high or low scores on a measure.
Example
Mr. Douglass is a mental health evaluator for the State of Florida. It is his job to determine whether or not people suffering from mental health problems qualify for government disability support. When he meets his new client, Ms. Miller, she informs him that she is “crazy” and needs him to verify this so that she can get a “crazy check”. As he is giving his mental competency assessment to Ms. Miller, he asks her, “How many legs does a dog have?” To which, she replies, “Three.” Knowing that she has a dog, he says “Now Ms. Miller, you know that a dog doesn’t have three legs.” In psychological assessment, this is also called faking bad or malingering. Ms. Miller is intentionally giving answers that she believes will make her appear mentally unstable to ensure that she will be eligible for mental health disability support.
Experimenter Expectancies
Experimenter expectancies occur when experimenters have certain expectations about participants’ thinking or behaviors, which may bias how they interpret participants’ responses. This can be avoided by not telling the experimenter which condition participants are in.
Example
Rosenthal and Jacobson (1966) randomly selected elementary school children to be “bright” or “dull”. The teachers of these students were informed as to which students were “bright” and which ones were “dull” based on an abilities test. The researchers noticed that the teachers responded more positively to the “bright” children than to the “dull” children. The “bright” children were treated more warmly, received more attention, and their responses were viewed as correct more often than the “dull” children. Because the children had been randomly assigned to be “dull” or “bright”, there was really no difference in the children’s intellectual ability. However, because the teachers believed that the “bright” children were indeed smarter, those children had higher performance ratings than the “dull” children.
Reactivity to the Experimental Situation
Reactivity to the experimental situation occurs when participants respond differently under experimental conditions because they are aware that they are being evaluated. Two types of reactivity to the experimental situation are evaluation apprehension and hypothesis guessing. Evaluation apprehension is when participants try to provide more favorable responses because they are anxious about being evaluated. Hypothesis guessing is when participants attempt to guess the researchers’ hypotheses and alter their responses to support or refute those hypotheses. One way to avoid reactivity to the experimental situation is to prevent the participants from knowing either what condition they are in or what the study is about (within ethical guidelines).
Example
Dr. Korchmaros is interested in studying helping behavior among sixth graders. To measure helping behavior she gives them a self-report, written questionnaire that asks them to indicate how often and under what circumstances they would assist other people in need. When she examines the results from this measure, she notices that all of the children had very high scores on altruism. She is surprised by this and decides to set up a scenario in which a third grade student drops her books in front of individual sixth grade students. Dr. Korchmaros observes how often the sixth grade students stop to help the younger child. Unlike the results from the questionnaire, she finds that only 30% of the sixth graders actually assisted the third grader with her books. When operationalizing helping behavior using the questionnaires, the children were aware that their intent to help was being observed and responded favorably to the questions. However, when they were not aware that their behavior was being assessed, they were less likely to engage in helping behavior.
Novelty and Disruption Effects
Novelty and disruption effects is when participants respond especially well or poorly to a new intervention, simply because it is different from their usual routine.
Example
Miss Frey has noticed that several of her dance students are becoming disillusioned by their regular exercise schedule. She decides to replace the traditional warm-up exercises with yoga stretches instead. Not knowing whether this will improve their flexibility over the traditional exercises, she uses a pretest-posttest design in which she compares their ease of movement after doing the traditional warm-up exercises and after doing yoga. She finds that for some of her students, they are able to dance more fluidly after doing yoga and others were dancing better after the traditional warm-ups. Because she notices a dramatic difference in dance performance depending on the style of warm-up, it appears that the exercises have an effect on performance, but are not consistent. In truth, the yoga exercises are not better than the traditional warm-up exercises, but the dancers are reacting to the change in routine. Some dancers like the novelty of doing something different, adapt well to the change, and dance better after yoga than after the traditional exercises. Other dancers find the new routine disruptive, respond poorly, and do not dance as well after doing yoga as they did after the traditional exercises.
Compensatory Equalization
Compensatory equalization is when the treatment, program or intervention provides desirable services and an administrator or staff member provides similar services to compensate the control group.
Example
Orange County Elementary School is interested in developing a new program intended to improve reading skills for children with learning disabilities (LD). To obtain an unbiased treatment effect, Ms. Tamkin, an educational researcher, randomly assigns children with LD to be in either the treatment condition (the new program) or the control condition (regular school curriculum). Unfortunately, when several of the teachers who have LD students learn that not all of their students will receive the new intervention, they bring in teachers’ assistants to tutor the LD children who were put in the treatment condition. Not knowing that the control group is receiving an alternative treatment, Ms. Tamkin is disappointed to learn that the new LD program is not effective when she compares reading skills between the treatment and control groups at posttest, t (58) = 0.54, p = .59.
Compensatory Rivalry
Compensatory rivalry is when the treatment, program or intervention provides desirable services and the control attempts to find a similar treatment or services.
Example
Mr. Benitone is a hypnotherapist who wants to know how effective hypnotherapy is reducing weight among those with Type 2 diabetes. He randomly assigns participants to either receive hypnotherapy for weight loss (the treatment group) or to continue with their normal eating and exercise routine (the control group). Dismayed that they were not given the treatment, several people in the control group begin to exercise more and eat food with less fat and cholesterol. In essence the control group has sought out and followed its own treatment program for reducing weight. When Mr. Benitone compares the mean weights of those in the treatment group to the control group, he finds that the treatment group did not weight less than the control group, t (32) = -1.22, p = .23. Assuming that the control group had continued with their normal routine, he concludes that hypnotherapy is not an effective weight loss solution.
Resentful Demoralization
Resentful demoralization is when the control group resents not receiving a desirable treatment, program or intervention and responds more negatively on outcomes or dependent variables.
Example
Miss Provence is a 5th grade math teacher who has developed a series of interactive games that are intended to improve students’ math skills. Not knowing whether these games are worthwhile, she randomly assigns half of her students to spend two hours each week playing the math games and the other half to work independently on word problems. When the children in the control group learn that those in the treatment group get to play math games, they feel resentful and make less of an effort to complete their own math assignment. When Miss Provence compares the groups at posttest on their math skills, she finds that the effect size in favor of the interactive games is exceptionally high, d = .44. While the math games may have been more effective in teaching math than the word problems, the effect size is inflated because the children in the control group stopped trying to learn math.
Treatment Diffusion
Treatment diffusion is when the control group learns about the program or intervention from the treatment group and also begins to use it.
Example
Dr. Weston has been asked by several of her students to give them a study guide when studying for their statistics exams. Although Dr. Weston does not believe that giving them a study guide will actually help them prepare for her exams, she decides that it would be prudent to test her hypothesis before refusing to give her students a study guide for their final exam. To ensure that she obtains an accurate treatment estimate, she randomly assigns only half of her students to get study guides for their midterm exam. When the students realize that only some of them got study guides, those that received the study guides copied them and shared them with their friends. In this case, many of those who were assigned to the control group had actually received the same intervention as the treatment group. Therefore, when Dr. Weston assessed the effectiveness of the study guides, she found that they did not help improve students’ test scores, t (18) = 0.09, p = .93.
External Validity
External validity addresses how well researchers can generalize causal relationships across units, treatments, observations, and settings. Essentially this considers how well researcher can infer causal relationships from individual experimental conditions to broader constructs of interest.
Threats to External Validity
Interaction of the Causal Relationship with Units
Interaction of the causal relationship with units is when the results of a study cannot be generalized to another group of participants.
Example
Dr. Howarth investigated the relationship between public speaking anxiety and audience perceptions of speakers among college students. She found that those with high levels of speaking anxiety believed that their audience thought that their speeches were terrible. Whereas those with low levels of speaking anxiety were either not concerned with audience perceptions or believed that the audience enjoyed their speeches. While her findings were compelling, because she had limited her sample to first-year college students, she does not know whether or not she would find the same results among college seniors, graduates or those who had never attended college. In this case, it is not clear that she can generalize the relationship that she found between public speaking anxiety and audience perceptions beyond college students.
Interaction of the Causal Relationship over Treatment Variations
Interaction of the causal relationship over treatment variations is when the results of a study cannot be generalized to a different type or part of a treatment, program or intervention.
Example
Dr. Walker introduces his obese patients to a weight loss program that has three main components. He proposes that to effectively lose weight, patients must: reduce and limit the amount of fat they eat, increase and maintain the amount of protein they eat; and exercise at least four times a day for 30 minutes. He has found that among those who adhere to all three components of his program, they are able to significantly decrease their body mass index (BMI) within six months, t (84) = -2.41, p = .02. However, many of his patients change their eating habits, but do not exercise. Because these patients do not follow the complete diet regiment, it is unlikely that they will obtain the same effects as those who do follow all three components of the diet program.
Interaction of the Causal Relationship with Outcomes
Interaction of the causal relationship with outcomes is when the same results of a study cannot be found if the researchers use a different operation or measure of the outcome or dependent variable.
Example
Dr. Floyd is looking at the relationship between birth order and intelligence. In a previous study, he finds that firstborn children tend to be more intelligent than lastborn children (Belmont & Marolla, 1973). However, when he conducted his own study he found that there was no difference in intelligence between first and lastborn children. While there are several possible explanations for this difference (year study was conducted, size of families studied), one of the differences was that he and the previous researchers used different instruments to measure intelligence. Dr. Floyd had used the Woodcock-Johnson IV and Belmont and Marolla had used the Raven Progressive Matrices. If the Woodcock-Johnson IV and Raven Matrices had measured intelligence differently, then the effects found in their studies would not be the same even if they had conducted the studies at the same time using the same participants.
Interaction of the Causal Relationship with Settings
Interaction of the causal relationship with settings is when the results of a study cannot be generalized to another setting.
Example
Dr. Lancaster is a clinical psychologist who specializes in Post Traumatic Stress Disorder (PTSD). As a graduate student, he developed a treatment plan to reduce anxiety in his clients. To the effectiveness of his intervention, he uses a pretest-posttest design and a sample of people within the community who seek treatment a the University Clinic in Savannah, GA. He finds that his study is effective in reducing his clients’ anxiety, F (2,54) = 3.22, p =.03. Therefore, when takes a position at a Veterans’ Affairs hospital in St. Paul, Minnesota, he uses his treatment plan with his new clientele. To his dismay, he finds that it is not as effective F (2,42) = 1.72, p =.18. It is possible that the reason for the difference in his findings is that he has changed settings in which he was treating clients. In his first study, he was treating clients in a university clinic in Georgia; and in his second study, he was treating clients in a VA clinic in Minnesota.
Context-Dependent Mediation
Context-dependent mediation is when an explanatory mediator of one causal relationship would not mediate a similar relationship in another context. In a mediational relationship, an independent variable causes a change in a mediator and the mediator causes the effect in the dependent variable. Therefore, the causal relationship between the independent and dependent variables are dependent on the mediator. If the mediator is not present, the relationship between the independent and dependent variable does not exist. In context-dependent mediation, the mediator may not have the same relationship with either the independent or dependent variable when the situation or circumstances of the relationship is changed.
Example
Ms. Bath found that students in her 12 th grade English class tended to improve their exam grades over time. In particular, students who made Ds and Fs on the midterm exam often made Cs and Bs on the final exam. Therefore, Ms. Bonnard hypothesized that students in her 11 thgrade algebra class would improve their scores on the second exam after receiving poor scores on the first exam. To her surprise, Ms. Bonnard’s students did not improve on the second exam. Contrary to Ms. Bath’s students, the students who made Ds and Fs on the first exam all earned Fs on the second exam. Puzzled why the two teachers found different results, they asked their students about the grades. The poor performers in Ms. Bath’s class said that when they saw the poor grades on their midterm (independent variable), they were motivated to study harder (mediator), and earned higher grades on the final exam (dependent variable). However, the poor performers in Ms. Bonnard’s class said that when they saw their poor grades on the first exam (independent variable), they felt discouraged and studied less (mediator), and earned poorer grades on the second exam (dependent variable). Although, in each of these situations, the relationship between the grades on the exams were mediated by students’ motivation; the motivation was different depending on the context.