Empirical Methods in Political Science: An Introduction

By Justin Zimmerman

9.1 Introduction

The field of political science has traditionally focused on the importance of hypothesis testing, causal inference, experiments and the use of large n data. Quantitative methods in all its capacities is without a doubt important, but what can be lost at times is the value of small n methods of inquiry within the field of political science. Researchers such as Kathy Kramer, Cathy Cohen, Reuel Rogers, and Jennifer Hochschild et. al. have all used small n methods to tell stories about particular groups that have rarely been highlighted in political science. Whether its identifying rural consciousness in Wisconsin ( Kramer 2016 ) , researching the secondary marginalization of the most disfranchised in the black community ( Cohen 1999 ) , explaining the unique political stances of Afro-Caribbean immigrants ( Rogers 2006 ) , or highlighting the politics of a new racial order ( Hochschild, Weaver, and Burch 2012 ) , small n data can allow for a researcher to discover new information not easily attainable through quantitative methods alone. Small n methods allow for a more in depth assessment of a particular area and people.

This chapter will focus on the importance small n research. The chapter will highlight the various methods for conducting small n research including: interviews, participant observation, focus groups, and process tracing, as well as the various procedures for determining case selection. First, the chapter will elaborate the differences and goals of small n research as compare to quantitative research.

9.2 Background

To be a well-rounded political scientist it is important to understand that not every question can be answered through quantitative methods alone. There are times when small n methods are the more appropriate option. Yet, how does a researcher decide when small n methods are appropriate for their research? The researcher must be able to identify the differences and purposes of small n qualitative research and quantitative research. First, quantitative research focuses on the effects of causes, while qualitative methods is focused on the causes of effects. In other words, quantitative research, especially with regards to causal inference, aims to figure out if a particular treatment causes a particular outcome, such as an increase in an individual’s education causes them to be more political mobilized.

Small n qualitative research on the other hand focuses on understanding how the outcome came to be. American Political Development (APD) scholars are a great reference to this line of thinking. APD scholars look to track why certain outcomes came to be, such as Paul Frymer’s work on Western expansion in the United States of America ( Frymer 2017 ) or Chloe Thurston’s research on housing policy and how it has historically discriminated against women, African Americans, and the poor through the use of public-private partnerships ( Thurston 2018 ) . Small n qualitative research also includes oral histories such as those provided by Yolande Bouka concerning the Rwandan genocide ( Bouka 2013 ) and the interviews and historical context to explain the coercive power of policing in Latin America as researched by Yanilda María González ( González 2017 ) . In short, small n qualitative research aims to tell a story of how an event or policy came to be, and what are the experiences of particular groups because of a particular event or policy.

Thus, a small n qualitative researcher must take care to ensure their work is able to satisfy three characteristics of good qualitative research. First, their research must emphasize the cause and the implications it has. Second, good small n qualitative theories must explain the outcome in all the cases within the population. Lastly, qualitative questions must answer whether events were necessary or sufficient for an outcome to occur, with the cause providing the explanation. To setup qualitative research it is important to that understand that qualitative methods are interested more in the mechanisms behind things. Small n approaches can help us explore the underlying process such as how institutions evolve and change by gathering data about institutions, but it can also be answered through looking at institutional change in one or two contexts. Small n qualitative research can be inductive as a researcher builds the theory and hypotheses from the data, or deductive by testing theories and hypotheses with the data. What is critical in building qualitative research whether inductively or deductively is case selection.

9.3 Case Selection

Case selection for small n qualitative research setup to use a small number of cases in order to go into a deep dive into a specific subject. For instance, a researcher may use a specific neighborhood to explain a specific political characteristic of the community. Reuel Rogers conducts this exact research when he interviewed Afro-Caribbean residents in New York City about their political preferences as new immigrants of the United States of America (2006). This case selection allowed for Rogers to assess the veracity of an age old claim that pluralism allows for immigrants to eventually assimilate into American culture and government participation by highlighting the complexity that comes from immigrants that are identified as black. Rogers finds that Afro-Caribbean immigrants suffer from discrimination that may hinder their ability to assimilate into American society. Yet, how does a researcher decide what cases to use? Seawright and Gerring provide some insight by identifying seven case selection procedures ( Seawright and Gerring 2008 ) . For the purposes of this text, this chapter will focus on four of these case selection procedures. The cases focused on will be most similar, most different, typical, and deviant. The chapter will also briefly describe extreme, diverse, and influential cases.

9.3.1 Most Similar

Seawright and Gerring instruct the use of the most similar case selection must have at least two cases to compare. Ideally, when using most similar cases all independent variables other than the key independent variable or dependent variable would be similar. For example, we may compare neighborhood with similar variables for income, religion, and education with the key independent variable such as race being the only difference. Thus, a researcher could use small n case selection to research differences or similarities that black middle class residents of particular neighborhood have with a white middle class neighborhood. It should be noted that matching any particular cases by exact characteristics is essentially impossible in the social science. Thus, this technique is daunting to say the least. Yet, part of the compromise of political science and social science in general is doing the best with the information you have and being honest about the limitations. This is especially important in the use of the most similar case selection procedure.

9.3.2 Most Different

Gerring and Seawright also identify the use of the most different case selection procedure. The most different case refers to cases that are different on specified variables other than the key independent variable and dependent variable. For instance, maybe there are class, education, and religion differences between two neighborhoods, but the key independent variable of race remains the same for both. Gerring and Seawright argue that this tends to be the weaker route to take in comparing two case but nonetheless it is an option to use for a small n researcher under the right circumstances.

9.3.3 Typical Case

The typical case refers to common or representative case that a theory explains. According to Gerring and Seawright, the typical case should be well defined by an existing model which allows for the researcher to observe problems within the case rather than relying on any particular comparison. A typical case is great for confirming or disconfirming particular theories. Referring back to the work of Reuel Rogers and his work on black Caribbean immigrants in New York City, Rogers was able to disconfirm Dahl’s argument on plurality allowing for the eventual full inclusion of immigrants by pointing to the racism and discrimination black Caribbean immigrants face that hinders their ability to be fully incorporated into the American polity. What is most important for understanding the typical case is that it is representative and that this representation must be placed somewhere within existing models and theories to be useful.

9.3.4 Deviant Case

Conversely to the typical procedure, the deviant case cannot be explained by theory. A researcher can have one or more deviant cases and these cases serve more as a function of exploration and confirming variation within cases. The deviant case is essentially checking for anomalies within an established theory and allows for the finding of previously unidentified explanations in particular cases. An example may be finding that liberalism is defined differently depending on certain populations which runs counter to Haartz’ assertion that liberalism assumes a certain amount of unity throughout the country. What is most important for understanding the deviant case is for a researcher to check for representativeness of a theory, which allows for much of the value of small n methods. A researcher can tell a story of a particular group that is often assumed to fit the general understandings of political science but through the use of qualitative methods is shown to be more complex than previously understood.

9.3.5 Other Selection Approaches

Along with the four main case selection procedures are other are three other approaches worth noting. The first being the extreme case . The extreme case is characterized by cases that are very high or very low on a researchers’ key independent or dependent variables. It can provide the means to better understand and explore phenomena through the means of maximizing variation on the dimensions of interest in the selection of very low and high cases (Seawright and Gerring, 2008). Unlike in linear regression, where extreme values can provide an incomplete or inaccurate picture, in small n approaches, extreme cases can offer the opportunity for deepening the understanding of a phenomenon by focusing on its most extreme instances. (Collier, Mahoney and Seawright 2004; 4-5)

Second, diverse cases highlight range of possible values. A researcher can choose low/medium/high for their independent variable to illustrate the range of possibility. Two or more cases are needed and this procedure mainly serves as a method for developing new hypotheses. These cases are minimally representative of the entire population

Lastly, influential cases are outliers in a sense that they are not typical and may be playing an outsize role in a researcher’s results. It is unlikely that small n methods will play a significant role as influential cases rely on large n methods.

Check-in Question 1: How should a researcher go about choosing a case selection procedure?

9.4 Method: setup/overview

Small n methods are characterized by an emphasis on detail. A researcher has to be able to see the environment that they are studying. The purpose of small n methods is to gain an in depth knowledge of particular cases. Field notes will be a researcher’s best friend. A researcher should take notes on the demographics, noises, emotions, mores, and much more to gain an accurate understanding of the population they are studying. Additionally, small n methods are about building rapport with the population being studied and constantly taking into account one’s own biases and thoughts as they conduct fieldwork. It is not uncommon for researchers to eventually live in the places they are studying. During her work on the black middle class, Mary Pattillo would eventually move into the South Side Chicago neighborhood of Groveland. The neighborhood was the subject of her book Black Picket Fences ( Pattillo 2013 ) . Pattillo would attend community meetings, shop, and cultivate lasting relationships with the community, which would guide her research. There is a level of intimacy needed to do good small n research. Not always to the extent of needing to live with one’s participants, but still a need for insight that goes beyond a shallow understanding of a particular community. Small n qualitative researcher gets at these insights through several methods.

Note: Take sometime to think about for your own research what you are noticing during your fieldwork? How is this informing your study?

9.5 Method: types

The typical methods used in small n research are interviews, participant observation, focus groups, process tracing, and ethnography. Each method has its advantages and disadvantages and a researcher can utilize more than one these methods depending on the aims of their research. In deciding on a small n method a researcher must consider the goals of the research, validity, and conceptual framework that will feed the researcher’s broader question. The diagram below illustrates that a small n qualitative researcher should be purposeful in their research design. They must consider their overall question. Specify the goals of their research, consider the theories that are driving the conceptual framework of their research, and consider the validity (does it make sense) of their research design.

Research Methods Diagram

Figure 9.1: Research Methods Diagram

Focusing on the methods portion of the diagram, this chapter will discuss in further detail each small n qualitative method.

9.5.1 Interviews

Conducting interviews can seem like a daunting experience. A researcher has to develop a comfort in approaching diverse sets of people, many times in unfamiliar environments. A researcher has to be able to build rapport, get their questions answered within a limited amount of time and encourage the participant to elaborate and clarify answers. Interviews are challenging but the good news is there are ways to make the process smoother through organization, commitment, and earnestness.

Before contacting anyone for an interview, a researcher should take sometime to organize their interview guide and decide whether they want to conduct structured or semi-structured interviews. The interview guide highlights the questions and themes the researcher plans to cover during the interview. The format of the interview guide is determined by whether the researcher has a rigid structure of questions they plan to ask each participant (Structured Interview) or a more flexible interview strategy that allows for the researcher to deviate from questions and allow for a more exploratory conversation within the confines of the research question (Semi-Structured Interview).

Once a researcher has decided on an interview structure and completed their interview guide, they can decide who they want to recruit to participate in the interview. The researcher will need to consider the key informants and representative sample they want to recruit. Key informants are experts that can discuss the population of interest including but not limited to academics, community leaders, and politicians. The representative sample is the population that your research is based on. For example, Wendy Pearlman’s text We Crossed a Bridge and it Trembled: Voices from Syria has a representative sample of Syrians displaced during the civil war ( Pearlman 2017 ) . What is important to understand about the difference between the representative sample and key informants is that the sample is giving a firsthand account of their experiences, while a key informant is mainly given their observation and experiences of the representative sample from an outside perspective.

Moving on to recruitment, Robert Weiss’ Learning From Strangers lists several reasons that affect whether an individual is willing to participate in an interview including: occupation, region, retirement status, vulnerability, and sponsorship from others within their network ( Weiss 1994 ) . Unfortunately, there is no easy way to recruit but from experience face to face discussions with potential participants and immediate follow up are quite effective. Also use snowball sampling to use previous participants acquaintances and networks to participate in interviews. These strategies are not full proof but a layer of personal interaction through face to face contact or networks does have advantages in making many people more receptive to participating in interviews.

Lastly, when the day to interview finally arrives a researcher should have two recorders, tissue, interview guide, consent form, and a gift card for the participant if possible. The interview should not take any longer than an hour as a sign of respect for the time of your participant. A researcher should take meticulous notes during the interview. Also, the researcher must gain the permission of the participant to conduct a follow up interview if necessary.

Check-in Question 2: What is this difference between a representative sample and a key informant?

9.5.2 Participant Observation

Participant observation is a variation of ethnographic research where the researcher participates in an organization, community, or other group-oriented activities as a member of the community. Typically used in anthropology, it involves a researcher immersing themselves within a community. Participant observation requires that the research build a strong bond of trust with the observed community. A researcher (with the help of IRB) will need to decide if participation will be active or passive and whether it should be overt or covert. This can be a particularly sticky situation, as a passive and covert observation may mean community members have no idea they are being studied, while active and overt participation can lead to the environment changing as the community is aware of the presence and role of the researcher. Referring back to the work of Mary Pattillo, recall that she eventually became a citizen of Groveland and participated as any other citizen in community activities ( Pattillo 2013 ) . This included leading the local church choir, joining the community’s local action group, and coaching cheerleading at the local park. Pattillo saw her participant observations as essential to describing the black middle class in Groveland and even speaks of the parallels between the Groveland neighborhood and her upbringing in Milwaukee.

The key purpose of participant observations is to provide deeper insight into process and how things function. This exercise is good for ‘theory building,’ but it may be best to include another method, such as interviewing, to allow for the community to tell their story as well, a supplemental method Pattillo uses as well. What is most important when using participant observation (in qualitative methods in general) is to take meticulous field notes with attention to accuracy. A researcher should be cognizant of their own biases and constantly thinking through their analysis to make sure they a capturing an accurate story. In order to tell an accurate story a researcher should keep both mental notes and a notepad. After the end of an event it is important to write everything down while the researcher’s memory is fresh.

Check-in Question 3: What are the advantages and disadvantage of covert and over participant observation?

9.5.3 Focus Groups

Focus Groups, similar to individual interviews requires a researchers to set questions, recruit participants and follow up with participants as necessary. As with an individual interview, the researcher should have an interview guide to help structure the questions and themes of the focus group. The advantage of a focus group is that a researcher is able to facilitate multiple respondents at once, which can lead to additional details and information you might not get in series of single interviews. As seen in Melissa Harris Perry’s Sister Citizen , focus groups are great for spurring discussion about topics such as stereotypes ( Harris-Perry 2011 ) . A researcher should note impressions, points of contention, and general interactions within the group. Group dynamics and discussions can be used for theory building as well as getting a deeper understanding of a particular group of people.

9.5.4 Process Tracing

Process tracing is a method of causal inference using descriptive inference over time. Notably used by APD scholars, the goal of process tracing are to collect evidence to evaluate a set of hypotheses through the framing of historical events. There are four tests when discussing process tracing.

The first is the straw in the wind test . The straw in the wind test can increase plausibility but cannot determine that any event necessary nor sufficient criterion for rejecting. It can only weaken hypotheses. The hoop test establishes necessary criterion. Though the hoop test does not confirm any particular hypotheses, the test can eliminate hypotheses. The smoking gun test provides a sufficient but not necessary criterion for hypotheses. The test can give strong support for a given hypothesis and can substantially weaken competing hypotheses. Lastly, the doubly decisive test illustrates evidence that is necessary sufficient. Necessary being when the necessary causes occur when the effect occur and sufficient being when causes always occur after effects.

What is important to understand about process tracing beyond the numerous tests is that process tracing is a good way in political science to draw evidence for certain events and phenomena. Chloe Thurston uses process tracing to track the development of the public-private partnership with regards to housing policy ( Thurston 2018 ) . Through numerous historical text including archives, testimonial, and presidential records, Thurston is able to develop a story of how public-private partnerships led to home owning policies that discriminated according to gender, race, and socioeconomic status and how advocacy groups were able to combat these policies.

Thus, process tracing looks for historical evidence to explain certain events or policies.

9.5.5 Ethnography

Ethnography involves studying groups of people and their experiences ( Emerson, Fretz, and Shaw 2011 ) . As mentioned earlier with participant observations, the purpose of ethnography is for a researcher to immerse themselves in the environment they are studying. The researcher will need to develop relationships with the community and detail the environment through constant note taking and reflection. This is reflected in the work of many of the researchers already detailed in the chapter. Done correctly a researcher can document the emotions, attitudes, and relationships in a community that are sometimes impossible to capture in quantitative work.

In his text Wounded City: Violent Turf Wars in a Chicago Barrio , Robert Vargas is able to capture the fear, frustration, and empowerment felt by the residents of Chicago’s Little Village as they negotiate turf wars between gangs, police, and alderman [vargas2016a]. The insight he is able to gather cannot simply be surveyed, but must be observed in the environment in order to develop trust within the community.

Ethnography is about relationship building and allows for latent findings that may give proper context for understanding particular groups. This is especially important for underrepresented communities, where in depth research is often lacking and responsiveness to a survey may not be likely under less personal circumstances. Ethnography allows a researcher to take a more holistic approach in understanding a community.

Check-in Question 4: What should a researcher be looking for when taking ethnographic field notes?

9.6 Applications

The application of small n qualitative methods is based on a researcher’s question. Sociologist, Celeste Watkins-Hayes, explains that qualitative research is meant to tell specific stories about a community. Going back to the diagram displayed in the beginning of the chapter, a researcher should think of the story they are trying to tell and goals, whether the small n qualitative methods they want to use are valid, and how does all of this relate to the research question. Most importantly when applying small n qualitative methods, record keeping is of the utmost importance. A researcher should make sure that their field notes are detailed and capture an accurate depiction of the environment of study. This means not only self-reflecting on one’s own biases, but also using multiple small n and quantitative methods when appropriate to tell the most complete story possible. Lastly, a researcher needs a method of coding the themes and messages found through their study. Recording encounters and taking good field notes will go far in creating an organized system, which will allow for a researcher to tell an accurate story that captures the nuances and characteristics of a particular community.

9.7 Advantages of Method

Small n qualitative research thrives with gaining in depth information about a limited number of cases. This will allow a researcher to provide insight of a small number of communities that may be missing from large n studies. In this same breath, small n methods allow for theory building that many times is unique to many of the lessons taken for granted in the discipline of political science. It is one thing to ask an individual participant to check an answer on a question about immigration, race, or president. Yet, there is value is going deeper and wrestling with the values, contradictions, as well as the historic and present-day context that make up the politics of a particular people. It is through small n methods that researchers are able to get a better understanding of topics such as rural consciousness, neighborhood violence, and linked-fate. Small n methods allow a researcher to tell the stories that are often ignored, unheard, or misinterpreted through other methods.

9.8 Disadvantages of Method

The major disadvantage of small n methods is that a researcher is working from a small pool. This should not be confused with having less data. Interviews, field notes, and archives bring an abundance of data but the sources are limited. A responsible researcher will have to consider whether their case selection is representative of the broader community and how best to ensure that they are getting a diverse set of voices to hear from to avoid inaccurate assessments of a community. Thus, it is difficult (but not impossible) to generalize from the use of small n research. A researcher including quantitative methods or multiple small n methods in their study will go a long way in strengthening their arguments.

9.9 Broader significance/use in political science

As has been noted numerous times in the chapter, small n qualitative methods allow a researcher to explore groups that cannot necessarily be understood merely with a survey, experiment, or causal inference. Small n allows for a researcher to go into more detail about groups that cannot be fully understood through quantitative research either because they are too small or too unresponsive to quantitative methods. Additionally, small n qualitative research also allows for political scientist to consider context and history when developing claims regarding the political behaviors and institutions that shape society. This context can help a political scientist go beyond superficial understandings of particular groups. For instance, Michael Dawson’s text Black Visions uses quantitative methods to show that African Americans have a high support for Black Nationalism ( Dawson 2001 ) . This finding alone could be taken as example of mass black prejudice, as Black Nationalism has been associated most notably with the bigoted views of Louis Farrakhan. Yet, Dawson takes care to include the historical context, including testimonials by leading black thinkers, detailing the long history of debate concerning Black Nationalism, as well as the economic violence and discrimination committed against the black community, which leads to support of some forms of Black Nationalism. Small n qualitative research through the use of history, interviews, and ethnography allows for the telling of these stories, adding complexity and nuance to many of political science’s well established theories and perceptions.

9.10 Conclusion

Not all questions can be answered with a survey and experiment alone. Sometimes a deeper study into a community and event can lead to new and exciting insights in the discipline of political science. Admittedly, small n qualitative research can be met with some cynicism in certain parts of the political science community, but when done correctly through meticulous note taking, coding, and preparation small n qualitative methods can provide insights that have yet to be fully articulated in the discipline and assist in answering some of the most important questions of the day including policing, immigration, and race relations.

9.11 Application Questions

Application Question 1

What are some materials needed to conduct small n research?

Application Question 2

When in the field, how does a researcher build rapport with the community?

9.12 Key Terms

NOTE: this is an incomplete list and it needs expanded!!!!!

covert observation

deviant case

diverse cases

ethnography

extreme case

focus groups

influential cases

key informant

most different

most similar

overt observation

participant observation

process tracing

representative Sample

snowball sampling

smoking gun test

straw wind test

typical case

9.13 Answers to Application Questions

A researcher should have their interview guide prepared, tissues, and two recorders if conducting interviews or focus groups. Additionally, a researcher should have a notepad for field notes and consent forms if necessary. Business cards are also useful when trying to recruit participants from the field.

Rapport can be built through appearance including dress, race, gender, regional, and class markers. Most importantly, a researcher should present themselves as engaged and attentive to the participants. A researcher should remain professional and read the room, rapport building for a group of blue collar workers may be different than with college students. A researcher should remain cognizant of this distinction and look for openings to build connections when possible.

small n case study example

Designing Case Studies

Explanatory Approaches in Small-N Research

  • © 2012
  • Joachim Blatter 0 ,
  • Markus Haverland 1

Department of Humanities and Social Sciences, University of Lucerne, Switzerland

You can also search for this author in PubMed   Google Scholar

Department of Public Administration, Erasmus University Rotterdam, The Netherlands

Part of the book series: ECPR Research Methods (REMES)

12k Accesses

196 Citations

9 Altmetric

This is a preview of subscription content, log in via an institution to check access.

Access this book

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Institutional subscriptions

About this book

Similar content being viewed by others.

small n case study example

The theory contribution of case study research designs

small n case study example

Case Study Research

small n case study example

  • methodology

Table of contents (5 chapters)

Front matter, relevance and refinements of case studies.

  • Joachim Blatter, Markus Haverland

Co-Variational Analysis

Causal-process tracing, congruence analysis, combining diverse research approaches, back matter.

"Hardly any subfield of political science methods has witnessed such dynamic development in recent years as the literature on case study designs. Case studies have always been at the core of central advances in political sciences research, but only in recent

years have the methodological underpinnings been elaborated on in corresponding depth. The book by Joachim Blatter and Markus Haverland contributes significantly to this field of study in two ways: firstly, it suggests a new, convincing typology of case studies; and secondly, it takes the 'designing' aspect of its title seriously by including many case studies from political science research to provide practical illustrations of their three suggested modes of designing case studies." - Julian Junk, West European Politics, 36:4, 893-894

Authors and Affiliations

Joachim Blatter

Markus Haverland

About the authors

Bibliographic information.

Book Title : Designing Case Studies

Book Subtitle : Explanatory Approaches in Small-N Research

Authors : Joachim Blatter, Markus Haverland

Series Title : ECPR Research Methods

DOI : https://doi.org/10.1057/9781137016669

Publisher : Palgrave Macmillan London

eBook Packages : Palgrave Political & Intern. Studies Collection , Political Science and International Studies (R0)

Copyright Information : Palgrave Macmillan, a division of Macmillan Publishers Limited 2012

Hardcover ISBN : 978-0-230-24969-1 Published: 30 May 2012

Softcover ISBN : 978-1-349-32085-1 Published: 01 January 2012

eBook ISBN : 978-1-137-01666-9 Published: 30 May 2012

Series ISSN : 2947-5201

Series E-ISSN : 2947-521X

Edition Number : 1

Number of Pages : XVIII, 262

Topics : Political Science , Teaching and Teacher Education , Science, Humanities and Social Sciences, multidisciplinary , Public Policy

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Sign up for our newsletter

  • Evidence & Evaluation
  • Evaluation guidance
  • Impact evaluation with small cohorts (evaluation guidance)
  • Find out more about small n evaluation

How do small n impact evaluations work?

Share content.

Small n impact evaluations draw on different understandings of causation from ‘traditional’ counterfactual impact evaluations.

small n case study example

  • Mid-level theory : All the approaches to impact evaluation associated with uncovering ‘causes of effects’ involve specifying a mid-level theory or a Theory of Change together with alternative causal hypotheses. Causation is then established beyond reasonable doubt by collecting evidence to validate, invalidate or revise hypothesised explanations (White and Phillips 2012). Mid-level, or middle range, theory is a sociological concept that attempts to combine high-level abstract concepts with concrete empirical examples. As such, it makes hypothesis development possible by bringing data and evidence together with theoretical constructs to help make sense of the subject under investigation. Mid-level theories can also support a focus on causal mechanisms operating at a more general level than the context of specific individual interventions. As a result, they encourage the identification of patterns and ‘regularities’ across a range of interventions or contexts. This process can support the development of generalisable conclusions and, therefore, the transfer of hypotheses between related interventions or contexts.
  • Cases : Key to all small n approaches is the concept of the ‘case’. However, small n methodologies are not ‘case studies’. Small n , case-based methodologies are varied but Befani and Stedman-Bryce (2017) suggest that case-based methods can be broadly typologised as either between-case comparisons (for example, qualitative comparative analysis) or within-case analysis (for example, process tracing).
  • Mixed method : Generally, quantitative and qualitative data are used, with no sharp distinction made between quantitative and qualitative methods.
  • Complexity : In counterfactual approaches, the two concepts of bias and precision are central to managing uncertainty. Sources of uncertainty are controlled through both research design (randomisation where possible and sample size/statistical power) and precision (e.g. methods of statistical inference, computation of confidence intervals, p-values, effect sizes and/or Bayes factors). In contrast, small n approaches are case-based, and the multifaceted exploration of complex issues in real-life settings allows these approaches to account for uncertainty through invoking concepts centred on ideas of complexity.
  • Generative and multiple causation : Within approaches to small n impact evaluation, Stern and colleagues (2012) make a broad distinction between approaches based on generative causation and those based on multiple causation. Multiple causation depends on combinations of causes that lead to an effect, whereas generative causation understands causation as the transformative potential of a phenomenon and is closely associated with identifying mechanisms that explain effects (Pawson and Tilley 1994).
  • Tools and Resources
  • Customer Services
  • Contentious Politics and Political Violence
  • Governance/Political Change
  • Groups and Identities
  • History and Politics
  • International Political Economy
  • Policy, Administration, and Bureaucracy
  • Political Anthropology
  • Political Behavior
  • Political Communication
  • Political Economy
  • Political Institutions
  • Political Philosophy
  • Political Psychology
  • Political Sociology
  • Political Values, Beliefs, and Ideologies
  • Politics, Law, Judiciary
  • Post Modern/Critical Politics
  • Public Opinion
  • Qualitative Political Methodology
  • Quantitative Political Methodology
  • World Politics
  • Share This Facebook LinkedIn Twitter

Article contents

Case selection in small-n research.

  • Jason Seawright Jason Seawright Department of Political Science, Northwestern University
  • https://doi.org/10.1093/acrefore/9780190228637.013.169
  • Published online: 05 August 2016

Recent methodological work on systematic case selection techniques offers ways of choosing cases for in-depth analysis such that the probability of learning from the cases is enhanced. This research has undermined several long-standing ideas about case selection. In particular, random selection of cases, paired or grouped selection of cases for purposes of controlled comparison, typical cases, and extreme cases on the outcome variable all appear to be much less useful than their reputations have suggested. Instead, it appears that scholars gain the most in terms of making new discoveries about causal relationships when they study extreme cases on the causal variable or deviant cases.

  • case selection
  • case studies
  • multi-method research
  • comparative method
  • qualitative methods

You do not currently have access to this article

Please login to access the full content.

Access to the full content requires a subscription

Printed from Oxford Research Encyclopedias, Politics. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 19 August 2024

  • Cookie Policy
  • Privacy Policy
  • Legal Notice
  • Accessibility
  • [185.80.149.115]
  • 185.80.149.115

Character limit 500 /500

Evaluating Research Methods of Comparative Politics

Evaluating the Research Methods of Three Modern Classics of Comparative Politics

The main aim of this essay will be to explore and theoretically evaluate the research designs of three classics of comparative politics: Putnam’s case study method in Making Democracy Work , Linz’s small-N research design in “The Perils of Presidentialism” and Amorim Neto & Cox’s large-N statistical analysis in “Electoral Institutions, Cleavage Structures, and the Number of Parties”. To do this, I will split my essay into two separate sections. In the first section of the essay I intend to illustrate the strengths and weaknesses of the different comparative research designs, focusing exclusively on the types found in the three main texts under evaluation. After a short introduction, some important focuses, amongst others, will be on the objectives of case studies, small-N studies and large-N studies, their evolution throughout the 20 th century, when they are best used, what their purposes are and what they intend to achieve.

The second section of the essay will be a critical evaluation of how successful one particular text – Linz’s “The Perils of Presidentialism” – has been in the execution of the authors’ objectives, looking at other work in the field to assess if its aims have been fully achieved. In short, I will consider how convincing the text is as a comparative study. I will then conclude the essay by considering how alternative research designs may have improved or worsened the selected study, again drawing on important academic works to support my theories and assertions.

Robert Putnam’s Making Democracy Work undoubtedly brought the concept of social capital to the forefront of the socio-political sphere. His celebrated work is a key example of a case study, “an intensive analysis of an individual unit (as a person, event or community) stressing developmental factors in relation to environment or context.” [1] Putnam’s research design has a broad analysis that compares 20 regions across Italy – a MSSD. He exploits a natural experiment to assess the difference institutional reform makes to institutional performance over his 20 year period of study. Upon recognising disparities in regional findings, he then assesses the reasons for cross-sectional and cross-temporal variation in institutional performance, moving to a more in-depth focus on 6 regions.

Although the case study discipline is beginning to decline in the modern academic sphere, the strengths of this kind of research design are still evident. Prominent examples from the field are Tocqueville’s Democracy in America (1888) and Lijphart’s The Politics of Accommodation (1968). Landman highlights some of the main strengths of case studies such as these, writing:

“Single-country studies provide contextual description, develop new classifications, generate hypotheses, confirm and inform theories, and explain the presence of deviant countries identified through cross-national comparison.” [2]

In short, the case study method allows an intensive depth study of a unit with limited resources. As Landman statement suggests, they are extremely flexible and can serve a multitude of purposes. Making Democracy Work is also a prime example of another key strength of case studies – utilising process-tracing to uncover evidence of causal mechanisms or to explain outcomes. [3] For example, Putnam is able to trace the roots of modern civic community and institutional performance back to Italy’s “Golden Age” in the 14 th century through his historical analysis. Other comparative methods, such as large-N, are far less conducive to this type of process tracing and unearthing of causal mechanisms. They are often able to say what happened, but not necessarily how it happened. The detail in which these mechanisms are explained is also far less extensive.

Despite the strengths highlighted above, there are also some considerable limitations to the case study method that have been recorded extensively in the literature of the social sciences. A key limitation, again highlighted by Landman, is that “inferences made from single-country studies are necessarily less secure than those made from the comparison of several or many countries.” [4] Generalizations about other units cannot be made from a case study, unlike with large-N statistical analysis. Furthermore, case studies are usually better at description than establishing causation. The case study method is also considered intensive rather than extensive, as it tends to offer a great deal of in-depth analysis of a single unit at the expense of breadth analysis. With Putnam, for example, the study has a sole focus on Italy and Italian institutions and social capital over 600 years. George & Bennett draw upon the importance of this point, stating that another potential pitfall of the case study method is the “selection bias” that occurs when a unit is chosen on its “intrinsic historical importance”, or “on the accessibility of evidence” [5] This form of “selection bias” could certainly be evident in Putnam’s research design. Finally, where larger-N studies often confirmatory in their nature, seeking to either confirm or reject hypotheses, case studies tend to be more exploratory, attempting to gain new insights into a topic or unit from which new hypotheses might be later developed.

Juan Linz’s “Perils of Presidentialism” differs to Putnam’s work in that it is a small-N research design that compares the effect presidential and parliamentary regime types on democratic stability. Small-N research designs such as Linz’s usually consist of 2-12 intentionally selected cases and are therefore too limited to conduct statistical analysis as with large-N designs. [6] Linz’s research design is a qualitative comparison of a small number of cases selected from Latin America and Western Europe, with a particular focus on the USA also. Many have argued that his case selection is based upon MSSD. The main aim of “Perils of Presidentialism” was to explore political processes over time and within cases as a means of showing that “the superior historical performance of parliamentary democracies is no accident.” [7]

As with the case study method, there are numerous advantages to using small-N research designs – often called the comparative method – in comparative politics. One of the most significant strengths of using the comparative method comes from the intentional selection of cases as previously mentioned. Not only can it be a substitute for the experimental control evident in large-N analysis, the intentional selection of cases that share similar characteristics means that hypothesis testing is made easier. [8] A similarly significant advantage of using this method in comparison to the large-N statistical method is that concepts and ideas used in small-N studies can be operationalized at a lower level of abstraction, meaning that concepts are at a lower risk of being stretched. The result of this is greater confidence that chosen concepts are being accurately measured. Collier’s The Comparative Method also highlights the fact that small-N designs such as Linz’s allow for an intensive analysis of a few cases with limited energy expenditure, financial resources and time. These intensive analyses can be more fruitful than superficial statistical analysis of many cases that can be extremely time-consuming and difficult to execute successfully. The collection of large data sets has also proved extremely difficult. [9]

It could be argued that the weaknesses of the comparative method, or small-N research designs, out-weigh its strengths. Where the selection of cases can work as an experimental control, Landman, amongst others, has highlighted case selection as one of the major pitfalls of small-N designs, stating that “the selection of cases in the absence of any rules of inquiry can lead to insecure inferences, limited findings, and, in some cases, simply incorrect conclusions about a particular topic.” [10] Similarly, the issue of “many variables, small number of cases” is raised extensively in the literature (e.g. Lijphart, 1971; Goggin, 1986). Problems often arise when comparing few countries when there are more factors identified explaining the observed outcome than there are countries observed. Because small-N analysis involves hand-picked specific cases, there are often many variables linking the cases that are not central to the study, hence “too many variables, not enough cases.” [11] A way of potentially solving this problem, as highlighted by Lijphart, is adding more cases to the equation. This eventually becomes problematic when too many cases are added and the research passes from the small-N comparative method in to the large-N statistical analysis method. Using the small-N design that Linz has can therefore be extremely problematic.

Octavio Amorim Neto & Gary Cox’s “Electoral Institutions, Cleavage Structures, and the Number of Parties” is an example of the third type of research design – a quantitative large-N statistical analysis. Amorim Neto & Cox utilise a large-N cross-case statistical analysis which analyses the degree of correlation between the number of effective parties, various measures of electoral system permissiveness and ethnic fragmentation in electoral systems around the world. The design is also cross-sectional, as data is collected from 54 elections around the world (c.1985) and observes both parliamentary and presidential elections. Multivariate regression is used to assess the different impact each of the independent variables has on the dependant variable. Their conclusion is that “the effective number of parties appears to depend on the product of social heterogeneity and electoral permissiveness.” [12]

A particularly key advantage of using the large-N statistical method is the fact that statistical controls can be used to rule out rival explanations for an observed outcome and to control for potentially confounding factors. [13] In small-N analysis, it is clear that a lack of statistical control can lead to the production of incorrect of flawed results. Large-N analyses also allow for an extensive coverage of countries over both space and time. [14] In Democracy and Development (2000), for example, Prezworski et al. use 150 countries over a 40 year period in their analysis. Unlike in case studies and small-N designs, large-N designs are better able to avoid selection bias because units are usually randomly selected. Furthermore, large-N designs are better able to highlight ‘deviant’ or ‘outlier’ countries whose outcomes are not as expected from the study. The nature of such designs also means that generalizations can be made about the wider population from their findings because theories are tested with a far greater number of cases that are more representative of the wider population, whether it be individuals, groups or countries. As Coppedge states, “the principal advantage of the kind of theory that emerges from large-sample work is that it is relatively general, both in its aspirations and in its empirical grounding.” [15]

As made apparent by Collier, one of the main disadvantages of using the statistical method in a study is the difficulty in “collecting adequate information in a sufficient amount of time” when resources and time are limited. [16] Large-N analysis are, without doubt, extremely time consuming and often expensive; they also assume a certain amount of mathematical or computing knowledge in their execution and interpretation. Moreover, only certain types of data can be used using this method of analysis and the reliability of data extracted from developing countries, for example, is often questionable. This means that relevant data can sometimes be omitted or incorrect, meaning inferences can also be misleading. Even when data is available, its higher level of abstraction can sometimes lead to concept stretching, where we cannot be sure if the researcher is still measuring what he/she originally intended to. Another critique of the method is that large-N studies often rely upon assumptions that may not hold true, e.g. there are no omitted variables, cases are fully independent, or unit homogeneity. Finally, many who interpret large-N statistical analysis often mistake correlation for causation. Correlation simply highlights the direction and strength of a relationship, not causation.

Regardless of its wider political impact, Juan Linz’s “The Perils of Presidentialism” has come under a vast amount of criticism in the social sciences for being too implicit and underdeveloped. Horowitz, for example, describes Linz’s claims that parliamentary governments is better able to stabilise democracy than presidential government as being “based on a regionally skewed and highly selective sample of comparative experience, principally from Latin America,” that “rest on a mechanistic, even caricatured, view of presidency.” [17] Criticising Linz choice of Latin America as antagonistic examples of unstable presidential systems, Horowitz highlights the inherited post-colonial Westminster parliamentary systems in Africa and Asia as equally destabilising and not at all conducive to democracy. This is because their winner-takes-all features allow any party with a majority to seize power. [18] Linz makes only a passing comment on these important examples that oppose his central causal arguments. Horowitz’s argument here could also be that a way to avoid the “regionally skewed” design that Linz’s work represents would be to adopt a MDSD – comparing various parliamentary and presidential systems from across the world as cases – instead of his MSSD design. This could also avoid the ‘selection bias’ that is often present through hand-picking cases in a small-N research design.

Similarly, both Horowitz and Shugart & Cary highlight Linz’s omission of the variation in electoral systems, party systems and presidential powers as a reason for “The Perils of Presidentialism” being a somewhat unconvincing comparative study. Shugart & Cary’s Presidents and Assemblies emphasises the variation in presidential systems, introducing semi-presidentialism, premier-presidential and president-parliamentary systems that are only briefly touched upon by Linz in his paper. France and Germany, for example, are mentioned but not elaborated on. When Shugart & Cary include these regime types in their analysis, they find that only twelve ‘full’ presidential systems had broken down in the twentieth century in comparison to twenty one parliamentary systems, [19] contradicting Linz’s argument that parliamentary systems are more conducive to stable democracy. In continuation with this, Horowitz draws on three examples from the USA, Nigeria and Sri Lanka to highlight the variations in electoral systems, processes and party systems that produce presidential executives. [20] Once again these variations are only touched upon in passing by Linz, whose analysis is underdeveloped and openly omitting important cases that contradict what Horowitz called his “caricatured view on presidency.” This caricatured view is also highlighted by Mainwaring & Shugart, who argue that presidentialism is predicated upon a system of checks and balances which usually inhibit winner-takes-all tendencies that Linz describes as a central feature of presidential systems. [21]

Further criticisms of Linz’s design come from Cheibub’s Presidentialism, Parliamentarism, and Democracy . Cheibub opposes Linz’s argument that the historical evidence of democratic breakdown shows presidential regimes to be less democratically stable, instead stating that presidential democracies fail because they arise in countries with a higher probability of democratic breakdown, regardless of regime type. [22] Cheibub highlights the importance of economic development in the survival of democratic regime types. Countries with parliamentary systems happen to be more wealthy, and therefore more likely to survive. [23] Economic development is not the only variable omission present in Linz’s work either, as Cheibub highlights the importance of geographical location and size of the country as well as economic development as important to democratic longevity. Cheibub also questions Linz’s decision to omit coalitions as a variable, an electoral outcome Linz holds in high regard in terms of democratic stability and performance. Again, Linz does not draw on any of these points in his qualitative analysis. A final point highlighted by Cheibub is that impeachment, something Linz asserts is extremely difficult in presidential systems, happened six separate times in 1990s Latin America alone, four of which were passed. This again shows some of Linz’s assertions to be incorrect or at least without empirical support.

It could be argued that Prezworski et al’s 1996 study What Makes Democracy Work? represents a superior research design to Linz’s. This is because Prezworski’s large-N statistical analysis of 135 countries since 1950 controls for affluence, economic performance, international climate and political learning/experience with democracy. [24] Linz’s small-N research design – a design that is never properly justified – has no explicit control variables, even though both studies have a similar aim of determining what/why democratic regimes prevail. The exploratory nature of Linz’s study, who openly admits is only to “recover a debate on the role of alternative democratic institutions in building state democracies,” [25] means that there is no explicit hypothesis testing per se. Prezworski, on the other hand, extensively tests his hypothesis on a large-N and can therefore make generalizations from his inferences. It could be argued that these factors make it much more useful to the social scientist studying the reasons for the longevity of democratic regimes. This is not to say, however, that Prezworski’s findings do not support Linz’s findings. Prezworski does find that “Linz is right about the durability of alternative institutional arrangements,” [26] although factors such as economic development are more important in determining the permanence of specific regime types. This shows that the statistical method has some features that could have improved Linz’s study.

Conclusions

Section 1 of this essay has shown the various strengths and weaknesses of each comparative research design. Case studies allow for an extremely in-depth analysis of a single unit with limited resources. As Landman shows, they can also have a multitude of purposes e.g. developing new classifications, generating hypotheses, confirming and informing theories etc. They are, however, limited. Inferences made from case studies are less secure and generalizations cannot be made from their findings. They tend to be extremely descriptive, being considered intensive rather than extensive because breadth analysis gives way to in-depth analysis. George & Bennett’s theory of potential “selection bias” can also be problematic.

Similarly, small-N research designs have a multitude of strengths and weaknesses. The intentional selection of cases can work as a substitute for experimental control found in large-N statistical analysis. Operationalizing concepts at a lower level of abstraction means that concept stretching is far less likely than in large-N research designs. The comparative method also allows for an intense analysis of a few countries when resources and time are low. On the other hand, though, small-N research designs often suffer from the “many variables, small number of cases,” where there are more observed explanatory factors than there are cases. This is just one of the problems with hand-picking cases, another being the potential of insecure inferences and limited findings.

Finally, the evaluation of large-N research designs shows it to have many strengths and weaknesses also. The fact they allow for statistical controls is one of their greatest strengths. They also allow for an extensive coverage of many cases over both space and time, as with Amorim Neto & Cox. The random selection of cases removes the issue of selection bias in the traditional sense. The ability to highlight ‘deviant’ or ‘outlier’ countries is also a key strength of the large-N design. They are, however, time-consuming, involve a lot of resources and assume a level of mathematical and computing knowledge from the outset. The collection of relevant and correct data, especially from developing countries can also be problematic.

Section 2 of the essay has shown that, upon evaluation, Juan Linz’s “The Perils of Presidentialism” has many flaws in both its research design and its execution. Although Prezworski et al. have shown Linz’s findings to be correct, Horowitz demonstrates that Linz’s case choices are not only unjustified, they are “regionally-skewed.” Evidence from post-colonial Africa and Asia juxtapose Linz’s theory, so a switch from MSSD to MDSD may have therefore improved the design. Moreover, Cheibub’s work shows a blatant omission of important variables in Linz’s research design, as well as an ignorance of the variation in presidential regimes. This, combined with Shugart & Cary’s work that shows a lack of acknowledgement for variation in electoral systems and the party system, demonstrates an undeveloped design with unconvincing causal inferences. Cheibub also counters Linz’s central causal argument by suggesting that regime type is not necessarily the driving force behind democratic longevity, but the individual situations in which the system is born. The importance geographical size, location and economic development as key factors in the survival of a regime type are not highlighted. This once again shows important omissions in the execution of his work. The large-N statistical method has features that could improve Linz’s design, as it allows for what has been described by Kerlinger as “the three criteria of the ideal research design: (1) that the design answer the research question; (2) that it introduce the element of control for extraneous independent variables; and (3) that it permit the investigator to generalize from his or her findings.” [27] An intrinsic weaknesses of case studies and small-N designs, including Linz’s, is that the second and third of these points are often unobtainable.

Bibliography

Amorim Neto, Octavio & Gary Cox, “Electoral Institutions, Cleavage Structures, and the Number of Parties”, American Journal of Political Science 41(1), Indiana: Midwest Political Science Association, Jan., 1997, pp. 149-174.

Boix, Carles & Daniel N. Posner, “Making Social Capital Work: A Review of Robert Putnam’s Making Democracy Work: Civic Traditions in Modern Italy”, The Weatherhead Centre For International Affairs 96(4), Harvard University, June, 1996, pp. 1-22. Accessed on: < http://www.wcfia.harvard.edu/sites/default/files/96-04.pdf > (06/12/2012)

Cheibub, José Antonio, Presidentialism, Parliamentarism, and Democracy , New York: Cambridge University Press, 2007.

Clark, William Roberts, Matt Golder & Sona Nadenichek Golder, Principles of Comparative Politics (2 nd ed.), London: Sage, 2013.

Collier, David, “The Comparative Method”, in Ada W. Finifter (ed.), Political Science: The State of the Discipline II , Washington, D.C.: The American Political Science Association, 1993, pp. 105-119.

Coppedge, Michael, “Theory Building and Hypothesis Testing: Large- vs. Small-N Research on Democratization”, Paper prepared for presentation at the Annual Meeting of the Midwest Political Science Association , Chicago, Illinois, April 25-27, 2002.

Flyvbjerg, Bent, “Case Study”, in Norman K. Denzin & Yvonna S. Linclon (eds.), The Sage Handbook of Qualitative Research (4 th ed.) California: Sage, 2011, pp. 301-316.

George, Alexander L. & Andrew Bennett, Case Studies and Theory Development in the Social Sciences , Massachusetts: MIT Press, 2005.

Gerring, John, “What Is a Case Study and What Is It Good For?”, American Political Science Review 98(2), Cambridge: University of Cambridge Press, May, 2004, pp. 341-354.

Goggin, Malcolm L., “The “Too Few Cases/Too Many Variables” Problem in Implementation Research”, The Western Political Quarterly 39(2), Utah: University of Utah, Jun., 1986, pp. 328-347.

Horowitz, Donald L., “Comparing Democratic Systems”, Journal of Democracy 1(4), Maryland: The Johns Hopkins University Press, Fall 1990, pp. 73-79.

Kerlinger, Fred N., Foundations of Behavioral Research (2 nd ed.), New York: Holt, Rinehart and Winston, 1973.

Landman, Todd, Issues and Methods in Comparative Politics: An Introduction (3 rd ed.), London: Routledge, 2008.

Lijphart, Arend, “Comparative Politics and the Comparative Method,” American Political Science Review 65(3), Cambridge: University of Cambridge Press, Sep., 1971, pp. 682-693.

Linz, Juan J., “The Perils of Presidentialism,” Journal of Democracy 1(1), Maryland: The Johns Hopkins University Press, Win., 1990, pp. 51-56.

Mainwaring, Scott & Matthew S. Shugart, “Juan Linz, Presidentialism, and Democracy: A Critical Appraisal”, Comparative Politics 29(4) Ph.D. Program in Political Science of the City University of New York, Jul., 1997, pp. 449-471.

Morgan-Jones, Edward, “Modern Classics of Comparative Politics,” Lectures at the University of Kent, Canterbury , 27 th Sep.-14 th . Dec, 2012.

Prezworski, A. et al., Democracy and Development: Political Institutions and Well-Being in the World, 1950-1990 , Cambridge: Cambridge University Press, 2000.

Prezworski, A. et al., “What Makes Democracy Endure?”, Journal of Democracy 7(1), Maryland: The Johns Hopkins University Press, Win., 1996, pp. 39-55.

Putnam, Robert D., Making Democracy Work: Civic Traditions in Modern Italy , NJ: Princeton University Press, 1993.

Shugart, Matthew Soberg & John M. Carey, Presidents and Assemblies: Constitutional Design and Electoral Dynamics , Cambridge: Cambridge University Press, 1992.

[1] Bent Flyvbjerg, “Case Study”, in Norman K. Denzin & Yvonna S. Linclon (eds.), The Sage Handbook of Qualitative Research (4 th ed.) (California: Sage, 2011), pp. 301-316

[2] Todd Landman, Issues and Methods in Comparative Politics: An Introduction (3 rd ed.), London: Routledge, 2008)

[3] Alexander L. George & Andrew Bennett , Case Studies and Theory Development in the Social Sciences , (Massachusetts: MIT Press, 2005)

[4] Landman, Issues and Methods in Comparative Politics

[5] George & Bennett, Case Studies and Theory Development in the Social Sciences

[6] Arend Lijphart, “Comparative Politics and the Comparative Method”, American Political Science Review 65(3), (Cambridge: University of Cambridge Press, Sep., 1971), pp. 682-693.

[7] Juan J. Linz, “The Perils of Presidentialism”, Journal of Democracy 1(1), (Maryland: The Johns Hopkins University Press, Win., 1990), pp. 51-56.

[8] In comparison to case study method; see Lijphart (1971)

[9] David Collier, “The Comparative Method”, in Ada W. Finifter (ed.), Political Science: The State of the Discipline II , (Washington, D.C.: The American Political Science Association, 1993), pp. 105-119; see also Lijphart (1971)

[10] Landman, Issues and Methods in Comparative Politics

[11] Ibid .

[12] Octavio Amorim Neto & Gary Cox, “Electoral Institutions, Cleavage Structures, and the Number of Parties”, American Journal of Political Science 41(1), (Indiana: Midwest Political Science Association, Jan., 1997), pp. 149-174.

[13] Landman, Issues and Methods in Comparative Politics ; see Collier (1993) & Lijphart (1971)

[14] Ibid .

[15] Michael Coppedge, “Theory Building and Hypothesis Testing: Large- vs. Small-N Research on Democratization”, Paper prepared for presentation at the Annual Meeting of the Midwest Political Science Association , (Chicago, Illinois, April 25-27, 2002)

[16] Collier, “The Comparative Method”

[17] Donald L. Horowitz, “Comparing Democratic Systems”, Journal of Democracy 1(4), (Maryland: The Johns Hopkins University Press, Fall 1990), pp. 73-79.

[18] Horowitz, “Comparing Democratic Systems”; see Lewis’ lectures on Politics in West Africa .

[19] Matthew Soberg Shugart & John M. Carey, Presidents and Assemblies: Constitutional Design and Electoral Dynamics , (Cambridge: Cambridge University Press, 1992) – includes third world states; 6 other breakdowns only presidential-parliamentary systems.

[20] Horowitz, “Comparing Democratic Systems”

[21] Juan J. Linz, “The Perils of Presidentialism”, Journal of Democracy 1(1), (Maryland: The Johns Hopkins University Press, Win., 1990), pp. 51-56; see also Mainwaring & Matthew Shugart (1997)

[22] José Antonio Cheibub, Presidentialism, Parliamentarism, and Democracy , (New York: Cambridge University Press, 2007)

[23] ^ Cheibub, Presidentialism, Parliamentarism, and Democracy , Table 6.1

[24] A. Prezworski et al., “What Makes Democracy Endure?”, Journal of Democracy 7(1), (Maryland: The Johns Hopkins University Press, Win., 1996), pp. 39-55.

[25] Linz, “The Perils of Presidentialism”

[26] Prezworski et al., “What Makes Democracy Endure?”

[27] Malcolm L. Goggin, “The “Too Few Cases/Too Many Variables” Problem in Implementation Research,” The Western Political Quarterly 39(2), (Utah: University of Utah, Jun., 1986), pp. 328-347, 349; see  Fred N. Kerlinger, Foundations of Behavioral Research (2 nd ed.), (New York: Holt, Rinehart and Winston, 1973), pp. 322-326.

Written by: Luke Johns Written at: University of Kent, Canterbury Written for: Dr. Edward Morgan-Jones Date written: November 2012

Further Reading on E-International Relations

  • Emancipation and Epistemological Hierarchy: Why Research Methods Are Always Political
  • Offensively Realist? Evaluating Trump’s Economic Policy Towards China
  • Mass Atrocities and Western Imperialism: Evaluating “Responsibility to Protect”
  • Between Pepe and Beyoncé: The Role of Popular Culture in Political Research
  • Gramscian Notions: Helpful for Research into Digital and Tech Corporations?
  • Evaluating Russia’s Grand Strategy in Ukraine

Please Consider Donating

Before you download your free e-book, please consider donating to support open access publishing.

E-IR is an independent non-profit publisher run by an all volunteer team. Your donations allow us to invest in new open access titles and pay our bandwidth bills to ensure we keep our existing titles free to view. Any amount, in any currency, is appreciated. Many thanks!

Donations are voluntary and not required to download the e-book - your link to download is below.

small n case study example

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Statistical analysis in Small-N Designs: using linear mixed-effects modeling for evaluating intervention effectiveness

Background:.

Advances in statistical methods and computing power have led to a renewed interest in addressing the statistical analysis challenges posed by Small-N Designs (SND). Linear mixed-effects modeling (LMEM) is a multiple regression technique that is flexible and suitable for SND and can provide standardized effect sizes and measures of statistical significance.

Our primary goals are to: 1) explain LMEM at the conceptual level, situating it in the context of treatment studies, and 2) provide practical guidance for implementing LMEM in repeated measures SND.

Methods & procedures:

We illustrate an LMEM analysis, presenting data from a longitudinal training study of five individuals with acquired dysgraphia, analyzing both binomial (accuracy) and continuous (reaction time) repeated measurements.

Outcomes & results:

The LMEM analysis reveals that both spelling accuracy and reaction time improved and, for accuracy, improved significantly more quickly under a training schedule with distributed, compared to clustered, practice. We present guidance on obtaining and interpreting various effect sizes and measures of statistical significance from LMEM, and include a simulation study comparing two p -value methods for generalized LMEM.

Conclusion:

We provide a strong case for the application of LMEM to the analysis of training studies as a preferable alternative to visual analysis or other statistical techniques. When applied to a treatment dataset, the evidence supports that the approach holds up under the extreme conditions of small numbers of individuals, with repeated measures training data for both continuous (reaction time) and binomially distributed (accuracy) dependent measures. The approach provides standardized measures of effect sizes that are obtained through readily available and well-supported statistical packages, and provides statistically rigorous estimates of the expected average effect size of training effects, taking into account variability across both items and individuals.

Introduction

The analysis of single-case or Small-N Designs (what we will refer to as SND) has long been a controversial topic. Debate over what constitutes an appropriate statistical approach has gone as far as questioning whether statistics should be used at all ( Evans et al., 2014 ), or only as support for visual analysis ( Campbell, Herzinger, & Gast, 2010 ; Lane & Gast, 2014 ; but also see Davis et al., 2013 ). Renewed interest in statistical approaches has been driven in part by advances in statistical computing power enabling the development of new techniques for analyzing SND. Over the last several years, many articles and special issues have been dedicated to this topic in journals in neuropsychology and education, including Neuropsychological Rehabilitation ( Evans et al., 2014 ), Journal of Behavioral Education ( Burns, 2012 ), Journal of Counseling and Development ( Lenz, 2015 ), and Journal of School Psychology ( Shadish, 2014 ).

Various statistical approaches have been proposed for evaluating a single case or a small number of cases, both to compares cases to controls and for within-case studies. These approaches can be categorized into regression and non-regression approaches. The latter category includes but is not limited to: obtaining standardized measures of effect size (e.g., Campbell, 2004 ; Hedges, Pustejovsky, & Shadish, 2012 ; Parker, Vannest, & Davis, 2011 ), simulation modeling analysis ( Borckardt & Nash, 2014 ), modified t -tests ( Corballis, 2009 ; Crawford & Howell, 1998 ), and ANOVAs (e.g., Crawford, Garthwaite, & Howell, 2009 ; Mycroft, Mitchell, & Kay, 2002 ). In this paper, we present an application of linear mixed-effects modeling (LMEM), a type of multiple regression which is well-documented in the statistical literature and is increasingly investigated for its applicability to the designs common in neuropsychological training studies (e.g., Baek et al., 2014 ; Huber, Klein, Moeller, & Willmes, 2015 ; Moeyaert, Ugille, Ferron, Beretvas, & Den Noortgate, 2014 ; Owens & Ferron, 2012 ; Swaminathan, Rogers, Horner, Sugai, & Smolkowski, 2014 ; Shadish, Kyse, & Rindskopf, 2013 ; Shadish, Zuur, & Sullivan, 2014 ; Van Den Noortgate & Onghena, 2003 ). We argue that LMEM is well suited to the analysis of repeated measures data where single individuals serve as their own controls. LMEM can be applied to training data to determine whether training has had the desired effect and provides standardized measures of effect size. To provide a concrete demonstration of the validity and usefulness of LMEM to SND, we apply it to a dataset from a training study of five individuals with acquired dysgraphia. We illustrate a “trial-based” approach, wherein the individual measurements taken throughout the training study are modeled as longitudinal repeated measures , as opposed to the more common approach of aggregating measurements taken within a time point and evaluating the data as a time series . 1

The primary goal of this paper is to introduce clinicians and aphasia researchers to LMEM and demonstrate its potential for analyzing the results of treatment studies that involve repeated measures across time. As will be demonstrated, LMEM has a number of properties (such as handling missing data and incorporating multiple, simultaneous estimates of random-effects) that offers advantages over other approaches like ANOVA or ordinary least squares regression (OLS). The first section provides a conceptual overview of LMEM and explanation of terminology, within the context of treatment studies. We then present a brief review of the existing literature on the use of LMEM specifically in SND. This review highlights the considerable work that has already been done showing how LMEM applies to SND. What is novel about this review is that, in addition to being written for an audience that does not specialize in statistics, we focus on repeated measures analysis and on accuracy data, which are more common in treatment studies but less researched in the statistical literature.

The second section provides a practical tutorial on conducting a repeated measures LMEM analysis of a SND, using the free statistical software R ( R Core Team, 2016 ). We walk through the analysis and interpretation of results of a new treatment study of five individuals with dysgraphia. We also report ( Appendix 2 ) the results of a simulation study investigating Type 1 error inflation in LMEM used to analyze accuracy (binomial) data, and present different options for obtaining p -values and measures of effect size for treatment effects.

Mixed-effects models: terminology and concepts

Fixed, mixed, and random-effects.

“Mixed-effects”, “multilevel”, and “hierarchical” all refer to regression models in which the probabilities of a model’s parameters (e.g., the regression coefficients) are modeled at multiple levels, allowing for testing hypotheses not otherwise possible using OLS regression (see Dedrick et al., 2009 , for a review). The term “mixed-effects” arises from the fact that these models include both fixed and random-effects. The determination of whether an effect should be modeled as fixed or random is somewhat contentious (see Gelman & Hill, 2006 ). For the purposes of understanding LMEM as it applies to analyzing SND, we adopt the pragmatic approach of considering the impact that including an effect as either fixed or random has on the interpretation of the results.

As an example, consider a study in which a clinician working with individuals with acquired dysgraphia launches a training program. Individuals are recruited from several clinics and are given a spelling app to practice spelling on a tablet device at home. Each individual practices 20 different words, and after every hour of practice the app administers a spelling test on those words. The app provides repeated measures data to the clinician: each individual’s performance on the spelling test at multiple time points. One question the clinician has is whether the individuals show significant improvement across the first 10 administrations of the spelling test (corresponding to the first 10 hours of practice). Thus, for each individual there are 200 observations (20 words × 10 time points).

The data could be modeled using OLS regression, where the dependent variable is accuracy on the spelling test items and the independent variable is the time point (1–10). Formula 1 ( Table 1 ) represents a fixed-effects -only model. Here, y is the score on the naming test ( the dependent variable ), which the model attempts to predict via the variables on the other side of the equation. Specifically, a is some constant ( the intercept ), b is the coefficient of interest (the slope ), x is the time point ( the independent variable ), and ϵ is error (the difference between the predicted and the observed values of the dependent variable). Both a and b are fixed-effects—the model solution will find the one value for a and the one value b that together minimize the error in predicting y , given different values of x . For example, the researchers might find that the best-fitting b = 2, interpreted as an increase of two points in accuracy from one administration of the spelling test to the next. These coefficients are fixed-effects because they are “fixed” to be the same for all “units”, meaning the effect of the dysgraphia treatment is predicted to be the same for all individuals, across all items on the spelling test, and across the different clinics from which the individuals were recruited.

Formulas reflecting: (1) a fixed-effects-only regression model, (2) the first level of a mixedeffects regression model, and (3a, 3b) individual coefficients for a random intercept and a random slope modeled at the second level of a mixed-effects regression model.

1: = + + Fixed-effects-only: estimates a single value for each coefficient ( and )
2: = Mixed-effects (1 level): estimates a value for each indexed coefficient ( and ) specific to each unit (e.g., participant, item, etc.)
3a: = + Mixed-effects (2 level): unit-specific coefficient (random intercept ) includes an estimated fixed-effect (intercept ) plus unit-specific random-effects ( )
3b: = Mixed-effects (2 level): unit-specific coefficient (random slope ) includes an estimated fixed-effect (slope ) plus unit-specific random-effects ( )

However, the researchers may have reasons to believe that the outcome of the intervention may vary across units, such as individuals or clinics. Formula 2 ( Table 1 ) corresponds to a mixed-effects model; the coefficients a and β are modeled at two levels, the first of which incorporates variability across individuals. In this mixed-effects regression, at the first level, the subscript j indexes the different individuals included in the study. Within this formula, we count two random-effects “by-individuals”: an intercept α j and a coefficient β j . Each of these will be calculated for each individual, 1 through j . Like the model in Formula 1, this model hypothesizes that the spelling test scores can be predicted by a line with an intercept and a slope. However, Formula 2 additionally hypothesizes that this line may be different for each individual, j . This is accomplished because the parameters α j and β j are determined by further equations, namely the second level of the mixed-effects regression (Formulas 3a and 3b, Table 1 ).

The equations in Formula 3a and 3b reveal why these models are called mixed: they contain both fixed-effects and random-effects. For example, in Formula 3b, the coefficient b is the fixed-effect of time point (the slope for time point, as in the fixed-effects-only regression, Formula 1), and will be the same value for all individuals. The coefficient η j is the random slope of time point and is specific to each individual (in mixed-effects terminology, the “random slope of time point, by-individuals”). The effect of time point is calculated for each individual as the sum of two values: the fixed-effect b plus the random-effect, η j , which results in a different value for each individual. The same logic applies to the equivalent formula for the parameter α j reflecting the fixed and random-effects related to the intercept term.

Fixed-effect coefficients (e.g., a , b ) indicate the effect size for the average unit, which in this example corresponds to how much accuracy improves per hour of practice. The difference between the estimate of b in Formula 1 vs. Formula 3 is the latter takes into account variability attributable to individual differences. The purpose of including random-effects is, in this sense, similar to including additional “control” predictors that researchers believe could affect outcome. However, a key strength of random-effects is they explain variance that can be attributed to dimensions of the data that are otherwise unexplained, without positing any specific hypotheses about how the units vary. For example, including sex as a control variable entails a specific hypothesis: that males will have lower values than females on the dependent variable, or vice versa. Including individuals as a control variable would similarly entail a specific hypothesis test—however, researchers seldom have hypotheses about how specific individuals will vary from one another. If they do, this means they have knowledge of some dimension(s) on which these individuals differ, and they would likely instead include those underlying dimensions as control variables.

Explanatory power of mixed-effects models

How can adding random-effects to a model change the inferences drawn from an analysis? Consider a hypothetical scenario in which the individuals recruited from one clinic showed faster improvement. To illustrate this, we created an artificial dataset in which individuals were recruited from five clinics. This data set was designed such that, collapsing across all five clinics, the correlation between time point and score on the spelling test is 0.20. However, the correlation is 0.91 averaging across individuals only from Clinic A, but in the other four clinics only 0.04. If the researchers ignored the testing site variability and modeled this data using OLS regression as in Formula 1, they would find that improvement on spelling accuracy across the first 10 administrations of the spelling test was significant, at p = 0.02. If instead the researchers adopted a mixed-effects approach and used Formula 2 with random-effects by-clinics, they would find that the relationship between time point and test score was not significant, at p = 0.35. This occurs because the correlation of 0.20 found when looking across all individuals can primarily be explained as an effect of Clinic A—this is reflected by the amount of variance explained in the mixed-effects approach, where only 1.4% is explained by the fixed-effect of time point, but an additional 67.5% is explained by the random slope for time point by-clinics. That is, the average individual’s score is poorly explained by the number of hours they used the app, but is quite well explained by the clinic. Although this is a hypothetical scenario, it is not far-fetched. Such a result could occur if there was a variable associated with the clinics that was not included in the regression model, but which was correlated with the outcomes. For example, perhaps Clinic A is associated with higher education levels or milder deficits relative to the other clinics. Ideally the researchers in this scenario would include variables like these as additional (fixed-effect) predictors, but it is not always possible to identify and quantify all such variables. This highlights an advantage of mixed-effects models: if there are unknown or otherwise unquantifiable sources of variance that are related to a unit (like clinic), then this variance can be taken into account by the random-effects structure.

The approach that we adopt in this paper is specifically appropriate to training studies in the neuropsychological context. We examine the small group ( N = 5) situation, with random-effects by-items and by-individuals. Including these random-effects increases our confidence that treatment effects are not due to any subset of the items trained or individuals enrolled in the study.

Nested versus crossed random-effects

In the actual analysis we present in this paper, we use the approach of crossed random-effects by-items and by-individuals, as opposed to nested. To illustrate how nested and crossed random-effects differ, we consider a further scenario in our hypothetical study. The researchers include random-effects both by-individuals and by-items, such that instead of estimating a coefficient for each individual, β j (Formula 2, Table 1 ) a coefficient is estimated for each individual-item combination, β jk (where k indexes items). Items in this example correspond to the words included on the spelling test. If items are nested within individuals, then each β jk reflects how each item differs from the others within each individual . If items are crossed with individuals, then each β jk reflects how each item differs from the others across all individuals , in addition to how each individual differs from the others. The decision to use nested or crossed random-effects should be theoretically motivated, but essentially hinges on assumptions about what the units do or do not share. This nested relationship is not appropriate for all units, such as linguistic items like words, where in fact we expect individuals to share common knowledge of words. This motivates a statistical model in which there are crossed random-effects by-items and by-individuals. 2

Small-N designs and LMEM

There are many potential applications of LMEM to treatment studies and, accordingly, researchers have investigated different aspects of the approach that are relevant to SND: necessary sample sizes ( Raudenbush & Bryk, 2002 ; Hox, 1998 ; Ferron et al., 2009 ; Maas & Hox, 2005 ), statistical power and accuracy of estimates ( Ferron, Farmer, & Owens, 2010 ; Heyvaert et al., 2017 , Baayen, Davidson, & Bates, 2008 ; Barr, Levy, Scheepers, & Tily, 2013 ; Huber et al., 2015 ; Matuschek, Kliegl, Vasishth, Baayen, & Bates, 2017 ), meta-analysis of treatment studies ( Owens & Ferron, 2012 ; Ugille et al., 2012 ; Van den Noortgate & Onghena, 2003 , 2008 ; Moeyaert, Ferron, et al., 2014 ; Moeyaert, Ugille, & Ferron, et al., 2014 ), and how to customize the models to different types of experimental designs ( Baek et al., 2014 ; Moeyaert, Ugille, & Ferron, et al., 2014 ; Moeyaert et al., 2015 ; Shadish et al., 2013 , 2014 ). The majority of this research has focused on the analysis of continuous outcome measures (e.g., reaction times), as opposed to count or binomial data (e.g., accuracy) where logistic regression is needed. This research has also typically assumed that the data will be modeled as a time series, where a single measure is obtained for each time point. Importantly, the LMEM approach that we present does not treat the data as a time series, but rather as longitudinal repeated measures, just as in the example of the spelling app. The difference lies in the fact that the repeated measures design provides multiple observations at each time point. This distinction cannot be over-emphasized: we are advocating that for SND in which individuals are tested on multiple items within a training session, performance on the items should not be averaged together. This allows for an items-based, trial-by-trial analysis in which the regression model attempts to predict the responses on each trial of each item. Not averaging items together increases the number of observations, maintains information about variance across items, and allows the researcher to control for variables related to the items. For example, in the spelling app study, additional fixed-effects of word length could be included to test if spelling accuracy improved more quickly for some words than others.

Repeated measures LMEM guidelines

Summarizing the literature on LMEM in SND, we identify guidelines for using the approach to quantify treatment study results. Rather than an exhaustive review, we focus specifically on using LMEM in longitudinal repeated measures designs, with the goal of obtaining an effect size for the effect of treatment.

Sample size

A distinction is made between sample sizes at the different levels of a LMEM. In the spelling app example, the first-level sample size refers to the number of observations from each individual; the second-level sample size is the number of individuals, if there is a random-effect for individuals. Based on the literature (e.g., Raudenbush & Bryk, 2002 ; Hox, 1998 ; Ferron et al., 2009 ; Maas & Hox, 2005 ), an adequate sample size for the second-level is as few as five units, if the question of interest relates to the fixed-effects . This means that data from five individuals is sufficient to estimate the treatment effects (beta coefficients of the fixed-effects). However, the estimated variance across individuals (the random-effects estimates) will be susceptible to inflated Type I error. 3 Those interested in the estimating how much treatment effects vary across individuals, as opposed to the effect size for the average individual, are thus encouraged to collect more data ( Ferron, Farmer, and Owens, 2010 ). Regarding the sample size at the first-level, there are no constraints particular to the LMEM approach—if there are enough observations for an OLS regression then there are enough for LMEM. In general, the more observations the better, and, importantly, there is evidence that a larger sample size at the lowest level can somewhat compensate for smaller sample sizes at the higher levels ( Maas & Hox, 2005 ).

Autocorrelation

Serial dependency, or autocorrelation, is a problem in regression analysis when the residual errors (the ϵ in Formulas 1 and 2, the difference between the predicted and the observed values) violate the assumption of independence. 4 Autocorrelation means the residual errors for observations at time t are correlated with the residual errors at some other time, such as t-1 . This can lead to inflated Type I error. Autocorrelation is not only a problem for LMEM or regression but it has even been found to influence visual analysis of time series ( Matyas & Greenwood, 1996 ). Multiple methods can address autocorrelation; arguably the best approach is to identify the underlying cause(s) and add predictors to remove it ( Hanke & Wichern, 2005 ). Although autocorrelation can arise whenever observations are ordered, it is prominently a concern for time series analyses. We refer to other researchers for methods for identifying and addressing autocorrelation in LMEM (see e.g., Baek & Ferron, 2013 ; Davis et al., 2013 ; Howard, Best, & Nickels, 2015 ; Swaminathan et al., 2014 ). However, it should be noted that the primary questions of interest in treatment studies (e.g., whether or not treatment was effective for the group) are likely to be addressed by the fixed-effects, and importantly, estimates of the fixed-effects will be unbiased even without explicitly incorporating autocorrelation estimates into the LMEM ( Shadish et al., 2013 ; Singer & Willett, 2003 ).

Determining random-effects structure

One of the challenges with LMEM is determining the optimal random-effects structure. Researchers familiar with multiple regression will recognize the general challenge of determining which variables to include in any model; the inclusion of random-effects adds to this challenge. First, one must decide which units should be included as random-effects—while items and individuals are the most common, one is not limited to those (sites, clinicians, etc.). Second, one must consider which random intercepts and slopes to include by-those units. Much attention has been paid to this difficulty in the psycholinguistics literature on mixed-effects (see Baayen et al., 2008 ; Barr et al., 2013 ; Matuschek et al., 2017 ). Barr et al. (2013) argued for the “maximal structure”: random intercepts by-items and by-individuals as well as random slopes by-items and by-individuals for each of the fixed-effects. 5 However, the maximal structure is not always possible for practical reasons, perhaps the chief being a limited number of observations. The minimum number of observations required is the sample size at the higher-level multiplied by the number of random-effects. For example, including one random intercept and one random slope by-items, in a study with 40 items, requires at least 80 observations.

Using model simulations, Matuschek et al. (2017) reported that the minimal Type I error rate achieved by the maximal structure comes at the cost of statistical power, and is likely to be overly conservative. Therefore, it is recommended to determine a reduced random-effects structure—this can be determined through model comparison, such as by the Akaike Information Criterion (AIC, Sakamoto, Ishiguro, & Kitagawa, 1986 ) or likelihood ratio tests, but the starting point should be theoretical considerations. Just as the researcher must decide what variables to include as fixed-effects based on their research question and experimental design, the decision of which random-effects to include should be based on which are thought to vary across-units. In the case of repeated measures SND, the most critical random-effects are going to be: (1) those related directly to the experimental conditions (e.g., treatment conditions), and (2) those related to time (e.g., session number)—we suggest that random slopes for treatment conditions and time variables be included both by-individuals and by-items. In the spelling app example, this would mean including a random slope of time point both by-individuals and by-item.

Effect sizes and p-values/confidence intervals

One advantage of LMEM is the ease of obtaining standardized effect sizes—the beta coefficients from the fitted models correspond to effect sizes. In logistic regression, used to analyze binomial data, the beta coefficients are reported in log-odds, which are easily interpretable as changes in the probability of a correct response. For this reason, LMEM is useful for meta-analysis or otherwise combining results across multiple studies (see Owens & Ferron, 2012 ; Ugille et al., 2012 ; Van den Noortgate & Onghena, 2003 , 2008 ; Moeyaert, Ugille, Ferron, et al, 2014 ; Moeyaert, et al., 2015 ).

Obtaining p -values is less straightforward and somewhat controversial ( Kuffner & Walker, 2017 ). Specifically for LMEM, the challenge lies in enumerating degrees of freedom, which is complicated by random-effects. There are different formulas for this calculation, including the Kenward-Rogers and Satterthwaite approximations, or the Wald-Z test for logistic regression. The first has been shown to provide accurate results for SND ( Ferron et al., 2009 ; Heyvaert et al., 2017 ), and is available in most statistical software. Another option is the likelihood ratio test (LRT), which provides a Chi-square measure of goodness of fit when comparing two models. While it has been suggested that the LRT may be preferable in SND (see Moeyaert et al., 2014 ), we conducted a simulation study to compare the performance of the LRT and Wald-Z tests in logistic LMEM, and report the novel finding that the LRT better controls Type I error in the study we report with a sample size of 5 ( Appendix 2 ).

Dysgraphia treatment: analysis and results

In the introduction, we presented a hypothetical dysgraphia study to situate the LMEM approach in the context of small- N training studies. Next, we present new results of an actual dysgraphia treatment study. In the hypothetical study, the question was whether practice with a spelling app improved accuracy. Time point (administrations 1–10 of the test) was the independent variable, and was modeled as a fixed-effect, with its effect size (beta coefficient) and its p -value the parameters of interest. We discussed including multiple random-effects to account for variability in outcomes due to factors related to differences across items and individuals.

In the actual dysgraphia study, participants ( N = 5) underwent spelling treatment for a number of sessions, under two different training schedules (manipulated within-participants). The goal, as in the hypothetical study, was to assess the amount of improvement in spelling ability per training session. However, the actual study additionally asks if this improvement differed for the two training schedules.

To demonstrate the use of LMEM, we present the results of two analyses and describe how to interpret the LMEM output and effect sizes. Analysis 1 examines the improvement in spelling accuracy attributed to the treatment, and also the difference between two treatment conditions. Analysis 2 examines the improvement in reaction time. A tutorial with full explanation for conducting these analyses using the software R ( R Core Team, 2016 ) appears in Appendix 1 .

We note that the approach demonstrated here is readily applied to the analysis of single case studies by simply removing the random-effects by-individuals but keeping the random-effects by-items. As long as the individual has been administered at least five items, a mixed-effects analysis is feasible. This highlights another benefit of analyzing the data as repeated measures (i.e., trial-by-trial) as opposed to as a time series: if the study includes fewer than five individuals and only one observation per time point, a mixed-effects analysis is not feasible.

Study rationale

The goal was to determine if there was a significant effect of training schedule (the spacing of learning trials) on the relearning of word spellings. There is an extensive literature with neurotypical adults examining the effects of the spacing of learning trials. Results very consistently show that distributing training trials across a session (or distributing training sessions across a training period) produces superior learning relative to massing training trials within a session (or massing training sessions across a training period) (see Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006 for a meta-analysis). Somewhat surprisingly, this issue has received scant attention in research on treatment of language disorders, though it is sometimes related to the more researched question of training intensity (e.g., Sage, Snell, & Lambon Ralph, 2011 ). Recently, Middleton, Schwartz, Rawson, Traut, and Verkuilen (2016) examined the effects of massed vs. distributed scheduling of training trials within sessions for individuals receiving treatment for naming deficits. Their findings, consistent with those reported for neurotypical individuals, showed clear superiority for distributed over massed schedules. Here, we evaluate two schedules that varied the spacing of the training trials: distributed and clustered schedules, with the latter a type of massed spacing. This manipulation allowed us to compare “bursts” of intense training across the training period with more regular, but less intense, repetition. More details are provided below.

Participants

Five college-educated individuals (2 females) ages 45–80 with acquired dysgraphia resulting from a single left hemisphere stroke participated in the study. They were, on average, 4 years post-stroke at enrollment and had no pre-morbid history of spelling or reading difficulties. See Table 2 for demographic information and language and cognitive testing results. It can be see that there was considerable variability in performance levels for most language measures: word and pseudoword spelling, oral reading, and spoken picture naming. However, word comprehension was excellent for all but KST and, in terms of working memory, all had normal visual-spatial spans. Regarding their specific dysgraphia deficits, 2 had only orthographic working memory deficits, 1 only an orthographic long-term memory deficit, and 2 had mixed deficits (see Buchwald & Rapp, 2009 , for information on these dysgraphia deficit types). The same treatment approach was used with all participants, given previous research showing its effectiveness across these deficit types ( Rapp, 2005 ; Rapp & Kane, 2002 ).

Demographic information and results of language and cognitive testing for the five individuals (initials: DTE, KST MSO, PQS, RHN). O-WM=orthographic working memory; O-LTM=orthographic long-term memory. JHU Word Length List and JHU Pseudoword List (Johns Hopkins University Dysgraphia Battery; Goodman & Caramazza, 1985 ) PALPA 35 (Psycholinguistic Assessments of Language Processing in Aphasia Task 35; Kay & Coltheart & Lesser, 1992 ) NNB (Northwestern Naming Battery; Thompson et al., 2012 ) PPVT (Peabody Picture Vocabulary Test; Dunn & Dunn, 1981 ), Corsi spatial span (Mueller & Piper 2014).

IndividualDTEKSTMSOPQSRHN
Age8061455475
SexFMMMF
HandednessRRRRL
Education (years)1814181819
Lesion LocationL frontal/parietalL frontal/parietalL frontal/temporalL frontal/parietalL posterior frontal
Dysgraphia deficit type/sO-WMMixedMixedO-WMO-LTM
Number of training sessions4029291716
Number of training trials per item1816201314
Spelling: JHU Word Length List (letter accuracy /336)59%82%33%81%89%
Spelling: JHU Pseudoword List (letter accuracy /99)66%64%19%83%92%
Auditory word comprehension: PPVT91st percentile10th percentile50th percentile99th percentile73rd percentile
Oral reading: PALPA 35 (/60)90%55%85%68%100%
Spoken picture naming: NNB (/66)94%56%83%97%98%
Forward digit span53346
Corsi spatial span55555

There were 5 study phases: pre-training evaluation, spelling training, post-training evaluation, 3 month waiting period, and follow-up evaluation. Here, we analyze data collected during the spelling training phase. Forty training words were selected for each individual such that baseline letter accuracy on each word was between 25% and 85%. A spell-study-spell technique ( Beeson, 1999 ; Rapp & Kane, 2002 ) was administered for approximately 90 min, typically 2×/week for an average of 13 weeks per individual. Each training trial was as follows: (1) the individual heard a target word, repeated it, and attempted to write the spelling. The accuracy and RT data for this response served as the data analyzed in this paper. (2) Regardless of accuracy, the individual was shown the correct spelling while the experimenter said aloud the word’s letters. The individual copied the word once. (3) If the word was spelled correctly at Step 1, then Step 3 was omitted and the experimenter continued to the next item; otherwise, the word was removed from view and the individual was asked to spell it. Steps 2 and 3 were repeated until the word was spelled correctly or for a maximum of 3 times before moving to the next item.

Each individual had one word set for each training schedule ( Figure 1 ). In the distributed training schedule, words were trained once per session every 2–3 sessions, while in the clustered schedule, words were trained 3–4 times within a session every 6–8 sessions. The total number of trials per word was the same regardless of schedule, ranging from 13–21 depending on the individual ( Table 2 ). Using the example in Figure 1 , “pen” would provide 12 observations of RT, and 36 (3 letters × 12 trials) observations of accuracy.

An external file that holds a picture, illustration, etc.
Object name is nihms-1506946-f0001.jpg

Example of the training schedules, either clustered (“cat”) or distributed (“pen”).

Table 3 reports summary results for each individual and the group average, showing general improvement across the training study and retention at follow-up.

Summary of training study data for each individual and group averages. Accuracy and RT are computed per letter. Observations for Accuracy Analysis = number of letters attempted over the course of the study. Observations for RT Analysis = number of words spelled correctly over the course of the study. S = seconds. First or last 25% indicates the mean during the initial or final quarter of the training program and follow-up accuracy refers to accuracy approximately 3 months after the end of treatment.

IndividualObservations for Accuracy AnalysisObservations for RT AnalysisMean Accuracy, first 25%Mean Accuracy, last 25%Mean Accuracy, follow-upMean RT (s), first 25%Mean RT (s), last 25%
DTE248439484%95%87%10.47.3
KST246029983%88%87%4.73.6
MSO284436460%80%78%2.72.9
PQS218026980%97%94%5.62.9
RHN237227687%99%98%2.80.8
78.8%91.8%88.6%5.243.50

Analysis 1: effect of treatment on spelling accuracy

This analysis address two questions: (1) is there a significant effect of training, on average, across individuals? and (2) is there a significant difference in training effectiveness between the two training schedules (distributed versus clustered)? With the clustered schedule, accuracy improved across the within-session trials (from the first to the fourth trial) but fell sharply from the fourth trial to the first trial of the subsequent session. This non-linear “sawtooth” pattern, would need to be accounted for in the regression by including Within-Session Trial Number as an additional predictor. However, it then would become impossible to include items receiving clustered and distributed in the same model, because “within-session number” is not a relevant variable for the distributed schedule. Therefore, only the first within-session trials were included for the clustered condition (e.g., in Figure 1 : the first trial with CAT on sessions 1, 9 and 17). This effectively reduced the amount of available data for this particular analysis by ~35%.

Model description

The model included six fixed-effects: the effects of Schedule (distributed versus clustered), Session (the session on which an item was trained), the interaction of Schedule × Session, DaysSince (the number of days since last training on that word), Word Length, and Word Frequency. Crossed random-effects were included both by-items (with a random intercept and slopes for Session and DaysSince) and by-individuals (with a random intercept and slopes for Schedule, Session, DaysSince). 6

To illustrate the impact of including random-effects, Figure 2 depicts how different model structures would estimate the effect of Session depending on which random-effects were included. Without any random-effects ( Figure 2(a) ) the estimated effect of Session would be identical for all individuals. Including only a random intercept by-individuals ( Figure 2(b) ) would take into account that different individuals began training (Session = 0) with different performance levels, but would estimate the slope of Session to be the same for all individuals. Including only a random slope for the effect of Session by-individuals ( Figure 2(c) ) would account for different rates of improvement across individuals, but ignore the variability in initial spelling accuracy (the intercept). Finally, both a random intercept and slope for Session by-individuals ( Figure 2(d) ) takes into account differences across individuals both in terms of initial accuracy and improvement rates. Allowing for variability in both intercepts and slopes can improve the model’s ability to explain the data, resulting in more accurate estimates of the average effects.

An external file that holds a picture, illustration, etc.
Object name is nihms-1506946-f0002.jpg

Depiction of the results of different model structures: (a) fixed-effects-only, (b) random intercept by-individuals only, (c) random slope for effect of Session by-individuals only, and (d) both random intercept and slope for the effect of Session by-individuals. The x-axes depict the training session and the y-axes the percent letter accuracy. In the random-effects models (b-d), separate regression lines for the effect of Session are estimated for each individual (DTE, KST, MSO, PQS, and RHN).

Lastly, because the by-items and by-individuals random-effects were crossed, this means that item-specific effects (variability attributed to the different training words) were based on responses from all individuals who were trained on that item. For example, three of the participants were trained on “scout”, and so variability from that word is estimated based on these three individuals’ responses to it (if the random-effects were nested, the effect of “scout” would be calculated separately for each individual).

Model results and interpretation

The results of Analysis 1 are reported in Table 4 ; details of the LMEM structure are available in Appendix 1 (“Analysis 1: Model Formula & Description”).

Results of LMEM Analysis 1 (letter accuracy). *** p < 0.001, ** p < 0.01, * p < 0.05.

EstimateStd. Errorz valueLRT
(Intercept)3.5310.5866.0300.000***
Schedule0.0350.1730.2020.627
Session1.3160.2076.3450.001**
Frequency0.1990.0982.0280.048*
Length−0.1950.095−2.0460.042*
DaysSince−0.2170.091−2.3850.059
Schedule*Session0.1270.0572.2250.040*

Table 4 reports the following: The first column, Estimate, refers to the estimated value of the beta coefficients. The second column, Std. Error, is the standard error of the estimated betas. The last two columns present z-values of the estimated beta coefficients and the likelihood ratio test (LRT) p -value. The selection of the LRT p -value was made on the basis of the literature review as well as the results of an original Type I error simulation (see Appendix 2 ). The visualization in Figure 3 depicts the slopes of the fixed-effects (i.e., the expected change in spelling accuracy (y-axis) per change in the independent variables (x-axis).

An external file that holds a picture, illustration, etc.
Object name is nihms-1506946-f0003.jpg

Visualization of the LMEM for Analysis 1 effects (using effects package in R; see Appendix 1 ). Each panel depicts the predicted percent letter accuracy across sessions for: (a) Word Frequency, (b) Word Length, (c) DaysSince, and (d) the Session × Schedule interaction. The gray bands reflect 95% confidence intervals for the associated effects.

Table 4 shows that all predictors except Schedule and DaysSince are significant at p < 0.05. Critically, there is a significant effect of Session, indicating increased accuracy over time, as well as a significant interaction of Schedule × Session. Because Schedule was coded as +1 for clustered and −1 for distributed, the positive slope of the interaction term reveals that spelling accuracy improved more rapidly for the clustered compared to the distributed schedule. The generalized R 2 measure of variance explained (see Appendix 1 “Effect size measures”) equaled 50.3% in total, with 22.0% explained by all of the fixed-effects and 10.8% by the effect of Session (i.e., 49% of the variance explained by fixed-effects was due to the effect of Session).

Summary of analysis 1

The results of this analysis confirm the spell-copy-spell training procedure was on average effective for improving the spelling accuracy of the five individuals. The beta for Session was found to be = 1.316 ( p = 0.001). Because this is a logistic regression, exponentiating the beta values provides a standardized effect size, namely the odds ratio: e β . Thus, the effect size of Session is e 1.316 = 3.73. This can be interpreted as ≈34% improvement per training session in the odds of a correct trial. An individual who began treatment with a 50% chance of correct responses would improve to 93% after 10 sessions (for details on interpreting the odds ratio, see Appendix 1 “Effect size measures”).

The results also revealed a significant interaction of Schedule × Session (beta = 0.127, p = 0.04), such that the effect of training for the clustered schedule was (1.316–0.127) = 1.189 and for the distributed schedule was (1.316 + 0.127) = 1.443. This means, e.g., an individual beginning with 50% probability of producing a correct letter improved to ≈90% in the clustered schedule versus ≈96% in the distributed after Session 10.

Two control variables were also significant, indicating that performance was worse on words that were longer (effect of Length = −0.195), and better on higher frequency words (effect of Frequency = 0.199). Altogether, the results of the LMEM reveal that training was effective in improving spelling accuracy, and more so for the distributed schedule, even after controlling for the significant effects of word length and frequency.

Overall the results of Analysis 1 reveal that the spell-copy-spell training procedure was successful in improving the spelling accuracy of these five individuals with acquired dysgraphia. The LMEM provides standardized measures of the effect sizes (in the form of beta coefficients and amount of variance explained) as well as p -values for statistical significance.

Analysis 2: effect of treatment on reaction time

The second analysis reports ( Table 6 and Figure 4 ) the improvement in reaction time (RT, measured as seconds per letter) only for trials with correct responses.

An external file that holds a picture, illustration, etc.
Object name is nihms-1506946-f0004.jpg

Visualization of the LMEM for Analysis 2 effects (using effects package in R; see Appendix 1 ). Each panel depicts the predicted percent letter accuracy across sessions for: (a) DaysSince, (b) Word Frequency, (c) Length, and (d) the Session × Schedule interaction. The gray bands reflect 95% confidence intervals for the associated effects.

Results of LMEM Analysis 2 (RT per letter for words that were correctly spelled). * p < 0 .05.

EstimateStd. Errort valueSatterthwaite
(Intercept)1.0620.3423.1090.036*
Schedule−0.0290.033−0.8750.416
Session−0.2470.063−3.9570.016*
Frequency−0.0180.028−0.6170.538
Length−0.0210.028−0.7550.451
DaysSince−0.0040.024−0.1460.889
Schedule*Session0.0150.0170.9100.363

Summary of analysis 2

It should be noted that the beta coefficients reported in Table 6 are not odds ratios, but instead indicate the expected change in log RT per unit of Session.

The results reveal a significant effect of training (Session = −0.25), which equates to a 3% decrease in RT per training session. No other predictors were significant in Analysis 2. Unlike the findings regarding accuracy, for RT there was no interaction of Schedule × Session, indicating no significant advantage of for one treatment schedule over the other for RT on correct trials.

General discussion

A statistical approach to analyzing SND is desirable in order to obtain standardized effect sizes and objective measures of significant outcomes, both of which are difficult to achieve with visual analysis. Researchers are faced with an array of choices in statistical techniques, the applicability of which depends on the experimental design. In this paper, we argued for using LMEM to analyze treatment effects in repeated measures studies for two primary reasons: 1) By modeling all individual observations instead of averaging scores within-time points (i.e., trial-by-trial repeated measures analysis, not time series), more information and more data points are retained, and, furthermore, predictors related to the specific items (e.g., word frequency) are included. 2) The addition of random-effects improves upon OLS regression, testing the hypothesis that differences across units (e.g., items, individuals) may have effects on the outcome measure.

The LMEM analysis demonstrated here uses a “trial-based” approach, distinguishing itself from the common approach of averaging together multiple observations collected within a session. For example, a trial-based approach provides 100 data points if ten measures are collected on each of 10 training sessions, compared to only 10 data points under the common approach. A trial-based approach is particularly beneficial for SND where the paucity of data is a concern. We focused on logistic regression, which can be used to analyze data types that are more common in treatment studies (i.e., accuracy or other binomially distributed data). We also conducted an original simulation study that confirms the Wald-Z test may lead to inflated Type I error in logistic LMEM in the analysis of SND, and recommended the use of the LRT for obtaining p -values.

There remain outstanding questions about using multilevel modeling with SND, as outlined by Shadish et al. (2013) . Briefly, the two major issues are statistical power and autocorrelation. Regarding the former, while we have demonstrated that LMEM is viable with as few as five participants, a relatively large number of observations within these participants is necessary for sufficient statistical power. It is impossible to state an exact number of required observations, because power depends on a number of factors, perhaps most crucially the size of the treatment effect, which is often not known in advance (see also Ferron, Farmer, & Owens, 2010 ; Heyvaert et al., 2017 , Baayen et al., 2008 ; Barr et al., 2013 ; Huber et al., 2015 ; Matuschek et al., 2017 ). For increasing the number of observations, we have highlighted the advantage of modeling individual trials, rather than averaging within-session observations into a single measure which, of course, is only relevant where there are multiple observations or items tested within sessions.

In addition to limitations of statistical power, any regression-based approach in which time is a variable faces the issue of autocorrelation. We have underscored, however, that estimation of the fixed-effects is unbiased in the presence of autocorrelation and, moreover, there is reason to believe that a complex random-effects structure (e.g., with random intercepts and slopes for effects of treatment and variables related to time) sufficiently addresses the issue of autocorrelation (see Shadish et al., 2013 ). In summary, the primary limitations of LMEM in the context of SND are low statistical power (a concern for all SND analyses) and potentially poor estimation of the random-effects parameters. However, the goal of this paper was not to solve these limitations per se, but rather to explain LMEM to aphasia researchers, and provide practical guidance to inform decisions about experimental design and analysis.

LMEM compared to other approaches

We have advocated for a LMEM approach to longitudinal repeated measures data—but what about alternative statistical approaches? As referenced in the introduction, LMEM is not the only approach available for the statistical analysis of SND. However, LMEM has a number of advantages. It has long been recognized that LMEM improves upon ANOVA (e.g., Gueorguieva & Krystal, 2004 ; Baayen et al., 2008 ) by addressing common pitfalls, including unequal spacing of observations and missing data, and in the worst case LMEM will return identical results as a comparable ANOVA. Regarding other advantages, in the treatment study we reported on, time (Sessions and DaysSince) was treated as a continuous variable, which is also not possible in the ANOVA framework. Another challenge for ANOVA is addressing the variability across both individuals and items—the traditional approach of collapsing across each separately to compute an F1 (by-individuals) and F2 (by-items) is inherently problematic (e.g., a significant result is obtained in the F1 but not the F2 ANOVA). Note also that a by-individuals ANOVA is equivalent to a LMEM with a random intercept by-individuals (and no other random-effects), and similarly a by-items ANOVA is equivalent to an LMEM with a random intercept by-items only. Thus, mixed-effects models avoid this F1–F2 ANOVA problem, by allowing simultaneous modeling of random-effects for both individuals and items, as well as the possibility of including both random intercepts and slopes. Furthermore, in LMEM missing data points result in losing only those observations, as opposed to the common practice of listwise deletion in ANOVA that results in losing an entire individual’s data.

In short, LMEM provides all of the benefits of statistical analysis of other approaches like ANOVA and OLS regression, and at the very least will produce the same results and lead to the same interpretation. However, in most cases, LMEM is likely to provide better results: it handles missing data and unevenly spaced observations, allows more flexibility in modeling both dependent variables (e.g., continuous, binomial, count data) and independent variables (e.g., time can be treated as categorical or continuous), and simultaneously models both across-individual and across-item variability. LMEM is also well-documented and well-supported in multiple statistical software platforms.

Recovery of function in acquired dysgraphia

Although the focus of this paper has been the statistical evaluation of training data from SND, we will briefly summarize and discuss four aspects of the findings concerning the dysgraphia treatment itself. The first two points concern the general success of the treatment protocol, while the latter two address current issues in learning and memory in the context of treatment of cognitive deficits.

First, the study provides additional evidence of positive outcomes in remediation of acquired dysgraphia (e.g., Beeson, 1999 ; Rapp & Kane, 2002 ). In addition to the group analyses, we also estimated treatment effects for each individual, by computing LMEMs identical to those reported here (Analysis 1 and Analysis 2; See Appendix 1 ), dropping the random-effects by-individuals (as each model was for one individual) but keeping the by-items effects. The results revealed that each of the five individuals had statistically significant gains in spelling accuracy for the trained words and 4 of 5 improved their writing times. 7 Furthermore, although not the focus of the analyses that were presented, all five maintained their gains at 3 months follow up (see Table 3 ). Second, as also found in Rapp and Kane (2002) and Rapp (2005) , the same treatment approach was effective for individuals with underlying spelling deficits affecting orthography working memory, orthographic long-term memory or both. Why the same treatment should be successful with different deficits is not entirely clear and future research directed at understanding this would be useful for developing remediation approaches tailored to the underlying deficits.

With regard to mechanisms of learning and re-learning, the success of the study-spell-study treatment approach is relevant to the debate regarding errorless/errorful learning, since it falls within the class of errorful learning methods. On each trial, individuals are first asked to spell the word and are allowed to produce errors. Then, during the subsequent copy portion of the trial, they are given feedback and practice and, finally, they are asked to try again to retrieve and produce the spelling. While this study did not compare the benefits of errorful vs. errorless learning, it does provide evidence that errorful learning can be successful (see Middleton & Schwartz, 2012 for a review and similar conclusions). Interestingly, research on learning and memory suggests that the success of this type of approach likely lies in the opportunities for repeated retrieval attempts in the “spell” components of each trial. Specifically, research with neurotypical individuals finds that learning is improved more when the learner attempts to recall information (whether successful or not) rather than simply restudying it; this benefit of retrieval over study is referred to as the “retrieval practice” effect (for a review, see Roediger & Karpicke, 2006 ).

Finally, and perhaps most significantly, we find a clear benefit for a distributed compared to a clustered training schedule. Recall that for each individual, all items were trained for the same number of trials, but half the items were trained using a clustered schedule (3–4 trials per session) and half with a distributed schedule (1 trial per session). Results revealed that, for the group, the distributed compared to the clustered schedule produced significantly greater accuracy. This finding is consistent with the many studies in the neurotypical learning and memory literature that support distributed over massed practice ( Cepeda et al., 2006 ) as well as more recent work with individuals with aphasia ( Middleton et al., 2016 ; Sage et al., 2011 ). These findings should provide some guidance to clinicians in planning treatment schedules.

Conclusions

This paper provides a strong case for the application of LMEM to the analysis of training studies as a preferable alternative to visual analysis or other statistical techniques. We have presented evidence that the approach is robust under the extreme conditions of small numbers of individuals, in the context of repeated measures training data for both continuous (reaction time) and binomially distributed (accuracy) dependent measures. The approach provides standardized measures of effect sizes that are obtained through readily available and well-supported statistical programs (i.e., R, R Core Team, 2016 ), and provides statistically rigorous estimates of the expected average effect size of training effects, taking into account variability across both items and individuals.

Acknowledgments

We are very grateful to Jennifer Shea for her many valuable contributions to this project, from individual testing through data analysis. We also thank Colin Wilson for guidance on statistical theory. This work was supported by the multi-site NIH-supported grant DC006740 examining the neurobiology of language recovery in aphasia.

This work was supported by the National Institutes of Health [DC006740];

Appendix 1. Tutorial

R packages and set-up.

First, the following R packages must be installed: lme4 (for fitting the models; Bates, Maechler, Bolker, & Walker, 2015 ); lmerTest (for obtaining Satterthwaite p -values; Kuznetsova, Brockhoff, & Christensen, 2017 ); effects (for visualizing effects; Fox, 2003), and MuMIn (for calculating variance explained; Bartoń, 2009 ):

Then the packages must be loaded:

The data should be loaded, which can be done from a number of formats. For example, a comma separated values file (“.csv”) from Excel, with the data for every trial and every individual, can be loaded:

The data frame group_data now exists, which in this case contains the following columns:

  • timeperletter: RT on each trial
  • letterpercent: proportion of letters spelled correctly
  • DaysSince: number of days since last presentation of the target word
  • Frequency: (log) word frequency (using SUBTLEX-US, Brysbaert & New, 2009 )
  • Length: number of letters in the word
  • Schedule: either Clustered or Distributed
  • Session: training session number
  • Individual: The individual’s initials
  • Target: The word spelled.

Consideration should be given to whether continuous variables should be mean-centered or standardized (mean-centered and divided by the standard deviation); this is recommended in particular when there are interaction terms in the model, as it will reduce multicollinearity. In the analyses presented here, all continuous variables were standardized. Any categorical variables that will be entered as predictor variables should be indicated as such to R, e.g., the variable Schedule, which can be referred to within the group_data frame by use of the $ symbol:

The appropriate contrasts for the levels of the categorical variable must also be set. By default, dummy coding is used, e.g., a two-level categorical variable is coded as 0 and 1 (with the first level, alphabetically, coded as 0). To set the contrasts to sum coding, e.g., the Schedule variable coded as −1 and 1:

With dummy coding, the intercept reflects the mean for the level of the variable that is coded as 0 (the “reference level”), and the betas associated with that variable reflect the difference between the level(s) of the variable coded as 1 and the reference level. In sum coding, the intercept instead reflects the grand mean (the mean of means for each level of the variable), and the betas associated with that variable reflect the difference between the level(s) of the variable coded as 1 and the grand mean . Sum coding may be preferable to dummy coding in the presence of higher-order interactions (e.g., the interaction between two categorical variables) because it is more interpretable (always reflecting the difference between some level of the categorical variable compared to the overall effect). See R documentation, command help(“contrasts”), for more information and alternative coding schemes. The data are now ready for analysis.

Analysis 1: model formula and description

The dependent variable of accuracy is measured as the proportion of letters correct in each word, and is modeled as a binomial distribution: e.g., the word “cat” is modeled as three trials, one for each letter. In R, the left-hand side of the equation represents the dependent variable, and the right hand side the formula for the fixed-effects, followed by the random-effects. For this analysis, the formula specified was:

The first line of the formula can be interpreted as follows: we make a call to the function glmer (generalized linear mixed-effects, used for logistic regression) and specify that the dependent variable (letterpercent) is predicted to be a linear combination of the independent variables Schedule, Session, DaysSince, Frequency, and Length. The * indicates an interaction between Schedule and Session, i.e., testing the hypothesis that the effect of training (Session) may differ across schedules (clustered vs. distributed). The second and third lines specify the random-effects structure; in this analysis, we have included random intercepts and slopes for both the effect of treatment and all time-related variables, by-individuals and by-items. 8 The last line specifies that the dependent variable should be modeled as binomially distributed, and that the weights for each trial are equal to the length of the word.

The function coef() can be wrapped around the function summary() to return the results of the model for the fixed-effects, as follows (see Table 4 ):

Effect size measures

As has already been stated, the beta coefficients themselves provide an effect size measure. For logistic regression, these betas are log-odd ratios, and so are standardized measures of effect size. In this model, model1, they represent the increase in the odds of correctly spelling a word per training session. As presented in Table 4 , the beta for Session = 1.316 can be exponentiated to give the odds ratio: e 1.316 = 3.73. This odds ratio of 3.73 indicates a 273% increase in the odds of a correct trial, per unit increase of the predictor. In this model, one unit of the predictor Session corresponded to approximately 8 actual training sessions (because the measure was standardized), and so equates to 273/8 ≈ 34% improvement per single training session in the odds of a correct trial. Odds can be transformed into more-familiar probabilities via the formula P = Odds Odds + 1 . In this model, e.g., an individual who had 50/50 odds of producing a correct letter during a trial in Session 1 would improve to (50 × 1.34) = 67/50, or 1.34, on Session 2. Those odds are furthermore interpretable as a probability 1.34 1 + 1.34 = 57 % .

An alternative effect size measure can be obtained in the form of a generalized R 2 measure, that is, the amount of variance explained. The function r.squaredGLMM() from the MuMIn package can be used for this purpose. It provides two measures of variance explained: marginal (fixed-effects only) and conditional (fixed plus random-effects). One can compare the amount of variance explained by the full fitted model to that explained by a reduced model to determine how much variance a single predictor explains. For example, we can determine how much variance is explained overall by the effect of Session; first, how much variance is explained in model1?

We can then fit a reduced model that removes the fixed-effect of Session, and attribute the difference in the amount of explained variance to the effect of Session:

Calculating LRT p -value

The default p -value returned from the summary of the glmer() function is the Wald-Z statistic. An alternative is to conducted a likelihood ratio test (LRT) using the built-in anova() function. As with the approach for isolating the amount of variance explained by a particular (set of) predictors, the LRT relies on model comparison. For example, to calculate a p-value for the interaction of Session and Schedule, another reduced model can be fit that is identical to model1 only without the interaction term; this model and the original are then called together with the anova() function, returning the following output ( Table 5 ):

LRT for interaction term in model1. AIC = Akaike Information Criterion, BIC = Bayesian Information Criterion, logLik = log likelihood, Chi Df = chi-squared degrees of freedom, Pr (>Chisq) = p -value.

DfAICBIClogLikdevianceChisqChi DfPr(>Chisq)
model1_no_interaction224421.74547.4−21894378
model12344204551−218743744.20710.040

This output returns for each model the degrees of freedom (Df), the Akaike and Bayesian Information Criterion measures (AIC and BIC), the log likelihood (logLik) and deviance (dev), and the Chi-squared value based on the degrees of freedom (here, 1, because of the 1 additional degree of freedom used by adding the interaction term to the model). Most importantly, the last column gives the p -value of the Chi-squared test, here testing the interaction between Schedule and Session and calculated to be 0.04. This compares to the Wald-Z p -value returned by the summary() function of 0.026. See Appendix 2 for the results of a Type I error simulation study that confirms the greater Type I error inflation for the Wald-Z p -values relative to the LRT.

Visualizing results

The effects calculated in the fitted model can be visualized by using the “effects” package. Figure 2 depicts all of the effects included in model1 and can be accessed using the following commands:

A note on pre/post or follow-up designs

Pre-test/post-test designs or follow-up designs can also be analyzed with LMEM. For example, in the dysgraphia treatment study reported here and analyzed as model1, individuals returned for follow-up tests where they spelled each of their training words. To determine whether their improved spelling performance was retained, a separate LMEM could be computed, using the scores from the final training session as post-training scores and comparing them to the follow-up scores. The primary difference between such a model and the ones presented here is that with only two time points, the time point variable (which previously was “Session”) becomes a categorical variable 9 encoding whether an observation was associated with post-training or follow-up time points Aside from that, the same method can be used, still without averaging items so that factors associated with them (like word frequency) can be accounted for. In such a model, the main effect of the time point would indicate if there was a significant change in performance after the end of training; the interaction Time × Schedule would indicate whether this change was significantly different between treatment schedules.

The second analysis demonstrates how to apply the same approach used in Analysis 1 to a continuous outcome variable, in this case reaction time (RT). The main difference relevant to the tutorial is that instead of logistic regression and the function glmer(), continuous variables are modeled with regular (i.e., non-logistic) regression and the function lmer().

Model Formula and Description

The dependent variable is RT and only on trials where the individual spelled the word correctly. This results in a reduced number of data points, and is the typical practice for RT analyses. It is worthwhile noting that because of the reduced number of data points for the RT analysis, some of the individuals have fewer data points than is needed to include random-effects for the effects of Schedule, Session, and DaysSince—this is one advantage of analyzing the individuals as a group, because the total number of observations combining across all five individuals is sufficient to include all of these random-effects. The model formula is identical to that used for model1 in Analysis 1, except the dependent variable (log) RT, and we call the function lmer() instead of glmer(). For this linear regression, we also do not specify the family as binomial nor do we specify the weights for the trials.

The results of this model are presented in Table 6 . With the lmerTest package loaded, the summary() function reports the p -values based on the Satterthwaite approximation for degrees of freedom.

Appendix 2. Type I error simulation

As indicated in the Introduction, there is discussion in the literature on the best method for estimating p -values in LMEM. The Wald-Z and LRT represent commonly used approaches that make different statistical assumptions and thus can yield different p -value estimates. In order to provide further understanding of this issue, we conducted an original simulation study to compare the Type I error rates associated with the Wald-Z and the LRT in the LMEM for Analysis 1. To do so, we generated artificial datasets by sampling from a distribution designed such that the average beta values for the variables of Session and the Schedule × Session interaction were zero, with standard deviations equal to those observed for the real dataset. We then subjected each of those datasets to LMEM analysis, evaluating the results using the Wald-Z and LRT approaches. If the Type I error of the statistical test equals 0.05, then in 1000 simulations we should obtain 50 (or fewer) p -values below 0.05. Table 7 reports the Type I error rates based on this simulation study: The probability of falsely finding a significant effect of Session or the interaction Schedule × Session when the effect (by design) is not significantly different from zero. These results reveal that for the main effect of Session, Type I error is near the desired criterion of 0.05, although it is somewhat higher for the Wald-Z compared to the LRT (0.064 versus 0.060). In the case of the interaction Schedule × Session, Type I error is highly inflated under the Wald-Z (0.116) but remains near 0.05 under the LRT (0.059). These results support that the Wald-Z method leads to inflated Type I error relative to the LRT, as has been suggested elsewhere (e.g., Moeyaert et al., 2014 ) and for this reason we recommend the use of LRT particularly in the case of SND and generalized LMEM (e.g., analysis of binomial data). See Appendix 1 (“Calculating LRT p -value”) for details on how to compute the LRT p -value.

Results of Type I error simulation for main effect (Session) and interaction (Schedule*Session) comparing the Wald-Z p -value and LRT p -value.

SessionSchedule*Session
Wald-Z:0.0640.116
LRT0.0600.059

Type I error simulation procedure

A simulation-based procedure was used to evaluate the Type I error rate resulting from two methods for obtaining p -values from the logistic LMEM: The Wald-Z test, which is the default method in the lme4 package in R, and the LRT as outlined in Analysis 1. Briefly, we simulated 1,000 datasets based on the fitted model1 presented in Analysis 1, setting the beta coefficients for Session and the Schedule*Session interaction = 0. We then refit the model to each of these simulated datasets and obtained the p -values for both Session and Schedule*Session using each of the two p -value methods. The results of these simulations are presented in Table 7 .

The results of the simulation indicate that for the fixed-effect of Session, the Wald-Z test performed near the desired Type 1 error level of 0.05, specifically 0.064. The LRT p -value for the fixed-effect of Session was also near the nominal Type 1 error level, slightly better than the Wald-Z test, specifically 0.060.

For the interaction effect of Session*Schedule, a noticeable difference was found between the two methods. The Wald-Z test resulted in decidedly inflated Type 1 error, with a rate of 0.116. In this case, the LRT clearly outperformed the Wald-Z, as it maintained performance near the level of 0.05, specifically 0.059. This result confirms what has been suggested elsewhere ( Moeyaert et al., 2014 ), and provides evidence that the LRT is preferable to the Wald-Z test for logistic LMEM, at least in the case of SND.

Models fit by the lme4 package can be used to simulate hypothetical datasets by first extracting the relevant parameters. For example, they can be obtained from logistic LMEM analyzing accuracy, model1:

These parameters can be selectively edited to assess Type I error (or statistical power) for various effect sizes by editing the values of the parameters of interest. For example, to assess Type I error for the effect of Session and the interaction Session*Schedule in this model:

This sets the mean of the sampling distribution for these betas to zero. To assess the Type I error rate, hypothetical datasets are then generated from these new parameters, and models are refit to the simulated data:

The number of simulations (“nsim”) determines the number of new sets of simulated dependent variables, each of which is stored in a column of “dv.sim”. Models are then fit to each of these simulated sets of observations with the same model formula used for the original model1. The desired output(s) of each model should be stored, this can be done, e.g., using a for loop:

1. The distinction between a repeated-measures design and a time series is that while the former provides multiple observations of the same variable at the same time point, the latter provides only a single observation per time point.

2. Nesting the effect of items within individuals would create separate effects for each word in each individual, e.g., allowing “house” to show a steeper training effect overall in one individual but a shallower effect in another individual. In the analysis we present, by crossing items with individuals, we assume instead that any differences in the effect of training the word “house” not accounted for by other variables, should be common across all individuals included in the study.

3. Type I error is the rejection of the null hypothesis when it is actually true, i.e., incorrectly finding an effect as significantly different from zero at some α threshold (typically 0.05) when it is not. In a treatment study this would correspond to a false positive—reporting an effect as significant when it is not.

4. To be clear, it is statistical independence of the errors that is a basic assumption of linear regression, not of the dependent variable itself. It is natural for data points observed from the same individual across time to show a relationship, due to any number of underlying factors. The problem of autocorrelation arises if this relationship is not accounted for by the independent variables in the model and/or a statistical method of control, and thus remains in the residual errors.

5. Barr et al. (2013) clarify that “if a factor [i.e., a fixed-effect] is a between-unit factor, then a random intercept is usually sufficient” (p. 275). The point here is that if any given unit (e.g., any specific item) only takes on one value for some predictor, then a random slope for that predictor should not be included (by definition, a slope cannot be estimated based on only one value). Thus, e.g., a random slope for word frequency does not make sense by-items, as word frequency only varies between words, not within.

6. Random slopes were included for all variables that related to time (i.e., Session and DaysSince) both by-items and by-individuals, and for Schedule by-individuals, on the basis that they are the critical variables of interest. A random slope for Schedule by-items was not included because no item appeared in both schedules (i.e., each word was either on a Clustered or a Distributed schedule but never both).

7. The choice of modeling a small number of individuals as a group or as separate individuals, depends on both theoretical and practical questions. Theoretically, if the interest is in estimating an effect for an average individual, then the participants should be modeled as a group. If instead the interest is in measuring the effects specific to each individual, then each can be modeled separately. There may, however, be insufficient data for individual LMEM models (e.g., a single observation per individual per time point), but sufficient data when combining the individuals into a group.

8. The random-effects structure in model1 is of crossed random-effects for participants and items. The alternative, nested random-effects, for participants and items would be modeled as (1 + Schedule + Session + DaysSince | Target:Individual).

9. We recommend in this situation using sum-coding (i.e., coding “post” as −1 and “follow-up” as +1) as opposed to treatment or dummy-coding (coding as 0 and 1).

Disclosure statement

No potential conflict of interest was reported by the authors.

  • Baayen RH, Davidson DJ, & Bates DM (2008). Mixed-effects modeling with crossed random-effects for subjects and items . Journal of Memory and Language , 59 ( 4 ), 390–412. doi: 10.1016/j.jml.2007.12.005 [ CrossRef ] [ Google Scholar ]
  • Baek EK, & Ferron JM (2013). Multilevel models for multiple-baseline data: Modeling across-participant variation in autocorrelation and residual variance . Behavior Research Methods , 45 ( 1 ), 65–74. doi: 10.3758/s13428-012-0231-z [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Baek EK, Moeyaert M, Petit-Bois M, Beretvas SN, Van Den Noortgate W, & Ferron JM (2014). The use of multilevel analysis for integrating single-case experimental design results within a study and across studies . Neuropsychological Rehabilitation , 24 ( 3–4 ), 590–606. doi: 10.1080/09602011.2013.835740 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Barr DJ, Levy R, Scheepers C, & Tily HJ (2013). Random-effects structure for confirmatory hypothesis testing: Keep it maximal . Journal of Memory and Language , 68 ( 3 ), 255–278. doi: 10.1016/j.jml.2012.11.001 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bartoń K (2009). MuMIn: Multi-model inference. R package version 1. 40. 4 http://r-forge.r-project.org/projects/mumin/ .
  • Bates D, Maechler M, Bolker B, & Walker S (2015). Fitting linear mixed-effects models using lme4 . Journal of Statistical Software , 67 ( 1 ), 1–48. doi: 10.18637/jss.v067.i01 [ CrossRef ] [ Google Scholar ]
  • Beeson PM (1999). Treating acquired writing impairment: Strengthening graphemic representations . Aphasiology , 13 ( 9–11 ), 767–785. doi: 10.1080/026870399401867 [ CrossRef ] [ Google Scholar ]
  • Borckardt JJ, & Nash MR (2014). Simulation modelling analysis for small sets of single-subject data collected over time . Neuropsychological Rehabilitation , 24 ( 3–4 ), 492–506. doi: 10.1080/09602011.2014.895390 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Brysbaert M, & New B (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English . Behavior Research Methods , 41 ( 4 ), 977–990. doi: 10.3758/BRM.41.4.977 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Buchwald A, & Rapp B (2009). Distinctions between orthographic long-term memory and working memory . Cognitive Neuropsychology , 26 ( 8 ), 724–751. doi: 10.1080/02643291003707332 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Burns MK (2012). meta-analysis of single-case design research [special issue] . Journal of Behavioral Education , 21 ( 3 ). doi: 10.1007/s10864-012-9158-9 [ CrossRef ] [ Google Scholar ]
  • Campbell JM (2004). Statistical comparison of four effect sizes for single-subject designs . Behavior Modification , 28 ( 2 ), 234–246. doi: 10.1177/0145445503259264 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Campbell JM, & Herzinger CV (2010). Statistics and single subject research methodology In Gast DL (Ed.), Single Subject Research Methodology in Behavioral Sciences (pp. 91–109). New York, NY: Routledge. [ Google Scholar ]
  • Cepeda NJ, Pashler H, Vul E, Wixted JT, & Rohrer D (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis . Psychological Bulletin , 132 ( 3 ), 354–380. doi: 10.1037/0033-2909.132.3.354 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Corballis MC (2009). Comparing a single case with a control sample: Refinements and extensions . Neuropsychologia , 47 ( 13 ), 2687–2689. doi: 10.1016/j.neuropsychologia.2009.04.007 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Crawford JR, Garthwaite PH, & Howell DC (2009). On comparing a single case with a control sample: An alternative perspective . Neuropsychologia , 47 ( 13 ), 2690–2695. doi: 10.1016/j.neuropsychologia.2009.04.011 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Crawford JR, & Howell DC (1998). Comparing an individual’s test score against norms derived from small samples . The Clinical Neuropsychologist , 12 ( 4 ), 482–486. doi: 10.1076/clin.12.4.482.7241 [ CrossRef ] [ Google Scholar ]
  • Davis DH, Gagné P, Fredrick LD, Alberto PA, Waugh RE, & Haardörfer R (2013). Augmenting visual analysis in single-case research with hierarchical linear modeling . Behavior Modification , 37 ( 1 ), 62–89. doi: 10.1177/0145445512453734 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Dedrick RF, Ferron JM, Hess MR, Hogarty KY, Kromrey JD, Lang TR, & Lee RS (2009). Multilevel modeling: a review of methodological issues and applications . Review Of Educational Research , 79 ( 1 ), 69–102. doi: 10.3102/0034654308325581 [ CrossRef ] [ Google Scholar ]
  • Dunn LM, Dunn LM, Bulheller S, & Häcker H (1981). Peabody picture vocabulary test-revised . Circle Pines, MN: American Guidance Service. [ Google Scholar ]
  • Evans JJ, Gast DL, Perdices M, & Manolov R (2014). Single case experimental designs: Introduction to a special issue of neuropsychological rehabilitation . Neuropsychological Rehabilitation , 24 ( 3–4 ), 305–314. doi: 10.1080/09602011.2014.903198 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ferron JM, Bell BA, Hess MR, Rendina-Gobioff G, & Hibbard ST (2009). Making treatment effect inferences from multiple-baseline data: The utility of multilevel modeling approaches . Behavior Research Methods , 41 ( 2 ), 372–384. doi: 10.3758/BRM.41.2.372 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ferron JM,. Farmer JL, & Owens CM (2010). Estimating individual treatment effects from multiple-baseline data: A monte carlo study of multilevel-modeling approaches . Behavior Research Methods , 42 ( 4 ), 930–943. doi: 10.3758/BRM.42.4.930 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gelman A, & Hill J (2006). Data analysis using regression and multilevel/hierarchical models(Vol. 1) . New York, NY: Cambridge University Press. [ Google Scholar ]
  • Goodman RA, & Caramazza A (1985). The Johns Hopkins university dysgraphia battery . Baltimore, MD: Johns Hopkins University. [ Google Scholar ]
  • Gueorguieva R, & Krystal JH (2004). Move over anova: progress in analyzing repeated-measures data andits reflection in papers published in the archives of general psychiatry . Archives of General Psychiatry , 61 ( 3 ), 310. doi: 10.1001/archpsyc.61.3.310 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hanke JE, & Wichern DW (2005). Business forecasting (8th ed.). Upper Saddle River, N.J.: Pearson Prentice Hall. [ Google Scholar ]
  • Hedges LV, Pustejovsky JE, & Shadish WR (2012). A standardized mean difference effect size for single case designs:AN EFFECT SIZE FOR SINGLE CASE DESIGNS . Research Synthesis Methods , 3 ( 3 ), 224–239. doi: 10.1002/jrsm.1052 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Heyvaert M, Moeyaert M, Verkempynck P, Van den Noortgate W, Vervloet M, Ugille M, & Onghena P (2017). Testing the intervention effect in single-case experiments: a monte carlo simulation study . The Journal Of Experimental Education , 85 ( 2 ), 175–196. doi: 10.1080/00220973.2015.1123667 [ CrossRef ] [ Google Scholar ]
  • Howard D, Best W, & Nickels L (2015). Optimising the design of intervention studies: Critiques and ways forward . Aphasiology , 29 ( 5 ), 526–562. doi: 10.1080/02687038.2014.985884 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hox J (1998). Multilevel modeling: When and why In Classification, data analysis, and data highways (pp. 147–154). Springer, Berlin, Heidelberg. [ Google Scholar ]
  • Huber S, Klein E, Moeller K, & Willmes K (2015). Comparing a single case to a control group – applying linear mixed-effects models to repeated measures data . Cortex; a Journal Devoted to the Study of the Nervous System and Behavior , 71 , 148–159. doi: 10.1016/j.cortex.2015.06.020 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kay J, Lesser R, & Coltheart M (1992). PALPA: Psycholinguistic assessments of language processing in aphasia . Hove: Lawrence Erlbaum Associates Ltd. [ Google Scholar ]
  • Kuffner TA, & Walker SG (2017). Why are P-Values Controversial? The American Statistician , 0–0. doi: 10.1080/00031305.2016.1277161 [ CrossRef ] [ Google Scholar ]
  • Kuznetsova A, Brockhoff PB, & Christensen RHB (2017). lmerTest Package: Tests in linear mixed-effects models . Journal of Statistical Software , 82 ( 13 ), 1–26. doi: 10.18637/jss.v082.i13 [ CrossRef ] [ Google Scholar ]
  • Lane JD, & Gast DL (2014). Visual analysis in single case experimental design studies: Brief review and guidelines . Neuropsychological Rehabilitation , 24 ( 3–4 ), 445–463. doi: 10.1080/09602011.2013.815636 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lenz AS (2015). Using single-case research designs to demonstrate evidence for counseling practices . Journal of Counseling and Development , 93 ( 4 ), 387–393. [ Google Scholar ]
  • Maas CJ, & Hox JJ (2005). Sufficient sample sizes for multilevel modeling . Methodology , 1 ( 3 ), 86–92. doi: 10.1027/1614-2241.1.3.86 [ CrossRef ] [ Google Scholar ]
  • Matuschek H, Kliegl R, Vasishth S, Baayen H, & Bates D (2017). Balancing type I error and power in linear mixed models . Journal of Memory and Language , 94 , 305–315. [ Google Scholar ]
  • Matyas TA, & Greenwood KM (1996). Serial Dependeny in In D. R, B. D, & Gorman BS(Eds.), Design and Analysis of SIngle-Case Research (pp. 215–243). Lawrence Erlbaum. [ Google Scholar ]
  • Middleton EL, & Schwartz MF (2012). Errorless learning in cognitive rehabilitation: A critical review . Neuropsychological Rehabilitation , 22 ( 2 ), 138–168. doi: 10.1080/09602011.2011.639619 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Middleton EL, Schwartz MF, Rawson KA, Traut H, & Verkuilen J (2016). Towards a theory of learning for naming rehabilitation: Retrieval practice and spacing effects . Journal of Speech Language and Hearing Research , 59 ( 5 ), 1111. doi: 10.1044/2016_JSLHR-L-15-0303 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Moeyaert M, Ferron JM, Beretvas SN, & Van den Noortgate W (2014). From a single-level analysis to a multilevel analysis of single-case experimental designs . Journal Of School Psychology , 52 ( 2 ), 191–211. doi: 10.1016/j.jsp.2013.11.003 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Moeyaert M, Ugille M, Ferron JM, Beretvas SN, & Den Noortgate WV (2014). Three-level analysis of single-case experimental data: Empirical validation . The Journal of Experimental Education , 82 ( 1 ), 1–21. doi: 10.1080/00220973.2012.745470 [ CrossRef ] [ Google Scholar ]
  • Moeyaert M, Ugille M, Ferron JM, Beretvas SN, & Van den Noortgate W (2014). The influence of the design matrix on treatment effect estimates in the quantitative analyses of single-subject experimental design research . Behavior Modification , 38 ( 5 ), 665–704. doi: 10.1177/0145445514535243 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Moeyaert M, Ugille M, Ferron JM, Onghena P, Heyvaert M, Beretvas SN, & Van den Noortgate W (2015). Estimating intervention effects across different types of single-subject experimental designs: empirical illustration . School Psychology Quarterly , 30 ( 1 ), 50–63. doi: 10.1037/spq0000068 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Mycroft RH, Mitchell DC, & Kay J (2002). An evaluation of statistical procedures for comparing an individual’s performance with that of a group of controls . Cognitive Neuropsychology , 19 ( 4 ), 291–299. doi: 10.1080/02643290143000150 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Owens CM, & Ferron JM (2012). Synthesizing single-case studies: A Monte Carlo examination of a three-level meta-analytic model . Behavior Research Methods , 44 ( 3 ), 795–805. doi: 10.3758/s13428-011-0180-y [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Parker RI, Vannest KJ, & Davis JL (2011). Effect size in single-case research: A review of nine nonoverlap techniques . Behavior Modification , 35 ( 4 ), 303–322. doi: 10.1177/0145445511399147 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • R Core Team. (2016). R: A language and environment for statistical computing . Vienna: R Foundation for Statistical Computing. [ Google Scholar ]
  • Rapp B (2005). The relationship between treatment outcomes and the underlying cognitive deficit: Evidence from the remediation of acquired dysgraphia . Aphasiology , 19 ( 10–11 ), 994–1008. doi: 10.1080/02687030544000209 [ CrossRef ] [ Google Scholar ]
  • Rapp B, & Kane A (2002). Remediation of deficits affecting different components of the spelling process . Aphasiology , 16 ( 4–6 ), 439–454. doi: 10.1080/02687030244000301 [ CrossRef ] [ Google Scholar ]
  • Raudenbush SW, & Bryk AS (2002). Hierarchical linear models: Applications and data analysis methods(Vol. 1) . Sage. [ Google Scholar ]
  • Roediger HL, & Karpicke JD (2006). The power of testing memory: Basic research and implications for educational practice . Perspectives on Psychological Science , 1 ( 3 ), 181–210. doi: 10.1111/j.1745-6916.2006.00012.x [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sage K, Snell C, & Lambon Ralph MA (2011). How intensive does anomia therapy for people with aphasia need to be? . Neuropsychological Rehabilitation , 21 ( 1 ), 26–41. doi: 10.1080/09602011.2010.528966 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sakamoto Y, Ishiguro M, & Kitagawa G (1986). Akaike information criterion statistics . Dordrecht: D. Reidel. [ Google Scholar ]
  • Shadish WR (2014). Analysis and meta-analysis of single-case designs [special issue] . Journal of School Psychology , 52 ( 2 ), doi: 10.1016/j.jsp.2013.11.009 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Shadish WR, Kyse EN, & Rindskopf DM (2013). Analyzing data from single-case designs using multilevel models: New applications and some agenda items for future research . Psychological Methods , 18 ( 3 ), 385–405. doi: 10.1037/a0032964 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Shadish WR, Zuur AF, & Sullivan KJ (2014). Using generalized additive (mixed) models to analyze single case designs . Journal of School Psychology , 52 ( 2 ), 149–178. doi: 10.1016/j.jsp.2013.11.004 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Singer JD, & Willett JB (2003). Applied longitudinal data analysis: Modeling change and event occurrence . New York, NY: Oxford University Press. [ Google Scholar ]
  • Swaminathan H, Rogers HJ, Horner RH, Sugai G, & Smolkowski K (2014). Regression models and effect size measures for single case designs . Neuropsychological Rehabilitation , 24 ( 3–4 ), 554–571. doi: 10.1080/09602011.2014.887586 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Thompson CK, Lukic S, King MC, Mesulam MM, & Weintraub S (2012). Verb and noun deficits in stroke-induced and primary progressive aphasia: The Northwestern naming battery . Aphasiology , 26 ( 5 ), 632–655. doi: 10.1080/02687038.2012.676852 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ugille M, Moeyaert M, Beretvas SN, Ferron J, & Van den Noortgate W (2012). Multilevel meta-analysis of single-subject experimental designs: a simulation study . Behavior Research Methods , 44 ( 4 ), 1244–1254. doi: 10.3758/s13428-012-0213-1 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Van den Noortgate W, & Onghena P (2008). A multilevel meta-analysis of single-subject experimental design studies . Evidence-based Communication Assessment and Intervention , 2 ( 3 ), 142–151. doi: 10.1080/17489530802505362 [ CrossRef ] [ Google Scholar ]
  • Van Den Noortgate W, & Onghena P (2003). Hierarchical linear models for the quantitative integration of effect sizes in single-case research . Behavior Research Methods, Instruments, & Computers , 35 ( 1 ), 1–10. doi: 10.3758/BF03195492 [ PubMed ] [ CrossRef ] [ Google Scholar ]

Advertisement

Supported by

How Did Mpox Become a Global Emergency? What’s Next?

The virus is evolving, and the newest version spreads more often through heterosexual populations. Sweden reported the first case outside Africa.

  • Share full article

A doctor in yellow protective gear and white gloves examines the head of a young boy in a makeshift tent.

By Apoorva Mandavilli

Apoorva Mandavilli covered the 2022 mpox outbreak and the Covid-19 pandemic.

Faced once again with a rapidly spreading epidemic of mpox, the World Health Organization on Wednesday declared a global health emergency. The last time the W.H.O. made that call was in 2022, when the disease was still called monkeypox.

Ultimately the outbreak affected nearly 100,000 people worldwide, primarily gay and bisexual men, including more than 32,000 in the United States.

The W.H.O.’s decision this time was prompted by an escalating crisis of mpox concentrated in the Democratic Republic of Congo. It recently spread to a dozen other African countries. If it is not contained, the virus again may rampage all over the world, experts warned.

On Thursday, Sweden reported the first case of a deadlier form of mpox outside Africa , in a person who had traveled to the continent. “Occasional imported cases like the current one may continue to occur,” the country’s public health agency warned.

“There’s a need for concerted effort by all stakeholders, not only in Africa, but everywhere else,” Dr. Dimie Ogoina, a Nigerian scientist and chair of the W.H.O.’s mpox emergency committee, said on Wednesday.

Congo alone has reported 15,600 mpox cases and 537 deaths, most of them among children under 15, indicating that the nature of the disease and its mode of spread may have changed.

We are having trouble retrieving the article content.

Please enable JavaScript in your browser settings.

Thank you for your patience while we verify access. If you are in Reader mode please exit and  log into  your Times account, or  subscribe  for all of The Times.

Thank you for your patience while we verify access.

Already a subscriber?  Log in .

Want all of The Times?  Subscribe .

  • Skip to main content
  • Skip to search
  • Skip to footer

Produkte und Services

Catalyst 9300 Series Switches and Cisco Catalyst Center user interface on laptop screen

Switches der Cisco Catalyst 9300-Serie

Ein sicherer arbeitsplatz – immer und überall.

Passen Sie Ihr Netzwerk an die Anforderungen der hybriden Arbeit an. Schaffen Sie eine sichere Grundlage, die optimierte Automatisierung und Einfachheit sowie umfassende Einblicke ermöglicht.

Machen Sie Ihr Netzwerk fit für die Zukunft

Ermöglichen Sie ein sichereres, nachhaltigeres Anwendungserlebnis, bei dem Geschwindigkeit und Skalierbarkeit mit erweiterter Automatisierung und detaillierten Analysen kombiniert werden.

Transformation für Ihren Arbeitsplatz

Erhalten Sie die Bandbreite, Geschwindigkeit und Skalierbarkeit, die Sie benötigen, um hybride Teams bei dem von ihnen gewählten Arbeitsmodell zu unterstützen.

Integrierte Netzwerksicherheit

Optimieren Sie Ihr Netzwerk mit integrierter Verschlüsselung und erweiterten KI-/ML-Lösungen für durchgängige Sicherheit nach dem Zero-Trust-Konzept.

Flexibles Netzwerkmanagement

Mit der Cisco Catalyst Center-Plattform oder dem Cloud-Monitoring für Catalyst können Sie herausragende Netzwerkerfahrungen bereitstellen.

Nachhaltiges Netzwerkdesign

Nutzen Sie das Netzwerk als Grundlage für intelligentere, nachhaltigere Gebäude, die energieeffizient sind und das Wohlbefinden der BewohnerInnen fördern.

Modelle der Catalyst 9300-Serie

C9300X-xx HX/HXN/TX SKUs

Catalyst 9300X (Kupfer)

Ausgelegt auf sicheren Hochgeschwindigkeits-Netzwerkzugriff, Aggregation und schlanke Zweigstellen mit 24 x oder 48 x 10-G-Multigigabit-Ports

  • Optionen: nur Daten (kein PoE) und 90 W Cisco UPOE+
  • Modulare Uplinks mit 100 G/40 G, 25 G/10 G/1 G oder 10 mGig
  • Stacking-fähig bis zu 1 Tbit/s

C9300X-12Y/24Y SKUs

Catalyst 9300X (Glasfaser)

Ausgelegt auf sicheren Hochgeschwindigkeits-Netzwerkzugriff, Aggregation und Bereitstellungen für schlanke Zweigstellen

  • 12 x oder 24 x Ports mit 25-G-/10-G-/1-G-Glasfaser
  • Modulare Uplinks mit 100 G/40 G, 25 G/10 G/1 G oder 10 G mGig

C9300-24UB/UBX, C9300-48UB SKUs

Catalyst 9300 für besonders hohe Leistung

24 x oder 48 x Ports mit doppelten Puffern, MAC- und IP-Adressenskalierung und höherer ACL(Access Control List; Zugriffskontrolllisten)-Skalierung

  • 1 G/2,5 G/5 G/10 G mGig und 90 W Cisco UPOE+
  • Modulare Uplinks mit 1 G, 10 G, 25 G, 40 G oder 10 G mGig
  • Stacking-fähig bis 480 Gbit/s

C9300-xx H/UX/UXM/UN SKUs

Catalyst 9300 UPOE+

Optimiert für konvergenten kabelgebundenen und Wireless-Zugriff, mit 24 oder 48 Ports

  • 1 G oder 10 G Multigigabit mit 90 W Cisco UPOE+

C9300-xx T/P/U/S SKUs

Catalyst 9300 1 G

Grundlage für Intent-based Networking mit 24 x oder 48 x Ports mit 1 G für Daten

  • PoE+, Cisco UPOE und SFP-Glasfaser
  • Modulare Uplinks mit 1 G, 10 G, 25 G, 40 G oder Multigigabit

C9300L and C9300LM SKUs

Catalyst 9300L/LM 1 G

Ideal für den Einsatz in geschäftskritischen Zweigstellen mit wenig Platz

  • 24 x oder 48 x Ports mit 1 G für Daten, PoE+ und Multigigabit
  • Feste Uplinks mit 4 x 1 G, 4 x 10 G, 2 x 40 G
  • Stacking-fähig bis 320 Gbit/s

Cisco Catalyst Center-Schnittstelle

Sehen Sie sich den Catalyst 9300X-Switch genauer an.

Cisco Catalyst 9000 access point and Cisco Catalyst 9000 switch with Cisco Meraki dashboard

Testen Sie unsere Cloud-Monitoring-Lösung für Ihr Switching-Netzwerk

So einfach und flexibel kann ortsunabhängiges Arbeiten sein – entdecken Sie unsere Cloud-Monitoring-Lösung für Cisco Catalyst Switches auf dem Cisco Meraki Dashboard.

Unterstützung während des gesamten Lebenszyklus

Netzwerkdienste

Beschleunigen Sie die Transformation Ihres Netzwerks

Erhalten Sie wertvolle Expertise und Einblicke für den Aufbau eines sicheren und flexiblen Netzwerks – vom Design bis hin zu Implementierung und Optimierung.

Supportservices

Ziehen Sie größeren Nutzen aus Ihren IT-Investitionen

Sorgen Sie mit fachkundiger Unterstützung dafür, dass Ihre IT und Ihr Unternehmen auf Kurs bleiben. Dazu bieten wir Ihnen rund um die Uhr Zugriff auf fundiertes technisches Wissen, umfassende globale Erfahrungen und digitale Intelligence, um Störungen und Risiken zu reduzieren.

Nehmen Sie an einer Demo teil

Vereinfachen Sie die Netzwerkverwaltung mit den Automatisierungs-, Assurance- und Analysefunktionen der Cisco Catalyst Center-Plattform. Nehmen Sie an einer Demo teil, um mehr darüber zu erfahren.

IMAGES

  1. 🌱 How to write a case study analysis example. 6 Steps of a Case

    small n case study example

  2. Case Study

    small n case study example

  3. 30 Business Case Study Examples

    small n case study example

  4. what is a small n design

    small n case study example

  5. 49 Free Case Study Templates ( + Case Study Format Examples + )

    small n case study example

  6. 49 Free Case Study Templates ( + Case Study Format Examples + )

    small n case study example

COMMENTS

  1. 9 Small N

    The chapter will highlight the various methods for conducting small n research including: interviews, participant observation, focus groups, and process tracing, as well as the various procedures for determining case selection. First, the chapter will elaborate the differences and goals of small n research as compare to quantitative research.

  2. Small‐N Designs

    Small-N designs, such as systematic case studies and single-case experiments, are a potentially appealing way of blending science and practice, since they enable clinicians to integrate formal research methods into their everyday work. There are two main types of design: single-case experiments and naturalistic case studies.

  3. PDF Research method: small N Method with small-N studies

    Research method: small N Method with small-N studies Limitations to small-N research (per Lijphart) Too few data points; rarely both temporal and spatial variation Example: claim that agrarian backwardness leads to social revolution. Case = Russia. Alternative theory: ideologically motivated vanguard movements.

  4. Small Sample Research Designs for Evidence-based Rehabilitation: Issues

    The advantages and limitations of various small-N designs are described and illustrated using three examples from the rehabilitation literature. The challenges and opportunities of applying small-N designs to enhance evidence-based rehabilitation are discussed.

  5. Small is beautiful: In defense of the small- N design

    Our example of an additive factors study with a bimodally distributed interaction parameter was a hypothetical one, intended to illuminate the relationship between small- N and large- N designs, but it is nevertheless interesting to reflect on what would be the implications for scientific inference of a result like the one in Fig. 1 —that ...

  6. Insights from Small-N Studies

    As in medical research, small- N case studies allow for deep examinations of phenomena in real-life contexts, shedding light on the underlying mechanisms and providing a rich basis for practitioner knowledge. This group of recent papers illustrates how small- N studies can make contributions to education research and practice.

  7. Designing Case Studies: Explanatory Approaches in Small-N Research

    About this book The authors explore three ways of conducting causal analysis in case studies. They draw on established practices as well as on recent innovations in case study methodology and integrate these insights into coherent approaches. They highlight the core features of each approach and provide advice on each step of the research process.

  8. Designing Case Studies: Explanatory Approaches in Small-N Research

    Designing Case Studies explores three different ways of conducting causal analysis in case studies: co-variational analysis, causal-process tracing, and congruence analysis. It is an inclusive ...

  9. Designing Case Studies, Explanatory Approaches in Small-N Research

    QCA is suitable to compare a medium number of cases that contribute to a higher external generalizability than single case and small-n case studies (Rihoux and Ragin 2009; Blatter and Haverland ...

  10. Small is beautiful: In defense of the small-N design

    Our example of an additive factors study with a bimodally distributed interaction parameter was a hypothetical one, intended to illuminate the relationship between small- N and large- N designs, but it is nevertheless interesting to reflect on what would be the implications for scientific inference of a result like the one in Fig. 1 —that ...

  11. How do small n impact evaluations work?

    However, small n methodologies are not 'case studies'. Small n, case-based methodologies are varied but Befani and Stedman-Bryce (2017) suggest that case-based methods can be broadly typologised as either between-case comparisons (for example, qualitative comparative analysis) or within-case analysis (for example, process tracing).

  12. Hypotheses Testing With a Small N: An Example From Federalism Research

    Thus, federalism researchers frequently use methods such as single or comparative case studies and qualitative methods of data collection and analysis. While some scholars would argue that these methods are not fit for testing hypotheses, we demonstrate that hypotheses testing is also possible in a small N and qualitative data design.

  13. Full article: Case-based research on democratization

    Empirical research on democratization is dominated by case studies and small-N comparisons. This article is a first attempt to take stock of qualitative case-based research on democratization. It finds that most articles use methods implicitly rather than explicitly and are disconnected from the burgeoning literature on case-based methodology.

  14. PDF What are Small-N Designs?

    What are Small-N Designs? Large-N Designs. Large numbers of subjects tested (the more the better). Subjects randomly distributed into groups (to deal with EVs). Experiments are relatively brief.

  15. Case Selection in Small-N Research

    Summary. Recent methodological work on systematic case selection techniques offers ways of choosing cases for in-depth analysis such that the probability of learning from the cases is enhanced. This research has undermined several long-standing ideas about case selection. In particular, random selection of cases, paired or grouped selection of ...

  16. Strategies of Causal Inference in Small-N Analysis

    Methods associated with three major strategies of small -N causal inference are examined: nominal comparison, ordinal comparison, and within-case analysis. The article argues that the use of these three strategies within particular small -N studies has led scholars to reach radically divergent conclusions about the logic of causal analysis in ...

  17. Evaluating Research Methods of Comparative Politics

    Evaluating the Research Methods of Three Modern Classics of Comparative Politics. The main aim of this essay will be to explore and theoretically evaluate the research designs of three classics of comparative politics: Putnam's case study method in Making Democracy Work, Linz's small-N research design in "The Perils of Presidentialism ...

  18. Small N And Large N Designs

    Name Small N designs Small N designs are not Case studies Case studies are published reports about a unique person, group, or situation that has been studied over a specific time period.

  19. Small N's and Big Conclusions: An Examination of the Reasoning

    An increasing number of studies, particularly in the area of comparative and historical research, are using the method of agreement and method of difference proposed by Mill (1872) to infer causality based on a small number of cases. This article examines the logic of the assumptions implicit in such studies. For example, the research must assume: (1) a deterninistic approach rather than a ...

  20. (PDF) Small N designs for rehabilitation research

    Single-case designs are better suited for studies in which understanding and changing patient behavior and functional status are primary goals and the targeted sample sizes are less than 30 and ...

  21. Cisco: Software, Network, and Cybersecurity Solutions

    Meet your career goals, build your IT know-how, or upskill your entire team with learning from Cisco Learning & Certifications.

  22. Statistical analysis in Small-N Designs: using linear mixed-effects

    Various statistical approaches have been proposed for evaluating a single case or a small number of cases, both to compares cases to controls and for within-case studies. These approaches can be categorized into regression and non-regression approaches.

  23. Validation of the Adult Eating Behavior Questionnaire in a Norwegian

    The present study adds to existing research by testing the psychometric properties of the AEBQ in a sample of 14-year-olds and examining its construct validity by means of the parent-reported CEBQ. The current study uses age 14 data (analysis sample: n = 636) from the ongoing Trondheim Early Secure Study, a longitudinal study of a ...

  24. Small‐N Designs

    Summary Small-N designs, such as systematic case studies and single-case experiments, are a potentially appealing way of blending science and practice, since they enable clinicians to integrate formal research methods into their everyday work. There are two main types of design: single-case experiments and naturalistic case studies.

  25. How Did Mpox Become a Global Emergency? What's Next?

    Faced once again with a rapidly spreading epidemic of mpox, the World Health Organization on Wednesday declared a global health emergency. The last time the W.H.O. made that call was in 2022, when ...

  26. Switches der Cisco Catalyst 9300-Serie

    Switches der Cisco Catalyst 9300-Serie sind auf Sicherheit, IoT und die Cloud ausgelegt. Schaffen Sie eine sichere Grundlage, die optimierte Automatisierung und Einfachheit sowie umfassende Einblicke ermöglicht.