Understanding confidence intervals helps you make better clinical decisions
P ERHAPS YOU DIDN ’ T LEARN about the confidence interval (CI) in your formal education or you don’t hear the term in daily conversation. Confidence interval just doesn’t roll of the tongue of a staff nurse quite like blood pressure or urine output does.
But knowing the importance of the CI allows you to interpret research for its impact on your practice. Evidence-based decision making is central to healthcare transformation. To make good decisions, you must know how to interpret and use research and practice evidence. Evaluating research means determining its validity (were the researchers’ methods good ones?) and reliability (can clinicians get the same results the researchers got?).
CI and the degree of uncertainty
In a nutshell, the CI expresses the degree of uncertainty associated with a sample statistic (also called a study estimate). The CI allows clinicians to determine if they can realistically expect results similar to those in research studies when they implement those study results in their practice. Specifically, the CI helps clinicians identify a range within which they can expect their results to fall most of the time.
Used in quantitative research, the CI is part of the stories that studies tell in numbers. These numeric stories describe the characteristics, or parameters, of a population; populations can be made up of individuals, communities, or systems. Collecting information from the whole population to find answers to clinical questions is practically impossible. For instance, we can’t possibly collect information from all cancer patients. Instead, we collect information from smaller groups within the larger population, called samples. We learn about population characteristics from these samples through a process called inference.
To differentiate sample values from those of the population (parameters), the numeric characteristics of a sample most commonly are termed statistics, but also may be called parameter estimates because they’re estimates of the population. Inferring information from sample statistics to population parameters can lead to errors, mainly because statistics may differ from one sample to the next. Several other terms are related to this opportunity for error—probability, standard error (SE), and mean. (See What are probability, standard error, and mean?)
Calculating the CI
Used in the formula to calculate the upper and lower boundaries of the CI (within which the population parameter is expected to fall), the SE reveals how accurately the sample statistics reflect population parameters. Choosing a more stringent probability, such as 0.01 (meaning a CI of 99%), would offer more confidence that the lower and upper boundaries of the CI contain the true value of the population parameter.
Not all studies provide CIs. For example, when we prepared this article, our literature search found study after study with a probability ( p ) value) but no CI. However, studies usually report SEs and means. If the study you’re reading doesn’t provide a CI, here’s the formula for calculating it:
95% CI: X= X‾ ± (1.96 x SE), where X denotes the estimate and X‾ denotes the mean of the sample.
To find the upper boundary of the estimate, add 1.96 times the SE to X‾. To find the lower boundary of the estimate, subtract 1.96 times the SE fromX‾. Note: 1.96 is how many standard deviations from the mean are required for the range of values to contain 95% of the values.
Be aware that values found with this formula aren’t reliable with samples of less than 30. But don’t despair; you can still calculate the CI— although explaining that formula is beyond the scope of this article. Watch the video at https://goo.gl/AuQ7Re to learn about that formula.
Real-world decision-making: Where CIs really count
Now let’s apply your new statistical knowledge to clinical decision making. In everyday terms, a CI is the range of values around a sample statistic within which clinicians can expect to get results if they repeat the study protocol or intervention, including measuring the same outcomes the same ways. As you critically appraise the reliability of research (“Will I get the same results if I use this research?”), you must address the precision of study findings, which is determined by the CI. If the CI around the sample statistic is narrow, study findings are considered precise and you can be confident you’ll get close to the sample statistic if you implement the research in your practice. Also, if the CI does not contain the statistical value that indicates no effect (such as 0 for effect size or 1 for relative risk and odds ratio), the sample statistic has met the criteria to be statistically significant.
The following example can help make the CI concept come alive. In a systematic review synthesizing studies of the effect of tai chi exercise on sleep quality, Du and colleagues (2015) found tai chi affected sleep quality in older people as measured by the Pittsburgh Sleep Quality Index (mean difference of -0.87; 95% CI [-1.25, -0.49]). Here’s how clinicians caring for older adults in the community would interpret these results: Across the studies reviewed, older people reported better sleep if they engaged in tai chi exercise. The lower boundary of the CI is -1.25, the study statistic is -0.87, and the upper boundary is -0.49. Each limit is 0.38 from the sample statistic, which is a relatively narrow CI. Keep in mind that a mean difference of 0 indicates there’s no difference; this CI doesn’t contain that value. Therefore, the sample statistic is statistically significant and unlikely to occur by chance. Because this was a systematic review and tai chi exercise has been established as helping people sleep, based on the sample statistics and the CI, clinicians can confidently include tai chi exercises among possible recommendations for patients who have difficulty sleeping.
Now you can apply your knowledge of CIs to make wise decisions about whether to base your patient care on a particular research finding. Just remember—when appraising research, consistently look for the CI. If the authors report the mean and SE but don’t report the CI, you can calculate the CI using the formula discussed earlier.
The authors work at the University of Texas at Tyler. Zhaomin He is an assistant professor and biostatistician of nursing. Ellen Fineout-Overholt is the Mary Coulter Dowdy Distinguished Professor of Nursing.
Selected references
Du S, Dong J, Zhang H, et al. Taichi exercise for self-rated sleep quality in older people: a systematic review and meta-analysis. Int J Nurs Stud . 2015;52(1):368-79.
Fineout-Overholt E. EBP, QI, and research: strange bedfellows or kindred spirits? In: Hedges C, Williams B, eds. Anatomy of Research for Nurses . Indianapolis, IN: Sigma Theta Tau International; 2014:23-44.
Fineout-Overholt E, Melnyk BM, Stillwell SB, Williamson KM. Evidence-based practice, step by step: critical appraisal of the evidence: part II: digging deeper—examining the “keeper” studies. Am J Nurs . 2010;110(9): 41-8.
Kahn Academy. Small sample size confidence intervals
Melnyk BM, Fineout-Overholt E. ARCC (Advancing Research and Clinical practice through close Collaboration): a model for system-wide implementation and sustainability of evidence-based practice. In: Rycroft-Malone J, Bucknall T, eds. Models and Frameworks for Implementing Evidence-Based Practice: Linking Evidence to Action . Indianapolis, IN: Wiley-Blackwell & Sigma Theta Tau International; 2010.
O’Mathúna DP, Fineout-Overholt E. Critically appraising quantitative evidence for clinical decision making. In: Melnyk BM, Fineout- Overholt E, eds. Evidence-Based Practice in Nursing and Healthcare: A Guide to Best Practice . 3rd ed. Philadelphia: Lippincott Williams and Wilkins; 2015:81-134.
Plichta, SB, Kelvin E. Munro’s Statistical Methods for Health Care Research . 6th ed. Philadelphia, PA: Lippincott, Williams & Wilkins; 2013.
NurseLine Newsletter
- First Name *
- Last Name *
- Hidden Referrer
*By submitting your e-mail, you are opting in to receiving information from Healthcom Media and Affiliates. The details, including your email address/mobile number, may be used to keep you informed about future products and services.
Test Your Knowledge
Recent posts.
Built to fit
ANA Enterprise News, October 2024
Your frontline is your bottom line
The evolution of pulse oximetry and its impact on patient care
Reviving reverence for vaccines
Responding to workplace violence
COVID-19 is here to stay
Creating an organization for future generations
Autonomic dysreflexia in spinal cord injuries
Firearm safety: Nurses’ knowledge and comfort
The Secret Garden: A staff-only wellness and respite space
Living donor liver transplant coordinator
Healthy Nurse, Healthy Nation: 2024 Highlights Report
Nurses build coalitions at the Capitol
- Skip to secondary menu
- Skip to main content
- Skip to primary sidebar
Statistics By Jim
Making statistics intuitive
Hypothesis Testing and Confidence Intervals
By Jim Frost 20 Comments
Confidence intervals and hypothesis testing are closely related because both methods use the same underlying methodology. Additionally, there is a close connection between significance levels and confidence levels. Indeed, there is such a strong link between them that hypothesis tests and the corresponding confidence intervals always agree about statistical significance.
A confidence interval is calculated from a sample and provides a range of values that likely contains the unknown value of a population parameter . To learn more about confidence intervals in general, how to interpret them, and how to calculate them, read my post about Understanding Confidence Intervals .
In this post, I demonstrate how confidence intervals work using graphs and concepts instead of formulas. In the process, I compare and contrast significance and confidence levels. You’ll learn how confidence intervals are similar to significance levels in hypothesis testing. You can even use confidence intervals to determine statistical significance.
Read the companion post for this one: How Hypothesis Tests Work: Significance Levels (Alpha) and P-values . In that post, I use the same graphical approach to illustrate why we need hypothesis tests, how significance levels and P-values can determine whether a result is statistically significant, and what that actually means.
Significance Level vs. Confidence Level
Let’s delve into how confidence intervals incorporate the margin of error. Like the previous post, I’ll use the same type of sampling distribution that showed us how hypothesis tests work. This sampling distribution is based on the t-distribution , our sample size , and the variability in our sample. Download the CSV data file: FuelsCosts .
There are two critical differences between the sampling distribution graphs for significance levels and confidence intervals–the value that the distribution centers on and the portion we shade.
The significance level chart centers on the null value, and we shade the outside 5% of the distribution.
Conversely, the confidence interval graph centers on the sample mean, and we shade the center 95% of the distribution.
The shaded range of sample means [267 394] covers 95% of this sampling distribution. This range is the 95% confidence interval for our sample data. We can be 95% confident that the population mean for fuel costs fall between 267 and 394.
Confidence Intervals and the Inherent Uncertainty of Using Sample Data
The graph emphasizes the role of uncertainty around the point estimate . This graph centers on our sample mean. If the population mean equals our sample mean, random samples from this population (N=25) will fall within this range 95% of the time.
We don’t know whether our sample mean is near the population mean. However, we know that the sample mean is an unbiased estimate of the population mean. An unbiased estimate does not tend to be too high or too low. It’s correct on average. Confidence intervals are correct on average because they use sample estimates that are correct on average. Given what we know, the sample mean is the most likely value for the population mean.
Given the sampling distribution, it would not be unusual for other random samples drawn from the same population to have means that fall within the shaded area. In other words, given that we did, in fact, obtain the sample mean of 330.6, it would not be surprising to get other sample means within the shaded range.
If these other sample means would not be unusual, we must conclude that these other values are also plausible candidates for the population mean. There is inherent uncertainty when using sample data to make inferences about the entire population. Confidence intervals help gauge the degree of uncertainty, also known as the margin of error.
Related post : Sampling Distributions
Confidence Intervals and Statistical Significance
If you want to determine whether your hypothesis test results are statistically significant, you can use either P-values with significance levels or confidence intervals. These two approaches always agree.
The relationship between the confidence level and the significance level for a hypothesis test is as follows:
Confidence level = 1 – Significance level (alpha)
For example, if your significance level is 0.05, the equivalent confidence level is 95%.
Both of the following conditions represent statistically significant results:
- The P-value in a hypothesis test is smaller than the significance level.
- The confidence interval excludes the null hypothesis value.
Further, it is always true that when the P-value is less than your significance level, the interval excludes the value of the null hypothesis.
In the fuel cost example, our hypothesis test results are statistically significant because the P-value (0.03112) is less than the significance level (0.05). Likewise, the 95% confidence interval [267 394] excludes the null hypotheses value (260). Using either method, we draw the same conclusion.
Hypothesis Testing and Confidence Intervals Always Agree
The hypothesis testing and confidence interval results always agree. To understand the basis of this agreement, remember how confidence levels and significance levels function:
- A confidence level determines the distance between the sample mean and the confidence limits.
- A significance level determines the distance between the null hypothesis value and the critical regions.
Both of these concepts specify a distance from the mean to a limit. Surprise! These distances are precisely the same length.
A 1-sample t-test calculates this distance as follows:
The critical t-value * standard error of the mean
Interpreting these statistics goes beyond the scope of this article. But, using this equation, the distance for our fuel cost example is $63.57.
P-value and significance level approach : If the sample mean is more than $63.57 from the null hypothesis mean, the sample mean falls within the critical region, and the difference is statistically significant.
Confidence interval approach : If the null hypothesis mean is more than $63.57 from the sample mean, the interval does not contain this value, and the difference is statistically significant.
Of course, they always agree!
The two approaches always agree as long as the same hypothesis test generates the P-values and confidence intervals and uses equivalent confidence levels and significance levels.
Related posts : Standard Error of the Mean and Critical Values
I Really Like Confidence Intervals!
In statistics, analysts often emphasize using hypothesis tests to determine statistical significance. Unfortunately, a statistically significant effect might not always be practically meaningful. For example, a significant effect can be too small to be important in the real world. Confidence intervals help you navigate this issue!
Similarly, the margin of error in a survey tells you how near you can expect the survey results to be to the correct population value.
Learn more about this distinction in my post about Practical vs. Statistical Significance .
Learn how to use confidence intervals to compare group means !
Finally, learn about bootstrapping in statistics to see an alternative to traditional confidence intervals that do not use probability distributions and test statistics. In that post, I create bootstrapped confidence intervals.
Neyman, J. (1937). Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability . Philosophical Transactions of the Royal Society A . 236 (767): 333–380.
Share this:
Reader Interactions
December 7, 2021 at 3:14 pm
I am helping my Physics students use their data to determine whether they can say momentum is conserved. One of the columns in their data chart was change in momentum and ultimately we want this to be 0. They are obviously not getting zero from their data because of outside factors. How can I explain to them that their data supports or does not support conservation of momentum using statistics? They are using a 95% confidence level. Again, we want the change in momentum to be 0. Thank you.
December 9, 2021 at 6:54 pm
I can see several complications with that approach and also my lack of familiarity with the subject area limits what I can say. But here are some considerations.
For starters, I’m unsure whether the outside factors you mention bias the results systematically from zero or just add noise (variability) to the data (but not systematically bias).
If the outside factors bias the results to a non-zero value, then you’d expect the case where larger samples will be more likely to produce confidence intervals that exclude zero. Indeed, only smaller samples sizes might produce CIs that include zero, but that would only be due to the relative lack of precision associated with small sample sizes. In other words, limited data won’t be able to distinguish the sample value from zero even though, given the bias of the outside factors, you’d expect a non-zero value. In other words, if the bias exists, the larger samples will detect the non-zero values correctly while smaller samples might miss it.
If the outside factors don’t bias the results but just add noise, then you’d expect that both small and larger samples will include zero. However, you still have the issue of precision. Smaller samples will include zero because they’re relatively wider intervals. Larger samples should include zero but have narrower intervals. Obviously, you can trust the larger samples more.
In hypothesis testing, when you fail to reject the null, as occurs in the unbiased discussion above, you’re not accepting the null . Click the link to read about that. Failing to reject the null does not mean that the population value equals the hypothesized value (zero in your case). That’s because you can fail to reject the null due to poor quality data (high noise and/or small sample sizes). And you don’t want to draw conclusions based on poor data.
There’s a class of hypothesis testing called equivalence testing that you should use in this case. It flips the null and alternative hypotheses so that the test requires you to collect strong evidence to show that the sample value equals the null value (again, zero in your case). I don’t have a post on that topic (yet), but you can read the Wikipedia article about Equivalence Testing .
I hope that helps!
September 19, 2021 at 5:16 am
Thank you very much. When training a machine learning model using bootstrap, in the end we will have the confidence interval of accuracy. How can I say that this result is statistically significant? Do I have to convert the confidence interval to p-values first and if p-value is less than 0.05, then it is statistically significant?
September 19, 2021 at 3:16 pm
As I mention in this article, you determine significance using a confidence interval by assessing whether it excludes the null hypothesis value. When it excludes the null value, your results are statistically significant.
September 18, 2021 at 12:47 pm
Dear Jim, Thanks for this post. I am new to hypothesis testing and would like to ask you how we know that the null hypotheses value is equal to 260.
Thank you. Kind regards, Loukas
September 19, 2021 at 12:35 am
For this example, the null hypothesis is 260 because that is the value from the previous year and they wanted to compare the current year to the previous year. It’s defined as the previous year value because the goal of the study was to determine whether it has changed since last year.
In general, the null hypothesis will often be a meaningful target value for the study based on their knowledge, such as this case. In other cases, they’ll use a value that represents no effect, such as zero.
I hope that helps clarify it!
February 22, 2021 at 3:49 pm
Hello, Mr. Jim Frost.
Thank you for publishing precise information about statistics, I always read your posts and bought your excellent e-book about regression! I really learn from you.
I got a couple of questions about the confidence level of the confidence intervals. Jacob Cohen, in his article “things I’ve learned (so far)” said that, in his experience, the most useful and informative confidence level is 80%; other authors state that if that level is below 90% it would be very hard to compare across results, as it is uncommon.
My first question is: in exploratory studies, with small samples (for example, N=85), if one wishes to generate correlational hypothesis for future research, would it be better to use a lower confidence level? What is the lowest level you would consider to be acceptable? I ask that because of my own research now, and with a sample size 85 (non-probabilistic sampling) I know all I can do is generate some hypothesis to be explored in the future, so I would like my confidence intervals to be more informative, because I am not looking forward to generalize to the population.
My second question is: could you please provide an example of an appropriate way to describe the information about the confidence interval values/limits, beyond the classic “it contains a difference of 0; it contains a ratio of 1”.
I would really appreciate your answers.
Greetings from Peru!
February 23, 2021 at 4:51 pm
Thanks so much for your kind words and for supporting my regression ebook! I’m glad it’s been helpful! 🙂
On to your questions!
I haven’t read Cohen’s article, so I don’t understand his rationale. However, I’m extremely dubious of using a confidence level as low as 80%. Lowering the confidence level will create a narrower CI, which looks good. However, it comes at the expense of dramatically increasing the likelihood that the CI won’t contain the correct population value! My position is to leave the confidence level at 95%. Or, possibly lower it to 90%. But, I wouldn’t go further. Your CI will be wider, but that’s OK. It’s reflecting the uncertainty that truly exists in your data. That’s important. The problem with lowering the CIs is that it makes your results appear more precise than they actually are.
When I think of exploratory research, I think of studies that are looking at tendencies or trends. Is the overall pattern of results consistent with theoretical expectations and justify further research? At that stage, it shouldn’t be about obtaining statistically significant results–at least not as the primary objective. Additionally, exploratory research can help you derive estimated effect sizes, variability, etc. that you can use for power calculations . A smaller, exploratory study can also help you refine your methodology and not waste your resources by going straight to a larger study that, as a result, might not be as refined as it would without a test run in the smaller study. Consequently, obtaining significant results, or results that look precise when they aren’t, aren’t the top priorities.
I know that lowering the confidence level makes your CI look more information but that is deceptive! I’d resist that temptation. Maybe go down to 90%. Personally, I would not go lower.
As for the interpretation, CIs indicate the likely range that a population parameter is likely to fall within. The parameter can be a mean, effect size, ratio, etc. Often times, you as the researcher are hoping the CI excludes an important value. For example, if the CI is of the effect size, you want the CI to exclude zero (no effect). In that case, you can say that there is unlikely to be no effect in the population (i.e., there probably is a non-zero effect in the population). Additionally, the effect size is likely to be within this range. Other times, you might just want to know the range of values itself. For example, if you have a CI for the mean height of a population, it might be valuable on its own knowing that the population mean height is likely to fall between X and Y. If you have specific example of what the CI assesses, I can give you a more specific interpretation.
Additionally, I cover confidence intervals associated with many different types of hypothesis tests in my Hypothesis Testing ebook . You might consider looking in to that!
July 26, 2020 at 5:45 am
I got a very wide 95% CI of the HR of height in the cox PH model from a very large sample. I already deleted the outliers defined as 1.5 IQR, but it doesn’t work. Do you know how to resolve it?
July 5, 2020 at 6:13 pm
Hello, Jim!
I appreciate the thoughtful and thorough answer you provided. It really helped in crystallizing the topic for me.
If I may ask for a bit more of your time, as long as we are talking about CIs I have another question:
How would you go about constructing a CI for the difference of variances?
I am asking because while creating CIs for the difference of means or proportions is relatively straightforward, I couldn’t find any references for the difference of variances in any of my textbooks (or on the Web for that matter); I did find information regarding CIs for the ratio of variances, but it’s not the same thing.
Could you help me with that?
Thanks a lot!
July 2, 2020 at 6:01 pm
I want to start by thanking you for a great post and an overall great blog! Top notch material.
I have a doubt regarding the difference between confidence intervals for a point estimate and confidence intervals for a hypothesis test.
As I understand, if we are using CIs to test a hypothesis, then our point estimate would be whatever the null hypothesis is; conversely, if we are simply constructing a CI to go along with our point estimate, we’d use the point estimate derived from our sample. Am I correct so far?
The reason I am asking is that because while reading from various sources, I’ve never found a distinction between the two cases, and they seem very different to me.
Bottom line, what I am trying to ask is: assuming the null hypothesis is true, shouldn’t the CI be changed?
Thank you very much for your attention!
July 3, 2020 at 4:02 pm
There’s no difference in the math behind the scenes. The real difference is that when you create a confidence interval in conjunction with a hypothesis test, the software ensures that they’re using consistent methodology. For example, the significance level and confidence level will correspond correctly (i.e., alpha = 0.05 and confidence level = 0.95). Additionally, if you perform a two-tailed test, you will obtain a two-sided CI. On the other hand, if you perform a one-tailed test, you will obtain the appropriate upper or lower bound (i.e., one-sided CIs). The software also ensures any other methodological choices you make will match between the hypothesis test and CI, which ensures the results always agree.
You can perform them separately. However, if you don’t match all the methodology options, the results can differ.
As for your question about assuming the null is true. Keep in mind that hypothesis tests create sampling distributions that center on the null hypothesis value. That’s the assumption that the null is true. However, the sampling distributions for CIs center on the sample estimate. So, yes, CIs change that detail because they don’t assume the null is correct. But that’s always true whether you perform the hypothesis test or not.
Thanks for the great questions!
December 21, 2019 at 6:31 am
Confidence interval has sample static as the most likely value ( value in the center) – and sample distribution assumes the null value to be the most likely value( value in the center). I am a little confused about this. Would be really kind of you if you could show both in the same graph and explain how both are related. How the the distance from the mean to a limit in case of Significance level and CI same?
December 23, 2019 at 3:46 am
That’s a great question. I think part of your confusion is due to terminology.
The sampling distribution of the means centers on the sample mean. This sampling distribution uses your sample mean as its mean and the standard error of the mean as its standard deviation.
The sampling distribution of the test statistic (t) centers on the null hypothesis value (0). This distribution uses zero as its mean and also uses the SEM for its standard deviation.
They’re two different things and center on different points. But, they both incorporate the SEM, which is why they always agree! I do have section in this post about why that distance is always the same. Look for the section titled, “Why They Always Agree.”
November 23, 2019 at 11:31 pm
Hi Jim, I’m the proud owner of 2 of your ebooks. There’s one topic though that keeps puzzling me: If I would take 9 samples of size 15 in order to estimate the population mean, the se of the mean would be substantial larger than if I would take 1 sample of size 135 (divide pop sd by sqrt(15) or sqrt(135) ) whereas the E(x) (or mean of means) would be the same.
Can you please shine a little light on that.
Tx in advance
November 24, 2019 at 3:17 am
Thanks so much for supporting my ebooks. I really appreciate that!! 🙂
So, let’s flip that scenario around. If you know that a single large sample of 135 will produce more precise estimates of the population, why would you collect nine smaller samples? Knowing how statistics works, that’s not a good decision. If you did that in the real world, it would be because there was some practical reason that you could not collect one big example. Further, it would suggest that you had some reason for not being able to combine them later. For example, if you follow the same random sampling procedure on the same population and used all the same methodology and at the same general time, you might feel comfortable combining them together into one larger sample. So, if you couldn’t collect one larger example and you didn’t feel comfortable combining them together, it suggests that you have some reason for doubting that they all measure the same thing for the same population. Maybe you had differences in methodology? Or subjective measurements across different personnel? Or, maybe you collected the samples at different times and you’re worried that the population changed over time?
So, that’s the real world reason for why a researcher would not combine smaller samples into a larger one.
As you can see, the expected value for the population standard deviation is in the numerator (sigma). As the sample size increases, the numerator remains constant (plus or minus random error) because the expected value for the population parameter does not change. Conversely, the square root of the sample size is in the denominator. As the sample size increases, it produces a larger values in the denominator. So, if the expected value of the numerator is constant but the value of the denominator increases with a larger sample size, you expect the SEM to decrease. Smaller SEM’s indicate more precise estimates of the population parameter. For instance, the equations for confidence intervals use the SEM. Hence, for the same population, larger samples tend to produce smaller SEMS, and more precise estimates of the population parameter.
I hope that answers your question!
November 6, 2018 at 10:26 am
first of all: Thanks for your effort and your effective way of explaining!
You say that p-values and C.I.s always agree. I agree.
Why does Tim van der Zee claim the opposite? I’m not enough into statistcs to figure this out.
http://www.timvanderzee.com/not-interpret-confidence-intervals/
Best regards Georg
November 7, 2018 at 9:31 am
I think he is saying that they do agree–just that people often compare the wrong pair of CIs and p-values. I assume you’re referring to the section “What do overlapping intervals (not) mean?” And, he’s correct in what he says. In a 2-sample t-test, it’s not valid to compare the CI for each of the two group means to the test’s p-values because they have different purposes. Consequently, they won’t necessarily agree. However, that’s because you’re comparing results from two different tests/intervals.
On the one hand, you have the CIs for each group. On the other hand, you have the p-value for the difference between the two groups. Those are not the same thing and so it’s not surprising that they won’t agree necessarily.
However, if you compare the p-value of the difference between means to a CI of the difference between means, they will always agree. You have to compare apples to apples!
April 14, 2018 at 8:54 pm
First of all, I love all your posts and you really do make people appreciate statistics by explaining it intuitively compared to theoretical approaches I’ve come across in university courses and other online resources. Please continue the fantastic work!!!
At the end, you mentioned how you prefer confidence intervals as they consider both “size and precision of the estimated effect”. I’m confused as to what exactly size and precision mean in this context. I’d appreciate an explanation with reference to specific numbers from the example above.
Second, do p-values lack both size and precision in determination of statistical significance?
Thanks, Devansh
April 17, 2018 at 11:41 am
Hi Devansh,
Thanks for the nice comments. I really appreciate them!
I really need to write a post specifically about this issue.
Let’s first assume that we conduct our study and find that the mean cost is 330.6 and that we are testing whether that is different than 260. Further suppose that we perform the the hypothesis test and obtain a p-value that is statistically significant. We can reject the null and conclude that population mean does not equal 260. And we can see our sample estimate is 330.6. So, that’s what we learn using p-values and the sample estimate.
Confidence intervals add to that information. We know that if we were to perform the experiment again, we’d get different results. How different? Is the true population mean likely to be close to 330.6 or further away? CIs help us answer these questions. The 95% CI is [267 394]. The true population value is likely to be within this range. That range spans 127 dollars.
However, let’s suppose we perform the experiment again but this time use a much larger sample size and obtain a mean of 351 and again a significant p-value. However, thanks to the large sample size, we obtain a 95 CI of [340 362]. Now we know that the population value is likely to fall within this much tighter interval of only 22 dollars. This estimate is much more precise.
Sometimes you can obtain a significant p-value for a result that is too imprecise to be useful. For example, with first CI, it might be too wide to be useful for what we need to do with our results. Maybe we’re helping people make budgets and that is too wide to allow for practical planning. However, the more precise estimate of the second study allows for better budgetary planning! That determination how much precision is required must be made using subject-area knowledge and focusing on the practical usage of the results. P-values don’t indicate the precision of the estimates in this manner!
I hope this helps clarify this precision issue!
Comments and Questions Cancel reply
Log in using your username and password
- Search More Search for this keyword Advanced search
- Latest content
- Current issue
- Write for Us
- BMJ Journals
You are here
- Volume 15, Issue 3
- What is a CI?
- Article Text
- Article info
- Citation Tools
- Rapid Responses
- Article metrics
- Jane Clarke
- Correspondence to Jane Clarke , Department of O&G, University of Auckland, 4 Prime Rd, Grey Lynn Auckland, Auckland 1021, New Zealand; janeclarkehome{at}gmail.com
https://doi.org/10.1136/ebnurs-2012-100802
Statistics from Altmetric.com
Request permissions.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
When reading a research report, the range of the CI provides assurance (or confidence) regarding how precise the data are. CIs are calculated at a confidence level, for example 95%. This level is predetermined by the researcher. Confidence levels are usually calculated so that this percentage is 95% although others 90%, 99%, and 99.9% are sometimes applied.
CIs are considered a more useful measure than p values, which only reflect a level of statistical significance. 1 (p values were discussed in a previous Research Made Simple paper. 2 )
The concept
A CI is a numerical range used to describe research data. For example, for a study outcome of weight, a CI may be 53 to 71 kg. This interval of 53 to 71 kg is where there is 95% certainty that the true weight would lie (if you were applying a 95% CI).
The mainstream press often quote CIs when interpreting the results of polls, for example: the results of the latest XXX Research telephone poll considering two weight loss techniques showed that 62% of respondents favoured technique A while 39% would chose technique B. This telephone poll of 12 073 respondents had a margin of error of plus or minus 4.2 percentage points.
Presuming a 95% confidence level was applied, these results suggest there is a 95% chance that between 64.2% and 57.8% of people would chose technique A (62%±4.2%). Conversely, there is a 5% chance that fewer than 57.8% of people or more than 64.2% of people would choose technique A.
CIs are used to interpret the reliability of research data. The width or range of the CI indicates the reliability of the data (sometimes known as precision). A narrow CI implies high precision and credible values whereas a wide interval suggests the reverse. A wide interval may indicate more data should be collected before conclusions can be drawn. Sometimes when a CI is very wide, it may indicate an inadequate sample size.
Meaning and interpretation
CI is usually found in the results section of a paper and provide the reader with an opportunity to draw conclusions about the importance of the size or strength of the results. CIs are expressed as X (A–B), where X is the observed statistic for example, a mean, A is the lower limit of the CI, and B is the upper limit.
Two questions should be considered when interpreting a CI:
Does the CI contain a value that implies no change or no effect?
Does the CI include (cross) zero? If it does, this implies no statistically significant change. For example, research on a treatment for hypertension found that the 95% CI included zero (−1 to 13) suggesting the treatment is ineffective.
Does the CI lie partly or entirely within a range of clinical indifference or clinical significance?
Clinical indifference represents values of such a small size that you would not want to change current practice. For example, a weight reduction programme showing a loss of 3 kg over 2 years, or a diagnostic test that had a predicative value of less than 50% would not be considered useful.
Some examples
Researchers examined the efficacy of a homeopathic preparation for analgesia and swelling postoral surgery. Two days postoperatively the homeopathic preparation had led to a mean reduction in swelling of 3 mm. The 95% CI ranged from −5.5 to 7.5 mm. This wide interval (which crosses zero) suggests there was neither a large increase in nor a decrease in swelling due to the homoeopathic preparation. As the CI crosses zero, this suggests the treatment is ineffective.
If parliamentary elections were held today and an opinion poll predicted the Blogs party would win 62% of the vote, the pollster might attach a 95% CI to the interval, so the range would then be 59 to 65%. It would be reasonable to conclude the Blogs party would get between 59% and 65% of the total vote. This CI is quite narrow telling us the estimated value is relatively reliable and that repeated polls would give similar results.
- Gardner MJ ,
- Altman DG ,
- Bryant TN ,
- Norman GR ,
- Streiner DL
Competing interests None.
Read the full text or download the PDF:
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- Published: 26 November 2021
The clinician’s guide to p values, confidence intervals, and magnitude of effects
- Mark R. Phillips ORCID: orcid.org/0000-0003-0923-261X 1 na1 ,
- Charles C. Wykoff 2 , 3 ,
- Lehana Thabane ORCID: orcid.org/0000-0003-0355-9734 1 , 4 ,
- Mohit Bhandari ORCID: orcid.org/0000-0001-9608-4808 1 , 5 &
- Varun Chaudhary ORCID: orcid.org/0000-0002-9988-4146 1 , 5
for the Retina Evidence Trials InterNational Alliance (R.E.T.I.N.A.) Study Group
Eye volume 36 , pages 341–342 ( 2022 ) Cite this article
22k Accesses
9 Citations
15 Altmetric
Metrics details
- Outcomes research
A Correction to this article was published on 19 January 2022
This article has been updated
Introduction
There are numerous statistical and methodological considerations within every published study, and the ability of clinicians to appreciate the implications and limitations associated with these key concepts is critically important. These implications often have a direct impact on the applicability of study findings – which, in turn, often determine the appropriateness for the results to lead to modification of practice patterns. Because it can be challenging and time-consuming for busy clinicians to break down the nuances of each study, herein we provide a brief summary of 3 important topics that every ophthalmologist should consider when interpreting evidence.
p -values: what they tell us and what they don’t
Perhaps the most universally recognized statistic is the p-value. Most individuals understand the notion that (usually) a p -value <0.05 signifies a statistically significant difference between the two groups being compared. While this understanding is shared amongst most, it is far more important to understand what a p -value does not tell us. Attempting to inform clinical practice patterns through interpretation of p -values is overly simplistic, and is fraught with potential for misleading conclusions. A p -value represents the probability that the observed result (difference between the groups being compared)—or one that is more extreme—would occur by random chance, assuming that the null hypothesis (the alternative scenario to the study’s hypothesis) is that there are no differences between the groups being compared. For example, a p -value of 0.04 would indicate that the difference between the groups compared would have a 4% chance of occurring by random chance. When this probability is small, it becomes less likely that the null hypothesis is accurate—or, alternatively, that the probability of a difference between groups is high [ 1 ]. Studies use a predefined threshold to determine when a p -value is sufficiently small enough to support the study hypothesis. This threshold is conventionally a p -value of 0.05; however, there are reasons and justifications for studies to use a different threshold if appropriate.
What a p -value cannot tell us, is the clinical relevance or importance of the observed treatment effects. [ 1 ]. Specifically, a p -value does not provide details about the magnitude of effect [ 2 , 3 , 4 ]. Despite a significant p -value, it is quite possible for the difference between the groups to be small. This phenomenon is especially common with larger sample sizes in which comparisons may result in statistically significant differences that are actually not clinically meaningful. For example, a study may find a statistically significant difference ( p < 0.05) between the visual acuity outcomes between two groups, while the difference between the groups may only amount to a 1 or less letter difference. While this may be in fact a statistically significant difference, the difference is likely not large enough to make a meaningful difference for patients. Thus, p -values lack vital information on the magnitude of effects for the assessed outcomes [ 2 , 3 , 4 ].
Overcoming the limitations of interpreting p -values: magnitude of effect
To overcome this limitation, it is important to consider both (1) whether or not the p -value of a comparison is significant according to the pre-defined statistical plan, and (2) the magnitude of the treatment effects (commonly reported as an effect estimate with 95% confidence intervals) [ 5 ]. The magnitude of effect is most often represented as the mean difference between groups for continuous outcomes, such as visual acuity on the logMAR scale, and the risk or odds ratio for dichotomous/binary outcomes, such as occurrence of adverse events. These measures indicate the observed effect that was quantified by the study comparison. As suggested in the previous section, understanding the actual magnitude of the difference in the study comparison provides an understanding of the results that an isolated p -value does not provide [ 4 , 5 ]. Understanding the results of a study should shift from a binary interpretation of significant vs not significant, and instead, focus on a more critical judgement of the clinical relevance of the observed effect [ 1 ].
There are a number of important metrics, such as the Minimally Important Difference (MID), which helps to determine if a difference between groups is large enough to be clinically meaningful [ 6 , 7 ]. When a clinician is able to identify (1) the magnitude of effect within a study, and (2) the MID (smallest change in the outcome that a patient would deem meaningful), they are far more capable of understanding the effects of a treatment, and further articulate the pros and cons of a treatment option to patients with reference to treatment effects that can be considered clinically valuable.
The role of confidence intervals
Confidence intervals are estimates that provide a lower and upper threshold to the estimate of the magnitude of effect. By convention, 95% confidence intervals are most typically reported. These intervals represent the range in which we can, with 95% confidence, assume the treatment effect to fall within. For example, a mean difference in visual acuity of 8 (95% confidence interval: 6 to 10) suggests that the best estimate of the difference between the two study groups is 8 letters, and we have 95% certainty that the true value is between 6 and 10 letters. When interpreting this clinically, one can consider the different clinical scenarios at each end of the confidence interval; if the patient’s outcome was to be the most conservative, in this case an improvement of 6 letters, would the importance to the patient be different than if the patient’s outcome was to be the most optimistic, or 10 letters in this example? When the clinical value of the treatment effect does not change when considering the lower versus upper confidence intervals, there is enhanced certainty that the treatment effect will be meaningful to the patient [ 4 , 5 ]. In contrast, if the clinical merits of a treatment appear different when considering the possibility of the lower versus the upper confidence intervals, one may be more cautious about the expected benefits to be anticipated with treatment [ 4 , 5 ].
There are a number of important details for clinicians to consider when interpreting evidence. Through this editorial, we hope to provide practical insights into fundamental methodological principals that can help guide clinical decision making. P -values are one small component to consider when interpreting study results, with much deeper appreciation of results being available when the treatment effects and associated confidence intervals are also taken into consideration.
Change history
19 january 2022.
A Correction to this paper has been published: https://doi.org/10.1038/s41433-021-01914-2
Li G, Walter SD, Thabane L. Shifting the focus away from binary thinking of statistical significance and towards education for key stakeholders: revisiting the debate on whether it’s time to de-emphasize or get rid of statistical significance. J Clin Epidemiol. 2021;137:104–12. https://doi.org/10.1016/j.jclinepi.2021.03.033
Article PubMed Google Scholar
Gagnier JJ, Morgenstern H. Misconceptions, misuses, and misinterpretations of p values and significance testing. J Bone Joint Surg Am. 2017;99:1598–603. https://doi.org/10.2106/JBJS.16.01314
Goodman SN. Toward evidence-based medical statistics. 1: the p value fallacy. Ann Intern Med. 1999;130:995–1004. https://doi.org/10.7326/0003-4819-130-12-199906150-00008
Article CAS PubMed Google Scholar
Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31:337–50. https://doi.org/10.1007/s10654-016-0149-3
Article PubMed PubMed Central Google Scholar
Phillips M. Letter to the editor: editorial: threshold p values in orthopaedic research-we know the problem. What is the solution? Clin Orthop. 2019;477:1756–8. https://doi.org/10.1097/CORR.0000000000000827
Devji T, Carrasco-Labra A, Qasim A, Phillips MR, Johnston BC, Devasenapathy N, et al. Evaluating the credibility of anchor based estimates of minimal important differences for patient reported outcomes: instrument development and reliability study. BMJ. 2020;369:m1714. https://doi.org/10.1136/bmj.m1714
Carrasco-Labra A, Devji T, Qasim A, Phillips MR, Wang Y, Johnston BC, et al. Minimal important difference estimates for patient-reported outcomes: a systematic survey. J Clin Epidemiol. 2020;0. https://doi.org/10.1016/j.jclinepi.2020.11.024
Download references
Author information
Authors and affiliations.
Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada
Mark R. Phillips, Lehana Thabane, Mohit Bhandari & Varun Chaudhary
Retina Consultants of Texas (Retina Consultants of America), Houston, TX, USA
Charles C. Wykoff
Blanton Eye Institute, Houston Methodist Hospital, Houston, TX, USA
Biostatistics Unit, St. Joseph’s Healthcare-Hamilton, Hamilton, ON, Canada
Lehana Thabane
Department of Surgery, McMaster University, Hamilton, ON, Canada
Mohit Bhandari & Varun Chaudhary
NIHR Moorfields Biomedical Research Centre, Moorfields Eye Hospital, London, UK
Sobha Sivaprasad
Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA
Peter Kaiser
Retinal Disorders and Ophthalmic Genetics, Stein Eye Institute, University of California, Los Angeles, CA, USA
David Sarraf
Department of Ophthalmology, Mayo Clinic, Rochester, MN, USA
Sophie J. Bakri
The Retina Service at Wills Eye Hospital, Philadelphia, PA, USA
Sunir J. Garg
Center for Ophthalmic Bioinformatics, Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA
Rishi P. Singh
Cleveland Clinic Lerner College of Medicine, Cleveland, OH, USA
Department of Ophthalmology, University of Bonn, Boon, Germany
Frank G. Holz
Singapore Eye Research Institute, Singapore, Singapore
Tien Y. Wong
Singapore National Eye Centre, Duke-NUD Medical School, Singapore, Singapore
Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, VIC, Australia
Robyn H. Guymer
Department of Surgery (Ophthalmology), The University of Melbourne, Melbourne, VIC, Australia
You can also search for this author in PubMed Google Scholar
- Varun Chaudhary
- , Mohit Bhandari
- , Charles C. Wykoff
- , Sobha Sivaprasad
- , Lehana Thabane
- , Peter Kaiser
- , David Sarraf
- , Sophie J. Bakri
- , Sunir J. Garg
- , Rishi P. Singh
- , Frank G. Holz
- , Tien Y. Wong
- & Robyn H. Guymer
Contributions
MRP was responsible for conception of idea, writing of manuscript and review of manuscript. VC was responsible for conception of idea, writing of manuscript and review of manuscript. MB was responsible for conception of idea, writing of manuscript and review of manuscript. CCW was responsible for critical review and feedback on manuscript. LT was responsible for critical review and feedback on manuscript.
Corresponding author
Correspondence to Varun Chaudhary .
Ethics declarations
Competing interests.
MRP: Nothing to disclose. CCW: Consultant: Acuela, Adverum Biotechnologies, Inc, Aerpio, Alimera Sciences, Allegro Ophthalmics, LLC, Allergan, Apellis Pharmaceuticals, Bayer AG, Chengdu Kanghong Pharmaceuticals Group Co, Ltd, Clearside Biomedical, DORC (Dutch Ophthalmic Research Center), EyePoint Pharmaceuticals, Gentech/Roche, GyroscopeTx, IVERIC bio, Kodiak Sciences Inc, Novartis AG, ONL Therapeutics, Oxurion NV, PolyPhotonix, Recens Medical, Regeron Pharmaceuticals, Inc, REGENXBIO Inc, Santen Pharmaceutical Co, Ltd, and Takeda Pharmaceutical Company Limited; Research funds: Adverum Biotechnologies, Inc, Aerie Pharmaceuticals, Inc, Aerpio, Alimera Sciences, Allergan, Apellis Pharmaceuticals, Chengdu Kanghong Pharmaceutical Group Co, Ltd, Clearside Biomedical, Gemini Therapeutics, Genentech/Roche, Graybug Vision, Inc, GyroscopeTx, Ionis Pharmaceuticals, IVERIC bio, Kodiak Sciences Inc, Neurotech LLC, Novartis AG, Opthea, Outlook Therapeutics, Inc, Recens Medical, Regeneron Pharmaceuticals, Inc, REGENXBIO Inc, Samsung Pharm Co, Ltd, Santen Pharmaceutical Co, Ltd, and Xbrane Biopharma AB—unrelated to this study. LT: Nothing to disclose. MB: Research funds: Pendopharm, Bioventus, Acumed – unrelated to this study. VC: Advisory Board Member: Alcon, Roche, Bayer, Novartis; Grants: Bayer, Novartis – unrelated to this study.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original version of this article was revised: In this article the middle initial in author name Sophie J. Bakri was missing.
Rights and permissions
Reprints and permissions
About this article
Cite this article.
Phillips, M.R., Wykoff, C.C., Thabane, L. et al. The clinician’s guide to p values, confidence intervals, and magnitude of effects. Eye 36 , 341–342 (2022). https://doi.org/10.1038/s41433-021-01863-w
Download citation
Received : 11 November 2021
Revised : 12 November 2021
Accepted : 15 November 2021
Published : 26 November 2021
Issue Date : February 2022
DOI : https://doi.org/10.1038/s41433-021-01863-w
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
This article is cited by
Exploring the fragility of meta-analyses in ophthalmology: a systematic review.
- Keean Nanji
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
Confidence intervals: what are they to us, medical doctors?
Vladimir trkulja, pero hrabač.
- Author information
- Copyright and License information
This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
In 2000, a BMJ-edition book ( 1 ) straightforwardly pointed-out the importance of providing effect measures with confidence intervals (CI) when reporting the results of clinical/epidemiological research, and not only the results of statistical tests. However, medical doctors commonly seem to be more aware of formal statistical testing and more fascinated with statistical significance than they are aware of the practical meaning of the location and size (and other properties) of the effect measure. The caveats of scientific reasoning based solely on the results of statistical tests and “ the concept of ‘statistical significance’, typically assessed with an index called the P value ” ( 2 ) have been thoroughly addressed. Here, we would only emphasize: the effect estimate (together with its CI) provides information that is not conveyed by the simple fact of it being “statistically significant” ( P < 0.05) or not ( P > 0.05) (whatever this, in its essence, would mean) ( 1 ) – imagine the effects of two preventive interventions, both associated with “ P <0.05” (“statistically significant”), but one estimated at 5% relative risk reduction (95% CI 1.0 to 9.0) and the other estimated at 30% reduction (95% CI 26.0 to 34.0). This shifts the focus away from testing and onto the questions about practical relevance of the observed effects. This becomes a matter of methodological (how the effects were observed) and medical expertise, not just statistical. However, the process requires that some concepts, including CIs, are adequately perceived, since they could be quite confusing for non-statisticians like medical doctors.
Research tends to identify generally applicable (generalizable) principles about relationships between factors that affect occurrence, natural history/diseases outcomes, or may help distinguish between health and disease, that would support claims like “this lifestyle intervention reduces the risk of type 2 diabetes by 30%”or “if this test is positive, it is 99% probable that the patient is indeed sick.” Such claims pertain to all potentially or actually diseased people at any time (population-wise claims). The inherent obstacle in this process is the fact that one cannot explore these relationships by encompassing the entire population since population is not a fixed “body of people” – every day some members of the population of, eg, patients with chronic heart failure (CHF) die, while some are newly diagnosed. Therefore, one explores relationships between factors within samples from the population (even studies designated as population-based or nationwide are performed in samples) and then projects back to the population, ie, one estimates the population.
In 1937, Polish mathematician and statistician Jerzy Neyman introduced the concept of CIs ( 3 ) and defined the problem of estimation in the introductory sentences (quote):
(ia) The statistician is concerned with a population, π, which for some reason or other cannot be studied exhaustively. It is only possible to draw a sample from this population which may be studied in detail and used to form an opinion as to the values of certain constants describing the population π…. … the problem … is the problem of estimation. This problem consists in determining what arithmetical operations should be performed on the observational data in order to obtain a result, to be called an estimate, which presumably does not differ very much form the true value of the numerical character… (ii) The theoretical aspect of the problem of statistical estimation consists primarily in putting in a precise form certain vague notions mentioned in (i)… connected with the sentence describing the meaning of the word estimate. What exactly is meant by the statement that the value of the estimate “presumably” should not differ very much from the estimated number? The only established branch of mathematics dealing with conceptions bearing on the word “presumably” is the calculus of probability.”
Deriving from the calculus of probability, he demonstrated that it was impossible to assume that the estimate (as a single value) would be “ exactly equal to ” (quote) the population parameter (object of estimation), but rather, that the process needed to result in the estimation of limits “ between which the true value [of the population parameter] presumably falls ” (quote) ( 3 ). He named the interval defined by these limits – CI (3).
There are two conceptual (or philosophical) views on estimation and probability, Bayesian and frequentist (the former being older and named after the 18th century mathematician Thomas Bayes), with some specifics regarding their respective computational methods. Both have wide applications in different scientific disciplines, each with its proponents who could commonly be defined also as opponents of the other (although the two philosophies are far from opposing). The concept of CIs and the closely related concept of statistical hypothesis testing [also greatly contributed to by Neyman ( 4 )] together with all the subsequent developments in the field are core frequentist concepts (the term frequentist comes from the view that probabilities of events or numerical values are defined by their relative frequency of occurrence in an infinite number of observations, ie, their occurrence in a long-run). Before we try to outline the main points about the two estimation philosophies, we point-out: (i) parameter is a population value that we want to estimate (“the true effect” or “the true value”); (ii) effect or statistic refers to any quantity that we determine in a sample in an attempt to estimate the population value: ie, mean or median of some continuous variable (eg, blood pressure), proportion of people with some characteristic (eg, cured), correlation or a regression coefficient indicating an association between variables (eg, between blood glucose and cholesterol levels), difference in means or medians between groups of subjects, difference in proportions or ratio measures (eg, relative risk, odds ratio, hazard ratio, incidence rate ratio) etc; (iii) population is defined by sets of characteristics, with no time constraints. For example, population of people with CHF refers to both present and future, and we know that there are at least two (sub)populations of CHF patients ( 5 ) – those with reduced left ventricular ejection fraction (LVEF) and those with preserved LVEF. Based on their characteristics, one of which is the fact that some interventions convey a considerable survival benefit in the former, but not in the latter, these are two different populations ( 5 ).
The main points of the Bayesian approach ( 6 ) are (i) parameter is a random variable, hence it needs to be estimated in a form of a probability distribution. This probability, largely but not solely [see (ii)] defined by data observed in the sample, is called posterior probability or posterior probability distribution. The calculation of the posterior distribution is the final result of the Bayesian estimation process. One can define a certain appropriate point of this distribution that is located at the highest density of the probability of the “true effect”, eg, mean or median, ie, the point-estimate, and an interval that contains the “true value” with a certain level of probability. This interval is called the credible interval (CrI), eg, 95% CrI tells us that the true population value is contained between its lower and upper limits with 95% probability. Different types of CrIs (95%) could be constructed, eg, “equal-tail” intervals, where 2.5% probability of the location of the true effect is below the lower interval limit and 2.5% is above the upper limit; or the highest posterior density interval (HPD) – it may not have equal tails, but it is the shortest interval encompassing 95% of the probability and any point contained within the interval “bears” a higher probability of the location of the true effect than any point outside the interval; (ii) posterior probability is determined (calculated) based on three key elements. The first one is the prior probability or prior probability distribution or simply – the prior. It reflects our previous knowledge, or a belief or a hypothesis (H) about the true effect – what we think that the “true value is” before we have seen the data (ie, before we have done the sample-based observation). The second one is the information collected by observing the sample (the data, D). The third element is the likelihood. Probability and likelihood are commonly used synonymously. Here, likelihood means probability (P) of observing exactly what we have observed in the sample, if our prior (our “pre-data” hypothesis, H) were true [likelihood = P (D|H)]. The computational part uses these three key elements to derive the posterior distribution – which, in a way, illustrates how our initial hypothesis was modified by the data and is influenced both by the prior and the observation in the sample. Calculation of the positive and negative predictive values (PPV, NPV, respectively) of a diagnostic test is a clear application of the Bayesian method: both PPV and NPV are posterior probabilities, that is, post-test probabilities of a disease if a test is positive (PPV) or of no disease if the test is negative (NPV) – for any combination of sensitivity and specificity (observed in the sample), they vary depending on the prevalence of a disease in the population (ie, the pre-test or prior probability of the disease/no disease). This is the essence. A simple (hypothetical) example: a randomized controlled trial aims to assess whether a treatment T conveys a survival benefit defined as a difference in 1-year mortality vs placebo (control, C) in patients with advanced stages of CHF with reduced LVEF who are already on their standard therapy. Considering the clinical setting and a potential meaningful effect, it is estimated that 2000 patients need to be included and randomized to T and to C in a 1:1 ratio, with randomization stratified by age and clinical severity stage (since both affect life expectancy). At the end, 1-year crude cumulative mortality is 20% with C and 16% with T. The T-C difference needs to be estimated with adjustment for age and baseline disease severity, ie, in a generalized linear model that models ln(risk) of death with T and with C to determine the difference [ln(riskT)-ln(riskC)] (exponentiation of the difference yields relative risk, RR). Bayesian analysis: a) two options may be considered, (i) one with no relevant pre-study input that would result in a meaningful (the so-called informed) prior, hence a so-called non-informative prior is used, and (ii) another one using an informed prior based on a pilot study indicating around 25% relative risk reduction and suggesting 95% probability of the effect being within the range between a 62.5% lower and a 50% higher relative risk; b) based on our input (prior and the data, ie, treatment, outcome, age, and disease severity), the computational algorithm uses complex simulation processes to generate and to randomly sample from a large number of simulated distributions to estimate the posterior distribution of ln(riskT)-ln(riskC): exponentiation of particular points of the distribution retrieves RR (median), equal-tail CrI (2.5 and 97.5 percentile), or HPD CrI. Based on the results ( Table 1 ), we conclude that the risk of 1-year mortality with T is relatively 20% lower than with C and that it is 95% probable that the effect is in the range between 34.4% lower and 4.8% lower. This is our claim about T. Note: SAS 9.4 for Windows (SAS Inc., Cary, NC) proc genmod (log link, binomial distribution) was used with (i) the built-in option for Bayesian analysis with Jeffreys prior; or (ii) a normal prior with mean -0.288 and variance 0.125 for treatment (0, 1e6 for other effects). Visual inspection of the trace plots indicated good Markov chain convergence. The use of informative or non-informative priors in data analysis is a question of debate. In general, for data like the present hypothetical trial, non-informative priors are preferred. Similarly, the preference is toward HPD vs equal-tail CrIs ( https://www.berryconsultants.com/use-bayesian-trial/ , accessed June 15, 2019).
Results of the Bayesian analysis of the hypothetical randomized trial comparing treatment (T) to control (C) regarding 1-y mortality in patients with advanced chronic heart failure. Results are relative risks (RR) with highest posterior density (HPD) or equal-tail (2.5 to 97.5 percentile) 95% credible intervals (CrI) for two scenarios: one using a non-informative prior (Jeffreys) and another one using a normal (mean -0.288, variance 0.125) prior probability distribution for treatment
RR | 95% CrI (HPD) | 95% CrI (2.5-97.5) | |
---|---|---|---|
Non-informative prior | 0.800 | 0.656-0.952 | 0.664-0.964 |
Informative prior | 0.829 | 0.682-0.981 | 0.692-0.993 |
Results of the frequentist analysis of a hypothetical randomized trial comparing treatment (T) to control (C) regarding 1-year mortality in patients with advanced chronic heart failure. Calculations are based on ln(risk): mean difference ln(riskT) – ln(riskC) = -0.2232; standard error = 0.0962; since the sampling distribution of the ln(relative risk) is normal, 95% confidence interval (CI) for the difference = -0.2232 ± 1.96 × 0.0962, ie, -0.4116 to -0.0347. Relative risk (RR) and its 95% CI are obtained by exponentiation of these values
RR | 95% CI | |
---|---|---|
T vs C | 0.800 | 0.663-0.966 |
With both the Bayesian and the frequentist approach we obtained a point-estimate – and this is our best estimate (given the data) of the “true value.” There is, however, one obvious difference: for CrI we stated it was 95% probable that it contained the “true value” (based on the whole process we undertook), while here, we make NO probabilistic statement related to the calculated 95% CI. We are tempted to do so, just as for the Bayesian CrI: such a statement would be a “natural answer” to a “natural question” ( 6 ) – we wanted an answer to the question “does T convey any survival benefit?”; for this purpose, we did the trial; when the trial is done and the estimate generated, the natural question is – does the interval contain the true value, or – how probable it is that it does? Next, our (medical) practice is in its essence largely Bayesian ( 10 ). For example, by taking medical history and by examining a patient, we gradually form an opinion about the probability of a certain diagnosis (vs other possibilities) that drives our decisions about subsequent diagnostic tests (in essence – this is the Bayesian prior probability) and affects our opinion about the probability of the respective diagnosis after seeing the test results (the posterior probability): eg, a slightly elevated value of some laboratory parameter might diminish the probability of a diagnosis when the preceding knowledge (medical history, physical examination) indicated a low probability of a diagnosis (low prior probability) or be considered as confirmatory for the diagnosis when previous knowledge strongly suggested the diagnosis (high prior probability). So, we are continuously dealing with probabilities, and thus are, likely, inherently prone to perceive the CI around the point-estimate as a probability statement. There has been a lot of discussion about what the frequentist CIs are and what they are not ( 6 , 8 , 11 , 12 ), commonly involving the debate about CrIs vs CIs, ie, Bayesian vs frequentist views (and which is better for science). But, even if one is to stay within the frequentist framework – many different misinterpretations of CIs have been generated ( 12 ). One of the most common ones is exactly the one of assigning a probabilistic statement to CIs in the way it is done with Bayesian CrIs ( 12 ) – and appears to be equally common among statisticians and non-statistician scientists and students ( 11 , 13 ). However, it is not correct and was not originally claimed ( 3 , 14 , 15 ). We need to go back to the essence of the concepts: (i) with Bayesian approach, the parameter is a random variable, hence needs to be estimated by a probability distribution. The point-estimate and the limits of CrI are fixed – they are points of a generated posterior distribution – hence, “CrIs make a direct probability statement about themselves”, ie, that it is 95% probable (for 95% CrIs) that they, based on the entire completed process, contain the true value; (ii) with frequentist approach, parameter is a fixed point, it does not have a probability distribution. However, the point-estimate and CIs around it are random variables – they have a theoretical probability distribution (sampling distribution) based on which, for the given study, they are generated. Once calculated, CIs either cover or do not cover the true value – we cannot make any probability statements about it based on the sole fact that we calculated them. Since the sampling distribution based on which they are derived is composed of “many samples, with many respective estimates and CIs” that differ one from the other, one also cannot say that 95% of the point-estimates (from the very same sampling distribution) would fall within the one particular 95% CI that we have determined ( 12 ). What we can say and what is in line with the frequentist philosophy that defines probability as relative occurrence of events/values over a large number of repeated observations is – if the underlying model is correct and the only source of variability is chance (sampling variation) alone (ie, there is no systematic error in the process), then if one is to repeat the entire (valid) process in an unlimited number of independent random samples of the same size and from the same population – at least 95% of thus generated CIs would cover the true value (or 90% or 99% – if these are the CIs that are of interest) ( 8 , 12 ). So, a specific CI also “conveys” a “probability message,” but not about its own probability of containing the true value, but about coverage probability – ie, probability that the unobserved (“unlimited” members of the respective sampling distribution) intervals generated by the same valid procedure do cover the true value 95% percent of the time. In other words, the “95% probability ” refers to the procedure, not to the one specific calculated CI. So, what does this mean for us at the very moment of looking at the obtained estimate? What are CIs to us? Why would we care about possible outcomes of unobserved repetitions that anyhow are only hypothetical? How should we view CIs calculated in this particular study? In their “statistical essence,” CIs are indicators of uncertainty about the point-estimate (remember SE as a “measure” of sampling error, ie, distance of the point-estimate from the true value), which is (uncertainty) due to chance alone, but may be problematic because they depend on the correctness of probability models and sampling properties ( 8 ). Some authors consider CIs to be useless, mainly due to their close relationship to hypothesis testing and “ P values,” and suggest they should be abandoned in favor of Bayesian CrIs ( 11 ). Regarding the interpretation of (95%) CIs, the 2000 BMJ-edition book ( 1 ) states (p.17): “ Put simply, this means that there is 95% chance that the indicated range includes the ‘population value’…, while another paper ( 12 ), co-authored by one of the co-authors of the cited quote, points-out such an interpretation as incorrect (quote: [misinterpretation No. 19] ” The specific 95% confidence interval presented by a study has a 95% chance of containing the true effect size . – No!”). The same author, in another source ( 16 ), along with a strictly correct definition of CIs states: “ Little is lost by the common but less pure interpretation of the CI as a range of values within which we can be 95% sure the population value lies .”
Hence, apart from the message that the frequentist CIs and Bayesian CrIs convey different information, and that the information conveyed by the former is not the one we commonly think it is – everything else is rather confusing. We may become even more confused when we consider the following two facts: a) Bayesian methods have had wide applications in biomedicine, particularly in the analysis of large complex data (eg, those produced by genomic and other omic analyses) ( 17 ), in pharmacokinetic-pharmacodynamic modeling, meta-analysis, their use in clinical trials is growing, particularly in adaptive-design sequential trials, and similar is the situation in epidemiology ( 18 ); b) however, frequentist methods have predominated, in eg, clinical and epidemiological research. We will skip the question whether this is “good or bad” [because (i) it is beyond our reach; (ii) the answer might be neither, or one or the other or even a combined approach might be better, depending on the problem addressed; (iii) because in some situations CIs and CrIs not only agree but could be interpreted in the same way ( 6 , 8 , 17 , 19 , 20 )], and address another one – does it mean that we have been continuously wrong (since relying mainly on frequentist methods) about relationships of interest? Over the decades, a huge number of estimates of the “frequentist type” have been made resulting in decisions (diagnostic, therapeutic, prophylactic) that have greatly improved medical practice. So, the concept seems to work. Whether and how statistical estimation could be improved is beyond our reach. However, in respect to specifically the concept of CIs it seems that, for all practical purposes, medical doctors who need to read and understand published papers on a daily basis could well accept, without “major harm,” the view expressed ( 1 , 16 ), although also dismissed ( 12 ), about interpretation of CIs from a single study (regardless of how conceptually and factually incorrect it might be): “ Little is lost by the common but less pure interpretation of the CI as a range of values within which we can be 95% sure the population value lies ” ( 16 ). But it needs to be added: the view that it is plausible to be “95% sure” that one specific actually realized CI that belongs to a range of hypothetical CIs, 95% of which (presumably) do cover the true value, includes the “truth” would hold only if “the process,” ie, the way in which data were collected and analyzed was a valid one. By saying this, we need to move away from the strictly statistical (conceptual and computational) views about “estimation of reality,” regardless of how essentially important they are, toward the equally important elements of the process that are inherently easier for us to understand. We will assume that in any individual study, statisticians have done their work correctly, that they have chosen the adequate approach and methodology for data processing, and that corresponding interval estimates (CrIs or CIs) are provided. The point is – while this part is clearly important, no statistical method can mend the flaws that occur in the process of data gathering (be it an experimental or observational study) – if data are invalid, the estimates (whatever their theoretical basis) will be invalid. So, we need valid data in order to get estimates that are “on target.” An inherent consequence of the fact that no one actually knows what the true value is, is that there is no immediate way to check this– eg, years could go by before we realize that an intervention is not as effective or that it is more harmful than initially estimated. Off-target estimates come as a consequence of systematic (bias) and random errors in the process of data gathering and estimation: this is the question of research methodology. Therefore, we need to be familiar with research methodology: types of studies appropriate for respective questions, their reaches and limitations, types of various biases to which such studies are sensitive, and methods of “protection” against them. A point-estimate generated in a study that, by type and design, has a potential to be accurate and could be judged as “well protected from bias” – is highly likely to be close to the “true value,” although we might not necessarily be able to say “how close” and “how likely” (or express this as, eg, a percentage). The interval around it (be it Bayesian or frequentist, 90%, 95% or 99%, or any other) would then be highly likely to cover the “truth.” We would therefore state that CIs (around estimates) provide important information that goes beyond the results of statistical tests and P values, regardless of their inherent close relationship to statistical testing (and P values) and their inherent conceptual limitations/fallacies. They should always be viewed within the entire context – for a study (experimental, observational or meta-analysis), this refers to its general appropriateness for the addressed question, design, and conduct characteristics and, of course, adequacy of the implemented statistical procedure. If these elements, ie, this “methodological package” could be considered valid – then we may consider the resulting CI around the estimate as a reliable indicator about the location and size of the true value, and thus, as a reliable basis to consider the practical relevance of this (true) value, regardless of the results of statistical tests. Although it may sound heretic, under such circumstances, for all practical purposes it becomes of less relevance to us whether the interval is a direct probability/certainty/confidence statement or not. At the end, we build our final “certainty” or “confidence” about size of the true effects based on (i) independent replication of estimates arising from methodologically adequate procedures and (ii) concordant estimates from different types of studies.
- 1. Altman DG, Machin D, Bryant TN, Gardner MJ, editors. Statistics with confidence, 2nd edition. Bristol: BMJ Books; 2000. [ Google Scholar ]
- 2. Wasserstein RL, Lazar NA. The ASA Statement on p-Values: Context, Process, and Purpose. Am Stat. 2016;70:129–133. doi: 10.1080/00031305.2016.1154108. [ DOI ] [ Google Scholar ]
- 3. Neyman J. Outline of a theory of statistical estimation based on the classical theory of probability. Philos Trans R Soc Lond B Biol Sci. 1937;236:333–80. doi: 10.1098/rsta.1937.0005. [ DOI ] [ Google Scholar ]
- 4. Kendall DG, Bartlett MS, Page TL. Jerzy Neyman. Biogr Mem Fellows R Soc. 1982;28:379–412. [ Google Scholar ]
- 5. Ponikowski P, Voors AA, Anker SD, Bueno H, Cleland JG, Coats AJ, et al. 2016 Guidelines for the diagnosis and treatment of acute and chronic heart failure. Eur Heart J. 2016;37:2129–200. doi: 10.1093/eurheartj/ehw128. [ DOI ] [ PubMed ] [ Google Scholar ]
- 6. Bolstad WM. Introduction to Bayesian statistics. 2nd ed. Hoboken, NJ: Wiley; 2007. [ Google Scholar ]
- 7. Statistical inference: populations and samples. In: van Belle G, Fisher LD, Heagerty PJ, Lumly T, eds. Biostatistics. A methodology for health sciences. 2nd ed. Hoboken, NJ: Wiley; 2004, pp. 61-116. [ Google Scholar ]
- 8. Precision and statistics in epidemiologic studies. In: Rothman KJ, Greenland S, Lash TL, eds. Modern epidemiology. 3rd ed. Philadelphia: Lippincott Williams&Wilkins; 2008, pp.148-167. [ Google Scholar ]
- 9. Sedgwick P. Uncertainty in sample estimates: sampling error. BMJ. 2015;350:h1914. doi: 10.1136/bmj.h1914. [ DOI ] [ PubMed ] [ Google Scholar ]
- 10. Gill CJ, Sabin L, Schmid CH. Why clinicians are natural Bayesians. BMJ. 2005;330:1080–3. doi: 10.1136/bmj.330.7499.1080. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 11. Morey RD, Hoekstra R, Rouder JN, Lee MD, Wagenmakers EJ. The fallacy of placing confidence in confidence intervals. Psychon Bull Rev. 2016;23:103–23. doi: 10.3758/s13423-015-0947-8. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 12. Greenland S, Senn SJ, Rothman K, Carlin JB, Poole C, Goodman SN, et al. Statistical tests, P values, confidence intervals and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31:337–50. doi: 10.1007/s10654-016-0149-3. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 13. Hoekstra R, Morely RD, Rouder JN, Wangemakers EJ. Robust misinterpretation of confidence intervals. Psychon Bull Rev. 2014;21:1157–64. doi: 10.3758/s13423-013-0572-3. [ DOI ] [ PubMed ] [ Google Scholar ]
- 14. Neyman J. Fiducial argument and the theory of confidence intervals. Biometrica. 1941;32:128–50. doi: 10.1093/biomet/32.2.128. [ DOI ] [ Google Scholar ]
- 15. Neyman J. Frequentist probability and frequentist statistics. Synthese. 1977;36:97–131. doi: 10.1007/BF00485695. [ DOI ] [ Google Scholar ]
- 16. Altman DG. Why we need confidence intervals. World J Surg. 2005;29:554–6. doi: 10.1007/s00268-005-7911-0. [ DOI ] [ PubMed ] [ Google Scholar ]
- 17. Efron B. Bayes’ Theorem in the 21st century. Science. 2013;340:1177–8. doi: 10.1126/science.1236536. [ DOI ] [ PubMed ] [ Google Scholar ]
- 18. Lee JJ, Chu CT. Bayesian clinical trials in action. Stat Med. 2012;31:2955–72. doi: 10.1002/sim.5404. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 19. Gray K, Hampton B, Silveti-Falls T, McConnell A, Bausell C. Comparison of Bayesian credible intervals to frequentist confidence intervals. J Mod Appl Stat Methods. 2015;12:43–52. doi: 10.22237/jmasm/1430453220. [ DOI ] [ Google Scholar ]
- 20. Little RJ. Calibrated Bayes. Am Stat. 2006;60:213–23. doi: 10.1198/000313006X117837. [ DOI ] [ Google Scholar ]
- View on publisher site
- PDF (571.0 KB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
Add to Collections
- Subscribe to journal Subscribe
- Get new issue alerts Get alerts
Secondary Logo
Journal logo.
Colleague's E-mail is Invalid
Your message has been successfully sent to your colleague.
Save my selection
Introduction to Statistical Hypothesis Testing in Nursing Research
Keeler, Courtney PhD, RN; Curtis, Alexa Colgrove PhD, FNP, PMHNP
Courtney Keeler is an associate professor and Alexa Colgrove Curtis is assistant dean of graduate nursing and director of the MPH–DNP dual degree program, both at the University of San Francisco School of Nursing and Health Professions. Contact author: Courtney Keeler, [email protected] . Bernadette Capili, PhD, NP-C, is the column coordinator: [email protected] . This manuscript was supported in part by grant No. UL1TR001866 from the National Institutes of Health's National Center for Advancing Translational Sciences Clinical and Translational Science Awards Program. The authors have disclosed no potential conflicts of interest, financial or otherwise.
Editor's note: This is the 16th article in a series on clinical research by nurses. The series is designed to be used as a resource for nurses to understand the concepts and principles essential to research. Each column will present the concepts that underpin evidence-based practice—from research design to data interpretation. To see all the articles in the series, go to https://links.lww.com/AJN/A204 .
Full Text Access for Subscribers:
Individual subscribers.
Institutional Users
Not a subscriber.
You can read the full text of this article if you:
- + Favorites
- View in Gallery
Readers Of this Article Also Read
Measurement in nursing research, descriptive and inferential statistics in nursing research, sample size planning in quantitative nursing research, selection of the study participants, sampling design in nursing research.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
- My Bibliography
- Collections
- Citation manager
Save citation to file
Email citation, add to collections.
- Create a new collection
- Add to an existing collection
Add to My Bibliography
Your saved search, create a file for external citation management software, your rss feed, hypothesis testing, p values, confidence intervals, and significance, affiliations.
- 1 University of Louisville School of Medicine
- 2 University of Louisville
- PMID: 32491353
- Bookshelf ID: NBK557421
Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the adequate application of the data.
Copyright © 2024, StatPearls Publishing LLC.
PubMed Disclaimer
Conflict of interest statement
Disclosure: Jacob Shreffler declares no relevant financial relationships with ineligible companies.
Disclosure: Martin Huecker declares no relevant financial relationships with ineligible companies.
- Definition/Introduction
- Issues of Concern
- Clinical Significance
- Nursing, Allied Health, and Interprofessional Team Interventions
- Review Questions
Similar articles
- The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). Messam LLM, Weng HY, Rosenberger NWY, Tan ZH, Payet SDM, Santbakshsing M. Messam LLM, et al. PeerJ. 2021 Nov 24;9:e12453. doi: 10.7717/peerj.12453. eCollection 2021. PeerJ. 2021. PMID: 34900418 Free PMC article.
- Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. Ferrill MJ, Brown DA, Kyle JA. Ferrill MJ, et al. J Pharm Pract. 2010 Aug;23(4):344-51. doi: 10.1177/0897190009358774. Epub 2010 Apr 13. J Pharm Pract. 2010. PMID: 21507834 Review.
- Interpreting "statistical hypothesis testing" results in clinical research. Sarmukaddam SB. Sarmukaddam SB. J Ayurveda Integr Med. 2012 Apr;3(2):65-9. doi: 10.4103/0975-9476.96518. J Ayurveda Integr Med. 2012. PMID: 22707861 Free PMC article.
- Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. Alam M, Barzilai DA, Wrone DA. Alam M, et al. Dermatol Surg. 2005 Apr;31(4):462-6. doi: 10.1111/j.1524-4725.2005.31115. Dermatol Surg. 2005. PMID: 15871325
- Is statistical significance testing useful in interpreting data? Savitz DA. Savitz DA. Reprod Toxicol. 1993;7(2):95-100. doi: 10.1016/0890-6238(93)90242-y. Reprod Toxicol. 1993. PMID: 8499671 Review.
- Jones M, Gebski V, Onslow M, Packman A. Statistical power in stuttering research: a tutorial. J Speech Lang Hear Res. 2002 Apr;45(2):243-55. - PubMed
- Sedgwick P. Pitfalls of statistical hypothesis testing: type I and type II errors. BMJ. 2014 Jul 03;349:g4287. - PubMed
- Fethney J. Statistical and clinical significance, and how to use confidence intervals to help interpret both. Aust Crit Care. 2010 May;23(2):93-7. - PubMed
- Hayat MJ. Understanding statistical significance. Nurs Res. 2010 May-Jun;59(3):219-23. - PubMed
- Ferrill MJ, Brown DA, Kyle JA. Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. J Pharm Pract. 2010 Aug;23(4):344-51. - PubMed
Publication types
- Search in PubMed
- Search in MeSH
- Add to Search
LinkOut - more resources
Full text sources.
- NCBI Bookshelf
- Citation Manager
NCBI Literature Resources
MeSH PMC Bookshelf Disclaimer
The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.
- Physician Physician Board Reviews Physician Associate Board Reviews CME Lifetime CME Free CME MATE and DEA Compliance
- Student USMLE Step 1 USMLE Step 2 USMLE Step 3 COMLEX Level 1 COMLEX Level 2 COMLEX Level 3 96 Medical School Exams Student Resource Center NCLEX - RN NCLEX - LPN/LVN/PN 24 Nursing Exams
- Nurse Practitioner APRN/NP Board Reviews CNS Certification Reviews CE - Nurse Practitioner FREE CE
- Nurse RN Certification Reviews CE - Nurse FREE CE
- Pharmacist Pharmacy Board Exam Prep CE - Pharmacist
- Allied Allied Health Exam Prep Dentist Exams CE - Social Worker CE - Dentist
- Point of Care
- Free CME/CE
Hypothesis Testing, P Values, Confidence Intervals, and Significance
Definition/introduction.
Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the adequate application of the data.
Issues of Concern
Register for free and read the full article, learn more about a subscription to statpearls point-of-care.
Without a foundational understanding of hypothesis testing, p values, confidence intervals, and the difference between statistical and clinical significance, it may affect healthcare providers' ability to make clinical decisions without relying purely on the research investigators deemed level of significance. Therefore, an overview of these concepts is provided to allow medical professionals to use their expertise to determine if results are reported sufficiently and if the study outcomes are clinically appropriate to be applied in healthcare practice.
Hypothesis Testing
Investigators conducting studies need research questions and hypotheses to guide analyses. Starting with broad research questions (RQs), investigators then identify a gap in current clinical practice or research. Any research problem or statement is grounded in a better understanding of relationships between two or more variables. For this article, we will use the following research question example:
Research Question: Is Drug 23 an effective treatment for Disease A?
Research questions do not directly imply specific guesses or predictions; we must formulate research hypotheses. A hypothesis is a predetermined declaration regarding the research question in which the investigator(s) makes a precise, educated guess about a study outcome. This is sometimes called the alternative hypothesis and ultimately allows the researcher to take a stance based on experience or insight from medical literature. An example of a hypothesis is below.
Research Hypothesis: Drug 23 will significantly reduce symptoms associated with Disease A compared to Drug 22.
The null hypothesis states that there is no statistical difference between groups based on the stated research hypothesis.
Researchers should be aware of journal recommendations when considering how to report p values, and manuscripts should remain internally consistent.
Regarding p values, as the number of individuals enrolled in a study (the sample size) increases, the likelihood of finding a statistically significant effect increases. With very large sample sizes, the p-value can be very low significant differences in the reduction of symptoms for Disease A between Drug 23 and Drug 22. The null hypothesis is deemed true until a study presents significant data to support rejecting the null hypothesis. Based on the results, the investigators will either reject the null hypothesis (if they found significant differences or associations) or fail to reject the null hypothesis (they could not provide proof that there were significant differences or associations).
To test a hypothesis, researchers obtain data on a representative sample to determine whether to reject or fail to reject a null hypothesis. In most research studies, it is not feasible to obtain data for an entire population. Using a sampling procedure allows for statistical inference, though this involves a certain possibility of error. [1] When determining whether to reject or fail to reject the null hypothesis, mistakes can be made: Type I and Type II errors. Though it is impossible to ensure that these errors have not occurred, researchers should limit the possibilities of these faults. [2]
Significance
Significance is a term to describe the substantive importance of medical research. Statistical significance is the likelihood of results due to chance. [3] Healthcare providers should always delineate statistical significance from clinical significance, a common error when reviewing biomedical research. [4] When conceptualizing findings reported as either significant or not significant, healthcare providers should not simply accept researchers' results or conclusions without considering the clinical significance. Healthcare professionals should consider the clinical importance of findings and understand both p values and confidence intervals so they do not have to rely on the researchers to determine the level of significance. [5] One criterion often used to determine statistical significance is the utilization of p values.
P values are used in research to determine whether the sample estimate is significantly different from a hypothesized value. The p-value is the probability that the observed effect within the study would have occurred by chance if, in reality, there was no true effect. Conventionally, data yielding a p<0.05 or p<0.01 is considered statistically significant. While some have debated that the 0.05 level should be lowered, it is still universally practiced. [6] Hypothesis testing allows us to determine the size of the effect.
An example of findings reported with p values are below:
Statement: Drug 23 reduced patients' symptoms compared to Drug 22. Patients who received Drug 23 (n=100) were 2.1 times less likely than patients who received Drug 22 (n = 100) to experience symptoms of Disease A, p<0.05.
Statement:Individuals who were prescribed Drug 23 experienced fewer symptoms (M = 1.3, SD = 0.7) compared to individuals who were prescribed Drug 22 (M = 5.3, SD = 1.9). This finding was statistically significant, p= 0.02.
For either statement, if the threshold had been set at 0.05, the null hypothesis (that there was no relationship) should be rejected, and we should conclude significant differences. Noticeably, as can be seen in the two statements above, some researchers will report findings with < or > and others will provide an exact p-value (0.000001) but never zero [6] . When examining research, readers should understand how p values are reported. The best practice is to report all p values for all variables within a study design, rather than only providing p values for variables with significant findings. [7] The inclusion of all p values provides evidence for study validity and limits suspicion for selective reporting/data mining.
While researchers have historically used p values, experts who find p values problematic encourage the use of confidence intervals. [8] . P-values alone do not allow us to understand the size or the extent of the differences or associations. [3] In March 2016, the American Statistical Association (ASA) released a statement on p values, noting that scientific decision-making and conclusions should not be based on a fixed p-value threshold (e.g., 0.05). They recommend focusing on the significance of results in the context of study design, quality of measurements, and validity of data. Ultimately, the ASA statement noted that in isolation, a p-value does not provide strong evidence. [9]
When conceptualizing clinical work, healthcare professionals should consider p values with a concurrent appraisal study design validity. For example, a p-value from a double-blinded randomized clinical trial (designed to minimize bias) should be weighted higher than one from a retrospective observational study [7] . The p-value debate has smoldered since the 1950s [10] , and replacement with confidence intervals has been suggested since the 1980s. [11]
Confidence Intervals
A confidence interval provides a range of values within given confidence (e.g., 95%), including the accurate value of the statistical constraint within a targeted population. [12] Most research uses a 95% CI, but investigators can set any level (e.g., 90% CI, 99% CI). [13] A CI provides a range with the lower bound and upper bound limits of a difference or association that would be plausible for a population. [14] Therefore, a CI of 95% indicates that if a study were to be carried out 100 times, the range would contain the true value in 95, [15] confidence intervals provide more evidence regarding the precision of an estimate compared to p-values. [6]
In consideration of the similar research example provided above, one could make the following statement with 95% CI:
Statement: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22; there was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).
It is important to note that the width of the CI is affected by the standard error and the sample size; reducing a study sample number will result in less precision of the CI (increase the width). [14] A larger width indicates a smaller sample size or a larger variability. [16] A researcher would want to increase the precision of the CI. For example, a 95% CI of 1.43 – 1.47 is much more precise than the one provided in the example above. In research and clinical practice, CIs provide valuable information on whether the interval includes or excludes any clinically significant values. [14]
Null values are sometimes used for differences with CI (zero for differential comparisons and 1 for ratios). However, CIs provide more information than that. [15] Consider this example: A hospital implements a new protocol that reduced wait time for patients in the emergency department by an average of 25 minutes (95% CI: -2.5 – 41 minutes). Because the range crosses zero, implementing this protocol in different populations could result in longer wait times; however, the range is much higher on the positive side. Thus, while the p-value used to detect statistical significance for this may result in "not significant" findings, individuals should examine this range, consider the study design, and weigh whether or not it is still worth piloting in their workplace.
Similarly to p-values, 95% CIs cannot control for researchers' errors (e.g., study bias or improper data analysis). [14] In consideration of whether to report p-values or CIs, researchers should examine journal preferences. When in doubt, reporting both may be beneficial. [13] An example is below:
Reporting both: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22, p = 0.009. There was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).
Clinical Significance
Recall that clinical significance and statistical significance are two different concepts. Healthcare providers should remember that a study with statistically significant differences and large sample size may be of no interest to clinicians, whereas a study with smaller sample size and statistically non-significant results could impact clinical practice. [14] Additionally, as previously mentioned, a non-significant finding may reflect the study design itself rather than relationships between variables.
Healthcare providers using evidence-based medicine to inform practice should use clinical judgment to determine the practical importance of studies through careful evaluation of the design, sample size, power, likelihood of type I and type II errors, data analysis, and reporting of statistical findings (p values, 95% CI or both). [4] Interestingly, some experts have called for "statistically significant" or "not significant" to be excluded from work as statistical significance never has and will never be equivalent to clinical significance. [17]
The decision on what is clinically significant can be challenging, depending on the providers' experience and especially the severity of the disease. Providers should use their knowledge and experiences to determine the meaningfulness of study results and make inferences based not only on significant or insignificant results by researchers but through their understanding of study limitations and practical implications.
Nursing, Allied Health, and Interprofessional Team Interventions
All physicians, nurses, pharmacists, and other healthcare professionals should strive to understand the concepts in this chapter. These individuals should maintain the ability to review and incorporate new literature for evidence-based and safe care.
Jones M, Gebski V, Onslow M, Packman A. Statistical power in stuttering research: a tutorial. Journal of speech, language, and hearing research : JSLHR. 2002 Apr:45(2):243-55 [PubMed PMID: 12003508]
Sedgwick P. Pitfalls of statistical hypothesis testing: type I and type II errors. BMJ (Clinical research ed.). 2014 Jul 3:349():g4287. doi: 10.1136/bmj.g4287. Epub 2014 Jul 3 [PubMed PMID: 24994622]
Fethney J. Statistical and clinical significance, and how to use confidence intervals to help interpret both. Australian critical care : official journal of the Confederation of Australian Critical Care Nurses. 2010 May:23(2):93-7. doi: 10.1016/j.aucc.2010.03.001. Epub 2010 Mar 29 [PubMed PMID: 20347326]
Hayat MJ. Understanding statistical significance. Nursing research. 2010 May-Jun:59(3):219-23. doi: 10.1097/NNR.0b013e3181dbb2cc. Epub [PubMed PMID: 20445438]
Ferrill MJ, Brown DA, Kyle JA. Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. Journal of pharmacy practice. 2010 Aug:23(4):344-51. doi: 10.1177/0897190009358774. Epub 2010 Apr 13 [PubMed PMID: 21507834]
Infanger D, Schmidt-Trucksäss A. P value functions: An underused method to present research results and to promote quantitative reasoning. Statistics in medicine. 2019 Sep 20:38(21):4189-4197. doi: 10.1002/sim.8293. Epub 2019 Jul 3 [PubMed PMID: 31270842]
Dorey F. Statistics in brief: Interpretation and use of p values: all p values are not equal. Clinical orthopaedics and related research. 2011 Nov:469(11):3259-61. doi: 10.1007/s11999-011-2053-1. Epub [PubMed PMID: 21918804]
Liu XS. Implications of statistical power for confidence intervals. The British journal of mathematical and statistical psychology. 2012 Nov:65(3):427-37. doi: 10.1111/j.2044-8317.2011.02035.x. Epub 2011 Oct 25 [PubMed PMID: 22026811]
Tijssen JG, Kolm P. Demystifying the New Statistical Recommendations: The Use and Reporting of p Values. Journal of the American College of Cardiology. 2016 Jul 12:68(2):231-3. doi: 10.1016/j.jacc.2016.05.026. Epub [PubMed PMID: 27386779]
Spanos A. Recurring controversies about P values and confidence intervals revisited. Ecology. 2014 Mar:95(3):645-51 [PubMed PMID: 24804448]
Freire APCF, Elkins MR, Ramos EMC, Moseley AM. Use of 95% confidence intervals in the reporting of between-group differences in randomized controlled trials: analysis of a representative sample of 200 physical therapy trials. Brazilian journal of physical therapy. 2019 Jul-Aug:23(4):302-310. doi: 10.1016/j.bjpt.2018.10.004. Epub 2018 Oct 16 [PubMed PMID: 30366845]
Dorey FJ. In brief: statistics in brief: Confidence intervals: what is the real result in the target population? Clinical orthopaedics and related research. 2010 Nov:468(11):3137-8. doi: 10.1007/s11999-010-1407-4. Epub [PubMed PMID: 20532716]
Porcher R. Reporting results of orthopaedic research: confidence intervals and p values. Clinical orthopaedics and related research. 2009 Oct:467(10):2736-7. doi: 10.1007/s11999-009-0952-1. Epub 2009 Jun 30 [PubMed PMID: 19565303]
Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. British medical journal (Clinical research ed.). 1986 Mar 15:292(6522):746-50 [PubMed PMID: 3082422]
Cooper RJ, Wears RL, Schriger DL. Reporting research results: recommendations for improving communication. Annals of emergency medicine. 2003 Apr:41(4):561-4 [PubMed PMID: 12658257]
Doll H, Carney S. Statistical approaches to uncertainty: P values and confidence intervals unpacked. Equine veterinary journal. 2007 May:39(3):275-6 [PubMed PMID: 17520981]
Colquhoun D. The reproducibility of research and the misinterpretation of p-values. Royal Society open science. 2017 Dec:4(12):171085. doi: 10.1098/rsos.171085. Epub 2017 Dec 6 [PubMed PMID: 29308247]
IMAGES
VIDEO
COMMENTS
Confidence interval just doesn’t roll of the tongue of a staff nurse quite like blood pressure or urine output does. But knowing the importance of the CI allows you to interpret research for its impact on your practice.
In this post, I demonstrate how confidence intervals work using graphs and concepts instead of formulas. In the process, I compare and contrast significance and confidence levels. You’ll learn how confidence intervals are similar to significance levels in hypothesis testing.
A CI is a numerical range used to describe research data. For example, for a study outcome of weight, a CI may be 53 to 71 kg. This interval of 53 to 71 kg is where there is 95% certainty that the true weight would lie (if you were applying a 95% CI).
Confidence intervals are estimates that provide a lower and upper threshold to the estimate of the magnitude of effect. By convention, 95% confidence intervals are most typically reported.
In 2000, a BMJ-edition book (1) straightforwardly pointed-out the importance of providing effect measures with confidence intervals (CI) when reporting the results of clinical/epidemiological research, and not only the results of statistical tests.
Testable hypothesis. Power (minimum = 0.8) “The ability to detect a difference if one exists” Alpha (most of the time = 0.05) Effect size (and its variability) Expected difference between groups. Mean difference. Regression/Correlation coefficient. Odds/risk ratio. Sources of effect size.
Hypothesis testing and Confidence Intervals July 15, 2014 Goals. Attendees should be able to: 1) Explain the basic logic of hypothesis testing. 2) Define and explain "p". 3) Define and distinguish type I and type II errors, alpha. 4) Define and distinguish statistical and clinical significance.
The series is designed to be used as a resource for nurses to understand the concepts and principles essential to research. Each column will present the concepts that underpin evidence-based practice—from research design to data interpretation.
Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators.
Hypothesis testing allows us to determine the size of the effect. An example of findings reported with p values are below: Statement: Drug 23 reduced patients' symptoms compared to Drug 22. Patients who received Drug 23 (n=100) were 2.1 times less likely than patients who received Drug 22 (n = 100) to experience symptoms of Disease A, p<0.05. Or