Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 29 June 2016

Points of Significance

Logistic regression

  • Jake Lever 1 ,
  • Martin Krzywinski 2 &
  • Naomi Altman 3  

Nature Methods volume  13 ,  pages 541–542 ( 2016 ) Cite this article

22k Accesses

73 Citations

25 Altmetric

Metrics details

  • Research data
  • Statistical methods

Regression can be used on categorical responses to estimate probabilities and to classify.

You have full access to this article via your institution.

In recent columns we showed how linear regression can be used to predict a continuous dependent variable given other independent variables 1 , 2 . When the dependent variable is categorical, a common approach is to use logistic regression, a method that takes its name from the type of curve it uses to fit data. Categorical variables are commonly used in biomedical data to encode a set of discrete states, such as whether a drug was administered or whether a patient has survived. Categorical variables may have more than two values, which may have an implicit order, such as whether a patient never, occasionally or frequently smokes. In addition to predicting the value of a variable (e.g., a patient will survive), logistic regression can also predict the associated probability (e.g., the patient has a 75% chance of survival).

There are many reasons to assess the probability of a state of a categorical variable, and a common application is classification—predicting the class of a new data point. Many methods are available, but regression has the advantage of being relatively simple to perform and interpret. First a training set is used to develop a prediction equation, and then the predicted membership probability is thresholded to predict the class membership for new observations, with the point classified to the most probable class. If the costs of misclassification differ between the two classes, alternative thresholds may be chosen to minimize misclassification costs estimated from the training sample ( Fig. 1 ). For example, in the diagnosis of a deadly but readily treated disease, it is less costly to falsely assign a patient to the treatment group than to the no-treatment group.

figure 1

Shown are observations of a categorical variable positioned using the predicted probability of being in one of two classes, encoded by open and solid circles, respectively. Top row: when class membership is perfectly separable, a threshold (e.g., 0.5) can be chosen to make classification perfectly accurate. Bottom row: when separation between classes is ambiguous, as shown here with the same predictor values as for the row above, perfect classification accuracy with a single-value threshold is not possible. The threshold is tuned to control false positives (e.g., 0.75) or false negatives (e.g., 0.25).

In our example of simple linear regression 1 , we saw how one continuous variable (weight) could be predicted on the basis of another continuous variable (height). To illustrate classification, here we extend that example to use height to predict the probability that an individual plays professional basketball. Let us assume that professional basketball players have a mean height of 200 cm and that those who do not play professionally have a mean height of 170 cm, with both populations being normal and having an s.d. of 15 cm. First, we create a training data set by randomly sampling the heights of 5 individuals who play professional basketball and 15 who do not ( Fig. 2a ). We then assign categorical classifications of 1 (plays professional basketball) and 0 (does not play professional basketball). For simplicity, our example is limited to two classes, but more are possible.

figure 2

( a ) The effect of outliers on classification based on linear regression. The plot shows classification using linear regression fit (solid black line) to the training set of those who play professional basketball (solid circles; classification of 1) and those who do not (open circles; classification of 0). When a probability cutoff of 0.5 is used (horizontal dotted line), the fit yields a threshold of 192 cm (dashed black line) as well as one false negative (FN) and one false positive (FP). Including the outlier at H = 100 cm (orange circle) in the fit (solid orange line) increases the threshold to 197 cm (dashed orange line). ( b ) The effect of outliers on classification based on step and logistic regression. Regression using step and logistic models yields thresholds of 185 cm (solid vertical blue line) and 194 cm (dashed blue line), respectively. The outlier from a does not substantially affect either fit.

Let us first approach this classification using linear regression, which minimizes least-squares 1 , and fit a line to the data ( Fig. 2a ). Each data point has one of two distinct y -values (0 and 1), which correspond to the probability of playing professional basketball, and the fit represents the predicted probability as a function of height, increasing from 0 at 159 cm to 1 at 225 cm. The fit line is truncated outside the [0, 1] range because it cannot be interpreted as a probability. Using a probability threshold of 0.5 for classification, we find that 192 cm should be the decision boundary for predicting whether an individual plays professional basketball. It gives reasonable classification performance—only one point is misclassified as false positive, and one point as false negative ( Fig. 2a ).

Unfortunately, our linear regression fit is not robust. Consider a child of height H = 100 cm who does not play professional basketball ( Fig. 2a ). This height is below the threshold of 192 cm and would be classified correctly. However, if this data point is part of the training set, it will greatly influence the fit 3 and increase the classification threshold to 197 cm, which would result in an additional false negative.

To improve the robustness and general performance of this classifier, we could fit the data to a curve other than a straight line. One very simple option is the step function ( Fig. 2b ), which is 1 when greater than a certain value and 0 otherwise. An advantage of the step function is that it defines a decision boundary (185 cm) that is not affected by the outlier ( H = 100 cm), but it cannot provide class probabilities other than 0 and 1. This turns out to be sufficient for the purpose of classification—many classification algorithms do not provide probabilities. However, the step function also does not differentiate between the more extreme observations, which are far from the decision boundary and more likely to be correctly assigned, and those near the decision boundary for which membership in either group is plausible. In addition, the step function is not differentiable at the step, and regression generally requires a function that is differentiable everywhere. To mitigate this issue, smooth sigmoid curves are used. One used commonly in the natural sciences is the logistic curve ( Fig. 2b ), which readily relates to the odds ratio.

If p is the probability that a person plays professional basketball, then the odds ratio is p /(1 − p ), which is the ratio of the probability of playing to the probability of not playing. The log odds ratio is the logarithmic transform of this quantity, ln( p /(1 − p )). Logistic regression models the log odds ratio as a linear combination of the independent variables. For our example, height ( H ) is the independent variable, the logistic fit parameters are β 0 (intercept) and β H (slope), and the equation that relates them is ln( p /(1 − p )) = β 0 + β H H . In general, there may be any number of predictor variables and associated regression parameters (or slopes). Modeling the log odds ratio allows us to estimate the probability of class membership using a linear relationship, similar to linear regression. The log odds can be transformed back to a probability as p ( t ) = 1/(1 + exp(− t )), where t = β 0 + β H H . This is an S-shaped (sigmoid) curve, with steepness controlled by β H that maps the linear function back to probabilities in [0, 1].

As in linear regression, we need to estimate the regression parameters. These estimates are denoted by b 0 and b H to distinguish them from the true but unknown intercept β 0 and slope β H . Unlike linear regression 1 , which yields an exact analytical solution for the estimated regression coefficients, logistic regression requires numerical optimization to find the optimal estimate, such as the iterative approach shown in Figure 3a . For our example, this would correspond to finding the maximum likelihood estimates, the pair of estimates b 0 and b H that maximize the likelihood of the observed data (or, equivalently, minimize the negative log likelihood). Once these estimates are found, we can calculate the membership probability, which is a function of these estimates as well as of our predictor H .

figure 3

The slope parameter for each logistic curve (upper plot) is indicated by a correspondingly colored point in the lower plot, shown with its associated negative log likelihood. ( a ) A non-separable data set with different logistic curves using a single slope parameter. A minimum is found for the ideal curve (blue). ( b ) A perfectly separable data set for which no minimum exists. Attempts at a solution create increasingly steeper curves—the negative log likelihood asymptotically decreases toward zero, and the estimated slope tends toward infinity.

In most cases, the maximum-likelihood estimates are unique and optimal. However, when the classes are perfectly separable, this iterative approach fails because there is an infinite number of solutions with equivalent predictive power that can perfectly predict class membership for the training set. Here, we cannot estimate the regression parameters ( Fig. 3b ) or assign a probability of class membership.

The interpretation of logistic regression shares some similarities with that of linear regression; for instance, variables given the greatest importance may be reliable predictors but might not actually be causal. Logistic regression parameters can be used to understand the relative predictive power of different variables, assuming that the variables have already been normalized to have a mean of 0 and variance of 1. It is important to understand the effect that a change to an independent variable will have on the results of a regression. In linear regression the coefficients have an additive effect for the predicted value, which increases by β i when the i th independent variable increases by one unit. In logistic regression the coefficients have an additive effect for the log odds ratio rather than for the predicted probability.

Similar to linear regression, correlation among multiple predictors is a challenge to fitting logistic regression. For instance, if we are fitting a logistic regression for professional basketball using height and weight, we must be aware that these variables are highly positively correlated. Either one of them already gives insight into the value of the other. If two variables are perfectly correlated, then there would be multiple solutions to the logistic regression that would give exactly the same fit. Correlated features also make interpretation of coefficients much more difficult. Discussion of the quality of the fit of the logistic model and of classification accuracy will be left to a later column.

Logistic regression is a powerful tool for predicting class probabilities and for classification using predictor variables. For example, one can model the lethality of a new drug protocol in mice by predicting the probability of survival or, with an appropriate probability threshold, by classifying on the basis of survival outcome. Multiple factors of an experiment can be included, such as dosing information, animal weight and diet data, but care must be taken in interpretation to include the possibility of correlation.

Altman, N. & Krzywinski, M. Nat. Methods 12 , 999–1000 (2015).

Article   CAS   Google Scholar  

Krzywinski, M. & Altman, N. Nat. Methods 12 , 1103–1104 (2015).

Altman, N. & Krzywinski, M. Nat. Methods 13 , 281–282 (2016).

Article   Google Scholar  

Download references

Author information

Authors and affiliations.

Jake Lever is a PhD candidate at Canada's Michael Smith Genome Sciences Centre.,

Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.,

  • Martin Krzywinski

Naomi Altman is a Professor of Statistics at The Pennsylvania State University.,

  • Naomi Altman

You can also search for this author in PubMed   Google Scholar

Ethics declarations

Competing interests.

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Lever, J., Krzywinski, M. & Altman, N. Logistic regression. Nat Methods 13 , 541–542 (2016). https://doi.org/10.1038/nmeth.3904

Download citation

Published : 29 June 2016

Issue Date : July 2016

DOI : https://doi.org/10.1038/nmeth.3904

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Income determines the impact of cash transfers on hiv/aids: cohort study of 22.7 million brazilians.

  • Andréa F. Silva
  • Inês Dourado
  • Davide Rasella

Nature Communications (2024)

Errors in predictor variables

Nature Methods (2024)

Comparing classifier performance with baselines

  • Fadel M. Megahed
  • Ying-Ju Chen

Deep learning-based multi-model approach on electron microscopy image of renal biopsy classification

  • Jingyuan Zhang
  • Aihua Zhang

BMC Nephrology (2023)

Association between resilience and advance care planning during the COVID-19 pandemic in Japan: a nationwide cross-sectional study

  • Jun Miyashita
  • Taro Takeshima
  • Shunichi Fukuhara

Scientific Reports (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research paper on logistic regression

Logistic Regression: A Basic Approach

  • Conference paper
  • First Online: 31 May 2023
  • Cite this conference paper

research paper on logistic regression

  • Naman Kaur 12 &
  • Himanshu 12 , 13  

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 623))

Included in the following conference series:

  • International Conference on Information and Communication Technology for Competitive Strategies

320 Accesses

1 Citations

Talking generally about data mining, and to be more specific about binary data categorization, Logistic Regression is among the most used procedures. In general, it is more often worked with dichotomous dependent variables, Logistic Regression is applicable to multiple dependent variables. The goal of this research is to provide an overview of the Logistic Regression model, why actually Logistic Regression is required even after Linear Regression, their similarities and differences, how Logistic Regression select different independent variables and how many are to be included, primary beliefs of Logistic Regression are also well discussed and shown. The study of this aims for an overview which will provide Logistic Regressions of most significant components for data modeling. It shows how Logistic Regression deals when the data used or the data which is given is irregular or is of rare occurrences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Wright RE (1995) Logistic regression

Google Scholar  

Stoltzfus JC (2011) Logistic regression: a brief primer. Acad Emerg Med 18(10):1099–1104

Article   Google Scholar  

Darlington RB (1990) Regression and linear models. McGraw-Hill college

Tabachnick BG, Fidell LS, Ullman JB (2007) Using multivariate statistics, vol 5. Pearson, Boston, MA, pp 481–498

Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. Wiley, Hoboken

Himanshu M, Mangla N (2021) Soft security resource scheduling issues in cloud computing: A review. 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC), Solan, India, 2021, pp 678–684. https://doi.org/10.1109/ISPCC53510.2021.9609428

Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR (1996) A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49(12):1373–1379

Agresti A (2003) Categorical data analysis. Wiley, Hoboken

Feinstein AR (1996) Multivariable analysis: an introduction. Yale University Press, New Haven

Himanshu M, Kaur SS, Sharma A. Real-time information system based on speech recognition

Himanshu M, Kaur S, Chaudhary V (2014) Literature survey on automatic speech recognition system‖. Int J Emerg Technol Adv Eng

Hosmer DW, Lemeshow S (2000) Applied logistic regression, 2nd edn. Wiley, New York

Bender R, Grouven U (1997) Ordinal logistic regression in medical research. J R Coll Phys Lond 31(5):546

Download references

Author information

Authors and affiliations.

Department of Mathematics, Chandigarh University, Sahibzada Ajit Nagar, 140413, India

Naman Kaur &  Himanshu

Department of Computer Science & Engineering, Chandigarh University, Mohali, India

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Naman Kaur .

Editor information

Editors and affiliations.

Global Knowledge Research Foundation, Ahmedabad, Gujarat, India

Nottingham Trent University, Nottingham, UK

Mufti Mahmud

University of Peradeniya, Kandy, Sri Lanka

Roshan G. Ragel

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Kaur, N., Himanshu (2023). Logistic Regression: A Basic Approach. In: Joshi, A., Mahmud, M., Ragel, R.G. (eds) Information and Communication Technology for Competitive Strategies (ICTCS 2022). ICTCS 2022. Lecture Notes in Networks and Systems, vol 623. Springer, Singapore. https://doi.org/10.1007/978-981-19-9638-2_41

Download citation

DOI : https://doi.org/10.1007/978-981-19-9638-2_41

Published : 31 May 2023

Publisher Name : Springer, Singapore

Print ISBN : 978-981-19-9637-5

Online ISBN : 978-981-19-9638-2

eBook Packages : Engineering Engineering (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

research paper on logistic regression

  • Get new issue alerts Get alerts
  • Submit a Manuscript

Secondary Logo

Journal logo.

Colleague's E-mail is Invalid

Your message has been successfully sent to your colleague.

Save my selection

Logistic regression

A simple primer.

Pal, Ankita

Mahamana Pandit Madan Mohan Malaviya Cancer Center, and Homi Bhabha Cancer Hospital, Tata Memorial Center, Varanasi, Uttar Pradesh, India

Address for correspondence: Ankita Pal, Mahamana Pandit Madan Mohan Malaviya Cancer Centre, Varanasi, Uttar Pradesh, India. E-mail: [email protected]

Received July 17, 2021

Received in revised form August 31, 2021

Accepted September 18, 2021

This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-Share Alike 4.0 Unported, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Logistic regression is used to obtain the odds ratio in the presence of more than one explanatory variable. This procedure is quite similar to multiple linear regression, with the only exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest. The main advantage of performing logistic regression is to avoid the effects of confounders by analyzing the association of all the variables together. In this article, we explain how to perform a logistic regression using practical examples. After defining the technique, the assumptions that need to be checked are explained, along with the process of checking them using the R software.

INTRODUCTION

Most of the understanding of the biological effects and their determinants are gained through statistical analysis. Clinical studies that evaluate the relative contributions of various factors to a binary outcome, such as death or disease, are the most common way of gaining an understanding of biological effects and determinants. In this article, we aim to provide a brief and simplified outline on performing a logistic regression, which would be sufficient to permit clinicians who are unfamiliar with regression methodology to understand and interpret the results.[ 1 ]

LOGISTIC REGRESSION

The Multivariate Logistic Regression is the statistical technique used when we wish to estimate the probability of a dichotomous outcome, such as the presence or absence of disease or of death. The probability of the outcome is referred to as the dependent variable, and the various factors that influence it are the independent variables, sometimes termed risk factors.

The probability of an outcome is expressed as a proportion or a percentage. For instance, suppose there were 600 patients with cancer of which 30 died. The proportion of deaths is 30/600, or 0.05 or 5%. In general, the results of logistic regression are presented in terms of the odds rather than the probability of the outcome. There is a direct relationship between probabilities and odds, that is, the odds of the occurrence are the probability of the outcome occurring divided by the probability of the outcome not occurring. In this example, the odds of death were obtained by dividing 0.05, the proportion of deaths by 0.95, the proportion of survivors, and determined to be 1:19. The probability of death can be obtained from the odds simply by dividing the odds by 1 plus the odds or (1/19)/(1 + 1/19) =0.05.[ 2 ]

Logistic regression uses the past experience of a group of patients to estimate the odds of an outcome by mathematically modeling or simulating that experience and describing it by means of a regression equation. Symbolically, a logistic regression equation is given as,

research paper on logistic regression

  • x 1 and x 2 are the two predictor variables
  • Y is a binary (Bernoulli) response variable, which is denoted as p = P ( Y = 1)
  • ℓ is the log-odds
  • β i are the parameters of the model ( i = 0, 1, 2).

A key feature in modeling a clinical experience is the selection of the independent variables that influence the result. The method for calculating the regression coefficients takes into consideration all the possible combinations of the independent variables. It then maximizes the probability that, for any given individual with a specific combination of independent variables, the chances of the result are going to be on the brink of the particular or observed outcome of all other individuals possessing the same combination of independent variables.[ 2 ]

The general form of the logistic regression equation is similar to that of multivariate linear regression; however, the logarithm of the odds of the outcome termed the logit or log-odds, is used as the dependent variable. The regression coefficients also are expressed as natural logarithms.

LOGISTIC REGRESSION DIAGNOSTICS

Many assumptions need to be checked before performing a Logistic Regression analysis. The assumptions are listed below along with a guide on how to check them with the help of R.

Dependent variable

The first assumption is that the binary logistic regression requires the dependent variable to be binary, and in case of ordinal logistic regression, the dependent variable needs to be ordinal.

Independent observations

In order to perform a logistic regression, the observations need to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.

Large sample size

In general, logistic regression typically requires a large sample size. There is a general guideline that one needs a minimum of 10 cases with the least frequent outcome for each independent variable in a model. For example, if there are 5 independent variables and the expected probability of the least frequent outcome is .10 then a minimum sample size of 500 (10 * 5/.10) will be required.[ 3 ]

Linearity assumption

The linear relationship between the continuous predictor variables and the logit of the outcome is checked. This can be done by visually inspecting the scatter plot between each predictor and the logit values.

research paper on logistic regression

If the scatter plot shows non-linearity, there is a need to perform other methods to build the model such as including 2 or 3-power terms, fractional polynomials, and spline function.

Influential values

Influential values are extreme individual data points that can alter the quality of the logistic regression model. The most extreme values in the data can be examined by visualizing the Cook’s distance values. Here we label the top 3 largest values.

research paper on logistic regression

A point to be noted is that not all outliers are influential observations. To check whether the data contain potential influential observations, the standardized residual error can be inspected. Data points with an absolute standardized residual above 3 represent possible outliers and may deserve closer attention.

The following R code computes the standardized residuals (.std.resid) and the Cook’s distance (.cooksd) using the R function augment() [broom package].

research paper on logistic regression

When outliers are present in a continuous predictor, the potential solutions include:

  • Removing the concerned records
  • Transforming the data into a log scale
  • Using non-parametric methods

Multicollinearity

Multicollinearity corresponds to a situation in which the data contain highly correlated predictor variables. Multicollinearity is an important issue in regression analysis and should be fixed by removing the concerned variables. It can be assessed using the R function vif()[car package], which computes the variance inflation factors:

research paper on logistic regression

As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

LOGISTIC REGRESSION IN R

The general mathematical equation for logistic regression that is used in R software is,

research paper on logistic regression

  • y is the response variable
  • x is the predictor variable
  • a and b are the coefficients which are numeric constants.

The function used to create the regression model is the glm() function.

The basic syntax for glm() function from the stats package in logistic regression is,

research paper on logistic regression

The description of the parameters mentioned in the above function is,

formula An object of class “formula” (or one that can be coerced to that class): a symbolic description of the model to be fitted.

family A description of the error distribution and link function to be used in the model. For glm this can be a character string naming a family function or the result of a call to a family function.

data An optional data frame, list or environment containing the variables in the model.

Real-life example

Suppose a real-life dataset known as the Cleveland Heart Disease dataset is considered, where the dataset contains information about patients who have or do not have heart disease. The dataset contains many medical indicators. It contains 76 attributes using which the medical history of patients of Hungarian and Switzerland origin was captured. The dataset is available online at: https://archive.ics.uci.edu/ml/datasets/heart+Disease .

The aim is to predict if a person has heart disease or not based on attributes such as blood pressure, heart rate, and others. Here, the dependent/response variable is target (whether the patient has heart disease or not) which is a binary variable, as it only takes the values 0 (= No) or 1 (= Yes). All the other variables are independent/predictor variables that will be used for predicting the response variable.

Therefore, a Logistic Regression Model is built in R with the help of the following R code.

research paper on logistic regression

To understand the above code, it can be broken down into parts and explained.

glm is the generalized linear model we will be using.

target ~ means that we want to model the target using (~) every available feature.

family = bionomial() is used because we are predicting a binary outcome. On running the above code, the result that was obtained is as below.

research paper on logistic regression

From the above output, it is clearly observed that a lot of variables are not significant, with the help of the P values (denoted as Pr(>|z||)). Hence, based on the least significance levels, the variables which were found to be significant will be removed one by one and checked for the best model by applying the glm function each time. Thus, on obtaining the best logistic regression model, it will be used for predicting the response variable.

CONCLUSIONS

No one knows better than a doctor how multiple factors can combine to produce patient outcomes. Logistic regression analysis is a powerful tool for assessing the relative importance of factors that determine outcome. It is increasingly used in clinical medicine to develop diagnostic algorithms and evaluate prognosis. Yet, this tool is both imperfect and subject to misuse. An article by Shahian et al .[ 4 ] describes the deficiencies of the method as currently employed in the production of “report cards.” A basic understanding of logistic regression analysis is the first step to appreciating both the usefulness and the limitations of the technique.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

  • Cited Here |
  • Google Scholar

Diagnostics; logistic regression; odds ratio; R; regression analysis

  • + Favorites
  • View in Gallery

Readers Of this Article Also Read

Logistic regression in cancer research: a narrative review of the concept,....

  • Open access
  • Published: 28 December 2018

A logistic regression investigation of the relationship between the Learning Assistant model and failure rates in introductory STEM courses

  • Jessica L. Alzen   ORCID: orcid.org/0000-0002-1706-2975 1 ,
  • Laurie S. Langdon 1 &
  • Valerie K. Otero 1  

International Journal of STEM Education volume  5 , Article number:  56 ( 2018 ) Cite this article

30k Accesses

49 Citations

4 Altmetric

Metrics details

Large introductory STEM courses historically have high failure rates, and failing such courses often leads students to change majors or even drop out of college. Instructional innovations such as the Learning Assistant model can influence this trend by changing institutional norms. In collaboration with faculty who teach large-enrollment introductory STEM courses, undergraduate learning assistants (LAs) use research-based instructional strategies designed to encourage active student engagement and elicit student thinking. These instructional innovations help students master the types of skills necessary for college success such as critical thinking and defending ideas. In this study, we use logistic regression with pre-existing institutional data to investigate the relationship between exposure to LA support in large introductory STEM courses and general failure rates in these same and other introductory courses at University of Colorado Boulder.

Our results indicate that exposure to LA support in any STEM gateway course is associated with a 63% reduction in odds of failure for males and a 55% reduction in odds of failure for females in subsequent STEM gateway courses.

Conclusions

The LA program appears related to lower course failure rates in introductory STEM courses, but each department involved in this study implements the LA program in different ways. We hypothesize that these differences may influence student experiences in ways that are not apparent in the current analysis, but more work is necessary to support this hypothesis. Despite this potential limitation, we see that the LA program is consistently associated with lower failure rates in introductory STEM courses. These results extend the research base regarding the relationship between the LA program and positive student outcomes.

Science, technology, engineering, and mathematics (STEM) departments at institutes of higher education historically offer introductory courses that can serve up to 1000 students per semester. Introductory courses of this size, often referred to as “gateway courses,” are cost-effective due to the number of students able to receive instruction in each semester, but they often lend themselves to lecture as the primary method of instruction. Thus, there are few opportunities for substantive interaction between the instructor and students or among students (Matz et al., 2017 ; Talbot, Hartley, Marzetta, & Wee, 2015 ). Further, these courses typically have high failure rates (Webb, Stade, & Grover, 2014 ) and lead many students who begin as STEM majors to either switch majors or drop out of college without a degree (Crisp, Nora, & Taggart, 2009 ). In efforts to address these issues, STEM departments across the nation now implement active engagement strategies in their classes such as peer instruction and interactive student response systems (i.e., clicker questions) during large lecture meetings (Caldwell, 2007 ; Chan & Bauer, 2015 ; Mitchell, Ippolito, & Lewis, 2012 ; Wilson & Varma-Nelson, 2016 ). In addition to classroom-specific active engagement, interventions are programs designed to guide larger instructional innovations from an institution level, such as the Learning Assistant (LA) model.

The LA model was established at University of Colorado Boulder in 2001. The program represents an effort to change institutional values and practices through a low-stakes, bottom-up system of course assistance. The program supports faculty to facilitate increased learner-centered instruction in ways that are most valued by the individual faculty member. A key component of the LA model is undergraduate learning assistants (LAs). LAs are undergraduate students who, through guidance, encourage active engagement in classes. LAs facilitate discussions, help students manage course material, offer study tips, and motivate students. LAs also benefit as they develop content mastery, teaching, and leadership skills. LAs get a monthly stipend for working 10 h per week, and they also receive training in teaching and learning theories by enrolling in a math and science education seminar taught by discipline-based education researchers. In addition, LAs meet with faculty members once a week to develop deeper understanding of the content, share insights about how students are learning, and prepare for future class meetings (Otero, 2015 ).

LAs are not peer tutors and typically do not work one-on-one with students. They do not provide direct answers to questions or systematically work out problems with students. Instead, LAs facilitate discussion about conceptual problems among groups of students and they focus on eliciting student thinking and helping students make connections between concepts. This is typically done both in the larger lecture section of the course as well as smaller meetings after the weekly lectures, often referred to as recitation. LAs guide students in learning specific content, but also in developing and defending ideas—important skills for higher-order learning in general. The model for training LAs and the design of the LA program at large are aimed at making a difference in the ways students think and learn in college overall and not just in specific courses. That is, we expect exposure to the program to influence student success in college generally.

Prior research indicates a positive relationship between exposure to LAs and course learning outcomes in STEM courses (Pollock, 2009 ; Talbot et al., 2015 ). Other research suggests that modifying instruction to be more learner-centered helps to address high failure rates (Cracolice & Deming, 2001 ; Close, Mailloux-Huberdeau, Close, & Donnelly, 2018 ; Webb et al., 2014 ). This study seeks to further understand the relationship between the LA program and probability of student success. Specifically, we answer the following research question: How do failure rates in STEM gateway courses compare for students who do and do not receive LA support in any STEM gateway course? We investigate this question because, as a model for institutional change, we expect that LAs help students develop skills and dispositions necessary for success in college such as higher-order thinking skills, navigating course content, articulating and defending ideas, and feelings of self-efficacy. Since skills such as these extend beyond a single course, we investigate the extent to which students exposed to the LA program have lower failure rates in STEM gateway courses generally than students who are not exposed to the program.

Literature review

The LA model is not itself a research-based instructional strategy. Instead, it is a model of social and structural organization that induces and supports the adoption of existing (or creation of new) research-based instructional strategies that require increased teacher-student ratio. The LA program is at its core, a faculty development program. However, it does not push specific reforms or try to change faculty directly. Instead, the opt-in program offers resources and structures that lead to changes in values and practices among faculty, departments, students, and the institution (Close et al., 2018 ; Sewell, 1992 ). Faculty members write proposals to receive LAs (these proposals must involve course innovation using active engagement and student collaboration), students apply to be LAs, and departments match funding for their faculty’s requests for LAs. Thus, the LA program has become a valued part of the campus community.

The body of research that documents the relationship between student outcomes and the LA program is growing. Pollock ( 2006 ) provided evidence regarding the relationship between instructional innovation including LAs and course outcomes in introductory physics courses at University of Colorado Boulder by comparing three different introductory physics course models (outlined in Table  1 ).

Pollock provides two sources of evidence related to student outcomes regarding the relative effectiveness of these three course models. First, he discussed average normalized learning gains on the force and motion concept evaluation (FMCE; Thornton & Sokoloff, 1998 ) generally. The FMCE is a concept inventory commonly used in undergraduate physics education to provide information about student learning on the topics of force and motion. Normalized learning gains are calculated by finding the difference in average post-test and pre-test in a class and dividing that value by the difference between 100 and the average pre-test score. It is conceptualized as the amount the students learned divided by the amount they could have learned (Hake, 1998 ).

Prior research suggests that traditional instructional strategies yield an average normalized learning gain of about 15% and research-based instructional methods such as active engagement and collaborative learning yield on average about 63% average normalized learning gains (Thornton, Kuhl, Cummings, & Marx, 2009 ). The approach using the University of Washington Tutorials with LAs saw a normalized learning gain of 66% on the FMCE from pre-test to post-test. Average learning gains for the approach using Knight’s ( 2004 ) workbooks with TAs were about 59%, and average normalized learning gains for the traditional approach were about 45%. The average normalized learning gains for all three methods in Pollock’s study are much higher than what the literature would expect from traditional instruction, but the course model including LAs is aligned with what is expected from research-based instructional strategies. Second, Pollock further investigated the impact of the different course implementations on higher and lower achieving students on FMCE scores. To do this, he considered students with high pre-test scores (those with pre-test scores > 50%) and students with low pre-test scores (those with pre-test scores < 15%). For both groups of students, the course implementation that included recitation facilitated by trained TAs and LAs had the highest normalized learning gains as measured by the FMCE.

In a similar study at Florida International University, Goertzen et al. ( 2011 ) investigated the influence of instructional innovations through the LA program in introductory physics. As opposed to the University of Washington Tutorials in the Pollock ( 2006 ) study, the research-based curriculum materials used by Florida International University were Open Source Tutorials (Elby, Scherr, Goertzen, & Conlin, 2008 ) developed at University of Maryland, College Park. Goertzen et al. ( 2011 ) used the Force Concept Inventory (FCI; Hestenes, Wells, & Swackhamer, 1992 ) as the outcome of interest in their study. Despite the different curriculum from the Pollock ( 2006 ) context, Goertzen et al. found that those students exposed to the LA-supported courses had a 0.24 increase in mean raw gain in scores from pre-test to post-test while students in classes that did not include instructional innovations only saw raw gains of 0.16.

In an attempt to understand the broader relationship between the LA program and student outcomes, White et al. ( 2016 ) investigated the impacts of the LA model on student learning in physics across institutions. In their study, White et al. used paired pre-/post-tests from four concept inventories (FCI, FMCE, Brief Electricity and Magnetism Assessment [BEMA; Ding, Chabay, Sherwood, & Beichner, 2006 ], and Conceptual Survey of Electricity and Magnetism [CSEM]) at 17 different institutions. Researchers used data contributed to the Learning Assistant Alliance through their online assessment tool, Learning About STEM Student Outcomes Footnote 1 (LASSO). This platform allows institutions to administer several common concept inventories, with data securely stored on a central database to make investigation across institutions possible (Learning Assistant Alliance, 2018 ). In order to identify differences in learning gains for students who did and did not receive LA support, White et al. tested differences in course mean effect sizes between the two groups using a two-sample t test. Across all of the concept inventories, White et al. found average Cohen’s d effect sizes 1.4 times higher for LA-supported courses compared to courses that did not receive LA support.

The research about the LA model shows that students exposed to the model tend to have better outcomes than those in more traditional lecture-based learning environments. However, due to the design of the program and the goals of the LA model, there is a reason to expect that there are implications for more long-term outcomes. LAs are trained to help students develop skills such as developing and defending ideas, making connections between concepts, and solving conceptual problems. Prior research suggests that skills such as these develop higher-order thinking for students. Martin et al. ( 2007 ) compared learning outcomes and innovative problem-solving for biomedical engineering students in inquiry-based, active engagement and traditional lecture biotransport courses. They found that both groups reached similar learning gains but that the active engagement group showed greater improvement in innovative thinking abilities. In a similar study, Jensen and Lawson ( 2011 ) investigated achievement and reasoning gains for students in either inquiry-based, active engagement or lecture-based, didactic instruction in undergraduate biology. Results indicated that students in active engagement environments outperformed students in didactic environments on more cognitively demanding items, while the groups performed equally well on items requiring low levels of cognition. In addition, students in active engagement groups showed greater ability to transfer reasoning among contexts.

This research suggests that active engagement such as what is facilitated with the LA model may do more than help students gain knowledge in a particular discipline in a particular course. Over and above, active engagement helps learners grow in reasoning and transfer abilities generally. This increase in higher-order thinking may help students to develop skills that extend beyond the immediate course. However, there is only one study focused on the LA model that investigates long-term outcomes related to the program. Pollock ( 2009 ) investigated the potential long-term relationship between exposure to the LA program and conceptual understanding in physics. In this line of inquiry, Pollock compared BEMA assessment scores for those upper-division physics majors who did and did not receive LA support in their introductory Physics II course, the course in which electricity and magnetism is first covered. Pollock’s results indicate that those students who received LA support in Physics II had higher BEMA scores following upper-division physics courses than those students who did not receive LA support in Physics II. This research provides some evidence to the long-term relationship between exposure to the LA program and conceptual learning. In the current study, we continue this line of inquiry by investigating the relationship between receiving LA support in a gateway course and the potential relationship to course failure in subsequent gateway courses. This study also contributes to the literature on the LA program as no prior research attempts to examine the relationship between taking LA-supported courses and student outcomes while controlling for variables that may confound this relationship. This study thus represents an extension of the previous work regarding the LA model in terms of both the methodology and the outcome of interest.

Data for this study come from administrative records at University of Colorado Boulder. We focus on 16 cohorts of students who entered the university as full-time freshmen for the first time each fall semester from 2001 to 2016 and took Physics I/II, General Chemistry I/II, Calculus I/II (Math department), and/or Calculus I/II for Engineers (Applied Math department). The dataset includes information for 32,071 unique students, 23,074 of whom took at least one of the above courses with LA support. Student-level data includes information such as race/ethnicity, gender, first-generation status, and whether a student ever received financial aid. Additional variables include number of credits upon enrollment, high school grade point average (GPA), and admissions test scores. We translate SAT total scores to ACT Composite Scores using a concordance table provided by the College Board to have a common admissions test score for all students (College Board, 2016 ). We exclude students with no admissions test scores (about 6% of the sample). We also have data on the instructor of record for each course. The outcome of interest in this study is failing an introductory STEM course. We define failing as receiving either a D or an F or withdrawing from the course altogether after the university drop date (i.e., “DFW”).

An important consideration in creating the data set for this study is timing of receiving LA support relative to taking any STEM gateway course. The data begin with all students who took at least one of the courses included in this study. We keep all students who took all of their STEM LA courses either with or without LA support. We also include all students who received LA support in the very first STEM gateway course they took, regardless of if they had LA support in subsequent STEM gateway courses. We would exclude any student who took a STEM gateway course without LA support and then took another STEM gateway course in a subsequent semester with LA support.

This data limitation ensures that exposure to the LA program happened before or at the same time as the opportunity to fail any STEM gateway course. If it were the case that a student failed a STEM gateway course without LA support, say, in their first year and then took LA-supported courses in the second year, this student would be indicated as an LA student in the data, but the courses taken during the first year would not have been affected by the LA program. Students with experiences such as this would misrepresent the relationship between being exposed to the LA program and probability of course failure. Conveniently, there were not any students with this experience in the current dataset. In other words, for every student in our study who took more than one of the courses of interest, their first experience with any of the STEM gateway courses under consideration included LA support if there was ever exposure to the LA program. Although we did not have to exclude any students from our study for timing reasons, other institutions carrying out similar studies should carefully consider such cases when finalizing their data for analysis.

We provide Fig.  1 as a way for readers to gain a better understanding of the adoption of the LA program in each of the departments in this study. This figure also gives information regarding the number of students exposed to LAs or not in each department, course, and term in our study.

figure 1

Course enrollment over time by LA exposure

Ideally, we would design a controlled experiment to estimate the causal effect of LA exposure on the probability of failing introductory STEM courses. To do this, we would need two groups of students: first, those who were exposed to LA support in a STEM gateway course, and second, a comparable group, on average, that significantly differed only in that they were not exposed to LA support in any STEM gateway course. However, many institutions do not begin their LA programs with such studies in mind, so the available data do not come from a controlled experiment. Instead, we must rely on historical institutional data that was not gathered for this type of study. Thus, this study not only contributes to the body of literature regarding the relationship between LA exposure and student outcomes, but it also serves as a model for other institutions with LA programs that would like to use historical institutional data for similar investigations.

Selection bias

The ways students are assigned to receive LA support in each of the departments represented in this study are not random, and the ways LAs are used in each department are not identical. These characteristics of pre-existing institutional data manifest themselves as issues related to selection bias within a study. For example, in the chemistry department, LA support was only offered in the “on semester” sections of chemistry from 2008 to 2013. “On semester” indicates General Chemistry I in the fall and General Chemistry II in the spring. Thus, there were few opportunities for those students who took the sequence in the “off semester,” or General Chemistry I in the spring and General Chemistry II in the fall to receive LA support in these courses during the span of time covered in this analysis. The most typical reasons why students take classes in the “off semester” are that they simply prioritize other courses more in the fall semester, so there is insufficient space to take General Chemistry I; they do not feel prepared for General Chemistry I in the fall and take a more introductory chemistry class first; or they fail General Chemistry I the first time in the fall and re-take General Chemistry I in the spring. This method of assignment to receiving LA support may overstate the relationship between receiving LA support and course failure in this department. That is, it might be the case that those students who received LA support were those who were more likely to pass introductory chemistry to begin with. Our analysis includes prior achievement variables (described below) to attempt to address these selection bias issues.

In chemistry, LAs attend the weekly lecture meetings and assist small groups of students during activities such as answering clicker questions. Instructors present questions designed to elicit student levels of conceptual understanding. The questions are presented to the students; they discuss the questions in groups and then respond using individual clickers based on their selection from one of several multiple-choice options. LAs help students think about and answer these questions in the large lecture meetings. In addition, every student enrolled in General Chemistry I and II is also enrolled in a recitation section. Recitations are smaller group meetings of approximately 20 students. In these recitation sections, LAs work with graduate TAs to facilitate small group activities related to the weekly lecture material. The materials for these recitation sections are created by the lead instructor for the course and are designed to help students investigate common areas of confusion related to the weekly material.

In the physics and math departments, the introductory courses went from no LA support in any section in any semester to all sections in all semesters receiving LA support. This historical issue affects selection bias in a different way than the off-semester chemistry sequence. One interpretation of decreased course failure rates could be that LA support caused the difference. However, we could not rule out the possibility that failure rates decreased due to other factors that also changed over time. It could be that the university implemented other student supports in addition to the LA model at the same time or that the types of students who enrolled in STEM courses changed. There is no way to determine conclusively which of these (or other) factors may have caused changes in failure rates. Thus, causal estimates of the effect of LA support on failure rates would be threatened by any historic changes that occurred. We have no way of knowing if we might over or underestimate the relationship between LA exposure and course failure rates due to the ways students were exposed (or not) to the LA program in these departments. In order to address this issue, we control for student cohort. This adjustment, described below, attempts to account for differences that might exist among cohorts of students that might be related to probability of failing a course.

The use of LAs in the math department only occurs during weekly recitation meetings. During this weekly meeting, students work in small groups to complete carefully constructed activities designed to enhance conceptual understanding of the materials covered during the weekly lecture. An anomaly in the math department is that though Calculus I/II are considered gateway courses, the math department at this institution is committed to keeping course enrollment under 40. This means that LA support is tied to smaller class sizes in this department. However, since this condition is constant across the timeframe in our study, it does not influence selection bias.

Similar to the math department, the physics department only uses LAs in the weekly recitation meeting. An additional anomaly in physics is that, not incidentally, the switch to the LA model happened concurrently with the adoption of the University of Washington Tutorials in introductory physics (McDermott & Shaffer, 2002 ). LAs facilitate small group work with the materials in the University of Washington Tutorials during recitation meetings. In other words, it is not possible to separate the effects of the content presentation in the Tutorials from the LAs facilitating the learning of the content in this department. Thus, data from this department might overestimate the relationship between receiving LA support and course failure. However, it should be noted that the University of Washington Tutorials require a low student-teacher ratio, and proper implementation of this curriculum is not possible without the undergraduate LAs helping to make that ratio possible.

Finally, every student in every section of Calculus I and II in the applied math department had the opportunity to be exposed to LA support. This is because LAs are not used in lecture or required recitation meetings, but instead facilitate an additional weekly one-unit course, called workgroup, that is open to all students. Thus, students who sign up for workgroup not only gain exposure to LA support, but they also gain an additional 90 min of time each week formally engaging in calculus material. It is not possible to know if lower failure rates might be due to the additional time on task generally, or exposure to LAs during that time specifically. This might cause us to overestimate the relationship between LA support and course failure. Additionally, those students who are expected to struggle in calculus (based on placement scores on the Assessment and LEarning in Knowledge Spaces [ALEKS] assessment) or are not confident in their own math abilities are more strongly encouraged to sign up for the weekly meeting by their instructors and advisors. Thus, those students who sign up for LA support might be more likely to fail calculus. This might lead us to underestimate the relationship between LA exposure and course failure. Similar to the chemistry department, we use prior achievement variables (described below) to address this issue to the best of our abilities.

We mention one final assumption about the LA model before describing our methods of statistical adjustment. Our data span 32 semesters of 8 courses (see Fig.  1 ). Although it is surely the case that the LA model adapted and changed in some ways over the course of this time, we make the assumption that the program was relatively stable within department throughout the time period represented in this study.

Statistical adjustment

Although we do not have a controlled experiment that warrants causal claims, we desire to estimate a causal effect. The current study includes a control group, but it is not ideal because of the potential selection bias in each department described above. However, this study is warranted because it takes advantage of historical data. Our analytic approach is to control for some sources of selection bias. Specifically, we use R to control for standardized high school GPA, standardized admissions test scores, and standardized credits at entry to try and account for issues related to prior aptitude. This helps to address the selection bias issues in the chemistry and applied math departments. Additionally, we control for student cohort to account for some of the historical bias in the physics and math departments. We also control for instructor and course as well as gender (coded 1 = female; 0 = male), race/ethnicity (coded 1 = nonwhite; 0 = white), first-generation status (coded 1 = first-generation college student; 0 = not first-generation college student), and financial aid status (coded 1 = received financial aid ever; 0 = never received financial aid) to disentangle other factors that might bias our results in any department. Finally, we consider possible interaction effects between exposure to LA support and various student characteristics. Table  2 presents the successive model specifications explored in this study. Model 1 controls only for student characteristics. Model 2 adds course, cohort, and instructor factor variables. Model 3 adds an interaction between exposure to the LA program and gender to the model 2 specification.

The control variables in Table  2 help to account for the selection bias described above as well as other unobserved bias in our samples, but we are limited by the availability of observed covariates. Thus, the results presented here lie somewhere between “true” causal effects and correlations. We know that our results tell us more than simple correlations, but we also know that we are surely missing key control variables that are typically not collected by institutes of higher education such as a measure of student self-efficacy, social and emotional health, or family support. Thus, we anticipate weak model fit, and the results presented here are not direct causal effects. Instead, they provide information about the partial association between course failure and LA support.

We begin our analysis by providing raw counts of failure rates for the students who did and did not receive LA support in STEM gateway courses. Next, we describe the differences between those students who did and did not receive LA support with respect to available covariates. If it is the case that we see large differences in our covariates between the group of students who did and did not receive LA support, we expect that controlling for those factors in the regression analysis will affect our results in meaningful ways. Thus, we close with estimating logistic regression models to disentangle some of the relationship between LA-support and course failure. The variable of most interest in this analysis is the indicator for exposure to the LA program. A student received a “1” for this variable if they were exposed to the LA program either concurrently or prior to taking STEM gateway courses, and a 0 if they took any classes in the study but never had any LA support in those classes.

Table  3 includes raw pass and failure rates across all courses. Students are counted every time they enrolled in one of the courses included in our study. We see that those students who were exposed to the LA program in at least one STEM gateway course had 6% lower failure rates in concurrent or subsequent STEM gateway course. We also provide the unadjusted odds ratios for ease of comparison with the logistic regression results. The odds ratio represents the odds that course failure will occur given exposure to the LA program, compared to the odds of course failure occurring without LA exposure. Odds ratios equal to 1.0 indicates the odds of failure is the same for both groups. Odds ratios less than 1.0 indicates that exposure to LA support is associated with a lower chance of failing, while odds ratios greater than 1.0 indicates that exposure to LA support is associated with a higher chance of failing. Thus, the odds ratio of 0.65 in Table  3 indicates a lower chance of failure with LA exposure compared to no LA exposure.

Although the raw data indicates that students exposed to LA support have lower course failure rates, these differences could be due, at least in part, to factors outside of LA support. To explore this possibility, we next examine demographic and academic achievement differences between the groups. In Table  4 , we present the mean values for all of our predictor variables for students who did and did not receive LA support. The top panel presents all of the binary variables, so averages indicate the percentage of students who identify with the respective characteristics. The bottom panel shows the average for the continuous variables. The p values are for the comparisons of means from a t test across the two groups for each variable. Table  4 indicates that students exposed to the LA program were more likely to be male, nonwhite, non-first-generation students who did not received financial aid. They also had more credits at entry, higher high school GPAs, and higher admissions test scores. These higher prior achievement variables might lead us to think that students exposed to LA support are more likely to pass STEM gateway courses. If this is true, then the relationship between LA exposure and failure in Table  3 may overestimate the actual relationship between exposure to LAs and probability for course failure. Thus, we next use logistic regression to control for potentially confounding variables and investigate any resulting change in the odds ratio.

R calculates logistic regression estimates in logits, but these estimates are often expressed in odds ratios. We present abbreviated logit estimates in the Appendix and abbreviated odds ratios estimates in Table  5 . Estimates for all factor variables (i.e., course, cohort, and instructor) are suppressed in these tables for ease of presentation. In order to make the transformation from logits to odds ratios, the logit estimates were exponentiated to calculate the odds ratios presented in Table  5 . For example, the logit estimate for exposure to LA in model 1 from the Appendix converts to the odds ratio estimate in Table  5 by finding exp(− 1.41) = 0.24.

We start off by discussing the results for model 3 as it is the full model for this analysis. Discussion of models 1 and 2 are saved for the discussion of model fit below. The results in model 3 provide information about what we can expect, on average, across all courses and instructors in the sample. We include confidence intervals with the odds ratios. Confidence intervals that include 1.0 suggest results that are not statistically significant (Long, 1997 ). The odds ratio estimate in Table  5 for model 3 is 0.367 for LA exposure with a confidence interval from (0.337–0.400). Since the odds ratio is less than 1.0, LA exposure is associated with a lower probability of failing, on average, and the relationship is statistically significant because the confidence interval does not include 1.0. Compared to the odds ratio in Table  3 (0.65), these results indicate that covariate adjustment has a large impact on this odds ratio. Failure to adjust for possible sources of confounding variables lead to an understatement of the “effect” of exposure to the LA program on course failure.

Our results show that LA exposure is associated with lower odds of failing STEM gateway courses. We also see that the interaction between exposure to the LA program and gender is statistically significant. The odds ratio of 0.37 for exposure to LA support in Table  5 is for male students. In order to find the relationship for female students, we must exponentiate the logit estimates for exposure to the LA program, female, and the interaction between the two variables (i.e. exp[01.002–0.092 + 0.297] = 0.45; see the Appendix ). This means that the LA program actually lowers the odds of failing for male students slightly more than female students. Recall that Table  3 illustrated that the raw odds ratio for failure when exposed to LA support was 0.65. Our results show that after controlling for possibly confounding variables, the relationship between LA support and odds of course failure is better for both male (0.37) and female (0.45) students.

Discussion and limitations

Throughout this paper, we have been upfront about the limitations of the current analysis. Secondary analysis of institutional data for longstanding programs is complex and difficult. In this penultimate section, we mention a few other limitations to the study as well as identify some ideas for future research that could potentially bolster the results found here or identify where this analysis may have gone astray.

First, and most closely related to the results presented above is model fit. The McFadden pseudo R-squared (Verbeek, 2008 ) values for the three models are 0.0708, 0.1793, and 0.1797 respectively. These values indicate two things: (1) that the data do not fit any of the models well and (2) that the addition of the interaction term does little to improve model fit. This is also seen in the comparison of AIC and log likelihood values in Table  5 . We spend significant time on the front end of this paper describing why these data are not ideal for understanding the relationship between exposure to the LA program and probability of failing, so we do not spend additional time here discussing this lack of goodness-of-fit. Instead, we acknowledge this as a limitation of the current analysis and reiterate the desire to conduct a similar type analysis to what is presented here with data more likely to fit the model. Such situations would include institutions that have the ability to compare, for example, large samples of students with and without LA exposure within the same semester, course, and instructor. Another way to improve such data would be to include a way to control for student confidence and feelings of self-efficacy. For example, the descriptions of selection bias above indicate that students in Applied Math might systematically be students who differ in terms of self-confidence. Data that could control for such factors would better facilitate understanding of the relationship between exposure to LA support and course failure. Alternatively, it may be more appropriate to consider the nested structure of the data (i.e., students nested within courses nested within departments) in a context with data better suited for such analysis. Hierarchical linear modeling might even be appropriate for a within-department study if it would be reasonable to consider students nested within classes if there was sufficient sample size at the instructor level.

Second, in addition to a measure of student self-efficacy, there are other variables that might be interesting to investigate such as transfer, out-of-state, or international student status; if students live on-campus; and a better measure of socioeconomic status than receiving financial aid. These are other important student characteristics that might uncover differential relationships between the LA program and particular types of students. Such analysis is important because persistence and retention in gateway courses—particularly for students from traditionally marginalized groups—are an important concern for institutions generally and STEM departments specifically. If we are to maintain and even build diversity in these departments, it is crucial we have solid and clear work in these areas.

Third, although this study controls for course- and instructor-level factors, there are surely complications introduced into this study due to the differential way the LA program is implemented in each department. A more careful study within department is another interesting and valuable approach to understanding the influence of the LA program but one that this data is not well-suited for. Again, there is a need for data which includes students exposed to the LA program and not exposed within the same term, course, and instructor to better disentangle the relationship. Due to the nature of the way the LA program was taken up at University of Colorado Boulder, we do not have the appropriate data for such an analysis.

Finally, an interesting consideration is the choice of outcome variable made in this analysis. Course failure rates are particularly important in gateway courses because failing such a course can lead students to switch majors or drop out of college. We do see a relationship between the LA model and lower failure rates in the current analysis. However, other approaches to course outcomes include course grades, pass rates, average GPA in other courses, and average grade anomaly (Freeman et al., 2014 ; Haak et al., 2011 ; Matz et al., 2017 ; Webb, Stade, & Grover, 2014 ). Similar investigations to what is presented here with other course outcomes are also of interest. For example, course grades would provide more nuanced information regarding how the LA model influences student outcomes. A measure such as Matz et al.’s ( 2017 ) average GPA in other courses could provide more information about how the LA program impacts course other than the ones in which the LA exposure occurred. In either of these situations, it would be interesting to see if the LA program would continue to appear to have a greater impact for male students than female. In short, there are a wide variety of student outcomes that have yet to be fully investigated with data from the LA model and more nuanced information would be a valuable contribution to the research literature.

In this study, we attempt to disentangle the relationship between LA support and course failure in introductory STEM courses. Our results indicate that failure to control for confounding variables underestimates the relationship between exposure to the LA program and course failure. The results here extend the prior literature regarding the LA model by providing evidence to suggest that exposure to the program increases student outcomes in subsequent as well as current courses. Programs such as the LA model that facilitate instructional innovations where students are more likely to be successful increase student retention.

Preliminary qualitative work suggests potential hypotheses for the relationship between LA support and student success. Observations of student-LA interactions indicate that LAs develop safe yet vulnerable environments necessary for learning. Undergraduates are more comfortable revealing their thinking to LAs than to TAs and instructors and are therefore better able to receive input about their ideas. Researchers find that LAs exhibit pedagogical skills introduced in the pedagogy course and course experience that promote deep understanding of relevant content as well as critical thinking and questioning needed in higher education (Top, Schoonraad, & Otero, 2018 ). Also, through their interactions with LAs, faculty seem to be learning how to embrace the diversity of student identities and structure educational experiences accordingly. Finally, institutional norms are changing as more courses adopt new ways of teaching students. For example, the applied math department provides additional time on task because of the LA program. Although we do not know if it is the additional time on task, the presence of LAs, or a combination of both that drives the relationship between LA exposure and lower course failure rates, both the additional time and LA exposure occur because of the LA program generally.

Further work is necessary to more fully understand the relationship between the LA program and student success. Although we controlled for several student-level variables, we surely missed key variables that contribute to these relationships. Despite this limitation, the regression analysis represents an improvement over unadjusted comparisons. We used the available institutional data to control for variables related to the selection bias present in each department’s method of assigning students to receive LA support. More research is needed to identify if the emerging themes in the present study are apparent at other institutions. Additional research with data better suited to isolate potential causal effects is also needed to bolster the results presented here. Despite the noted limitations discussed here, the current findings are encouraging for further development and implementation of the LA program in STEM gateway courses. Identifying relationships between models for change and lower course failure rates are helpful for informing future decisions regarding those models.

For more information about joining LASSO and resources available to support LA programs, visit https://www.learningassistantalliance.org /

Abbreviations

Brief Electricity and Magnetism Assessment

Conceptual Survey of Electricity and Magnetism

Force Concept Inventory

Force and Motion Concept Evaluation

Learning Assistant model

Learning assistants

Peer-led team learning

Science, technology, engineering, and mathematics

Caldwell, J. E. (2007). Clickers in the large classroom: current research and best-practice tips. CBE-Life Sci Educ, 6 (1), 9–20.

Article   Google Scholar  

Chan, J. Y., & Bauer, C. F. (2015). Effect of peer-led team learning (PLTL) on student achievement, attitude, and self-concept in college general chemistry in randomized and quasi experimental designs. J Res Sci Teach, 52 (3), 319–346.

Close, E. W., Mailloux-Huberdeau, J. M., Close, H. G., & Donnelly, D. (2018). Characterization of time scale for detecting impacts of reforms in an undergraduate physics program. In L. Ding, A. Traxler, & Y. Cao (Eds.), AIP Conference Proceedings: 2017 Physics Education Research Conference .

Google Scholar  

College Board. (2016). Concordance tables. Retrieved from https://collegereadiness.collegeboard.org/pdf/higher-ed-brief-sat-concordance.pdf

Cracolice, M. S., & Deming, J. C. (2001). Peer-led team learning. Sci Teach, 68 (1), 20.

Crisp, G., Nora, A., & Taggart, A. (2009). Student characteristics, pre-college, college, and environmental factors as predictors of majoring in and earning a STEM degree: an analysis of students attending a Hispanic serving institution. Am Educ Res J, 46 (4), 924–942 Retrieved from http://www.jstor.org/stable/40284742 .

Ding, L., Chabay, R., Sherwood, B., & Beichner, R. (2006). Evaluating an electricity and magnetism assessment tool: brief electricity and magnetism assessment. Physical Rev Special Topics Physics Educ Res, 2 (1), 010105.

Elby, A., Scherr, R. E., Goertzen, R. M., & Conlin, L. (2008). Open-source tutorials in physics sense making. Retrieved from http://umdperg.pbworks.com/w/page/10511238/Tutorials%20from%20the%20UMd%20PERG

Freeman, S., Eddy, S. L., McDonough, M., Smith, M. K., Okoroafor, N., Jordt, H., & Wenderoth, M. P. (2014). Active learning increases student performance in science, engineering, and mathematics. Proc Nat Acad Sci, 111 (23), 8410–8415.

Goertzen, R. M., Brewe, E., Kramer, L. H., Wells, L., & Jones, D. (2011). Moving toward change: institutionalizing reform through implementation of the Learning Assistant model and Open Source Tutorials. Physical Rev Special Topics Physics Education Research, 7 (2), 020105.

Haak, D. C., HilleRisLambers, J., Pitre, E., & Freeman, S. (2011). Increased structure and active learning reduce the achievement gap in introductory biology. Science, 332 (6034), 1213–1216.

Hake, R. R. (1998). Interactive-engagement versus traditional methods: a six-thousand-student survey of mechanics test data for introductory physics courses. Am J Physics, 66 (1), 64–74.

Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. Physics Teach, 30 (3), 141–158.

Jensen, J. L., & Lawson, A. (2011). Effects of collaborative group composition and inquiry instruction on reasoning gains and achievement in undergraduate biology. CBE-Life Sci Educ, 10 (1), 64–73.

Knight, R. (2004). Physics for scientists and engineers: A strategic approach. Upper Saddle River, NJ: Pearson/Addison Wesley.

Learning Assistant Alliance. (2018). About LASSO. Retrieved from https://www.learningassistantalliance.org/modules/public/lasso.php

Long, J. S. (1997). Advanced quantitative techniques in the social sciences series, Vol. 7. Regression models for categorical and limited dependent variables. Thousand Oaks, CA, US.

Martin, T., Rivale, S. D., & Diller, K. R. (2007). Comparison of student learning in challenge-based and traditional instruction in biomedical engineering. Annals of Biomedical Engineering, 35 (8), 1312–1323.

Matz, R. L., Koester, B. P., Fiorini, S., Grom, G., Shepard, L., Stangor, C. G., et al. (2017). Patterns of gendered performance differences in large introductory courses at five research universities. AERA Open, 3 (4), 2332858417743754.

McDermott, L. C., and Shaffer, P. S. (2002). Tutorials in introductory physics. Upper Saddle Ridge, New Jersey: Prentice Hall.

Mitchell, Y. D., Ippolito, J., & Lewis, S. E. (2012). Evaluating peer-led team learning across the two semester general chemistry sequence. Chemistry Education Research and Practice, 13 (3), 378–383.

Otero, V. K. (2015). Effective practices in preservice teacher education. In C. Sandifer & E. Brewe (Eds.), Recruiting and educating future physics teachers: case studies and effective practices (pp. 107–127). College Park: American Physical Society.

Pollock, S. J. (2006). Transferring transformations: Learning gains, student attitudes, and the impacts of multiple instructors in large lecture courses. In P. Heron, L. McCullough, & J. Marx (Eds.), Proceedings of 2005 Physics Education Research Conference (pp. 141–144). Salt Lake City, Utah.

Pollock, S. J. (2009). Longitudinal study of student conceptual understanding in electricity and magnetism. Physical Review Special Topics-Physics Education Research, 5 (2), 1–8.

Talbot, R. M., Hartley, L. M., Marzetta, K., & Wee, B. S. (2015). Transforming undergraduate science education with learning assistants: student satisfaction in large-enrollment courses. J College Sci Teach, 44 (5), 24–30.

Thornton, R. K., & Sokoloff, D. R. (1998). Assessing student learning of Newton’s laws: the force and motion conceptual evaluation and the evaluation of active learning laboratory and lecture curricula. Am J Physics, 66 (4), 338–352.

Thornton, R. K., Kuhl, D., Cummings, K., & Marx, J. (2009). Comparing the force and motion conceptual evaluation and the force concept inventory. Physical review special topics-Physics education research, 5(1), 010105.

Top, L., Schoonraad, S., & Otero, V. (2018). Development of pedagogical knowledge among learning assistants. Int J STEM Educ, 5 (1). https://doi.org/10.1186/s40594-017-0097-9 .

Verbeek, M. (2008). A guide to modern econometrics . West Sussex: Wiley.

Webb, D. C., Stade, E., & Grover, R. (2014). Rousing students’ minds in postsecondary mathematics: the undergraduate learning assistant model. J Math Educ Teach College, 5 (2).

White, J. S. S., Van Dusen, B., & Roualdes, E. A. (2016). The impacts of learning assistants on student learning of physics. arXiv preprint arXiv, 1607.07469 . Retrieved from https://arxiv.org/ftp/arxiv/papers/1607/1607.07469.pdf .

Wilson, S. B., & Varma-Nelson, P. (2016). Small groups, significant impact: a review of peer-led team learning research with implications for STEM education researchers and faculty. J Chem Educ, 93 (10), 1686–1702.

William H. Sewell, (1992) A Theory of Structure: Duality, Agency, and Transformation. American Journal of Sociology 98 (1):1–29

Download references

Acknowledgements

There is no funding for this study.

Availability of data and materials

The datasets generated and/or analyzed during the current study are available in the LAs and Subsequent Course Failure repository, https://github.com/jalzen/LAs-and-Subsequent-Course-Failure .

Author information

Authors and affiliations.

University of Colorado Boulder, 249 UCB, Boulder, CO, 80309, USA

Jessica L. Alzen, Laurie S. Langdon & Valerie K. Otero

You can also search for this author in PubMed   Google Scholar

Contributions

JLA managed the data collection and analysis. All authors participated in writing, revising, and approving the final manuscript.

Corresponding author

Correspondence to Jessica L. Alzen .

Ethics declarations

Ethics approval and consent to participate.

The IRB at University of Colorado Boulder (FWA 00003492) determined that this study did not involve human subjects research. The approval letter specifically stated the following:

The IRB determined that the proposed activity is not research involving human subjects as defined by DHHS and/or FDA regulations. IRB review and approval by this organization is not required. This determination applies only to the activities described in the IRB submission and does not apply should any changes be made. If changes are made and there are questions about whether these activities are research involving human subjects in which the organization is engaged, please submit a new request to the IRB for a determination.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Alzen, J.L., Langdon, L.S. & Otero, V.K. A logistic regression investigation of the relationship between the Learning Assistant model and failure rates in introductory STEM courses. IJ STEM Ed 5 , 56 (2018). https://doi.org/10.1186/s40594-018-0152-1

Download citation

Received : 29 August 2018

Accepted : 10 December 2018

Published : 28 December 2018

DOI : https://doi.org/10.1186/s40594-018-0152-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Learning assistant
  • Underrepresented students

research paper on logistic regression

Subscribe to the PwC Newsletter

Join the community, edit method, add a method collection.

  • GENERALIZED LINEAR MODELS

Remove a collection

  • GENERALIZED LINEAR MODELS -

Add A Method Component

Remove a method component, logistic regression.

Logistic Regression , despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

Source: scikit-learn

Image: Michaelg2015

research paper on logistic regression

Paper Code Results Date Stars
Task Papers Share
42 7.73%
23 4.24%
20 3.68%
16 2.95%
13 2.39%
12 2.21%
10 1.84%
10 1.84%
8 1.47%

Usage Over Time

Component Type Edit Add Remove
🤖 No Components Found You can if they exist; e.g. uses

Categories Edit Add Remove

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Understanding logistic regression analysis

  • February 2014
  • Biochemia Medica 24(1):12-8
  • CC BY-NC-ND 3.0

Sandro Sperandei at Western Sydney University

  • Western Sydney University

Abstract and Figures

. Results from ctional endocarditis treatment study by

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

Supplementary resource (1)

Funda Akar

  • BMC PUBLIC HEALTH
  • Daniel I Alohan
  • Gabrielle Evans
  • Travis Sanchez

Natalie D Crawford

  • Eman Adel Seif
  • Wafaa Mohamed Elsehly
  • Maii Farag Henaidy
  • Magda Hassan Mabrouk Soffar
  • Hamza Ouhakki
  • Abdelali Elmoufidi
  • TELECOMMUN POLICY
  • Omer Bugra Kirgiz

Meltem Kiygi-Calli

  • Sendi Cagliyor

Maryam el Oraiby

  • A. A. Afifi
  • W. Haenszel
  • BIOCHEM MEDICA
  • Mary L. McHugh

Aya a. Afifi

  • Robert B. Bendel
  • Abdelmonem Afifi
  • DIABETIC MED
  • Carlos Antonio Negrato

Roberta Cobas

  • N. J. Mantel
  • Haenszel WH
  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PMC11361657

Logo of plosone

Lifestyle risk behavior and atherosclerotic cardiovascular risk: An analysis using the Korea National Health and Nutrition Examination Survey

1 Department of Neurology, Hallym University Sacred Heart Hospital, Anyang, Korea

Hyo-Jeong Ahn

2 Health Insurance Review and Assessment Research Institute, Health Insurance Review & Assessment Service, Wonju, Korea

Su Jung Lee

3 Research Institute on Nursing Science, School of Nursing, Hallym University, Chuncheon, Korea

Pum-Jun Kim

4 Department of Artificial Intelligence, Ulsan National Institute of Science and Technology, Ulsan, Korea

5 Department of Neurology, Chuncheon Sacred Heart Hospital, Chuncheon, Korea

6 Institute of New Frontier Research Team, Hallym University College of Medicine, Chuncheon, Korea

Sang-Hwa Lee

Jong-hee sohn, jae-jun lee.

7 Department of Anaesthesiology and Pain Medicine, Chuncheon Sacred Heart Hospital, Chuncheon, Korea

Associated Data

We used the Korean National Health and Nutrition Examination Survey (KNHANES) data, which could be publicly assessed from the website ( https://knhanes.kdca.go.kr/knhanes/sub03/sub03_02_05.do ). Our data is available to our IRB with a reasonable request ( rk.ro.myllah@brinoehcnuhc ).

Clustering lifestyle risk behaviors is important for predicting cardiovascular disease risk. However, it is unclear which behavior mediates other ones to influence cardiovascular disease risk. We aimed to assess the causal inference of each lifestyle risk behavior for the atherosclerotic cardiovascular disease (ASCVD) risk of the general population.

We performed a Bayesian network mediation analysis using data from the Korea National Health and Nutrition Examination Survey from 2014 to 2019. The main exposure was a combination of lifestyle risk behaviors including unhealthy weight, heavy alcohol consumption, inadequate sleep, physical inactivity, excessive sodium intake, and current smoking among subjects 40 to 79 years of age. The high risk of ASCVD (≥7.5% for the 10-year risk) was assessed using logistic regression, Bayesian networks, and structural equational models to examine the causal relationships between these six lifestyle risk behaviors.

Among all participants, the most prevalent lifestyle risk behavior for those at high risk for ASCVD was excessive sodium intake (95.6%), followed by inadequate sleep (49.9%) and physical inactivity (43.8%). Older age (65–79 years) and male sex were directly associated with a high risk for ASCVD. Physical inactivity, current smoking, excessive sodium intake, and unhealthy weight indirectly mediated the effects of older age (8.2% of the older age) and male sex (39.9% of males) to high ASCVD risk. Physical inactivity, current smoking, excessive sodium intake, and unhealthy weight particularly mediated the high ASCVD risk sequentially. Heavy alcohol consumption and inadequate sleep were not directly associated with high ASCVD risk and did not indirectly mediate the effects of older age and males on the high ASCVD risk.

Lifestyle risk behaviors mediated the atherosclerotic cardiovascular disease risk in a different manner. Especially, physical inactivity preceded current smoking, excessive sodium intake, and unhealthy weight in relation to high ASCVD risk, and this causal relationship was different according to age and sex. Therefore, tailored strategies according to specific target populations may be needed to effectively reduce the high ASCVD risk.

Introduction

Cardiovascular disease (CVD) is one of the leading causes of death worldwide [ 1 , 2 ]. In 2019, approximately 18.6 million people died from CVD [ 3 ], thus leading to a greater burden on global health than any other chronic disease [ 4 ]. Although the age-standardized mortality rates of CVD are consistently decreasing, the crude CVD mortality rates are increasing continuously because of global aging, advancements in the primary prevention of CVD, and changes in lifestyle and living environment [ 5 ].

To quantify and manage the CVD risk, atherosclerotic cardiovascular disease (ASCVD) risk estimation has been widely used during the past decades [ 6 ]. The primary method of minimizing the future risk of ASCVD is improving the lifestyle by modulating risk factors, especially smoking, unhealthy dietary patterns, and physical inactivity [ 7 ]. More than 50% of individuals at high risk for ASCVD have multiple lifestyle risk behaviors (LRB) [ 8 , 9 ]. Recently, proper control of LRBs has been reported to effectively prevent recurrent cardiovascular events and reduce cardiovascular mortality [ 10 ]. Many prospective cohort studies have reported that the sum of LRBs affects CVD outcomes [ 7 , 11 ]; however, knowledge of how each LRB affects the ASCVD risk is limited [ 12 ].

The evaluation of combinations of LRBs associated with CVD risk and preventive medicine is important because LRBs are not non-modifiable risk factors such as age and sex. Important modifiable risk factors can be sufficiently corrected to reduce CVD risk [ 13 , 14 ]. Additionally, assessing associations between each LRB and risk is important for targeting which LRB should be corrected to effectively reduce the risk of specific diseases [ 15 ]. In other words, there may be a causal relationship between lifestyle factors and increased CVD risk. For example, deciding whether to correct smoking or control body weight is important for the intensive implementation of primary CVD prevention. Therefore, this study aimed to analyze the mediating effects of LRBs on the ASCVD risk and determine how LRBs influence each other in the general population.

Study population

During this study, we used data from the Korean National Health and Nutrition Examination Survey (KNHANES) conducted annually by the Division of Chronic Disease Surveillance under the guidance of the Korea Centers for Disease Control. The KNHANES is a nationwide surveillance of data designed to evaluate and develop the health status and nutrition status of the representative Korean population through questionnaires including health interviews, nutrition surveys, and details of health examinations collected by professionally trained staff. Additionally, the survey applied a stratified, multistage, and probability sampling method to collect a representative sample of the study population.

This is the retrospective cross-sectional study to assess the relationship of each LRBs using network mediation analysis. We gathered and analyzed data from 2014 to 2019 in the KNHANES database. We included 25,639 participants 40 to 79 years of age so that the 10-year ASCVD risk score could be calculated. Among these data, 7309 observations, incomplete answers in the survey, and incomplete physical examination results were excluded ( Fig 1 ).

An external file that holds a picture, illustration, etc.
Object name is pone.0307677.g001.jpg

ASCVD risk estimation

The 10-year risk of ASCVD was estimated using clinical factors, demographic factors, and laboratory results, as defined by the American College of Cardiology/American Heart Association [ 6 ]. The estimate function of the ASCVD risk score for the white population was applied during this study because the estimation based on the white race had been widely used for other races in previous studies [ 16 , 17 ]. The primary outcome was the estimated 10-year ASCVD risk score of more than 7.5% [ 18 ].

Assessment of lifestyle risk behaviors

Because the methodologies of our analyses required different types of predictors, the categorical and continuous transformations of variables were conducted as appropriate. Body mass index (BMI) was calculated as weight in kilograms divided by the square of height in meters (kg/m 2 ) according to Asian Pacific World Health Organization criteria [ 19 ]. Unhealthy weight was defined as a BMI less than 18.5 (kg/m 2 ) or more than 25 (kg/m 2 ) [ 20 ]. Heavy alcohol consumption was defined as drinking 14 or more alcoholic beverages per week for men and 10 or more alcoholic beverages per week for women [ 21 ]. Current smoking was defined as smoking more than five packs of cigarettes during their lifetime and currently smoking at the time of the survey [ 21 , 22 ]. Physical activity was calculated using the sum of the minutes of exercise per week and considering the strength of the exercise intensity (vigorous and moderate). The combination of vigorous-intensity and moderate-intensity exercise was considered by calculating the minutes exercised per week as follows: 2 × moderate activity (min/week) + vigorous activity (min/week). Physical inactivity was defined as performing less than 150 minutes of moderate-intensity physical activity per week [ 23 ]. We measured sleep duration by checking the number of hours of sleep per week; we defined inadequate sleep as less than 7 hours or 9 hours or more [ 24 ]. Sodium intake was estimated using Tanaka’s equation using the measured amount of spot urine sodium (mmol/L) and creatine (mg/dL) per 24 hours. Excessive sodium intake was defined as more than 87 mmol of urine sodium excretion over the course of 24 hours [ 25 ]. Age was dichotomized into ages 40 to 64 years and 65 to 79 years. Additional definitions of conventional cardiovascular risk factors and demographic variables are presented in the Supplementary material online ( S1 Text ).

Statistical analysis

We assessed the baseline differences of independent variables of dichotomized ASCVD risk score groups (≥7.5% vs. <7.5%) using the χ 2 -test and Student’s t-test. We used the following two-step statistical approach to assess how each LRB or any correlation of LRBs affects the ASCVD risk. First, a binary logistic regression analysis was performed to classify the subjects with high ASCVD risk scores (≥7.5%) and to check whether a significant covariance shift existed when using categorized predictors in the model. As stated in the introduction, we set the outcome to be the ASCVD risk, not the development of the ASCVD itself. Therefore, we did not include vascular risk factors in logistic regression analysis because if vascular risk factors were entered into the model to determine the association between the ASCVD risk and lifestyle risk behaviors, it would only reflect the importance of variables directly reflected in ASCVD risks, such as age, gender, hypertension, diabetes, and dyslipidemia.

Second, we investigated the causal effects of LRBs on the ASCVD risk using the Bayesian network mediation analysis and structural equation model (SEM). Bayesian network mediation is a statistical method for investigating causal relationships between variables in a dataset. Specifically, it allows researchers to determine whether the relationship between two variables is mediated by one or more additional variables. A Bayesian network is a probabilistic graphical model that represents the relationships between variables in a dataset. In a Bayesian network, nodes represent variables and edges represent conditional dependencies between variables by a directed acyclic graph (DAG). A Markov Blanket is a set of variables that contains all of the variables that are direct causes, direct effects, or direct confounders of a given variable [ 26 ]. We used a constraint-based algorithm called the grow-shrink Markov blanket to learn the Bayesian network model structure [ 27 ]. The bootstrap approach was repeated 200 times, and an average DAG representing the final Bayesian network was drawn. We used sex and age as prior nodes (non-modifiable variables) of conditional dependency in the Bayesian network model and combinations of LRBs (modifiable variables) as mediators of probabilistic inference of the ASCVD risk. The detailed procedure for the Bayesian network mediation analysis was summarized in the Supplementary material online ( S2 Text ). Additionally, we used the SEM model to quantify the direct effects of age and sex on the ASCVD risk and their indirect effects through LRBs. We used the R 2 value for the explanatory power of the SEM model. All analyses were performed using the Bnlearn (version 4.7) and Lavaan (version 0.6–9) packages for R software (R Foundation for Statistical Computing version 4.0.3).

Ethics statement

This study was approved by the Institutional Review Board of Chuncheon Sacred Heart Hospital (IRB no. 2021-12-006), and the need for informed consent was waived because we used the fully deidentified public database.

This study included 18,330 participants (mean age, 58.3±10.8 years; 41.8% males). The clinical and demographic characteristics of the predefined ASCVD risk score groups are demonstrated in Table 1 . The proportion of participants with high ASCVD scores was 42.7% (7832/18,330). Age and male sex were positively associated with a high ASCVD risk. However, the academic year, household income level, white race, occupation, and being married were negatively associated with the ASCVD risk. The most prevalent LRBs in the high ASCVD risk group were excessive sodium intake (95.6%), followed by inadequate sleep (49.9%), physical inactivity (43.8%), and unhealthy weight (43.3%).

ASCVD Score <7.5% (N = 10498)ASCVD Score ≥7.5% (N = 7832)
Male2879 (27.4%)4806 (61.4%)<0.001
Age, years<0.001
 40–649946 (94.7%)2583 (33.0%)
 65–79552 (5.3%)5249 (67.0%)
Educational year<0.001
 <61301 (12.4%)3452 (44.1%)
 6–91251 (11.9%)1273 (16.2%)
 9–123954 (37.7%)1837 (23.5%)
 ≥123992 (38.0%)1270 (16.2%)
Level of income0.009
 Low2384 (22.7%)1911 (24.4%)
 Mid-low2629 (25.1%)1959 (25.0%)
 Mid-high2680 (25.5%)2015 (25.7%)
 High2805 (26.7%)1947 (24.9%)
Occupation<0.001
 White collar4693 (44.7%)1255 (16.0%)
 Blue collar2507 (23.9%)2652 (33.9%)
 Unemployed3298 (31.4%)3925 (50.1%)
Married5419 (84.9%)4217 (75.4%)<0.001
Unhealthy weight3654 (34.8%)3394 (43.3%)<0.001
Heavy alcohol consumption936 (8.9%)881 (11.2%)<0.001
Inadequate sleep4525 (43.1%)3910 (49.9%)<0.001
Physical inactivity3927 (37.4%)3430 (43.8%)<0.001
Excessive sodium intake9963 (94.9%)7486 (95.6%)0.037
Current smoking967 (9.2%)1798 (23.0%)<0.001

The p-values of the χ 2 test represent the statistical significance of categorical variables. ASCVD, atherosclerotic cardiovascular disease; BMI, body mass index.

The association between each LRB and the risk of having a high ASCVD score (>7.5%) using binary logistic regression analyses is demonstrated in Table 2 . All six LRBs were statistically significant factors in the univariate analyses. Among them, current smoking (odds ratio [OR], 3.73; 95% confidence interval [CI], 3.31–4.22), unhealthy weight (OR, 1.51; 95% CI; 1.37–1.66), and inadequate sleep (OR, 1.11; 95% CI, 1.01–1.22) remained significant in the multivariate analysis after adjusting for age, sex, educational status, occupation, household income, and marital status.

OR (95% CI)
Crude ModelModel 1Model 2Model 3
Unhealthy weight1.43 (1.35–1.52)1.40 (1.32–1.49)1.50 (1.37–1.64)1.51 (1.37–1.66)
Heavy alcohol consumption1.29 (1.17–1.43)0.91 (0.82–1.01)1.02 (0.90–1.17)1.01 (0.88–1.16)
Inadequate sleep1.32 (1.24–1.40)1.31 (1.23–1.39)1.19 (1.09–1.31)1.11 (1.01–1.22)
Physical inactivity1.30 (1.23–1.38)1.25 (1.17–1.32)1.26 (1.15–1.38)1.09 (0.99–1.21)
Excessive sodium intake1.16 (1.01–1.34)1.22 (1.06–1.41)1.01 (0.81–1.25)1.09 (0.87–1.37)
Current smoking2.94 (2.70–3.20)2.95 (2.71–3.23)3.58 (3.20–4.02)3.73 (3.31–4.22)

Crude model: lifestyle risk behaviors 1–6 are considered univariate predictors. Model 1: lifestyle risk behaviors 1–6 are considered multivariate predictors. Model 2: lifestyle risk behaviors 1–6 are considered multivariate predictors after adjusting for age and sex. Model 3: lifestyle risk behaviors 1–6 are considered multivariate predictors after adjusting for age, sex, occupation, household income, and marital status.

Fig 2 shows the results of the Bayesian network model of the association of LRBs and high ASCVD risk, which provide probabilistic inference pathways that could be mediated by age, sex, and several combinations of LRBs. In this network model, older age (65–79 years) and male sex were directly associated with a high ASCVD risk, and physical inactivity, current smoking, excessive sodium intake, and unhealthy weight indirectly mediated the effects of older age and male sex to a high ASCVD risk. The older age group was positively associated with unhealthy weight and negatively associated with current smoking. Male sex was positively associated with current smoking, excessive sodium intake, unhealthy weight, and heavy alcohol consumption, and it was negatively associated with physical inactivity. Excessive sodium intake was not directly associated with a high ASCVD risk; however, it was indirectly associated with ASCVD only through the mediation of unhealthy weight. Heavy alcohol consumption and inadequate sleep were not directly associated with high ASCVD risk and did not indirectly mediate the effects of older age and male sex to the high ASCVD risk. Additionally, physical inactivity, current smoking, excessive sodium intake, and unhealthy weight were shown to mediate the high ASCVD risk sequentially. Additionally, excessive sodium intake negatively mediated the effect of current smoking on the high ASCVD risk.

An external file that holds a picture, illustration, etc.
Object name is pone.0307677.g002.jpg

ASCVD, atherosclerotic cardiovascular disease. In the model, age and older age are placed in antecedent because they are non-modifiable risk factors. Only the variables that enter the logistic regression analysis will enter into the grow-shrink Markov blanket to determine the predictor (antecedent, start point of the arrow) and the outcome (outcome, end point of the arrow).

Table 3 demonstrates the indirect effects of LRBs on developing high risk for ASCVD in the SEM model. The overall paths indicated the significant indirect effects of physical inactivity, excessive sodium intake, current smoking, and unhealthy weight on the high ASCVD risk. The direct effects of older age and the male sex mainly accounted for the development of the high ASCVD risk. However, indirect mediation by LRBs explained 8.2% and 39.9% of the direct effects of older age and male sex, respectively ( Table 4 ). The R 2 value for the SEM model was 0.75, which indicated a high percentage of the variance of the ASCVD risk explained by older age, male sex, and LRBs.

Independent VariableIndirect Effect
PathβSE
Older age group (65–79 years)CS → ASCVD-0.2250.015<0.001
UW → ASCVD0.0080.0030.010
MaleCS → ASCVD0.5980.026<0.001
PI → CS → ASCVD-0.0080.002<0.001
UW → ASCVD0.0230.004<0.001
ES → UW → ASCVD0.0040.0010.001
CS → ES → UW → ASCVD-0.0020.0010.001

CS = current smoking; IS = inadequate sleep; HAC = heavy alcohol consumption; UW = unhealthy weight; PI = physical inactivity; ES = excessive sodium intake; ASCVD, atherosclerotic cardiovascular disease; β , standardized regression coefficient; SE, standard error.

Independent VariableDirect EffectTotal Indirect EffectTotal EffectMediated Proportion
βSE βSE βSE %
Older age group (65–79 years)2.8750.037<0.001-0.2170.016<0.0012.6580.035<0.0018.2%
Male0.9220.038<0.0010.2290.024<0.0011.1500.043<0.00139.9%

The mediated proportion is presented as the total indirect effect/total effect. β , standardized regression coefficient; SE, standard error.

During this study, we examined the causal inference of each LRB with the ASCVD risk in the general population and demonstrated that nearly half of the participants had a 10-year ASCVD risk of 7.5% or more. Moreover, high-risk LRBs were highly prevalent in this general population. Excessive sodium intake was the most prevalent LRB, followed by inadequate sleep, physical inactivity, and unhealthy weight. In our Bayesian network model and SEM used to assess the mediation effects of LRBs, old age, and male sex accounted for most of the ASCVD risk. However, physical inactivity increased the ASCVD risk by meditating current smoking, excessive sodium intake, and unhealthy weight. The direct effects of old age and male sex on the ASCVD risk represented most of the risk, but the indirect effects of LRBs were found to be statistically significant mediators of increasing the ASCVD risk. To reduce the additional risk of ASCVD mediated by LRB, focusing on physical inactivity and current smoking, rather than reducing excessive sodium intake and unhealthy weight, as targets is an effective prevention method.

Lifestyle modifications, including smoking cessation and maintaining a high level of physical activity, are effective preventive measures that can reduce the CVD risk and maintain overall health. As described, many studies have reported that the CVD risk can be reduced by correcting multiple LRBs. However, it is clinically relevant to evaluate which of the LRBs precedes the effect modifier to reduce the CVD risk using limited preventive resources. To our best knowledge, only a few studies have reported the causal relationship between several LRBs and CVD risk in a prospective cohort. Although this study used cross-sectional and retrospective data, the Bayesian network model was used to present which LRB component can probabilistically precede the other components.

According to our results, physical inactivity preceded current smoking and was shown to affect the high ASCVD risk of the male subjects. Results of a systematic review of physical activity for smoking cessation showed that exercise intervention did not have sufficient preventive power to stop smoking in 20 clinical trial populations [ 28 ]. However, the small sample size of these trials raised the possibility that intense physical exercise intervention may be sufficiently effective for smoking cessation [ 28 ]. According to a previous study, a low level of physical activity was associated with depression and physical dependency on nicotine [ 29 ]. Another study also supported the evidence that physical activity reduces nicotine cravings and withdrawal symptoms, which in turn, cause successful smoking cessation [ 30 ]. A negative impact of cigarette smoking on appetite (sodium intake) has been reported by many experimental and clinical studies [ 31 , 32 ]. Besides, heavy alcohol consumption was not a modulator of the high ASCVD according to our results. However, a prospective Russian cohort study showed that frequent heavy alcohol consumption (alcohol intake ≥80 g/day and ≥3 times/week) was associated with a two-fold increase in CVD mortality [ 33 ]. An insufficient adjustment of confounders, including LRBs and frequent heavy alcohol consumption episodes during this study, might partially explain the weak or absence of associations between heavy alcohol consumption and high ASCVD risk shown by our data. However, a dose-response relationship between alcohol consumption and CVD risk is an undeniable truth in the usual setting.

In our Bayesian network model, antecedent current smoking was associated with consequent excessive sodium intake in a negative manner ( Fig 2 ). There are several studies that smoking has an appetite-lowering effect. Increasing oral nicotine intake gradually decreased food intake in a clinical trial [ 34 ]. Moreover, adult smokers tend to have a lower BMI and unhealthy eating habits compared to non-smokers, and smoking cessation was associated with weight gain [ 35 ]. The results of our Bayesian network graph suggesting that smoking had a negative effect on excessive sodium intake are consistent with these reports. Bayesian network mediation analysis using the grow-shrink Markov Blanket is a powerful method that can help researchers identify causal pathways between variables in a dataset.

According to our data, the male sex was directly and indirectly associated with a high ASCVD risk. Vyssoulis et al. analyzed the CVD risk factor profiles of 21,280 Greek patients with hypertension [ 36 ]. Of those subjects, smoking, diabetes, and high triglyceride concentrations were more prevalent in men, and these differences were more evident in young subjects. According to the European Action on Secondary and Primary Prevention through Intervention to Reduce Events survey that included 7998 patients with established coronary artery disease from 24 European Union countries, clustering of cardiovascular risk factors such as smoking, obesity, high blood pressure, high low-density lipoprotein, and diabetes were more prevalent in women than in men [ 37 ]. The high prevalence of the CVD risk factor profile in this survey could be explained by the slightly higher proportion of older women than older men (39.1% vs. 27.3%). In addition to this age gap between men and women, they suggested that sex differences in CVD risk factor profiles could be explained by residual confounders such as educational status, target achievement for risk factor control, and LRBs [ 37 ]. According to our data, male subjects were more likely to be associated with unhealthy LRBs than women. Therefore, our data support the hypothesis that LRBs could be confounders for this sex difference in ASCVD risk estimation.

We used the Bayesian network and the SEM models to assess the causal relationship of each LRB with high ASCVD risk using cross-sectional data. The Bayesian network can simultaneously evaluate the relationships between one or more variables, including dependency and independency. With multilevel models, the covariance estimation of random probability effects involves two mediation regression equations with different dependent variables. Therefore, we obtained estimates of the coefficients in regression models, and the resulting DAG was interpreted as causal nodes of relationships. Therefore, the Bayesian network can measure causal relationships of multiple factors and effectively identify the effect modifiers and confounders with complex pathways. Although there is no universally accepted method of constructing the Bayesian network from data, an additional SEM could effectively assess the interactions of the mediator variables, as in our study.

Our study had several limitations. First, the LRB information was not prospectively collected; therefore, there might have been significant sectional bias because of the unmeasured confounders. However, we sufficiently adjusted for demographic and socioeconomic factors. Second, we only assessed the ASCVD risk and not actual ASCVD events. While ASCVD risk is an important predictor of future cardiovascular events, it is important to acknowledge that actual events may differ from predicted risk. Therefore, the effects of LRBs on the ASCVD risk might be different compared to other clinical studies using real CVD outcomes.

Despite these limitations, our study had several strengths. First, we used nationally representative data, and complex multistage sampling was performed to reflect the fundamental demographic characteristics in Korea. Second, the advantage of using KNHANES data is that all participants were of a single ethnic group; therefore, we did not have to consider the effects of multi-ethnicity on LRBs when conducting this study. Finally, the dietary intake level (e.g., excessive sodium intake) was quantified by a laboratory test instead of a dietary questionnaire.

In conclusion, this nationally representative population-based study revealed that highly prevalent LRBs were directly and indirectly associated with high ASCVD risk. Using Bayesian network mediation and the SEM models, we identified that physical inactivity might precede current smoking, excessive sodium intake, and unhealthy weight. Although most ASCVD risks were directly caused by old age and male sex, important mediators, including physical inactivity, current smoking, excessive sodium intake, and unhealthy weight, were also identified as high ASCVD risk factors. Our results provide evidence that LRBs mediate ASCVD risks in the Bayesian network model.

Supporting information

Funding statement.

This research was supported by the Ministry of Education / "Regional Innovation Strategy (RIS)" through the National Research Foundation of Korea (NRF, 2022RIS-005, awarded to CK), and by the Ministry of Health and Welfare / the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI, HR21C0198, awarded to JJL). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability

  • PLoS One. 2024; 19(8): e0307677.

Decision Letter 0

25 Jun 2024

PONE-D-24-16325Lifestyle Risk Behavior and Atherosclerotic Cardiovascular Risk: An Analysis Using the Korea National Health and Nutrition Examination SurveyPLOS ONE

Dear Dr. Kim,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 09 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at  gro.solp@enosolp . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Hidetaka Hamasaki

Academic Editor

Journal Requirements:

1. When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have indicated that there are restrictions to data sharing for this study. For studies involving human research participant data or other sensitive data, we encourage authors to share de-identified or anonymized data. However, when data cannot be publicly shared for ethical reasons, we allow authors to make their data sets available upon request. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. 

Before we proceed with your manuscript, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., a Research Ethics Committee or Institutional Review Board, etc.). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of recommended repositories, please see https://journals.plos.org/plosone/s/recommended-repositories . You also have the option of uploading the data as Supporting Information files, but we would recommend depositing data directly to a data repository if possible.

Please update your Data Availability statement in the submission form accordingly.

3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information .

4. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Dear editor,

Thank you for allowing me to review this manuscript.

The authors present a manuscript with valuable insights into the relationship between lifestyle risk behaviors and ASCVD risk in the Korean population, employing robust statistical methods and utilizing nationally representative data from KNHANES.

Here are my minor comments:

- In the discussion section; the second paragraph lacks references. “Lifestyle modifications, including smoking cessation and maintaining a high level of physical activity, are effective preventive measures that can reduce the CVD risk and maintain overall health. As described, many studies have reported that the CVD risk can be reduced by correcting multiple LRBs. However, it is clinically relevant to evaluate which of the LRBs precedes the effect modifier to reduce the CVD risk using limited preventive resources. To our best knowledge, only a few studies have 9 reported the causal relationship between several LRBs and CVD risk in a prospective cohort. Although this study used cross-sectional and retrospective data, the Bayesian network model was used to present which LRB component can probabilistically precede the other components.”

- I encourage the authors to provide a paragraph on the discussion section addressing how can healthcare professionals utilize this information to improve CVD risk assessment and prevention strategies.

Reviewer #2: 1. The manuscript is technically sound, and the data supported the conclusions.

2. The statistical analysis has been performed appropriately and rigorously.

3. The authors have made all data underlying the findings in their manuscript fully available.

4. The manuscript is presented in an intelligible fashion and written in standard English.

6. PLOS authors have the option to publish the peer review history of their article ( what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2:  Yes:  Kassa Demissie Abdi (PhD)

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool,  https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at  gro.solp@serugif . Please note that Supporting Information files do not need this step.

Submitted filename: comments plos one ASCVD.docx

Submitted filename: Final Manuscript.docx

Author response to Decision Letter 0

MS ID#: PONE-D-24-16325

MS TITLE: Lifestyle Risk Behavior and Atherosclerotic Cardiovascular Risk: An Analysis Using the Korea National Health and Nutrition Examination Survey

Dear Dr. Hidetaka Hamasaki

Thank you for providing us with the opportunity to revise and resubmit our manuscript entitled ‘Lifestyle Risk Behavior and Atherosclerotic Cardiovascular Risk: An Analysis Using the Korea National Health and Nutrition Examination Survey’ (MS ID: PONE-D-24-16325) for possible publication in PLOS ONE. We are sincerely grateful to the reviewers for their insightful comments, which have significantly contributed to the enhancement of our manuscript.

Enclosed, please find our revised manuscript. We have provided point-by-point responses to the reviewers’ comments below. The reviewers’ feedback is highlighted in blue text and set in an 11-point font, while our response can be found directly beneath each reviewer’s comment. For your convenience, all changes made to the manuscript are highlighted with a yellow background.

It is our hope that, with these revisions, our manuscript will be deemed suitable for publication in PLOS ONE. Should you require any further information or clarification, please do not hesitate to contact me.

Sincerely yours,

Chulho Kim, MD, PhD

Department of Neurology, Chuncheon Sacred Heart Hospital, Hallym University College of Medicine, 77 Sakju-ro, 24253 Chuncheon, Korea

Telephone: +82-33-240-5255

Fax: +82-33-255-1338

E-mail: moc.revan@25lodmug

Reviewer #1

In the discussion section; the second paragraph lacks references. “Lifestyle modifications, including smoking cessation and maintaining a high level of physical activity, are effective preventive measures that can reduce the CVD risk and maintain overall health. As described, many studies have reported that the CVD risk can be reduced by correcting multiple LRBs. However, it is clinically relevant to evaluate which of the LRBs precedes the effect modifier to reduce the CVD risk using limited preventive resources. To our best knowledge, only a few studies have 9 reported the causal relationship between several LRBs and CVD risk in a prospective cohort. Although this study used cross-sectional and retrospective data, the Bayesian network model was used to present which LRB component can probabilistically precede the other components.”

Response #1

We thank the reviewer for this insightful comment. To address this, we have cited appropriate articles we have reviewed while writing the manuscript in the paragraph.

Lifestyle modifications, including smoking cessation and maintaining a high level of physical activity, are effective preventive measures that can reduce the CVD risk and maintain overall health7,10,11. As described, many studies have reported that the CVD risk can be reduced by correcting multiple LRBs10,11. However, it is clinically relevant to evaluate which of the LRBs precedes the effect modifier to reduce the CVD risk using limited preventive resources. To the best of our knowledge, only a few studies have reported the causal relationship between several LRBs and CVD risk in a prospective cohort28.29. Although this study used cross-sectional and retrospective data, the Bayesian network model was used to present which LRB component can probabilistically precede the other components.

28.Liu G, Li Y, Hu Y, Zong G, Li S, Rimm EB, et al. Influence of lifestyle on incident cardiovascular disease and mortality in patients with diabetes mellitus. Journal of the American College of Cardiology. 2018;71:2867-2876

29.Nambo R, Karashima S, Mizoguchi R, Konishi S, Hashimoto A, Aono D, et al. Prediction and causal inference of cardiovascular and cerebrovascular diseases based on lifestyle questionnaires. Sci Rep. 2024;14:10492

I encourage the authors to provide a paragraph on the discussion section addressing how can healthcare professionals utilize this information to improve CVD risk assessment and prevention strategies.

Response #2

We thank the reviewer for this valuable suggestion. In response, we have added a paragraph in the discussion section that addresses how healthcare professionals can utilize the findings of our study to improve CVD risk assessment and prevention strategies. The new paragraph has been included below:

[Discussion, Last paragraph]

Our study results can be utilized by healthcare professionals to promote healthy lifestyles for patients, with a particular focus on encouraging regular physical activity to mitigate sequential LRBs, especially among older males, given the significant role of LRBs in mediating ASCVD risk. Additionally, understanding the interplay between different LRBs allows for more personalized and effective preventive measures, ultimately optimizing resource allocation and enhancing patient outcomes.

Reviewer #2

1. The manuscript is technically sound, and the data supported the conclusions.

We thank the reviewer for their positive comments and appreciation of our work.

Submitted filename: Response to the Reviewers.docx

Decision Letter 1

10 Jul 2024

Lifestyle Risk Behavior and Atherosclerotic Cardiovascular Risk: An Analysis Using the Korea National Health and Nutrition Examination Survey

PONE-D-24-16325R1

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at gro.solp@gnillibrohtua .

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact gro.solp@sserpeno .

Additional Editor Comments (optional):

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

2. Is the manuscript technically sound, and do the data support the conclusions?

3. Has the statistical analysis been performed appropriately and rigorously?

4. Have the authors made all data underlying the findings in their manuscript fully available?

5. Is the manuscript presented in an intelligible fashion and written in standard English?

6. Review Comments to the Author

Reviewer #2: 1. The authors have adequately addressed my comments raised in a previous round of review and I feel that this manuscript is now acceptable for publication.

2. The manuscript is technically sound and the data support the conclusions.

3. The statistical analysis has been performed appropriately and rigorously.

4. The authors have made all data underlying the findings in their manuscript fully available.

5. The manuscript is presented in an intelligible fashion and written in standard English.

7. PLOS authors have the option to publish the peer review history of their article ( what does this mean? ). If published, this will include your full peer review and any attached files.

Acceptance letter

18 Jul 2024

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact gro.solp@sserpeno .

If we can help with anything else, please email us at gro.solp@eracremotsuc .

Thank you for submitting your work to PLOS ONE and supporting open access.

PLOS ONE Editorial Office Staff

on behalf of

Dr. Hidetaka Hamasaki

COMMENTS

  1. (PDF) Logistic regression in data analysis: An overview

    2 The Logistic Regression Model. Let X∈Rn×dbe a data matrix where nis the number of instances (examples) and dis the number of features (parameters or attributes), and ybe a binary. outcomes ...

  2. Logistic Regression in Medical Research

    Logistic regression is used to estimate the association of one or more independent (predictor) variables with a binary dependent (outcome) variable. 2 A binary (or dichotomous) variable is a categorical variable that can only take 2 different values or levels, such as "positive for hypoxemia versus negative for hypoxemia" or "dead versus ...

  3. Understanding logistic regression analysis

    Memahami analisis regresi logistik - PMC

  4. PDF Logistic Regression

    Logistic regression is an excellent tool for modeling relationships with outcomes that are not measured on a continuous scale (a key requirement for linear regression). Logistic regres-sion is often leveraged to model the probability of observations belonging to different classes of a categorical outcome, and this type of modeling is known as ...

  5. PDF Logistic Regression: From Art to Science

    Logistic regression is a common classification method when the response variable is binary. Given a response vector yn×1, a model matrix X =[X1,..., X n]∈Rn×p, and regression coefficients β ∈Rp×1,the logistic regression model assumes log(P(yi =1 |xi)/ P(yi =0 |xi))=β xi. Logistic regression minimizes the negative log-likelihood of ...

  6. Logistic Regression: A Brief Primer

    Logistic Regression: A Brief Primer - Stoltzfus - 2011

  7. Primer on binary logistic regression

    Primer on binary logistic regression - PMC

  8. An Introduction to Logistic Regression Analysis and Reporting

    This article demonstrates the preferred pattern for the application of logistic methods with an illustration of logistic regression applied to a data set in testing a research hypothesis ...

  9. Logistic regression

    Logistic regression | Nature Methods

  10. Logistic regression in data analysis: an overview

    Logistic regression (LR) continues to be one of the most widely used methods in data mining in general and binary data classification in particular. This paper is focused on providing an overview of the most important aspects of LR when used in data analysis, specifically from an algorithmic and machine learning perspective and how LR can be applied to imbalanced and rare events data.

  11. Logistic Regression: A Basic Approach

    2 Logistic Regression. An approach of "supervised machine learning" which is data, to foretell occurrences for a given event or of a class is called Linear Regression. This technique is applicable to the data when it is linearly divisible and when there is dichotomous or binary output. The result is, Logistic Regression is generally used ...

  12. PDF An Introduction to Logistic Regression: From Basic Concepts to

    Logistic regression sometimes called the logistic model or logit model, analyzes the relationship between multiple independent variables and a categorical dependent variable, and estimates the probability of occur-rence of an event by fitting data to a logistic curve. There are two models of logistic regression, binary logistic regression and ...

  13. PDF CHAPTER Logistic Regression

    Logistic Regression

  14. A review of the application of logistic regression in educational

    This study reviews the international literature of empirical educational research to examine the application of logistic regression. The aim is to examine common practices of the report and ...

  15. PDF An Introduction to Logistic Regression Analysis and Reporting

    formats of logistic regression results and the minimum observation-to-predictor ratio. The remainder of this article is divided into five sections: (1) Logistic Regression Mod-els, (2) Illustration of Logistic Regression Analysis and Reporting, (3) Guidelines and Recommendations, (4) Eval-uations of Eight Articles Using Logistic Regression, and (5)

  16. Logistic regression: A simple primer : Cancer Research ...

    Abstract. Logistic regression is used to obtain the odds ratio in the presence of more than one explanatory variable. This procedure is quite similar to multiple linear regression, with the only exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest.

  17. A logistic regression investigation of the relationship between the

    These results extend the research base regarding the relationship between the LA program and positive student outcomes. ... we use logistic regression with pre-existing institutional data to investigate the relationship between exposure to LA support in large introductory STEM courses and general failure rates in these same and other ...

  18. Linear and logistic regression models: when to use and how to interpret

    Linear and logistic regressions are widely used statistical methods to assess the association between variables in medical research. These methods estimate if there is an association between the independent variable (also called predictor, exposure, or risk factor) and the dependent variable (outcome). 2. The association between two variables ...

  19. Logistic Regression Explained

    Logistic Regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. Source: scikit-learn Image ...

  20. Logistic Regression Model Optimization and Case Analysis

    Traditional logistic regression analysis is widely used in the binary classification problem, but it has many iterations and it takes a long time to train large

  21. Common pitfalls in statistical analysis: Logistic regression

    Logistic regression analysis is a statistical technique to evaluate the relationship between various predictor variables (either categorical or continuous) and an outcome which is binary (dichotomous). In this article, we discuss logistic regression analysis and the limitations of this technique. Keywords: Biostatistics, logistic models ...

  22. (PDF) Understanding logistic regression analysis

    The two classification-based QSAR models reported in this paper were developed using binary logistic regression method. Some of the main attractions of using binary logistic regression algorithm ...

  23. Application of logistic regression models to assess household financial

    The logistic regression model allows to examine the influence of many independent variables ð '‹ð '‹1 ,… ,ð '‹ð '‹ð '˜ð '˜the dependent variable Y. The variable Y takes only two values and is dichotomous. ... .†Research Papers of WrocÅ‚aw University of Economics/ Prace Naukowe Uniwersytetu Ekonomicznego ...

  24. Lifestyle risk behavior and atherosclerotic cardiovascular risk: An

    First, a binary logistic regression analysis was performed to classify the subjects with high ASCVD risk scores (≥7.5%) and to check whether a significant covariance shift existed when using categorized predictors in the model. As stated in the introduction, we set the outcome to be the ASCVD risk, not the development of the ASCVD itself.