Bayesian Logistic Regression Application in Disease Prediction

CapStone Project_2025

Author

Namita Mishra (Advisor: Dr. Ashraf Cohen)

Published

September 14, 2025

Namita’s Literature review

Introduction

  1. Bayesian Hierarchical Model (Disease reclassification and prediction)

What is the goal of the paper?

The authors aimed to develop a Bayesian hierarchical model for multivariate longitudinal data to predict health status, trajectories, and intervention effects at the individual level in the PCORI mission to address questions about health status from patients and clinicians.

Why is it important?

Healthcare data (DNA sequences, functional images of the brain, patient-reported outcomes, and electronic health records with patients’ sequences of health measurements, diagnoses, and treatments) are complex, and the standard approaches are not adequate for clinical data analysis. Electronic health records (EHRs) could improve diagnostic accuracy and predict treatment effects. Visualizations of characteristics of posterior distributions can be immediately understood by clinicians and patients as relevant to their decision. Combining prior knowledge and patient data with evidence could predict the patient’s health status, trajectory, and/or likely benefits of intervention.

How is it solved?

Method: Application of Bayesian hierarchical regression on multivariate longitudinal patient data using R-packages to develop two-levels (1)time within person and (2) persons within a population along with co-variates and interventions by combining exogenous (eg, age, clinical history) factors and endogenous (eg, current treatment) variables on the individual’s multivariate health measurements and the effects of health measurements at one time on subsequent interventions.

The model provided estimate of the posterior distribution for each predictor variable and an estimate of the marginal distribution of the regression coefficients for each coefficient that measures the outcome (health status). In a larger sample, the likelihood dominates the prior distribution for regression coefficients and Bayesian hierarchical model is a likelihood-based approach, uses priors (prior laboratory and clinical trials data) that provided the assay sensitivities, which through the prior assumptions, made the model identifiable and the integration of Markov chain Monte Carlo (MCMC) estimates the posterior distributions, avoided missing data and complex outcome measurements.

Results

Three case studies: pneumonia etiology in children, prostate cancer, and mental disorders chosen for model development, identified low-risk patient population, reduced the risk of overtreatment, complications, adverse effects, and financial burden for patients (Disease Reclassification). Prostate cancer software was then implemented within the JHM HER.

Limitation:

Models were entirely parametric, and extensions to nonparametric or more flexible parametric models were recommended to improve approaches for neuroimage or genomic data.

neuroimage or genomic data.

Applications:

  • to scale and address a particular unmet need across a larger, more diverse population of patients and clinicians

  • use in autoimmune diseases, sudden cardiac arrest, and diabetes.

  • embed tools to acquire and use the most relevant information, agnostic to its level of measurement, to improve population and individual health decisions for a better outcome at an affordable costs. Zeger et al. (2020)

  1. Bayesian Inference (parametric vs non-parametric)

What is the goal of the paper?

The authors calculated the posterior probability of disease diagnosis by applied Bayesian inference to develop three modules comparing parametric (with a fixed set of parameters) and nonparametric distributions (which do not make a priori assumptions). The National Health and Nutrition Examination Survey data from two separate diagnostic tests on both diseased and non-diseased populations was used for model development.

Why is it important?

Medical diagnosis, treatment, and management decisions are crucial, and conventional methods for diagnosis use clinical criteria and fixed numerical thresholds that limit the information captured related to the intricate relationship between diagnostic tests and the varying prevalence of diseases. The probability distributions associated with quantitative diagnostic test outcomes have some overlap between the diseased and nondiseased groups. The dichotomous method fails to capture the complexity and heterogeneity of disease presentations across diverse populations. The applicability of the normal distribution (conventional method) is critiqued, especially in dealing with clinical measurands having skewness, bimodality, or multimodality.

How is it solved?

Methods: Bayesian Bayesian nonparametric (vs parametric) diagnostic modeling is a Flexible distributional modeling for test outcomes; posterior disease probs

The Authors developed models employing Bayesian inference (Bayesian diagnostic approach) to calculate the posterior probability of disease diagnosis in the Wolfram Language and integrated prior probabilities of disease with distributions of diagnostic measurands in both diseased and nondiseased populations. The approach enabled the evaluation of combined data from multiple diagnostic tests that improved the diagnostic accuracy, precision and adaptability. The model showed flexibility, adaptability, and versatility in the diagnostic.

Results

Nonparametric Bayesian models tend to fit data distributions better, especially given limited existing literature, and are more robust in capturing complex data patterns.

These models produce multimodal probability patterns for disease, unlike the bimodal, double-sigmoidal curves seen with parametric models.

Limitations

  • Reliance on parametric models: A need to extend to nonparametric or more flexible parametric models for medical data.

  • Limited scholarly publications and over-dependence on prior probabilities increase uncertainties, resulting in broader confidence intervals for posterior probabilities. Systemic bias (unrepresentative datasets) compromises the accuracy of Bayesian calculations. For Incomplete datasets, Bayesian methods combined with other statistical and computational techniques could enhance diagnostic capabilities.

  • The foundational data is crucial to compare new diagnostic measurements. Absence of normative data compromises the reliability and validity of Bayesian diagnostic methods. Chatzimichail and Hatjimihail (2023)

  1. Bayesian methodology overview - stages, development and advantages

    What is the goal of the paper?

The study describes the stages of Bayesian analysis specifying the importance of the priors, data modeling, inferences, model checking and refinement, selecting a proper sampling technique from a posterior distribution, variational inferences, variable selection and its application across various research fields. The study proposes strategies for reproducibility and reporting standards, outlining an updated WAMBS (when to Worry and how to Avoid the Misuse of Bayesian Statistics) checklist and outlining the impact of Bayesian analysis on artificial intelligence in the future.

Why is it important?

Bayesian statistics is across different fields (social sciences, ecology, genetics, medicine)

How is it solved?

In the study, priors are categorized into (informative, weakly informative and diffuse) based on the degree of (un)certainty (hyperparameters) surrounding the population parameter. The prior distribution with N( μ0 , σ^ 20) with a larger variance represents a greater amount of uncertainty surrounding.

Prior elicitation to construct a prior distribution can be through experts, generic expert, data-based, sample data using maximum likelihood or sample statistics, etc

A prior sensitivity analysis of the likelihood helps examine different forms of the model and to assesses how the priors and the likelihood align and impact on posterior estimates, reflecting the variations not captured by the prior or the likelihood alone.

Prior estimation allows data-informed shrinkage, regularization or influence algorithms towards a likely high-density region, and improves estimation efficiency.

A small sample conveys less information compared to the priors that quantify the strength of support the observed data lends to possible value(s) for the unknown parameter(s). Knowing the exact probabilistic specification of the priors for a complex model with smaller sample sizes is important.

Frequentists do not consider the probability of the unknown parameters as useful, and consider to be fixed; likelihood is considered as the the conditional probability distribution p(y|θ) of the data (y), given fixed parameters (θ).

In contrast, in Bayesian inference, unknown parameters (random variables) have varied values, while the (observed) data have fixed values, and the likelihood is a function of θ for the fixed data y. Therefore, the likelihood function summarizes a statistical model that stochastically generates a range of possible values for θ and the observed data y. With priors and the likelihood of the observed data, the resulting posterior distribution provides an estimate of the unknown parameters, along with capturing the primary factors and improving our understanding.

Monte Carlo technique provides integrals of sampled values from a given distribution through computer simulations. The packages BRMS and Blavaan in R are used for the probabilistic programming language Stan.

Variables are selected after checking correlations among the variables in the model (Eg: gene-to-gene interaction to predict genes in biomedical research).

Spatial and temporal variability are factored in Bayesian general linear models. A posterior distribution can simulate new data conditional on this distribution and assess providing valid predictions to be used for extrapolating to future events.

Results and application

The Bayesian approach analyzes large-scale cancer genomic data, identifies novel molecular changes in cancer initiation and progression, the interactions between mutated genes and captured mutational signatures, highlighting key genetic interactions components, allowing genomic-based patient stratification both in clinical trials, in the personalized use of therapeutics, and in understanding cancer and its evolutionary processes.

Limitations:

In temporal models, the spatial and/or temporal dependencies (autocorrelation of parameters over time)is a challenge in posterior inference. Schoot et al. (2021)

  1. Bayesian Normal linear regression, Core parametric (conjugate) model with Normal–Inverse-Gamma prior

What is the goal of the paper?

The author provides guidance on Bayesian inference by performing Bayesian Normal linear regression in metrology to calibrate instruments to evaluate inter-laboratory comparisons in determining fundamental constants emphasizing prior elicitation, analytical posteriors, robustness checks.

Why is it important?

The measurement errors are assumed to be additive, independent, and identically distributed according to a Gaussian distribution with mean zero and variance σ2, which is usually unknown. Regression is used to calibrate instruments, evaluate inter-laboratory comparisons and to determine fundamental constants, but the regression model cannot be uniquely formulated as a measurement function. Guide to the Expression of Uncertainty in Measurement (GUM) and its supplements are not applicable directly.

How is it solved?

Methods: Bayesian inference has the advantage of accounting for additional a priori information, which robustifies the analyses. Three steps (prior elicitation, posterior calculation, and robustness to prior uncertainty and model adequacy) and model assumptions are critical to Bayesian inference.

In Bayesian inference, all unknowns—observables (data) as well as unobservables (parameters and auxiliary variables) are considered to be random, are assigned probability distributions that summarizes the available information, and update prior knowledge about the unobservables with information about them contained in the data. The prior distribution and likelihood function iare provided by simple graphical displays, sensitivity analyses, or model checking that enhances the elicitation and interpretation process.

For Normal linear regression problems (1) a family of prior distributions for θ and σ2 is (Normal inverse Gamma (NIG) distribution to a posterior from the same family of (NIG) distribution. The NIG prior with known variance σ2 of observations is a conjugate prior distribution. Vague or non-informative prior distributions can be derived from the NIG prior.

  1. alternative families of prior distributions (hierarchical priors) assign an additional layer of distributions to uncertain prior parameters or non-para- metric priors.

Bayesian inference is influenced by

  • the uncertainty in the transformation of prior knowledge to prior distributions

  • the assumptions of the statistical model

  • the mistakes in data acquisition

Results and Application

The knowledge from related previous experiments (Normal inverse Gamma distributions) allow for analytic posterior calculations of many quantities of interest. Klauenberg et al. (2015)

  1. Bayesian Hierarchical / meta-analytic linear regression and priors (exchangeable and unexchangeable)

What is the goal of the paper?

The study developed a test of a formal method for augmenting data in linear regression analyses, by incorporating both exchangeable and unexchangeable information on regression coefficients (and standard errors) of previous studies.

Why is it important?

The frequent combination of multiple testing has relatively low statistical power, which is problematic in null-hypothesis significance testing. Linear regression analyses do not account for the published results and summary statistics from similar previous studies. Ignoring information on parameters from previous studies (relevant and readily available), affects the stability and precision of the parameter estimates and results in lower values than they could have been, resulting in conclusions that are less certain and are affected by sampling variation. Multiple linear regression with separate significance tests for all regression coefficients, and with the modest sample sizes, different studies have different sets of statistically significant predictors, and addressing the issue on larger samples is practically unrealistic.

How is it solved?

Methods: Bayesian linear regression accommodates prior knowledge. Overcoming the absence of formal studies, it handles the issue of increasing the sample size, and augments the data of a new study with regression coefficients and standard errors from previous similar studies.

To solve the issue of the univariate case analysis, Bayesian linear regression combines the evidence of specific predictors from different linear regression analyses (meta-analysis) and found it a better method to resolve the issue of simultaneously combining multiple regression parameters per study, which ignor the relationship between the regression coefficients. Including summary statistics from previous studies,Bayesian linear regression provided a more acceptable solution when data from previous studies are not (realistically) obtainable.

Based on the predictors from previous and current data, the Models are categorized into (1) Exchangable - when the current data and previous studies have the same set of predictors. (2) Unexchangable – when the predictors were different in the two.

To yield the posterior density that reflects the updated knowledge about the model parameters after having observed the data, the steps to Bayesian linear regression steps are include

  1. To calculate the probability density function for the data, given the unknown model parameters;

  2. The likelihood function is the second part of the prior density function of the model parameters. It quantifies what is assumed to be known about the model parameters before observing the data. The Standard multiple linear regression model, integrate the prior, and provide the joint posterior density using the Gibbs sampler.

  3. A hierarchical model version is used in analyzing parameters where studies are categorized as not-exchangeable.

Results

Incorporating priors from previous studies in a linear regression on new data yield a significantly better parameter estimate with an adequate approximation. Encouraging performance with gains and the large effects were obtained when the data from previous studies.

Performance of the two versions (exchangeable and unexchangeable) of the replication model was consistently superior to using the current data alone.

The model developed using exchangeable and unexchangeable prior offers better parameter estimates in a linear regression setting without the need to expend a large amount of time and energy to obtain data from the previous studies.

Hierarchical unexchangeable model version offers the advantage of being able to address questions about differences between studies and thus allows for explicit testing of the exchangeability assumption.

Limitations:

  • All studies need to have the same set of predictors.

  • The issue of correlation between predictor variables. Leeuw and Klugkist (2012)

  1. Bayesian logistic regression (Bayesian GLM) (Sequential clinical reasoning approach)

What is the goal of the paper?

To develop model on participants free of CVD at baseline followed for over five years to diagnose new CVD cases. The study analyzed a longitudinal prospective cohort to predict the risk of incident cardiovascular disease by incorporating (1) demographic features (basic) (2) six metabolic syndrome components (metabolic score) (3) conventional risk factors (enhanced model)

The application of Logistic Regression included priors on coefficients and sequential updating to predict Individual-level CVD risk.

Why is it important?

Early diagnosis, prevention, by identifying subjects under the high-risk category for cardiovascular disease (CVD), impacts health interventions. Limited availability of molecular information in clinical practice due to cost and unavailability affects efficient disease diagnosis.

It is required to have an alternative approach to analyze data to efficiently identify a high-risk population based on the routinely checked biological markers before doing these expensive molecular tests.

The tailored Framingham Risk Score method, is not sufficient because of the differences present in ethnic groups, location, and socio-economic status, and require the construction of their own models. Heterogeneity (geographic, ethnic group, variations, and different characteristics of social contextual network) often is unobservable and unmeasurable.

How is it solved?

Methods: The study on subjects enrolled in a Keelung Community-based Integrated Screening (KCIS) Program, for mass screening (20–79 years) in the Keelung city of Taiwan, were followed for 5 years to identify incident cancers and chronic diseases (cardiovascular disease).

The study classified the risk of having incident CVD cases or death from CVD by dint of - (1) available and calculated standardized risk score of the MetS components (fasting glucose, blood pressure, HDL-C, triglyceride and waist circumference) (2) together with conventional risk factors (gender, heredity, smoking, alcohol drinking, family history of parent’s CVD and betel quid and other relevant factors).

Emulating a clinician’s evaluation process, the Bayesian clinical reasoning approach in a sequential manner was developed and applied in three models.

The approach considered the normal distribution of regression coefficients of all predictors, allowing for uncertainty of clinical weights. The credible intervals of predicted risk estimates were obtained by averaging out. In the model, the individual risk is elicited by prior speculation (first impression) that is updated by objective observed data (patient’s history and laboratory findings), the regression coefficients for computing risk score were treated as random variable with a certain statistical distribution (e.g. normal distribution) rather than a fixed value (traditional risk prediction model by frequentist). The updated prior distribution with the likelihood of the current data provided a posterior distribution to predict the risk for a specific disease. The sequential approach included -

  1. Basic model developed via logistic regression used prior information constructed on gender, age, age2, and time period.

  2. The Classical model (metabolic score model: MS model) included six MetS components.

  3. The third (enhanced model) incorporated information on smoking, drinking, betel-quid, and family history of CVD.

Results

Compared to the basic model and classical model, the enhanced model had better performance. The proposed models predicted CVD incidence at the individual level by incorporating routine information with a sequential Bayesian clinical reasoning approach. Patients’ background significantly contributes to baseline risk. Even with ecological heterogeneity, the regression model adopts individual characteristics and makes individual risk prediction for the CVD incidence.

Limitations:

  • Whether the interactions between age, gender, metabolic score, and other risk factors should be included.

  • The use of an enhanced model should be validated through external validation by applying the proposed models to new subjects not included in the training of the model parameters. Liu et al. (2013)

  1. What is the goal of the paper?

In this paper, Bayesian parametrization was performed, where the parameters of the discrete Weibull distribution were conditioned on the predictors under a uniform non-informative prior, to produce posterior distribution. The model promises for the wide applicability of to the analyze count data using R package BDWreg.

Why is it important

The regression model for a discrete variable based on the discrete Weibull distribution is a good fit in comparison with other distributions for count data

The important features of a discrete Weibull distribution that make this a valuable alternative to the more traditional Poisson and Negative Binomial distributions and their extensions, such as Poissonmixtures, Poisson-Tweedie, zero-inflated semiparametric regres- sion and COMPoisson- the ability to capture both over and under-dispersion and a closed-form analytical expression of the quantiles of the conditional distribution.

How is it solved?

They considered non-informative priors, (Jeffreys and uniform), and the case of Laplace priors with a hyper penalty parameter and proved the posterior distribution is proper with finite moments under a uniform non-informative prior.

Results

The advantage of Bayesian approaches over classical maximum likelihood inference are: (1) the possibility of taking prior information into account, such as sparsity or information from historical data, (2) the procedure returns automatically the distribution of all parameters, from which credible intervals can easily be obtained.

Limitations:

Application

The study compared the proposed model with the Bayesian Poisson (BPoisson), Bayesian Negative Binomial (BNB) models and Bayesian DW model on three datasewts (inhaler use, health survey, health registry), where BDW(regQ,β) models showed superior performance to the other models. The Bayesian discrete Weibull model sjows applicability in analysing count data from the medical domain.

The analysis was applied on datasets (number of visits to doctors/specialists, - an indicator of healthcare demand) with discrete response variable and a skewed distribution,under-dispersion, over-dispersion and excess of zeros.

Methods

  • Detail the models or algorithms used.

  • Justify your choices based on the problem and data.

The common non-parametric regression model is \(Y_i = m(X_i) + \varepsilon_i\), where \(Y_i\) can be defined as the sum of the regression function value \(m(x)\) for \(X_i\). Here \(m(x)\) is unknown and \(\varepsilon_i\) some errors. With the help of this definition, we can create the estimation for local averaging i.e. \(m(x)\) can be estimated with the product of \(Y_i\) average and \(X_i\) is near to \(x\). In other words, this means that we are discovering the line through the data points with the help of surrounding data points. The estimation formula is printed below (R-base?):

Method and data preparation

Methodology

Staistical software used is R and packages were installed> We used libraries in R to run the analysis. The .xpt files were converted to df using haven package that reads .XPT files.

Statistical Method Used:

Bayesian logistic Regression (GLM) will be used to calculate the probability of CVD. We will incorporate Prior for Bayesian logistic for CVD as Response Variable (Binary), and will use a. Bernoulli likelihood with a logit link is the standard, interpretable model for risk. b. Weakly informative priors (Normal 0,1) - shrink extreme estimates, preventing infinite/unstable odds ratios when some subgroups are rare or perfectly predict the outcome. 1. Priors -
• Shrinkage is possible when including predictors (BMI, smoking, diet, demographics, labs). • Clinical beliefs (e.g., positive association for age/BMI, protective for HDL) can be encoded as priors to stabilize estimates and improve calibration 2. Bayesian modeling plays nicely with multiple imputation or joint, propagating uncertainty instead of silently dropping cases. 3. Calculated Posterior and credible intervals for both coefficients and individual risk will make clinical interpretation and shared decision-making models. 4. Bayesian logistic GLM gives an interpretable, survey-aware, uncertainty-quantified CVD risk model that’s robust to sparse subgroups and easy to update. 5. For an easy update, a new NHANES cycle can update the calculated posterior rather than refitting from scratch, developing a live risk model.

\[ M_n(x) = \sum_{i=1}^{n} W_n (X_i) Y_i \tag{1} \]\(W_n(x)\) is the sum of weights that belongs to all real numbers. Weights are positive numbers and small if \(X_i\) is far from \(x\).

Another equation:

\[ y_i = \beta_0 + \beta_1 X_1 +\varepsilon_i \]

Analysis and Results: Data Exploration and Visualization

  • Describe your data sources and collection process.

Data Source

NHANES 2-year cross-sectional data, all variables are measured at the same visit for in 5 datasets (demographics, labs, exam, diet, questionnaire) to classify current CVD status, using Bayesian logistic GLM. Data Management

Df (download) -> select variables -> merging df -> created combined df for analysis -> coded missing and special codes -> categorzing variables

Categorization and Creation of new variable

  1. Outcome Variable

CVD • Composite cardiovascular disease outcome (CVD) derived by combining MCQ series(“MCQ160B”, “MCQ160C” , “MCQ160D” ,“MCQ160E” ,“MCQ160F”). • In CVD, the data was transformed the original values (Yes or no) into a new set of numeric codes: • “Yes” → 1 • “No” → 0 • Missing → NA

  1. Predictors

Smoking Variable (SMQ030) categorization

• The numeric variable in the original dataset (SMQ030- Age started smoking regularly (in years) was transformed to categorical variable with 3 levels • Derived categorical variable (smoking_cat): o Never smoker: SMQ030 = NA o Early starter: SMQ030 ≤ 18 years o Late starter: SMQ030 > 18 years We found it rationale as participants who never smoked do not report an age (Never smokers), further categorization into early vs. late starters allows assessment of age at initiation as a risk factor. Smoking categories are included in the logistic regression model, with “Never smoker” as the reference group.

Body Mass Index (BMI)

• The original variable: BMDBMIC (measured BMI) in the data is a categorized variable (BMI_cat) based on percentiles and has 4 levels:
o Underweight (<5th percentile)
o Normal (5th–<85th)
o Overweight (85th–<95th) o Obese (≥95th percentile)
o Missing We kept it as it is as categorization provides clinically interpretable groups.

Age categorization age_cat (factor, 4 levels): <20, 21-39, 40-59, 60+

Final Analytic Dataset (Response and Predictor variables) • Outcome: CVD (binary)

• Predictors: o SEQN num
o CVD binary (1 = any “Yes”, 0 = all “No”, NA if all missing). o smoking_cat (factor, 3 levels) Never (no and missing, assuming they never started) Early (<=18 years) Late (>18 years) o BMI_cat (factor, 4 levels): Underweight Normal Overweight Obese

• Co-variates: o Gender (factor, 2 levels): Male : Female o Ethnicity (factor, (5 levels) Mexican American Non-Hispanic White Non-Hispanic Black Other Hispanic
Other Race - Including Multi-Racial o age_cat (factor, 4 levels): <20, 21-39, 40-59, 60+

This final dataset is interpretable, with good statistical power, and with proper handling of missing values within the data and within each variable.

Data Preparation

Variable Selection and merging of the subsets

Subset were created from each df, all NAs were observed in each subset with selected variables and summed before merging the subsets, in each column and sum of NAs in all columns.

Merging of all variables from different datasets into a merged df for analysis

Data Cleaning

• Silently dropping cases = throwing out any rows with missing values (aka complete-case analysis). Loosing sample size (power) can introduce bias if the missingness isn’t completely random. In the dataset, the missing values (NA) are kept as NA, and we used complete-case analysis (or “listwise deletion”) criteria as regression with glm(), R will automatically drop any row with NA in the model variables. It was the simplest approach to handle missingness, and we aware of losing sample size and introducing bias if missingness is informative.

• NHANES special codes (as below) were transformed to NAs, preserving the number of rows intact, considering special codes are not random and cannot be dropped. Type Codes / Values Meaning Refused 7, 77, 777, 7777 Participant refused to answer Don’t know / Unknown 9, 99, 999, 9999 Participant didn’t know or unknown Not applicable / skipped 8, 88, 888, 8888 Variable not applicable or skipped Other 0 Sometimes “No” depending on variable (check codebook)

Exclusions

• Participants missing CVD response in all MCQ160 series were merged into NO. • Participants with missing in smoking_cat were added under “Never” to preserve the row numbers. • CDQ009F (Pain in left chest) column, with a high % of missingness, was excluded to avoid overlap with the CVD response.

Names of the selected variables.

Code
     # handling special codes
                                         
      special_codes <- c(7, 9, 77, 99)
                       
       missing_summary <- sapply(merged_data_1, function(x) sum(is.na(x) | x %in% special_codes))
      missing_summary
    SEQN RIDAGEYR RIAGENDR RIDRETH1   SMD030   DSD010  CDQ009F  MCQ160B 
       0      484        0        0     7630        0    10123     4406 
 MCQ160C  MCQ160D  MCQ160E  MCQ160F  LBDTCSI  BMDBMIC 
    4406     4406     4406     4406     2552     6652 
Code
## Special codes "Refused" = 7,77 ; "Don't know" = 9,99
                       
                       special_codes <- c(7, 9, 77, 99, 888)
                       cols_with_special_codes <- sapply(merged_data_1, function(x) any(x %in% special_codes, na.rm = TRUE))
                       cols_with_special_codes[cols_with_special_codes == TRUE]
RIDAGEYR   SMD030  LBDTCSI 
    TRUE     TRUE     TRUE 
Code
   # Variables to clean  
                       
        vars_to_clean <-c("DSD010","MCQ160B","MCQ160C","MCQ160D","MCQ160E","MCQ160F","SMD030")
                       
        special_codes <- c(7, 9, 77, 99, 777,999)
                       
  # Replace special codes with NA
                       
                       for (v in vars_to_clean) {
                         merged_data_1[[v]][merged_data_1[[v]] %in% special_codes] <- NA
                       }
                       
                       
  # Check for the number of NAs in each variable
                       
                       sapply(merged_data_1[, vars_to_clean], function(x) sum(is.na(x)))
 DSD010 MCQ160B MCQ160C MCQ160D MCQ160E MCQ160F  SMD030 
      0    4406    4406    4406    4406    4406    7641 
Code
  # summary and structure of the final data frame after managing special codes
                       
                       #summary(merged_data_1[, vars_to_clean])
                       #str(merged_data_1)
                       #summary(merged_data_1)
                       
                       
# Exclude CDQ009F (chest pain) since it overlaps with CVD outcome and >70% missing values
                       
                       merged_data_1$CDQ009F <- NULL
                       #summary(merged_data_1)

Creating new variables and categorization

  1. Outcome Variable

CVD • Composite cardiovascular disease outcome (CVD) derived by combining MCQ series(“MCQ160B”, “MCQ160C” , “MCQ160D” ,“MCQ160E” ,“MCQ160F”). • In CVD, the data was transformed the original values (Yes or no) into a new set of numeric codes: • “Yes” → 1 • “No” → 0 • Missing → NA

  1. Predictors

Smoking Variable (SMQ030) categorization

• The numeric variable in the original dataset (SMQ030- Age started smoking regularly (in years) was transformed to categorical variable with 3 levels • Derived categorical variable (smoking_cat): o Never smoker: SMQ030 = NA o Early starter: SMQ030 ≤ 18 years o Late starter: SMQ030 > 18 years We found it rationale as participants who never smoked do not report an age (Never smokers), further categorization into early vs. late starters allows assessment of age at initiation as a risk factor. Smoking categories are included in the logistic regression model, with “Never smoker” as the reference group.

Body Mass Index (BMI)

• The original variable: BMDBMIC (measured BMI) in the data is a categorized variable (BMI_cat) based on percentiles and has 4 levels:
o Underweight (<5th percentile)
o Normal (5th–<85th)
o Overweight (85th–<95th) o Obese (≥95th percentile)
o Missing We kept it as it is as categorization provides clinically interpretable groups.

Age categorization age_cat (factor, 4 levels): <20, 21-39, 40-59, 60+

Final Analytic Dataset (Response and Predictor variables) • Outcome: CVD (binary)

• Predictors: o SEQN num
o CVD binary (1 = any “Yes”, 0 = all “No”, NA if all missing). o smoking_cat (factor, 3 levels) Never (no and missing, assuming they never started) Early (<=18 years) Late (>18 years) o BMI_cat (factor, 4 levels): Underweight Normal Overweight Obese

• Co-variates: o Gender (factor, 2 levels): Male : Female o Ethnicity (factor, (5 levels) Mexican American Non-Hispanic White Non-Hispanic Black Other Hispanic
Other Race - Including Multi-Racial o age_cat (factor, 4 levels): <20, 21-39, 40-59, 60+

  • Present initial findings and insights through visualizations.

Summary statistics, histogram (continuous variables), and bar plot (categorical variables) were observed before regression analysis

Code
table(merged_data_1$age_cat, useNA = "ifany")    # Check table distribution

  <20 20-39 40-59   60+ 
 4406  1954  1974  1841 
Code
# Tables for all variable in the merged data frame for analysis

table(merged_data_1$CVD)

   0    1 
5173  595 
Code
table(merged_data_1$smoking_cat)

Never Early  Late 
 7641  1746   788 
Code
table(merged_data_1$age_cat)

  <20 20-39 40-59   60+ 
 4406  1954  1974  1841 
Code
table(merged_data_1$RIAGENDR)

  Male Female 
  5003   5172 
Code
table(merged_data_1$BMDBMIC)

  Underweight Normal weight    Overweight         Obese 
          132          2167           595           629 
Code
table(merged_data_1$RIDRETH1)

                   Mexican American                      Other Hispanic 
                               1730                                 960 
                 Non-Hispanic White                  Non-Hispanic Black 
                               3674                                2267 
Other Race - Including Multi-Racial 
                               1544 
Code
# bar plot for categorical and histo for continuous variable

plot_bar <- function(df, var) {
  ggplot(df, aes_string(x = var)) +
    geom_bar(fill = "steelblue") +
    theme_minimal() +
    labs(title = paste("Distribution of", var),
         x = var,
         y = "Count") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}


plot_bar(merged_data_1, "CVD")

Code
plot_bar(merged_data_1, "smoking_cat")

Code
plot_bar(merged_data_1, "age_cat")

Code
plot_bar(merged_data_1, "RIAGENDR")

Code
plot_bar(merged_data_1, "BMDBMIC")

Code
plot_bar(merged_data_1, "RIDRETH1")

Code
hist(merged_data_1$RIDAGEYR)

Code
hist(merged_data_1$SMD030)

Code
hist(merged_data_1$LBDTCSI)

Unexpected reports, patterns or anomalies.

  • The age showed a skewed histogram. Most population were from <20years range followed by 40-59years range with mean age being 26 years
  • Smoking age when first started regularly shows a higher number in < 20 years range
  • lab report on total cholesterol had a mean of 4.53 (mmol/L) with max being 21 (mmol/L)
  • Non-Hispanic White were majority in the study cohort.
  • Majority reported normal weight but followed by higher number of obese population.
  • Female: Male population reported a higher side on females.
  • Majority reported never started smoking followed by the age of starting smoking regularly in the range of early age (<18 years) category.
  • Total number of population being told having CVD (stroke, coronary disease, heart attack, infarction) reported “NO”
Code
import pandas as pd

Modeling and Results

  • Explain your data preprocessing and cleaning steps.

  • Present your key findings in a clear and concise manner.

  • Use visuals to support your claims.

  • Tell a story about what the data reveals.

Conclusion

  • Summarize your key findings.

  • Discuss the implications of your results.

References

Chatzimichail, Theodora, and Aristides T. Hatjimihail. 2023. A Bayesian Inference Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.” Diagnostics 13 (19). https://doi.org/10.3390/DIAGNOSTICS13193135,.
Klauenberg, Katy, Gerd Wübbeler, Bodo Mickan, Peter Harris, and Clemens Elster. 2015. A tutorial on Bayesian Normal linear regression.” Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Leeuw, Christiaan de, and Irene Klugkist. 2012. Augmenting Data With Published Results in Bayesian Linear Regression.” Multivariate Behavioral Research 47 (3): 369–91. https://doi.org/10.1080/00273171.2012.673957.
Liu, Yi Ming, Sam Li Sheng Chen, Amy Ming Fang Yen, and Hsiu Hsi Chen. 2013. Individual risk prediction model for incident cardiovascular disease: A Bayesian clinical reasoning approach.” International Journal of Cardiology 167 (5): 2008–12. https://doi.org/10.1016/J.IJCARD.2012.05.016.
Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, et al. 2021. Bayesian statistics and modelling.” Nature Reviews Methods Primers 1 (1): 1. https://doi.org/10.1038/s43586-020-00001-2.
Zeger, Scott L, Zhenke Wu, Yates Coley, Anthony Todd Fojo, Bal Carter, Katherine O’Brien, Peter Zandi, et al. 2020. Using a Bayesian Approach to Predict Patients’ Health and Response to Treatment,” no. 2020. http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=reference&D=medp&NEWS=N&AN=37708307.