Statistical Analysis and JupyterLab Tips

1 Learning Objectives

In this last session, we will perform statistical tests and fit models that are common in medical research. Detailed discussions on statistics are beyond the scope of this workshop. If you want to learn more about these tests, check out links provided in Additional Resources.

At the end of Session 4, learners will be able to:

Perform t-test, ANOVA, chi-square tests in R.
Fit regression models: simple linear regression, multiple linear regression, logistic regression
Obtain odds ratios from logistic regression models.
Familiarize with navigating JupyterLab

2 Setup

We will load packages and read in the data file. In addition to tidyverse, we will also import broom package which has convenient functions to make results output very readable.

library(tidyverse)
library(broom)

df <- read_csv("https://raw.githubusercontent.com/yesols/aou-workshop/refs/heads/main/data/final_dataset2.csv")
head(df)

# A tibble: 6 × 14
  person_id year_of_birth race    sex_at_birth condition_concept_id
      <dbl>         <dbl> <chr>   <chr>                       <dbl>
1         1          1987 Unknown Male                           NA
2         2          1987 Asian   Female                         NA
3         3          2004 Asian   Male                           NA
4         4          1944 Black   Male                           NA
5         5          1941 Asian   Male                           NA
6         6          2006 Other   Male                           NA
# ℹ 9 more variables: condition_start_date <date>,
#   condition_source_value <chr>, drug_concept_id <dbl>,
#   drug_exposure_start_date <date>, true_exposure_status <chr>, SBP <dbl>,
#   age <dbl>, condition_status <dbl>, exposure_status <dbl>

3 Two-Sample T-Test

We use t-test for continuous vs binary variables. Let’s test if there is a statistically significant difference in mean Systolic Blood Pressure (SBP) between participants with and without the condition (condition_status).

By default, R executes a Welch two-sample t-test, which does not assume equal variances between groups. This is generally the preferred approach for clinical data.

t_result <- t.test(SBP ~ condition_status, data = df)
print(t_result)


    Welch Two Sample t-test

data:  SBP by condition_status
t = 0.20451, df = 74.577, p-value = 0.8385
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -4.953265  6.086527
sample estimates:
mean in group 0 mean in group 1 
       117.3881        116.8214

4 One-Way ANOVA

For testing continuous variables in more than 2 categories in a variable, we use one-way ANOVA. Let’s test if there is a statistical difference in mean SBP among sex groups (male, female, other)

# Execute One-Way ANOVA
anova_model <- aov(SBP ~ sex_at_birth, data = df)

# Output the ANOVA table (Degrees of freedom, Sum of Squares, F-value, p-value)
summary(anova_model)

             Df Sum Sq Mean Sq F value Pr(>F)
sex_at_birth  2    174   87.08   0.416  0.661
Residuals    92  19258  209.33               
5 observations deleted due to missingness

5 Pearson’s Chi-Square Test

If you want to test associations between categorical variables, use Pearson’s Chi-square test. In our example, we will see if drug exposure (exposure_status) is associated with MCI/dementia (condition_status).

To execute a Chi-squared test, you must pass a contingency table into the chisq.test() function.

# Generate the base R contingency table and pass it to the test
chi_sq_result <- chisq.test(table(df$condition_status, df$exposure_status))
print(chi_sq_result)


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(df$condition_status, df$exposure_status)
X-squared = 20.931, df = 1, p-value = 4.76e-06

Note

If any expected cell count is less than 5, R will generate a warning. In such cases, fisher.test() should be used instead.

6 Simple Linear Regression

To see the relationship a continuous dependent variable and a continuous independent variable, we can use simple linear regression. We will fit a linear regression model for SBP and age. We use lm() function for fitting linear regression models.

# Fit the linear model
lm_model <- lm(SBP ~ age, data = df)

# Base R summary (verbose)
summary(lm_model)


Call:
lm(formula = SBP ~ age, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-30.1453  -6.0564   0.9969   5.9673  22.3843 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 90.34318    2.92574  30.879  < 2e-16 ***
age          0.50592    0.05148   9.827 4.75e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.12 on 93 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.5094,    Adjusted R-squared:  0.5041 
F-statistic: 96.57 on 1 and 93 DF,  p-value: 4.754e-16

You can use tidy() function from broom package to display the model neatly as a dataframe:

library(broom)
tidy(lm_model)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   90.3      2.93       30.9  1.13e-50
2 age            0.506    0.0515      9.83 4.75e-16

We learned how to create scatter plots in the last session. We can overlay regression fit to the scatter plot easily with ggplot2, using geom_smooth().

ggplot(df, aes(x = age, y = SBP)) +
  geom_point(alpha = 0.5, color = "darkblue") +
  geom_smooth(
    method = "lm", 
    color = "red", 
    fill = "lightcoral", 
    alpha = 0.3
  ) +
  labs(
    title = "Linear Regression: Systolic Blood Pressure by Age",
    x = "Age (Years)",
    y = "Systolic Blood Pressure (mmHg)"
  ) +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).

7 Multiple Linear Regression

We use multiple linear regression to see the relationship between one continuous dependent variable and two or more independent variables. In medical research, multiple linear regression is often used as a way to adjust for confounding variables.

We will fit a model to describe the relationship between SBP and age again, but will add sex. To include more independent variables, we simply add variable names to the formula using +:

lm_adj_model <- lm(SBP ~ age + sex_at_birth, data = df)
summary(lm_adj_model)


Call:
lm(formula = SBP ~ age + sex_at_birth, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-29.3473  -5.4471   0.5786   6.7308  21.8585 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       89.63749    3.13686  28.576  < 2e-16 ***
age                0.50411    0.05207   9.681 1.19e-15 ***
sex_at_birthMale   1.31476    2.19149   0.600    0.550    
sex_at_birthOther  1.86637    4.18336   0.446    0.657    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.21 on 91 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.5118,    Adjusted R-squared:  0.4957 
F-statistic:  31.8 on 3 and 91 DF,  p-value: 3.764e-14

Using broom:

tidy(lm_adj_model)

# A tibble: 4 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)         89.6      3.14      28.6   3.14e-47
2 age                  0.504    0.0521     9.68  1.19e-15
3 sex_at_birthMale     1.31     2.19       0.600 5.50e- 1
4 sex_at_birthOther    1.87     4.18       0.446 6.57e- 1

8 Logistic Regression

Because logistic regression is not discussed in your Research Methods course, I will provide a little bit of background before we learn the R command.

8.1 Theoretical Background

Standard linear regression predicts a continuous outcome \(Y\) based on a linear combination of predictors \(X\): \[Y=\beta_0+\beta_1X_1+\dots+\beta_pX_p\]

When applied to a binary clinical outcome (absence or presence of a disease, benign or malignant. alive or dead), the linear fit does not make sense when trying to model binary outcomes.

Suppose we have a dataset of tumor sizes from 200 patients and whether they are benign (0) or malignant (1). A linear regression model (red line) would look like this:

It does not seem like a good fit at all!

For this kind of scenarios, we use logistic regression. In a logistic regression model, the outcome \(Y\) is “logit”-transformed, i.e., turned into log-odds. \[\ln\left(\frac{p}{1-p}\right)=\beta_0+\beta_1X_1+\dots+\beta_pX_p\] Solving this equation for \(p\), the probability of a given outcome (tumor is malignant): \[p=\frac{1}{1+e^{-(\beta_0+\beta_1X_1+\dots+\beta_pX_p)}}\]

Now we have a valid probability that enables us to predict malignancy based on the tumor size.

The fact that logistic regression models the log-odds of a binary outcome provides a useful mathematical property: the coefficients, \(\beta_1, \beta_2, \dots\), of the independent variables, \(X_1, X_2, \dots\), represent the log of the odds ratio for a one-unit increase in those variables. If the independent variables are binary (e.g., \(1\) for exposed and \(0\) for unexposed), the coefficient specifically represents the log odds ratio of the outcome for the exposed group compared to the unexposed group. That means the exponent of a coefficient, i.e., \(e^{\beta_1}\), etc, represent odds ratio of the corresponding variable.

So, logistic regression is primarily utilized for two purposes: predictive modeling and effect estimation. In predictive modeling, it is used to calculate the probability of a binary outcome. In inferential statistics, it is used to calculate adjusted odds ratios to measure the association between exposures and outcomes. This makes it the standard modeling approach for retrospective case-control studies, where calculating relative risk is mathematically invalid.

8.2 Application

We will fit a logistic model on our practice dataset now, to see if there is any association between drug exposure and dementia outcome, while adjusting for age and sex. We use glm() function for this purpose. The argument, family = "binomial", specifies that we want logit link (i.e., \(\ln\left(\frac{p}{1-p}\right)\)), which is what makes it logistic regression as opposed to other types.

# Fit the multivariable logistic regression model
logistic_model <- glm(
  condition_status ~ exposure_status + age + sex_at_birth,
  data = df,
  family = "binomial"
)
summary(logistic_model)


Call:
glm(formula = condition_status ~ exposure_status + age + sex_at_birth, 
    family = "binomial", data = df)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -2.18139    0.83531  -2.611  0.00902 ** 
exposure_status    2.15673    0.50529   4.268 1.97e-05 ***
age                0.01018    0.01301   0.783  0.43383    
sex_at_birthMale  -0.39916    0.52993  -0.753  0.45131    
sex_at_birthOther  0.22907    0.90841   0.252  0.80091    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 122.173  on 99  degrees of freedom
Residual deviance:  98.024  on 95  degrees of freedom
AIC: 108.02

Number of Fisher Scoring iterations: 4

You can obtain the odds ratio by expoentiating the coefficients manually, using exp() function, or the tidy() function from broom package can do this for you as well as confidence intervals.

# Extract exponentiated coefficients (adjusted Odds Ratios) and CIs
library(broom)
tidy_results <- tidy(
  logistic_model,
  exponentiate = TRUE,
  conf.int = TRUE
)
print(tidy_results)

# A tibble: 5 × 7
  term              estimate std.error statistic   p.value conf.low conf.high
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)          0.113    0.835     -2.61  0.00902     0.0191     0.525
2 exposure_status      8.64     0.505      4.27  0.0000197   3.31      24.3  
3 age                  1.01     0.0130     0.783 0.434       0.985      1.04 
4 sex_at_birthMale     0.671    0.530     -0.753 0.451       0.235      1.91 
5 sex_at_birthOther    1.26     0.908      0.252 0.801       0.193      7.35

Model Diagnostics

Regression methods such as linear regression and logistic regression rely on statistical assumptions about underlying data. Before making an inference based on a modeling result, it is critical to check these assumptoms through model diagnostics. Refer to your statistics text or Additional Resources link on the left sidebar to learn more.

9 Navigating the Notebook: Modes and Shortcuts

When you click inside a cell to type code or text, you enter Edit Mode (indicated by a blinking cursor and often a highlighted border). However, if you press the Esc key or click outside the text box, you switch to Command Mode which allows you delete, move, insert cells.

Mastering a few Command Mode keyboard shortcuts will drastically speed up your workflow. Press Esc to ensure you are in Command Mode, and try these keystrokes:

A: Insert a new cell Above the current one.
B: Insert a new cell Below the current one.
D, D (press D twice quickly): Delete the current cell.
Z: Undo a deleted cell.
Shift + Enter: Execute the code (or format the Markdown) in the current cell and advance to the next one.

10 Documenting with Markdown Cells

Markdown is a lightweight markup language that is easy to use for simple text formatting. In Jupyter Notebooks, markdown is often used for sectioning the documents and annotating the code. You can change a cell from “Code” to “Markdown”, type in the section headings or your comments, and running the cell will render the markdown formatting.

Here are a few common markdown examples:

# Main Heading
## Subheading
### Smaller Heading

**Bold Text**
*Italic Text*

- Bullet point 1
- Bullet point 2
  - Indented bullet point

1. Numbered list
2. Numbered list

And here is what that exact formatting looks like once the cell is executed:

Main Heading

Subheading

Smaller Heading

Bold Text
Italic Text

Bullet point 1
Bullet point 2
- Indented bullet point

Numbered list
Numbered list

11 Pivoting Survey Data (Long to Wide)

If you pull multiple survey items, the query result will be in a “long” format, where each row corresponds to a single survey item per person. This means that there will be multiple rows per person. We typically want a dataframe where each row is a unique participant and questions are columns with answers as values. We can reshape the dataframe from a long format to a wide format using pivot_wider() function.

First, let’s look at a mock example of what your raw survey data might look like:

# Create a mock "long" survey dataframe
survey_long <- data.frame(
  person_id = c(1, 1, 1, 2, 2, 2),
  question = c("1535717", "1585723", "1585729", "1535717", "1585723", "1585729"),
  T_DISP_answer = c("Yes", "Never", "Good", "No", "Always", "Fair")
)

head(survey_long)

  person_id question T_DISP_answer
1         1  1535717           Yes
2         1  1585723         Never
3         1  1585729          Good
4         2  1535717            No
5         2  1585723        Always
6         2  1585729          Fair

Now, we will pivot this dataframe so the question IDs become our new column names, and the T_DISP_answers become the values inside those columns.

# Reshape to wide format
survey_wide <- survey_long %>%
  pivot_wider(
    id_cols = person_id,               # The column that identifies unique rows
    names_from = question,             # The column containing your new column names
    values_from = T_DISP_answer,       # The column containing the actual answers
  )

head(survey_wide)

# A tibble: 2 × 4
  person_id `1535717` `1585723` `1585729`
      <dbl> <chr>     <chr>     <chr>    
1         1 Yes       Never     Good     
2         2 No        Always    Fair

12 Other Tips

Tab completion: Use tab key to see suggested functions or to drill down a file path.

Datetime: In All of Us data, you will often see a datetime column with values in a format, “YYYY-MM-DD 00:00:00 UTC”. When you read in a CSV file, this type may be read as character data type. In that case, use as_date() function from lubridate package (included when you load tidyverse).

R objects vs. files: When you run commands like df <- read_csv("my_data.csv"), you are reading a copy of the file contents into the Jupyter environment as an R object (the dataframe df). This object exists solely in memory (RAM), not in any directories unlike files. To list all those objects that are currently in your environment, use ls() function. Also note that if your notebook kernel restarts or your session times out, all R objects are erased from memory and must be reloaded by re-running your code.

%in% operator: %in% checks whether an element exists in a vector. This can be useful when you want to filter rows with certain values. For example: person_df %>% filter(sex_at_birth %in% c("Male", "Female"))

Where am I?: Use getwd() function to check the current directory you’re in. This may become relevant when you’re constructing a relative path of a file.