Exploratory Data Analysis and Visualization

1 Learning Objectives

In this session, we will learn common functions to explore data and visualize data using the synthetic dataset we worked with last time.

At the end of Session 3, you will be able to:

  1. Use R commands to understand data.
  2. Perform descriptive statistics.
  3. Visualize data with box plots, histograms, and scatter plots.

2 Setup

We will begin by loading packages and reading in a file.

library(tidyverse)
df <- read_csv("https://raw.githubusercontent.com/yesols/aou-workshop/refs/heads/main/data/final_dataset.csv")
head(df)
# A tibble: 6 × 10
  person_id year_of_birth race    sex_at_birth condition_concept_id
      <dbl>         <dbl> <chr>   <chr>                       <dbl>
1         1          1987 Unknown Male                           NA
2         2          1987 Asian   Female                         NA
3         3          2004 Asian   Male                           NA
4         4          1944 Black   Male                           NA
5         5          1941 Asian   Male                           NA
6         6          2006 Other   Male                           NA
# ℹ 5 more variables: condition_start_date <date>,
#   condition_source_value <chr>, drug_concept_id <dbl>,
#   drug_exposure_start_date <date>, SBP <dbl>

This file is what we saved after merging dataframes in the previous session, but I added an additional column SBP for systolic blood pressure.

3 Create Necessary Variables for Analysis

We will create an age column from year_of_birth.

df <- df %>% mutate(
  age = 2026 - year_of_birth
)
head(df)
# A tibble: 6 × 11
  person_id year_of_birth race    sex_at_birth condition_concept_id
      <dbl>         <dbl> <chr>   <chr>                       <dbl>
1         1          1987 Unknown Male                           NA
2         2          1987 Asian   Female                         NA
3         3          2004 Asian   Male                           NA
4         4          1944 Black   Male                           NA
5         5          1941 Asian   Male                           NA
6         6          2006 Other   Male                           NA
# ℹ 6 more variables: condition_start_date <date>,
#   condition_source_value <chr>, drug_concept_id <dbl>,
#   drug_exposure_start_date <date>, SBP <dbl>, age <dbl>

It is also useful to have binary variables to indicate their disease status and drug exposure status. We will use if_else() function for this purpose. if_else() takes three arguments: condition to evaluate, value if the evaluation is true, value if false.

df <- df %>%
  mutate(
    condition_status = if_else(is.na(condition_concept_id), 0, 1), # 0 if condition_concept_id is NA, 1 if not
    exposure_status = if_else(is.na(drug_concept_id), 0, 1) # 0 if drug_concept_id is NA, 1 if not
  )

4 Contingency Table

Let’s take a quick look at MCI/dementia cases breakdown by sex.

df %>% count(sex_at_birth, condition_status)
# A tibble: 6 × 3
  sex_at_birth condition_status     n
  <chr>                   <dbl> <int>
1 Female                      0    28
2 Female                      1    16
3 Male                        0    37
4 Male                        1    11
5 Other                       0     5
6 Other                       1     3

janitor package has a nice cross-tabulation function:

library(janitor)
df %>% tabyl(sex_at_birth, condition_status)
 sex_at_birth  0  1
       Female 28 16
         Male 37 11
        Other  5  3

5 Summarize Data

summary() function can give you a quick overview of data and its distribution. It also will show you a count of NAs in each column. You can use summary() on a vector (or a single column from a dataframe) or the entire dataframe.

summary(df$SBP)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   84.0   108.0   119.0   117.2   127.0   149.0       5 
summary(df)
   person_id      year_of_birth      race           sex_at_birth      
 Min.   :  1.00   Min.   :1936   Length:100         Length:100        
 1st Qu.: 25.75   1st Qu.:1957   Class :character   Class :character  
 Median : 50.50   Median :1972   Mode  :character   Mode  :character  
 Mean   : 50.50   Mean   :1973                                        
 3rd Qu.: 75.25   3rd Qu.:1990                                        
 Max.   :100.00   Max.   :2007                                        
                                                                      
 condition_concept_id condition_start_date condition_source_value
 Min.   :4128031      Min.   :1988-02-15   Length:100            
 1st Qu.:4128031      1st Qu.:2016-04-30   Class :character      
 Median :4128031      Median :2026-12-22   Mode  :character      
 Mean   :4151509      Mean   :2028-10-20                         
 3rd Qu.:4182210      3rd Qu.:2041-07-23                         
 Max.   :4182210      Max.   :2071-04-19                         
 NA's   :70           NA's   :70                                 
 drug_concept_id    drug_exposure_start_date      SBP             age       
 Min.   :19073183   Min.   :1987-09-16       Min.   : 84.0   Min.   :19.00  
 1st Qu.:19073183   1st Qu.:2013-02-12       1st Qu.:108.0   1st Qu.:35.75  
 Median :19073183   Median :2025-05-28       Median :119.0   Median :54.50  
 Mean   :19073183   Mean   :2029-02-16       Mean   :117.2   Mean   :52.91  
 3rd Qu.:19073183   3rd Qu.:2043-04-05       3rd Qu.:127.0   3rd Qu.:69.00  
 Max.   :19073183   Max.   :2080-10-17       Max.   :149.0   Max.   :90.00  
 NA's   :65         NA's   :65               NA's   :5                      
 condition_status exposure_status
 Min.   :0.0      Min.   :0.00   
 1st Qu.:0.0      1st Qu.:0.00   
 Median :0.0      Median :0.00   
 Mean   :0.3      Mean   :0.35   
 3rd Qu.:1.0      3rd Qu.:1.00   
 Max.   :1.0      Max.   :1.00   
                                 

Grouped summaries using group_by() and summarise() can be helpful.

df %>%
  group_by(sex_at_birth) %>%
  summarise(
    n = n(),
    mean_SBP = mean(SBP, na.rm = TRUE),
    sd_SBP = sd(SBP, na.rm = TRUE),
    median_age = median(age, na.rm = TRUE)
  )
# A tibble: 3 × 5
  sex_at_birth     n mean_SBP sd_SBP median_age
  <chr>        <int>    <dbl>  <dbl>      <dbl>
1 Female          44     116.  16.6        51.5
2 Male            48     118.  13.1        56.5
3 Other            8     117.   9.05       59.5

6 Histogram

Histogram is a good way to visualize distribution of continuous data. I like base R commands for a quick visualization on the fly. But ggplot() from ggplot2 package (included already when we loaded tidyverse) is highly customizable and recommended for publication-quality plots. Below, you can see the code for both.

# Base R histogram
hist(df$age, breaks = 20) # change breaks size to make bins larger or smaller

ggplot(df, aes(x = age)) +
  geom_histogram(
    binwidth = 5, 
    fill = "steelblue", 
    color = "black"
  ) +
  labs(
    title = "Age Distribution",
    x = "Age (Years)",
    y = "Frequency"
  ) +
  theme_minimal()

7 Box Plot

Box plots are good for visualizing continuous data broken down to some groups. Here, we will visualize distribution of SBP for each sex_at_birth category. Again, I will show both base R version and ggplot2 version.

# Base R box plot using the formula interface
boxplot(
  SBP ~ sex_at_birth, 
  data = df
)

# ggplot2 box plot
ggplot(df, aes(x = sex_at_birth, y = SBP, fill = sex_at_birth)) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Systolic Blood Pressure by Sex",
    x = "Sex at Birth",
    y = "SBP (mmHg)"
  ) +
  theme_minimal() +
  theme(legend.position = "none") # Suppress redundant legend
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Warning

Take a note of the warning message in ggplot2 version that says 5 rows were removed. This often occurs when there are NAs in the data. We know SBP column had NA values. If this was unexpected, you need to inspect the data and handle the out-of-range values accordingly.

8 Scatter Plot

We use scatter plots for visualizing the relationship between two continuous variables. We will plot SBP against age.

# Base R scatter plot
plot(df$age, df$SBP)

# ggplot2 scatter plot
ggplot(df, aes(x = age, y = SBP)) +
  geom_point(color = "darkblue", alpha = 0.5) +
  labs(
    title = "Systolic Blood Pressure vs. Age",
    x = "Age (Years)",
    y = "SBP (mmHg)"
  ) +
  theme_minimal()
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).

9 Preparing for Next Session

In session 4, we will learn how to perform basic statistical tests in R. Because this workshop is focused on learning the technical aspect of programming, discussions of statistical concepts will be minimal. Review the following methods as needed prior to the next session:

  • t-test and ANOVA
  • chi-square test
  • linear regression
  • logistic regression (WWAMI Research Methods course does not cover this. So we will go over this concept in a little more detail than others.)

Also, we will apply the graphical and statistical functions on the real All of Us data. So, if you have not already, ensure you have a dataset prepared for your own project by next session. It does not have to be perfect - constructing a dataset always takes a long time. Just have a dataset that has demographic variables and an outcome variable at a minimum.