Introduction to R | Yesol Sapozhnikov

This short tutorial will introduce you to basic R syntax and useful commands to help you get started with manipulating and analyzing All of Us datasets. To learn more about R programming, check out following online books:

Example 1
Example 2

1. R Basics

Hello World and Variables

In R, we typically use the assignment operator <- to create variables.

# Displaying a simple string
print("Hello, All of Us!")

# Variables and basic arithmetic
age <- 30
years_in_study <- 5
total_time <- age + years_in_study

# Printing the variable value
print(total_time)

Output:

[1] "Hello, All of Us!"
[1] 35

Data Structures: Vectors and Data Frames

Vectors are sequences of data elements of the same basic type (created with c()).
Data Frames are tables where columns can contain different types of data.

# A vector of participant IDs
participant_ids <- c(1001, 1002, 1003, 1004)
print(participant_ids[1]) # R indexes start at 1, not 0!

# Creating a Data Frame from scratch
df <- data.frame(
  person_id = c(1001, 1002, 1003, 1004),
  age_at_enrollment = c(65, 42, 71, 35),
  gender_concept_id = c(8507, 8532, 8507, 8532),
  blood_pressure = c(130, 120, 145, 110)
)
print(df$age_at_enrollment) # Access a column using $

Output:

[1] 1001
[1] 65 42 71 35

2. Introduction to Tidyverse

The tidyverse is a collection of R packages (including dplyr for manipulation and ggplot2 for visualization) that share a common design philosophy.

Loading the Library

# Load the core tidyverse packages
library(tidyverse)

Reading and Writing Data Files

Assume a file named my_data.csv exists (we created the data above, but this is how you load external files).

# Reading a CSV file into a Data Frame (tibble)
# read_csv is faster and smarter than standard read.csv
df <- read_csv("my_data.csv")

# Display the first few rows
print(head(df))

# Saving the data frame back to a new CSV file
write_csv(df, "modified_data.csv")

3. Data Manipulation: Selection and Recoding

Filtering Rows

The filter() function selects rows based on conditions. The pipe operator %>% (or |>) passes the data from one step to the next.

# Filter rows where age is over 60
older_participants <- df %>%
  filter(age_at_enrollment > 60)

print(older_participants)

Output:

# A tibble: 2 × 4
  person_id age_at_enrollment gender_concept_id blood_pressure
      <dbl>             <dbl>             <dbl>          <dbl>
1      1001                65              8507            130
2      1003                71              8507            145

Recoding Values

Use mutate() to create or change columns, and case_match() (or recode) to map concept IDs to labels.

# Create a new column 'gender_label' based on concept IDs
df <- df %>%
  mutate(gender_label = case_match(gender_concept_id,
    8507 ~ "Male",
    8532 ~ "Female",
    .default = "Other/Unknown"
  ))

# Select specific columns to view
print(df %>% select(person_id, gender_concept_id, gender_label))

Output:

# A tibble: 4 × 3
  person_id gender_concept_id gender_label
      <dbl>             <dbl> <chr>       
1      1001              8507 Male        
2      1002              8532 Female      
3      1003              8507 Male        
4      1004              8532 Female      

4. Merging Data Frames

The inner_join(), left_join(), etc., functions merge tables by a common key.

# Create a second data frame for lab results
labs_df <- data.frame(
  person_id = c(1001, 1004, 1005),
  cholesterol = c(210, 185, 225)
)

# Perform an inner join (keeps only matching IDs: 1001 and 1004)
merged_df <- inner_join(df, labs_df, by = "person_id")

print(merged_df)

Output:

# A tibble: 2 × 6
  person_id age_at_enrollment gender_concept_id blood_pressure gender_label cholesterol
      <dbl>             <dbl>             <dbl>          <dbl> <chr>              <dbl>
1      1001                65              8507            130 Male                 210
2      1004                35              8532            110 Female               185

5. Descriptive Statistics and Visualization

Basic Statistics

Use group_by() and summarize() to calculate aggregated statistics.

# Calculate mean age and count by gender
summary_stats <- df %>%
  group_by(gender_label) %>%
  summarize(
    mean_age = mean(age_at_enrollment),
    count = n()
  )

print(summary_stats)

Output:

# A tibble: 2 × 3
  gender_label mean_age count
  <chr>           <dbl> <int>
1 Female           38.5     2
2 Male             68.0     2

Simple Visualization

ggplot2 is the standard visualization package in R.

# Create a bar plot of participant counts by gender
plot <- ggplot(df, aes(x = gender_label)) +
  geom_bar(fill = "steelblue") +
  labs(
    title = "Participant Count by Gender",
    x = "Gender",
    y = "Count"
  ) +
  theme_minimal()

# Display the plot
print(plot)

# Save the plot to a file
ggsave("gender_counts_bar.png", plot)