# Load packages
library(tidyverse)
library(lubridate)Introduction to R Programming and Data Wrangling
1 Learning Objectives
Session 2 will focus on learning the basics of R programming. We will first work with synthetic data to practice common R functions and syntax, followed by logging into All of Us Workbench and running the queries we set up last time.
Unless indicated otherwise, we will practic R on Posit Cloud.
At the end of Session 2, learners will be able to:
- Manipulate different types of data types in R.
- Read in CSV files as a dataframe and save dataframes as CSV files.
- Run auto-generated queries and obtain data.
- Inspect and clean up data.
2 Preliminaries
Concepts:
- file system: “directory”, “working directory”, “file path”
- input and output
- functions and arguments
- variables
- errors vs. warning
Best practice in programming:
- comment copiously (You think you will remember what you were doing but you won’t)
- write reproducible code
- feel free to utilize AI to help you with coding but make sure you understand every line (That’s why you should still learn the basics. No “vibe coding” in this workshop and your research!)
3 Loading Libraries
“Libraries” or “packages” are collections of pre-written functions that provide convenience and efficiency in coding. Common R libraries we use for All of Us projects are:
- tidyverse: collection of packages (dplyr, readr, ggplot2,…) for file handling, working with dataframes, visualizations.
- bigrquery: facilitates the connection between R and Google BigQuery database - the type of database All of Us data are stored in.
- lubridate: for handling dates and times
We use library() function to load these packages.
These common packages are standard in cloud platforms. Occasionally, a package you want to use may not be available in the virtual machine. Attempting to load a package with library() when it does not exist will result in an error. Then, you need to first install it using install.packages() function.
# Do not run this. This is just to show you the code for installing a package.
install.packages("MatchIt") #MatchIt is a package used for propensity score matching
library("MatchIt")4 Data Types
There are 5 data types you must be familiar with in programming with R. Knowing the data types of the values in your dataframe, for example, can be critical in performing computations or troubleshooting errors.
4.1 Numeric
Numeric data can be either double (dbl) for numbers with decimals such as weight or lab values, or integer (int) for whole numbers such as participant ID numbers.
4.2 Character
Also known as string, character (chr) data types are non-numeric, text data. In All of Us data, condition names or drug names for example, are character types. Character values are enclosed in " ". So if you have values like `“1”, “2”, “3”, they are character type even though they are numbers, and R will not be able to perform computations that require numeric data types.
4.3 Factor
Factor (fct) data types are used for categorical data. They often look just like characters which is a source of common confusion. However, under the hood, factors are stored as integers with associated “levels”. Some statistical functions may require factors as input.
4.4 Logical
Logicals are binary data with values TRUE and FALSE. While this type of data are rarely seen as values in a dataframe, they are an output of filtering or conditional operations. Interestingly, TRUE value equals 1 while FALSE equals 0. This property can be useful in checking data.
4.5 Date
5 Data Structures
5.1 Vector
A vector is a sequence of data where all elements are of the same data type. We can create a vector using c().
# Create character vectors
names <- c("Josh", "Evelyn", "Olivia", "Liam", "Adrian", "Tamara", "Teddy", "Allison")
cities <- c("Moscow", "Moscow", "Troy", "Viola", "Moscow", "Pullman", "Lewiston", "Moscow")
# Create a numeric vectors
ages <- c(23, 18, 32, 56, 15, 25, 24, 16)Use print() to see the values.
print(names)[1] "Josh" "Evelyn" "Olivia" "Liam" "Adrian" "Tamara" "Teddy"
[8] "Allison"
Use length() to see how many elements there are in a vector.
length(names)[1] 8
You can use indexing to access a specific element in a vector.
names[2] # Access the second element in the names vector.[1] "Evelyn"
For numeric values, we can perform simple computations.
mean(ages) # Average of the elements in ages vector[1] 26.125
sd(ages) # Standard deviation [1] 13.28197
Other common functions for numeric data include median(), min(), max().
5.2 List
Lists look similar to vectors but they can contain mixed data types, even lists within a list. Lists are a very versatile object for programming purposes but we won’t go into it too much as they are not necessary for typical statistical analyses. I will just mention that we use double brackets ([[ ]]) to access a specific element in a list instead of single brackets as with vectors.
5.3 Dataframe
A dataframe is like a table in an Excel sheet. Each column is a vector. Therefore, values in a column must be of same data type but different columns can have different types. We will create a dataframe using the vectors we created earlier to see what it looks like.
# Create a dataframe with existing vectors
df <- data.frame(Name = names, Age = ages, City = cities)
# Inspect
print(df) Name Age City
1 Josh 23 Moscow
2 Evelyn 18 Moscow
3 Olivia 32 Troy
4 Liam 56 Viola
5 Adrian 15 Moscow
6 Tamara 25 Pullman
7 Teddy 24 Lewiston
8 Allison 16 Moscow
In reality, you will rarely create a dataframe from scratch like this. Rather, you will often read a CSV file into a dataframe which we will learn later in this session. For now, we will use this small dataframe to learn a few essential skills in working with dataframes.
6 Dataframe Basics
6.1 Inspect the dataframe
dim() command to see the dimension. It will return the number of rows and number of columns.
dim(df)[1] 8 3
Alternatively, you can also use nrow() and ncol().
str() to see the structure. This is handy in inspecting data types and preview values of all columns.
str(df)'data.frame': 8 obs. of 3 variables:
$ Name: chr "Josh" "Evelyn" "Olivia" "Liam" ...
$ Age : num 23 18 32 56 15 25 24 16
$ City: chr "Moscow" "Moscow" "Troy" "Viola" ...
names() to list column names.
names(df)[1] "Name" "Age" "City"
head() or tail() to inspect the first few rows or the last few rows.
head(df) Name Age City
1 Josh 23 Moscow
2 Evelyn 18 Moscow
3 Olivia 32 Troy
4 Liam 56 Viola
5 Adrian 15 Moscow
6 Tamara 25 Pullman
The default is to show 5 rows but you can specify the number of rows inside the function. For example, head(df, 10) will show the first 10 rows.
table() is a quick way to see value counts in a column.
table(df$City)
Lewiston Moscow Pullman Troy Viola
1 4 1 1 1
6.2 Subsetting
$ signifies names in R. We can use $ to pull out values for a specific column.
df$Age # Show all the values in Age column[1] 23 18 32 56 15 25 24 16
head(df$Age) # Show the first few values in Age colun[1] 23 18 32 56 15 25
mean(df$Age) # Compute average age [1] 26.125
select() function can also pull out columns.
df %>% select(Name, Age) # Subset just Name and Age columns Name Age
1 Josh 23
2 Evelyn 18
3 Olivia 32
4 Liam 56
5 Adrian 15
6 Tamara 25
7 Teddy 24
8 Allison 16
Symbols such as %>% and |> in R are called “pipes”. Pipes are used to string together multiple operations in sequence. In the above code block, we take the dataframe df and then apply select() function on it.
filter() function is used to subset rows based on some logical criteria.
df %>% filter(Age >= 21) # Subset rows for ages 21 or older Name Age City
1 Josh 23 Moscow
2 Olivia 32 Troy
3 Liam 56 Viola
4 Tamara 25 Pullman
5 Teddy 24 Lewiston
Earlier, we briefly discussed the logical data type. When we use filter(), R evaluates each row - whether its Age value is 21 or greater - and creates a temporary vector of logical data, e.g., c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE). And then it outputs rows that correspond to TRUE.
We can filter by more than one criteria as well. We use the symbol | for OR and the symbol & for AND.
df %>% filter(Age >= 21 & City == "Moscow") # Subset rows for ages 21 or older AND city is Moscow Name Age City
1 Josh 23 Moscow
We can then save the resulting filtered dataframe by assigning a new variable.
df_filtered <- df %>% filter(Age >= 21 & City == "Moscow")
print(df_filtered) Name Age City
1 Josh 23 Moscow
6.3 Join
Working with All of Us data frequently involves combining two dataframes. Common join types are left_join and inner_join. We’ll create a new dataframe to demonstrate these joins.
# Create a BMI dataframe
bmi_df <- data.frame(Name = c("Evelyn", "Liam", "Olivia", "Tamara"),
BMI = c(20.2, 27.9, 32.1, 21.6))
bmi_df Name BMI
1 Evelyn 20.2
2 Liam 27.9
3 Olivia 32.1
4 Tamara 21.6
inner_join keeps only the rows where there is a match in both tables. So this type of join can work as a filter.
# Join df and df_bmi by Name and only keep matched rows
df_inner_joined <- inner_join(df, bmi_df, by = "Name")
df_inner_joined Name Age City BMI
1 Evelyn 18 Moscow 20.2
2 Olivia 32 Troy 32.1
3 Liam 56 Viola 27.9
4 Tamara 25 Pullman 21.6
left_join is used most often in working with All of Us data. This join keeps all the rows from the primary table (left table) and add any matched data from the right table.
df_left_joined <- left_join(df, bmi_df, by = "Name")
df_left_joined Name Age City BMI
1 Josh 23 Moscow NA
2 Evelyn 18 Moscow 20.2
3 Olivia 32 Troy 32.1
4 Liam 56 Viola 27.9
5 Adrian 15 Moscow NA
6 Tamara 25 Pullman 21.6
7 Teddy 24 Lewiston NA
8 Allison 16 Moscow NA
Note how rows without a match from bmi_df have values NA for the BMI column. NA indicates missing values and must be handled carefully in data analysis. We will discuss NA more in the Missing Values section shortly.
As a side note, the join syntax in above code blocks can be modified using a pipe %>%. For example, left_join() is often written this way:
df_left_joined <- df %>% left_join(bmi_df, by = "Name")
df_left_joined Name Age City BMI
1 Josh 23 Moscow NA
2 Evelyn 18 Moscow 20.2
3 Olivia 32 Troy 32.1
4 Liam 56 Viola 27.9
5 Adrian 15 Moscow NA
6 Tamara 25 Pullman 21.6
7 Teddy 24 Lewiston NA
8 Allison 16 Moscow NA
6.4 Mutate
mutate() function is used to create or transform variables. For example, we will create a column to show overweight status based on BMI.
df_left_joined %>% mutate(Overweight = BMI > 25) Name Age City BMI Overweight
1 Josh 23 Moscow NA NA
2 Evelyn 18 Moscow 20.2 FALSE
3 Olivia 32 Troy 32.1 TRUE
4 Liam 56 Viola 27.9 TRUE
5 Adrian 15 Moscow NA NA
6 Tamara 25 Pullman 21.6 FALSE
7 Teddy 24 Lewiston NA NA
8 Allison 16 Moscow NA NA
Other use cases of mutate() include arithmetic calculation of some numeric values. For example, if we had a dataframe with a year-of-birth column (YOB) and no age column, we can use mutate() to compute current age like this: mutate(current_age = 2025 - YOB). Similarly, if we had columns Weight in kg and Height in cm, but no BMI, we could create BMI column by: mutate(BMI = Weight / (Height/100)^2).
7 Missing Values
In a small dataframes like our toy example, NAs are easy to see. However, it is much harder to spot in a large dataset. This poses a problem because many statistical functions cannot deal with NAs automatically. NAs can also cause errors in unexpected ways. Therefore, researchers must always be vigilant about NAs in their data.
As an example, let’s try computing the average of BMI values.
mean(df_left_joined$BMI)[1] NA
What happened? Many functions including mean() returns “unknown” if any of the input values are unknown (i.e. missing values). In the case of mean(), we must tell R explicitly to ignore NAs by using the optional argument, na.rm = TRUE.
mean(df_left_joined$BMI, na.rm = TRUE)[1] 25.45
NAs are very common in electronic health records and they can also be introduced during joings as we saw earlier.
Always check for NAs in your data before statistical analyses!
There are various ways to check for NAs. Here are some examples.
colSums(is.na(df_left_joined)) # Returns a count of NAs in every columnName Age City BMI
0 0 0 4
sum(is.na(df_left_joined$BMI)) # Returns a sum of NA values in a specific column[1] 4
Sometimes you want to see the entire rows that contain NA.
df_left_joined %>% filter(is.na(BMI)) # Show rows where BMI values are NAs Name Age City BMI
1 Josh 23 Moscow NA
2 Adrian 15 Moscow NA
3 Teddy 24 Lewiston NA
4 Allison 16 Moscow NA
For categorical variables, we can use count() to see value counts including NAs. count() is the same as table() but shows NAs by default unlike table() which does not count NAs by default. Below example will not return any NAs because our City column does not have any, but this is just to show you the syntax.
df_left_joined %>% count(City) # Show value counts of the column City, including possible NAs City n
1 Lewiston 1
2 Moscow 4
3 Pullman 1
4 Troy 1
5 Viola 1
If a column of interest turns out to have too many NAs, you may have to remove that variable from your analysis. If NAs are only few and if the variable is critical to your analysis, you might decide to drop the rows containing NAs.
df_left_joined %>% drop_na(BMI) Name Age City BMI
1 Evelyn 18 Moscow 20.2
2 Olivia 32 Troy 32.1
3 Liam 56 Viola 27.9
4 Tamara 25 Pullman 21.6
If the variable is critical and you want to keep all the rows, you can look into advanced statistical methods called “imputation” which fills in missing values.
8 Reading Files
In programming, plain-text files are preferred over application-dependent file formats such as Word or Excel files. Comma-separated values (CSV) format is a popular file format for tabular data. We use this file type extensively to work with All of Us data.
We use read_csv() function to read CSV files into dataframes so we can work with them. Typically, we put the file path inside the function, which will look like read_csv("file.csv") or read_csv("directory/file.csv"). For this practice, I placed practice files on the web so that they can be directly read via file URLs. Copy/paste and run the following code to obtain the data:
library(readr)
# Read directly from public URLs
person_df <- read_csv("https://raw.githubusercontent.com/yesols/aou-workshop/refs/heads/main/data/demographic_data.csv")
condition_df <- read_csv("https://raw.githubusercontent.com/yesols/aou-workshop/refs/heads/main/data/conditions.csv")
drug_df <- read_csv("https://raw.githubusercontent.com/yesols/aou-workshop/refs/heads/main/data/drug_exposures.csv")The data in these files are randomly generated and completely fake, but designed to resemble All of Us data although much simplified.
Now, try the commands, dim(), head(), tail(), str(), etc, to inspect the these three dataframes.
In subsequent sections, we will learn how to clean up data and merge these three dataframes. Ultimately, we will analyze the dataset to investigate association between anticholinergic medications and dementia.
9 Recoding Values
Sometimes we have to recode values in categorical variables. For example, race in All of Us often have categories you might want to collapse into one. Let’s take a look at our example data.
# Inspect the first few lines of the demographics data
head(person_df)# A tibble: 6 × 4
person_id year_of_birth race sex_at_birth
<dbl> <dbl> <chr> <chr>
1 1 1987 Skip Male
2 2 1987 Asian Female
3 3 2004 Asian Male
4 4 1944 Black or African American Male
5 5 1941 Asian Male
6 6 2006 More than one population Male
person_df %>% count(race) # Count the values in race column# A tibble: 7 × 2
race n
<chr> <int>
1 Asian 20
2 Black or African American 23
3 More than one population 6
4 None of these 5
5 Other 4
6 Skip 4
7 White 38
“More than one population” and “None of these” can be recoded as “Other”. We could also change “Skip” to “Unknown” and “Black or African American” to “Black” for simplicity. We can use recode_values() if we are changing all values or replace_values() if just updating some of the values. In our case, we just want to tweak some of the categories, so we’ll use replace_values().
# Recode race category
person_df <- person_df %>%
mutate(
race = replace_values(
race,
"Black or African American" ~ "Black",
"Skip" ~ "Unknown",
c("More than one population", "None of these") ~ "Other"
)
)Check the counts again to make sure it worked. This is a good practice to also ensure NAs were not introduced as a result of a coding mistake.
person_df %>% count(race)# A tibble: 5 × 2
race n
<chr> <int>
1 Asian 20
2 Black 23
3 Other 15
4 Unknown 4
5 White 38
Similarly, we’ll also clean up sex_at_birth categories. We will change anything other than “Female” or “Male” to “Other”.
# Recode sex_at_birth
person_df <- person_df %>%
mutate(
sex_at_birth = replace_values(
sex_at_birth,
c("Intersex", "Prefer not to answer", "Skip") ~ "Other"
)
)Check the result:
person_df %>% count(sex_at_birth)# A tibble: 3 × 2
sex_at_birth n
<chr> <int>
1 Female 44
2 Male 48
3 Other 8
10 Selecting Earliest Occurrence
When you extract conditions, drug exposures, or lab measurements data in All of Us, there are multiple entries per person. In our practice data, we will pick the earliest occurrence of the condition and the earliest occurrence of drug exposure. slice() and its variations, slice_head(), slice_sample(), “slice_min(), "slice_max() are useful for subsetting a dataframe in a variety of ways. We will use slice_min() to pick the earliest date for each person. group_by() function allows us to perform slice_min() on grouped rows by person_id rather than globally.
earliest_condition <- condition_df %>%
group_by(person_id) %>% # group rows by person_id
slice_min(
order_by = condition_start_date, # specify which column to evaluate
n = 1, # pick one row
with_ties = FALSE # keep only one in case of ties
) %>%
ungroup()We do the same for drug dataframe:
earliest_drug <- drug_df %>%
group_by(person_id) %>%
slice_min(
order_by = drug_exposure_start_date,
n = 1,
with_ties = FALSE
) %>%
ungroup()11 Merging Demographics with Clinical Data
When you extract All of Us data with auto-generated queries, you will have separate dataframes for “person”, “conditions”, “drug exposures”, etc, and you must combine them for analysis. We will use left_join() that we learned earlier to merge condition_df and drug_df to person_df.
combined_df <- person_df %>%
left_join(earliest_condition, by = "person_id") %>%
left_join(earliest_drug, by = "person_id")
head(combined_df) # Check result# A tibble: 6 × 9
person_id year_of_birth race sex_at_birth condition_concept_id
<dbl> <dbl> <chr> <chr> <dbl>
1 1 1987 Unknown Male NA
2 2 1987 Asian Female NA
3 3 2004 Asian Male NA
4 4 1944 Black Male NA
5 5 1941 Asian Male NA
6 6 2006 Other Male NA
# ℹ 4 more variables: condition_start_date <date>,
# condition_source_value <chr>, drug_concept_id <dbl>,
# drug_exposure_start_date <date>
12 Writing and Saving Files
In order to save our data, we use write_csv() function to save the dataframe as a CSV file. Its arguments are the dataframe name and a file path:
write_csv(combined_df, "cleaned_dataset.csv") # file path will be different13 Running Queries and Working with Queried Data
Demonstration in class.
14 Saving to and Reading from Workspace Bucket
In All of Us Workbench, running the code like write_csv(combined_df, "cleaned_dataset.csv") will create a CSV file in your current working directory. For permanent and secure storage, we need to copy the file to the Workspace Bucket. Below code is what it looks like when you save a file to the bucket.
Do not run these code blocks in Posit Cloud or local RStudio. It will only work within the All of Us Workbench.
# Retrieve the environment variable for the bucket path
my_bucket <- Sys.getenv("WORKSPACE_BUCKET")
# Save the file to a "data" directory in the workspace bucket
system(paste0("gsutil cp cleaned_dataset.csv ", my_bucket, "/data/"))In the second line of the code above, paste0() function constructs a string of characters with given arguments. It creates a system command with gsutil, a command that allows us to interact with the bucket. cp is a system command that means “copy”. Then, the constructed system command is passed to the function, system(), which is used to invoke operating system commands directly from the R console.
Conversely, when you want to read a file from the bucket, you will do something like this:
# Copy the file in the bucket to the current working directory
system(paste0("gsutil cp ", my_bucket, "/data/cleaned_dataset.csv ."))
# Read the csv file as a dataframe
df <- read_csv("cleaned_dataset.csv")15 Preparing for Next Session
Before the next session, apply what you learned today to your All of Us datasets to clean and construct an analysis-ready dataset. Make a note of any issues. We will begin with a brief Q & A next time.