Introduction to R Programming and Data Wrangling

1 Learning Objectives

Session 2 will focus on learning the basics of R programming. We will first work with small datasets to practice common R functions and syntax, followed by demonstration of running queries in the All of Us Workbench.

At the end of Session 2, learners will be able to:

Understand different types of data types and structures in R.
Inspect and manipulate dataframes.
Familiarize with Data Explorer and app configurations in the Workbench.

Unless indicated otherwise, we will practice R on Google Colab. When you start a new Colab notebook, it defaults to python. Change to R by clicking Runtime > Change runtime type and select R from the dropdown menu. Click Save. Click Connect in the top right corner to start coding.

2 Preliminaries

Concepts:

file system: “directory”, “working directory”, “file path”
input and output
functions and arguments
variables
errors vs. warning

Best practice in programming:

When naming files and directories, avoid spaces. Use underscores or dashes for readability. (e.g. my_important_file.csv)
If including dates in naming files, use a standard format such as YYYY-MM-DD or YYYYMMDD. (e.g. query_result_2026-05-21.csv)
Comment copiously: you think you will remember what you were doing but you won’t. Within a code cell, comment using #. Any lines beginning with # are ignored during code execution. Alternatively, utilize “markdown” cells.
Write reproducible code: keep a linear flow in a Notebook, and write it so that you have a clear “paper trail”. Anyone who runs your Notebook should be able to get the same result.
Feel free to utilize AI to help you with coding but make sure you understand every line (That’s why you should still learn the basics. No “vibe coding” in this workshop and your research!)

3 Loading Libraries

“Libraries” or “packages” are collections of pre-written functions that provide convenience and efficiency in coding. Common R libraries we use for All of Us projects are:

tidyverse: collection of packages (dplyr, readr, ggplot2,…) for file handling, working with dataframes, visualizations.
bigrquery: facilitates the connection between R and Google BigQuery database - the type of database All of Us data are stored in.
lubridate: for handling dates and times

We use library() function to load these packages.

# Load packages
library(tidyverse)
library(lubridate)

These common packages are standard in cloud platforms. Occasionally, a package you want to use may not be available in the virtual machine. Attempting to load a package with library() when it does not exist will result in an error. Then, you need to first install it using install.packages() function.

# Do not run this. This is just to show you the code for installing a package.
install.packages("MatchIt") #MatchIt is a package used for propensity score matching
library("MatchIt")

4 Data Types

There are 5 data types you must be familiar with in programming with R. Knowing the data types of the values in your dataframe, for example, can be critical in performing computations or troubleshooting errors.

4.1 Numeric

Numeric data can be either double (dbl) for numbers with decimals such as weight or lab values, or integer (int) for whole numbers such as participant ID numbers.

4.2 Character

Also known as string, character (chr) data types are non-numeric, text data. In All of Us data, condition names or drug names for example, are character types. Character values are enclosed in " ". So if you have values like `“1”, “2”, “3”, they are character type even though they are numbers, and R will not be able to perform computations that require numeric data types.

4.3 Factor

Factor (fct) data types are used for categorical data. They often look just like characters which is a source of common confusion. However, under the hood, factors are stored as integers with associated “levels”. Some statistical functions may require factors as input.

4.4 Logical

Logicals are binary data with values TRUE and FALSE. While this type of data are rarely seen as values in a dataframe, they are an output of filtering or conditional operations. Interestingly, TRUE value equals 1 while FALSE equals 0. This property can be useful in checking data.

4.5 Date

Date data types represent calendar dates and times. In All of Us data, variables like condition_start_date fall into this category. While they might look like character strings (e.g., “2024-10-31”), R stores them numerically under the hood and is able to calculate things like age or the number of days between two medical visits.

5 Data Structures

5.1 Vector

A vector is a sequence of data where all elements are of the same data type. We can create a vector using c().

# Create character vectors
names <- c("Josh", "Evelyn", "Olivia", "Liam", "Adrian", "Tamara", "Teddy", "Allison")
cities <- c("Moscow", "Moscow", "Troy", "Viola", "Moscow", "Pullman", "Lewiston", "Moscow")
# Create a numeric vectors
ages <- c(23, 18, 32, 56, 15, 25, 24, 16)

Use print() to see the values.

print(names)

[1] "Josh"    "Evelyn"  "Olivia"  "Liam"    "Adrian"  "Tamara"  "Teddy"  
[8] "Allison"

Use length() to see how many elements there are in a vector.

length(names)

[1] 8

You can use indexing to access a specific element in a vector.

names[2]  # Access the second element in the names vector.

[1] "Evelyn"

For numeric values, we can perform simple computations.

mean(ages)  # Average of the elements in ages vector

[1] 26.125

sd(ages)  # Standard deviation

[1] 13.28197

Other common functions for numeric data include median(), min(), max().

5.2 List

Lists look similar to vectors but they can contain mixed data types, even lists within a list. Lists are a very versatile object for programming purposes but we won’t go into it too much as they are not necessary for typical statistical analyses. I will just mention that we use double brackets ([[ ]]) to access a specific element in a list instead of single brackets as with vectors.

5.3 Dataframe

A dataframe is like a table in an Excel sheet. Each column is a vector. Therefore, values in a column must be of same data type but different columns can have different types. We will create a dataframe using the vectors we created earlier to see what it looks like.

# Create a dataframe with existing vectors
df <- data.frame(Name = names, Age = ages, City = cities)
# Inspect
print(df)

     Name Age     City
1    Josh  23   Moscow
2  Evelyn  18   Moscow
3  Olivia  32     Troy
4    Liam  56    Viola
5  Adrian  15   Moscow
6  Tamara  25  Pullman
7   Teddy  24 Lewiston
8 Allison  16   Moscow

In reality, you will rarely create a dataframe from scratch like this. Rather, you will often read a CSV file into a dataframe which we will learn later in this session. For now, we will use this small dataframe to learn a few essential skills in working with dataframes.

6 Dataframe Basics

6.1 Inspect the dataframe

dim() command to see the dimension. It will return the number of rows and number of columns.

dim(df)

[1] 8 3

Alternatively, you can also use nrow() and ncol().

str() to see the structure. This is handy in inspecting data types and preview values of all columns.

str(df)

'data.frame':   8 obs. of  3 variables:
 $ Name: chr  "Josh" "Evelyn" "Olivia" "Liam" ...
 $ Age : num  23 18 32 56 15 25 24 16
 $ City: chr  "Moscow" "Moscow" "Troy" "Viola" ...

names() to list column names.

names(df)

[1] "Name" "Age"  "City"

head() or tail() to inspect the first few rows or the last few rows.

head(df)

    Name Age    City
1   Josh  23  Moscow
2 Evelyn  18  Moscow
3 Olivia  32    Troy
4   Liam  56   Viola
5 Adrian  15  Moscow
6 Tamara  25 Pullman

The default is to show 5 rows but you can specify the number of rows inside the function. For example, head(df, 10) will show the first 10 rows.

table() is a quick way to see value counts in a column.

table(df$City)


Lewiston   Moscow  Pullman     Troy    Viola 
       1        4        1        1        1

unique() lists distinct, unique values.

unique(df$City)

[1] "Moscow"   "Troy"     "Viola"    "Pullman"  "Lewiston"

This can be combined with length() to show how many unique values in a vector (or a column).

length(unique(df$City))

[1] 5

I often use this combination to check for redundancies in a large dataframe. For example, let’s say I have a demographics dataframe with thousands of participants. I might run nrow(df_demographics) and then length(unique(df_demographics$person_id)). If the number of rows is larger than the number of unique individuals, then I know there are duplicates which needs further investigation.

6.2 Subsetting

$ signifies names in R. We can use $ to pull out values for a specific column.

df$Age   # Show all the values in Age column

[1] 23 18 32 56 15 25 24 16

head(df$Age)  # Show the first few values in Age colun

[1] 23 18 32 56 15 25

mean(df$Age)  # Compute average age

[1] 26.125

select() function can also pull out columns.

df %>% select(Name, Age)  # Subset just Name and Age columns

     Name Age
1    Josh  23
2  Evelyn  18
3  Olivia  32
4    Liam  56
5  Adrian  15
6  Tamara  25
7   Teddy  24
8 Allison  16

Pipe Operator

Symbols such as %>% and |> in R are called “pipes”. Pipes are used to string together multiple operations in sequence. In the above code block, we take the dataframe df and then apply select() function on it.

filter() function is used to subset rows based on some logical criteria.

df %>% filter(Age >= 21)  # Subset rows for ages 21 or older

    Name Age     City
1   Josh  23   Moscow
2 Olivia  32     Troy
3   Liam  56    Viola
4 Tamara  25  Pullman
5  Teddy  24 Lewiston

Earlier, we briefly discussed the logical data type. When we use filter(), R evaluates each row - whether its Age value is 21 or greater - and creates a temporary vector of logical data, e.g., c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE). And then it outputs rows that correspond to TRUE.

We can filter by more than one criteria as well. We use the symbol | for OR and the symbol & for AND.

df %>% filter(Age >= 21 & City == "Moscow")  # Subset rows for ages 21 or older AND city is Moscow

  Name Age   City
1 Josh  23 Moscow

We can then save the resulting filtered dataframe by assigning a new variable.

df_filtered <- df %>% filter(Age >= 21 & City == "Moscow")
print(df_filtered)

  Name Age   City
1 Josh  23 Moscow

6.3 Join

Working with All of Us data frequently involves combining two dataframes. Common join types are left_join and inner_join. We’ll create a new dataframe to demonstrate these joins.

# Create a BMI dataframe
bmi_df <- data.frame(Name = c("Evelyn", "Liam", "Olivia", "Tamara"),
                     BMI = c(20.2, 27.9, 32.1, 21.6))
bmi_df

    Name  BMI
1 Evelyn 20.2
2   Liam 27.9
3 Olivia 32.1
4 Tamara 21.6

inner_join keeps only the rows where there is a match in both tables. So this type of join can work as a filter.

# Join df and df_bmi by Name and only keep matched rows
df_inner_joined <- inner_join(df, bmi_df, by = "Name")
df_inner_joined

    Name Age    City  BMI
1 Evelyn  18  Moscow 20.2
2 Olivia  32    Troy 32.1
3   Liam  56   Viola 27.9
4 Tamara  25 Pullman 21.6

left_join is used most often in working with All of Us data. This join keeps all the rows from the primary table (left table) and add any matched data from the right table.

df_left_joined <- left_join(df, bmi_df, by = "Name")
df_left_joined

     Name Age     City  BMI
1    Josh  23   Moscow   NA
2  Evelyn  18   Moscow 20.2
3  Olivia  32     Troy 32.1
4    Liam  56    Viola 27.9
5  Adrian  15   Moscow   NA
6  Tamara  25  Pullman 21.6
7   Teddy  24 Lewiston   NA
8 Allison  16   Moscow   NA

Note how rows without a match from bmi_df have values NA for the BMI column. NA indicates missing values and must be handled carefully in data analysis. We will discuss NA more in the Missing Values section shortly.

As a side note, the join syntax in above code blocks can be modified using a pipe %>%. For example, left_join() is often written this way:

df_left_joined <- df %>% left_join(bmi_df, by = "Name")
df_left_joined

     Name Age     City  BMI
1    Josh  23   Moscow   NA
2  Evelyn  18   Moscow 20.2
3  Olivia  32     Troy 32.1
4    Liam  56    Viola 27.9
5  Adrian  15   Moscow   NA
6  Tamara  25  Pullman 21.6
7   Teddy  24 Lewiston   NA
8 Allison  16   Moscow   NA

6.4 Mutate

mutate() function is used to create or transform variables. For example, we will create a column to show overweight status based on BMI.

df_left_joined %>% mutate(Overweight = BMI > 25)

     Name Age     City  BMI Overweight
1    Josh  23   Moscow   NA         NA
2  Evelyn  18   Moscow 20.2      FALSE
3  Olivia  32     Troy 32.1       TRUE
4    Liam  56    Viola 27.9       TRUE
5  Adrian  15   Moscow   NA         NA
6  Tamara  25  Pullman 21.6      FALSE
7   Teddy  24 Lewiston   NA         NA
8 Allison  16   Moscow   NA         NA

Other use cases of mutate() include arithmetic calculation of some numeric values. For example, if we had a dataframe with a year-of-birth column (YOB) and no age column, we can use mutate() to compute current age like this: mutate(current_age = 2025 - YOB). Similarly, if we had columns Weight in kg and Height in cm, but no BMI, we could create BMI column by: mutate(BMI = Weight / (Height/100)^2).

7 Missing Values

In a small dataframes like our toy example, NAs are easy to see. However, it is much harder to spot in a large dataset. This poses a problem because many statistical functions cannot deal with NAs automatically. NAs can also cause errors in unexpected ways. Therefore, researchers must always be vigilant about NAs in their data.

As an example, let’s try computing the average of BMI values.

mean(df_left_joined$BMI)

[1] NA

What happened? Many functions including mean() returns “unknown” if any of the input values are unknown (i.e. missing values). In the case of mean(), we must tell R explicitly to ignore NAs by using the optional argument, na.rm = TRUE.

mean(df_left_joined$BMI, na.rm = TRUE)

[1] 25.45

NAs are very common in electronic health records and they can also be introduced during joings as we saw earlier.

Warning

Always check for NAs in your data before statistical analyses!

There are various ways to check for NAs. Here are some examples.

colSums(is.na(df_left_joined))  # Returns a count of NAs in every column

Name  Age City  BMI 
   0    0    0    4

sum(is.na(df_left_joined$BMI))  # Returns a sum of NA values in a specific column

[1] 4

Sometimes you want to see the entire rows that contain NA.

df_left_joined %>% filter(is.na(BMI))  # Show rows where BMI values are NAs

     Name Age     City BMI
1    Josh  23   Moscow  NA
2  Adrian  15   Moscow  NA
3   Teddy  24 Lewiston  NA
4 Allison  16   Moscow  NA

For categorical variables, we can use count() to see value counts including NAs. count() is the same as table() but shows NAs by default unlike table() which does not count NAs by default. Below example will not return any NAs because our City column does not have any, but this is just to show you the syntax.

df_left_joined %>% count(City)  # Show value counts of the column City, including possible NAs

      City n
1 Lewiston 1
2   Moscow 4
3  Pullman 1
4     Troy 1
5    Viola 1

If a column of interest turns out to have too many NAs, you may have to remove that variable from your analysis. If NAs are only few and if the variable is critical to your analysis, you might decide to drop the rows containing NAs.

df_left_joined %>% drop_na(BMI)

    Name Age    City  BMI
1 Evelyn  18  Moscow 20.2
2 Olivia  32    Troy 32.1
3   Liam  56   Viola 27.9
4 Tamara  25 Pullman 21.6

If the variable is critical and you want to keep all the rows, you can look into advanced statistical methods called “imputation” which fills in missing values.

8 Navigating Researcher Workbench Continued

Demonstration in class:

Choosing and configuring an analysis app
- AoU Jupyter: This is what we want.
- R Analysis Environment for AoU: For those who prefer RStudio environment.
- JupyterLab Spark Cluster: This is for genomics.
- JupyterLab NVIDIA options: These are GPU-accelerated environment suitable for AI/ML, and very expensive.

9 Running Queries and Working with Queried Data

Demonstration in class:

Continue Data Explorer demo
- “View in hierarchy” issue
- Source code for conditions and mapping
File organization in the Bucket
Strategies to keep Notebooks organized
Browsing and viewing files in the Workbench vs. GCP

10 Preparing for Next Session

Create your own project workspace and at least one data snapshot using Data Explorer prior to Session 3. Follow the instruction here or the video posted in the Getting Started page.