Introduction to Cloud Computing and All of Us Workbench
1 Learning Objectives
In this session, we will start with an overview of cloud computing concepts. We will then log into the All of Us Workbench and build the cohort and datasets for your project.
At the end of Session 1, learners will be able to:
- Define the concept of cloud computing and its advantages in working with large and sensitive data.
- Identify and navigate the components of Researcher Workbench.
- Demonstrate proper steps of starting and ending analysis sessions in the Workbench.
- Demonstrate using the Workbench tools to create datasets
2 What is Cloud Computing?
Cloud computing is the on-demand availability of computing resources (servers, storage, software, etc) over the internet. Common services include AWS by Amazon, Azure by Microsoft, and Google Cloud Platform (GCP). The All of Us Researcher Workbench is built on GCP. Cloud computing allows users to only pay for what they use without having to maintain the computing resources themselves.
“Cloud Computing” by Sam Johnston is licensed under CC BY-SA 3.0.
Cloud computing is especially advantageous for researchers to share and analyze large data as they do not have to buy and maintain expensive hardware. Cloud computing is also secure and provides better protection for sensitive data than if it were stored in individual researchers’ own machines.
3 OMOP Common Data Model
Raw data collected from various EHR systems lack standardization that makes analysis challenging. The All of Us Research Program harmonizes the data in accordance to the OMOP CDM framework in which clinical terms such as medication names, diagnoses, procedures, etc, are mapped to the OMOP standard vocabulary. For example, SNOMED is the standard vocabulary for conditions and RxNorm is the standard vocabulary for drugs. Then, all concepts are assigned an “OMOP Concept ID”, which is globally unique across all concept domains, allowing an accurate and efficient query. Refer to this support article for more details: Understanding OMOP Basics
A good way to look up a concept is Athena database. I often use Athena to:
- obtain the code or concept ID for a term
- inspect hierarchical relationship of a concept
- translate between different vocabularies (for example, between ICD-10 and SNOMED)
4 Relational Database
All of Us data is organized in so-called “relational database.” In a relational database, data is not stored in a single, massive spreadsheet, but rather organized into multiple, specialized tables that are linked together through shared identifiers called “keys.” This structure ensures data integrity and efficiency.
(Full diagram at: http://ohdsi.github.io/CommonDataModel/cdm54erd.html)
Structured Query Language (SQL) is a programming language for interacting with relational databases. In All of Us context, point-and-click data extraction tools in the Workbench generates SQL queries automatically, which are then pasted to a notebook and run to extract data. Advanced users can write their own SQL queries themselves.
However, knowledge of SQL is not necessary to conduct research projects with All of Us and we will not learn it in this workshop. If you are interested in learning how to use SQL to extract data, check out this support article: Exploring Concepts with OMOP and SQL
Sometimes, you might want to look up exact data tables and associated column names in All of Us. Check on the Data Dictionary which shows you exactly how all the data are organized. It also contains information about suppressed data types.
With the new Verily Workbench (“Workbench 2.0”), there are extra tables that facilitate faster queries. So when you query using Data Explorer, the resulting dataset may contain table names and column names that do not appear in the official OMOP CDM.
6 File Management
- Bucket for storage
- Deleting files: exercise caution!
7 Billing
- Pods
- Initial credits
- What happens when initial credits expire
- 3 components:
- Data extraction
- Computation
- Storage
- Monitoring cost
8 Using Data Explorer
Demonstration in class:
- Selecting cohorts
- Adding concepts
- Saving data snapshot
9 Preparing for Next Session
In Session 2, we will learn basic R programming. We will use Google Colab, a cloud-based Jupyter environment. This way, you do not have to deal with installations on your own machine and can start coding right away from a browser window.