Introduction to Python
This tutorial covers the fundamentals of Python and the Pandas library, essential tools for working with structured data like that found in the All of Us Research Program datasets.
1. Python Basics
Hello World and Variables
The print() function displays output. Variables store data.
# Displaying a simple string
print("Hello, All of Us!")
# Variables and basic arithmetic
age = 30
years_in_study = 5
total_time = age + years_in_study
# Printing the variable value
print(total_time)Output:
Hello, All of Us!
35
Data Structures: Lists and Dictionaries
- Lists are ordered, mutable sequences.
- Dictionaries are unordered collections of key-value pairs.
# A list of participant IDs
participant_ids = [1001, 1002, 1003, 1004]
print(participant_ids[0]) # Access the first element (index 0)
# A dictionary storing participant information
participant_info = {
"1001": {"age": 55, "sex": "Male"},
"1002": {"age": 62, "sex": "Female"}
}
print(participant_info["1001"]["sex"]) # Access a value by its keyOutput:
1001
Male
2. Introduction to Pandas
Importing Pandas
Pandas is the primary library for working with DataFrame in Python. A DataFrame object is how we often organize our datasets for an analysis.
We typically import Pandas using the alias pd.
import pandas as pdReading and Inspecting Data Files
Assume a file named my_data.csv is loaded.
# Reading a CSV file into a DataFrame
df = pd.read_csv('my_data.csv')
# Display the first few rows of the DataFrame
print(df.head())
# Saving the modified DataFrame back to a new CSV file
df.to_csv('modified_data.csv', index=False)Example Data Input (my_data.csv):
| person_id | age_at_enrollment | gender_concept_id | blood_pressure |
|---|---|---|---|
| 1001 | 65 | 8507 | 130 |
| 1002 | 42 | 8532 | 120 |
| 1003 | 71 | 8507 | 145 |
| 1004 | 35 | 8532 | 110 |
Expected Output (df.head()):
person_id age_at_enrollment gender_concept_id blood_pressure
0 1001 65 8507 130
1 1002 42 8532 120
2 1003 71 8507 145
3 1004 35 8532 110
3. Data Manipulation: Selection and Recoding
Filtering Rows
Filter for participants older than 60.
# Filter rows where age is over 60 (Boolean indexing)
older_participants = df[df['age_at_enrollment'] > 60]
print(older_participants)Output:
person_id age_at_enrollment gender_concept_id blood_pressure
0 1001 65 8507 130
2 1003 71 8507 145
Recoding Values
Recode the numeric gender_concept_id to descriptive labels using .replace().
# Recode gender_concept_id (e.g., 8507 is Male, 8532 is Female)
df['gender_label'] = df['gender_concept_id'].replace({
8507: 'Male',
8532: 'Female',
9000: 'Other/Unknown'
})
print(df[['person_id', 'gender_concept_id', 'gender_label']].head())Output:
person_id gender_concept_id gender_label
0 1001 8507 Male
1 1002 8532 Female
2 1003 8507 Male
3 1004 8532 Female
4. Merging DataFrames
The pd.merge() function joins data from two different tables using a common identifier like person_id.
Assume a second DataFrame (labs_df) is loaded:
| person_id | cholesterol |
|---|---|
| 1001 | 210 |
| 1004 | 185 |
| 1005 | 225 |
# Perform an inner merge: only includes IDs present in BOTH tables (1001, 1004)
merged_df = pd.merge(
df, # Left DF (demographics)
labs_df, # Right DF (labs)
on='person_id',
how='inner' # Change to 'left', 'right', or 'outer' as needed
)
print(merged_df)Output (merged_df):
person_id age_at_enrollment gender_concept_id blood_pressure gender_label cholesterol
0 1001 65 8507 130 Male 210
1 1004 35 8532 110 Female 185
5. Descriptive Statistics and Visualization
Basic Statistics
Use .groupby() to calculate statistics per category.
# Mean age grouped by the new gender label
mean_age_by_gender = df.groupby('gender_label')['age_at_enrollment'].mean()
print("Mean Age by Gender:\n", mean_age_by_gender)
# Count of participants in each gender group
gender_counts = df['gender_label'].value_counts()
print("\nGender Counts:\n", gender_counts)Output:
Mean Age by Gender:
gender_label
Female 38.5
Male 68.0
Name: age_at_enrollment, dtype: float64
Gender Counts:
gender_label
Male 2
Female 2
Name: count, dtype: int64
Simple Visualization
Pandas has a built-in .plot() method that uses Matplotlib for quick visualizations.
import matplotlib.pyplot as plt
# Bar plot of participant counts by gender
df['gender_label'].value_counts().plot(
kind='bar',
title='Participant Count by Gender',
xlabel='Gender',
ylabel='Count',
rot=0
)
# Save the plot to a file
plt.savefig('gender_counts_bar.png')
plt.close() # Close the plot to prevent displaying inline