Team HealthCAir

Overview

Air pollution is a pervasive, global issue worsening due to climate change, increasing population, deforestation, and other causes. Among the many types of air pollution, PM2.5 is known to be the most deadly (Cardenas 2021) and contributed to over 4 million deaths worldwide in 2019 (GBD 2019 Risk Factor Collaborators). PM2.5 consists of fine particulate matter smaller than 2.5 microns and poses significant health effects as it can enter the bloodstream.

Prior studies have found that proximity to sources of pollution is associated with increased risk of childhood cancer, cardiovascular and respiratory illnesses, end-stage renal disease, and diabetes (Brender, Maantay, and Chakraborty, 2011), but the breadth and extent of contemporaneous and long-term impacts of PM2.5 exposure on chronic health conditions among children is not well-understood.

We develop a causal model for estimating the effects of pollution exposure on pediatric health outcomes using variation in local air pollution driven by wind speed, wind direction, pollution emitted by pollution sites, and proximity of pollution sites to schools. We study these effects among the pediatric population in the state of California during the years 2002 through 2017.

Project Aims

We aimed to examine the link between PM2.5 pollution exposure and adverse pediatric health outcomes throughout the state of California from 2002 to 2017.

Aim 1: Issues of air quality can affect all children. However, some may be disproportionately impacted by having high concentrations of PM2.5 at or near their school. To understand this issue better, we built unique high-quality data of PM2.5 pollution exposure at every school in CA.

Aim 2: We established an empirical relationship between PM2.5 levels recorded at schools in CA and being upwind from key point sources of pollution.

Aim 3: We estimated the causal effect of a school being located downwind from key pollution sources on select chronic health conditions among children under the age of 19.

Aim 4: We conducted a counterfactual study to compare and quantify the benefits of various air pollution mitigation strategies.

Datasets

Wind

Wind data forms the basis for constructing our instrument, which predicts the amount of PM2.5 at each school. We find that as wind blows from a pollution source towards a school, we see higher levels of PM2.5 measurements.

The wind data used in our study comes from ERA5, which contains hourly wind observations measured every 31 km across the globe. Each measurement provides information such as wind direction and speed. For each school and pollution point source, we find the nearest wind measurement and use this information to predict PM2.5 levels at each school.

This diagram to the right shows a sample of a few hours of wind measurements across California. Faster wind speeds are represented by larger wind arrows, and purple points on the map represent each school used in our study.

Medical Diagnoses

Pediatric medical diagnosis data come from the California Department of Public Health (CDPH) and the California Office of Statewide Health Planning and Development (OSHPD). This dataset contains HIPAA-protected medical data on diagnoses and procedures for each hospital stay and ER visit in the state of California from 2000 through 2018. For our project, we aggregate these data to month and zip code levels and aim to predict the proportion of the population that will experience certain medical diseases.

Because we are not medical specialists, we consulted with medical doctors to better understand which pediatric health outcomes would be plausible to have causal links with PM2.5 exposure. Following research and discussion, we decided to focus on the following pediatric medical conditions in our study:

Hematopoietic cancer
Type 1 diabetes
All malignant cancers
All non-blood malignant cancers
Cardiorespiratory cancer
Blood diseases
Blood vessel diseases
Both blood and blood vessel diseases
Injuries (used as a control diagnosis for which we expect to see no statistical significance)

PM2.5 Readings

PM2.5 data come from NASA MODIS, MISR, and SeaWIFS instruments and were assembled by researchers at Washington University in St. Louis. The data are collected using chemical transport modeling, satellite remote sensing, and ground-based monitoring.

This data visualization shows the average PM2.5 levels from 2000 through 2017. Only the zip codes which had an open school during the time period of our study (2002-2017) were considered in our analysis and are displayed in the graph. We observe that PM2.5 levels are generally the highest in Los Angeles County with average PM2.5 levels as high as 18 tons per year. Similarly, the Central Valley of California has also seen high levels of PM2.5 with average PM2.5 levels in the range of 12-17 tons per year.

Pollution Sources

In order to predict the levels of PM2.5 at each school, we looked at the 15 point sources of pollution geographically nearest to each school and the amount of PM2.5 emitted from each of those point sources. These data come from the National Emissions Inventory (NEI) provided by the Environmental Protection Agency (EPA), which contains information on stationary pollution point sources and the amount of PM2.5 emitted measured in tons per year (TPY).

The NEI list contains a wide variety of pollution point sources and the amounts of PM2.5 emitted. These sources are primarily large industrial facilities, power plants, airports, and commercial facilities. Some polluters emit over a thousand tons of PM2.5 per year, while others emit a very insignificant amount. To ensure that we adequately capture significant polluters, we filter the dataset to use only pollution point sources with PM2.5 emissions above 4 TPY.

The data visualization to the right maps these polluters, with larger dots representing those emitting higher amounts of PM2.5 each year. You can click and drag on the map to select a subset of polluters, and the histogram will update to represent the selected subset.

Schools

The California schools data come from both the California Department of Education and the California School Campus Database (CSCD), which supplemented the state data with better campus centroid information. The data set contains over 10,000 California schools ranging from kindergartens to high schools. Each record contains information such as the date the school opened, its closing date if applicable, school district, address, and grades offered.

We made the simplifying assumption that every child attends a school in the same zip code as their home and is exposed to the same amount of PM2.5 pollution. Our analysis uses the PM2.5 exposures measured at each school to predict the diagnosis rate of selected medical outcomes.

Population

Population data come from the U.S. Census Bureau. The Census Bureau performs an official census every 10 years. Thus, we used these official census counts for 2000 and 2010. Between the years 2001 through 2009, we linearly interpolated the populations for each individual zip code assuming a constant growth rate. For the years 2011 through 2017, we used estimated census counts calculated by the American Community Survey (ACS). The ACS contacts over 3.5 million US households every year and uses these results to estimate the population by zip code.

We use the population data in each zip code to convert the sum of medical diagnoses to a per capita rate. This allows us to compare the diagnosis rates across zip codes with varying population levels.

Combining Datasets

After collecting and cleaning data from our many sources, we joined them together so we could find nearby pollution point sources and wind measurements to each school. This map displays a snapshot of just one hour of data. Cyan points represent schools. Pollution sources are in red, with larger circles representing higher amounts of PM2.5. Gray arrows represent wind direction, with larger arrows representing higher wind speeds.

The schools, pollution sources, and PM2.5 levels can be digested more easily by viewing separate maps side-by-side. This visualization shows these quantities by zip code.

Modeling

Instrumental Variables Regression

We model each of our health outcomes using a two-stage least squares instrumental variables (2SLS IV) regression. The two modeling stages are as follows:

Model PM2.5 using our instrument and fixed effects:

\[ \widehat{PM}_{2.5}=\beta_0 + \beta_1 \kern 0.05em I + \beta_2 \kern 0.05em year + \beta_3 \kern 0.05em month + \beta_4 \kern 0.05em county + \beta_5 \kern 0.05em temperature \]

\[ \kern 1.5em + \beta_6 \kern 0.05em elevation + \beta_7 \kern 0.05em income + \beta_8 \kern 0.05em population + \beta_9 \kern 0.05em wspd_{avg} + \epsilon \]

Model the 12-month change in medical diagnoses using this predicted PM2.5 and same fixed effects:

\[ Y = \Delta \frac{\sum Diagnoses}{Pop_{zipcode, 0-19}} = \delta_0 + \delta_1 \kern 0.05em \widehat{PM}_{2.5} + \delta_2 \kern 0.05em year + \delta_3 \kern 0.05em month + \delta_4 \kern 0.05em county + \delta_5 \kern 0.05em temperature \]

\[ \kern 6em + \delta_6 \kern 0.05em elevation + \delta_7 \kern 0.05em income + \delta_8 \kern 0.05em population + \delta_9 \kern 0.05em wspd_{avg} + \epsilon \]

\( \widehat{PM}_{2.5} \) = the predicted PM2.5 9-month average for every zip code and month
\( \beta_i \) = the first stage regression coefficient for variable i
\( \delta_i \) = the second stage regression coefficient for variable i

The medical diagnoses in our study are likely to develop over long term exposures to PM2.5. This is why we used a rolling 9 month average of PM2.5 to predict each medical diagnosis. The medical outcomes are measured as the change in the number of diagnoses per 1,000 children over a 12-month period, where the medical counts in a given month are measured as the sum of total diagnoses over a 3-month period. We applied this regression framework to each of our ten medical outcomes of interest.

The instrument, \(I\), is calculated using several different variables involving wind, pollution, and distance variables. We considered the nearest 15 pollution point sources to each school, and for each, we used their PM2.5 emissions measured in tons per year (TPY), distance to the school, wind speed, and wind direction to compute this instrument:

\[ I_{zmy} = \sum_{ps^{*}=1}^{15} \sum_{d_{m}=1}^{D_{m}}\sum_{H_{d}=1}^{24}\theta_{downstream_{zd_{m}}} \times \widetilde{TPY}\kern -0.2em_{ps} \times S_{zd_{m}} \]

\( zmy \) = zip code, month, and year combination
\( ps^{*} \) = pollution point source prefiltered to only include those with average TPY levels greater than 4
\( D_m \) = number of days in month
\( d_m \) = day of month
\( H_d \) = hour of day
\( \theta_{downstream_{zd_{m}}} \) = wind bearing in relation to each school
\( \widetilde{TPY}\kern -0.2em_{ps} \) = point source pollution output (TPY), scaled using standard scaler, followed by min-max scaler
\( S_{zd_m} \) = wind speed

Instrument Demonstration

We have included a video which helps to illustrate this instrument calculation and how it was used to predict PM2.5 levels at each school.

Results

Stage 1 Regression

The first stage regression model predicts a rolling 9-month average PM2.5 value for each zip code and month. While this model includes several fixed effects for its predictions, the primary variable of interest in this equation is our instrument, I. Because our instrument combines 4 different variables, its unit of measurement is uninterpretable. However, the low p-value indicates that the model is confident that the true impact of our instrument on PM2.5 is not 0.

To examine the strength of our instrument, we ran an F-test to measure the confidence that our instrument was able to improve PM2.5 predictions over using the fixed effects alone. Generally, an instrument is considered to be sufficiently strong if the F-statistic is greater than 10. With an F-statistic of over 1,000, we can be confident that our instrument is helping us to predict PM2.5 values more accurately.

Instrument Coefficient	p-value	95% Confidence Interval
-0.000107	4.02 e-07	[-0.00015, -0.00007]

F-Statistic	p-value
1314.9	3.41 e-287

Stage 2 Regression

The medical outcomes are measured as the change in the number of diagnoses per 1,000 children over a 12-month period. The results of the stage 2 regression are in the table below. For each medical outcome, we have included the summary statistics for both the predicted PM2.5 value (\( \widehat{PM}_{2.5} \)) and the true PM2.5 value (\( PM_{2.5} \)).

We observe that \( \widehat{PM}_{2.5} \) is a statistically significant predictor of blood vessel diseases and all malignant cancers at the 95% confidence interval level. As an example, we can interpret the coefficient for blood vessel diseases to be as follows: as the rolling 9-month average PM2.5 rises by 1 TPY, we expect the annual change in diagnoses of blood vessel diseases to increase by 0.0072 cases per 1,000 children. The model is 95% confident that the true impact of PM2.5 lies between 0.0019 and 0.0135.

The statistics under the \( PM_{2.5} \) section show the results of the OLS regression using the same fixed effects as before, but by substituting the predicted PM2.5 with the true PM2.5 value. Doing so produces a much more biased estimate of the impact of PM2.5 than the instrumental variable results. Many coefficients from this regression are not statistically significant and have wide confidence intervals.

Disease Category	Summary Statistics for \( \widehat{PM}_{2.5} \)			Summary Statistics for \( {PM}_{2.5} \)
Disease Category	Coefficient	p-value	95% Confidence Interval	Coefficient	p-value	95% Confidence Interval
Blood Vessel Diseases	0.00772	0.00993	[0.0019, 0.0135]	0.00035	0.825	[-0.0028, 0.0035]
All Malignant Cancers	0.00788	0.0372	[0.00048, 0.0153]	0.000677	0.145	[-0.0024, 0.0016]
Cardioresp. Cancer	0.00250	0.412	[-0.0037, 0.0086]	0.00078	0.459	[-0.0014, 0.0029]
Type 1 Diabetes	0.00383	0.475	[-0.0069, 0.0145]	0.00038	0.195	[-0.0020, 0.0095]
Non-Blood Cancers	-0.00129	0.718	[-0.0084, 0.0058]	0.00028	0.183	[-0.0014, 0.0070]
Blood Cancers	0.00782	0.227	[-0.0050, 0.021]	0.00517	0.148	[-0.0019, 0.0012]
Blood Diseases	-0.00530	0.131	[-0.0122, 0.0016]	0.00291	0.234	[-0.0019, 0.0078]
Blood & B.V. Diseases	-0.00121	0.750	[-0.0088, 0.0064]	0.00238	0.369	[-0.0029, 0.0077]
Injuries/Accidents	0.01039	0.868	[-0.114, 0.135]	0.00795	0.811	[-0.053, 0.069]

Counterfactual Analysis

After determining which medical outcomes are most related to PM2.5 through the instrumental variables regression framework, we conducted a counterfactual analysis to see how the predicted number of medical diagnoses would differ under various levels of PM2.5. The purpose of the counterfactual analysis is to show how the predicted medical outcome would change under scenarios with lower PM2.5 values. We chose to conduct this counterfactual analysis on blood vessel disease because this medical outcome showed statistically significant results from our instrumental variables regression.

In order to make these predictions for the counterfactual analysis, we used XGBoost. XGBoost allows us to model the nonlinear relationships between our variables, which provides an advantage over OLS. We first used XGBoost to predict PM2.5 values in the first stage regression, with the same fixed effects that we used in our two-stage OLS model. Next, we used these predicted PM2.5 values with the same fixed effects to train an XGBoost model to predict blood vessel disease. Then, we altered the test data by decreasing the PM2.5 values by certain percentages to see how the blood vessel disease cases would change.

The choropleth maps to the right show the PM2.5 levels in red. The zip codes that show the biggest decreases in blood vessel outcomes as PM2.5 levels decrease are highlighted in blue.

Counterfactual Changes in Average Health Outcomes

Stanford University Maternal and Child Health Research Institute Symposium

stanford_mchri

Our team had the opportunity to present our research at Stanford University's Maternal and Child Health Research Institute (MCHRI) Symposium. While there, we spoke to researchers and gathered feedback about our results. We are grateful to our advisors for recommending that we submit our work to the symposium!

Resources

Team

Trevor Johnson

trevorj@berkeley.edu

Matt Lyons

mattslyons@berkeley.edu

Anand Patel

anand.patel@berkeley.edu

Michelle Shen

michelleshen@berkeley.edu

Acknowledgments

Several subject matter experts and professors were instrumental in aiding the success of our project. We'd like to thank the following people:

Cornelia Ilin, Ph.D. for the project ideation, continual guidance throughout the process, and for encouraging us to submit our abstract to Stanford University's Maternal and Child Health Research Institute Poster Session.
Alberto Todeschini, Ph.D. for his guidance and for encouraging us to present at UC Berkeley's AI Research (BAIR).
Dr. Matthew Meyer and Dr. Kyle Enfield of UVA Health for helping us narrow down our list of pediatric medical diagnoses to study in our research.
Robert Davis, Ph.D. for sharing his expertise in bioclimatology and for helping us understand the factors for spreading PM2.5.

HealthCAir

Quantifying the impact of air pollution on pediatric health outcomes