Enhancing data quality: Integration of internal and external data sources

Researcher(s)

  • Araf Jahin, Computer Science, University of Delaware

Faculty Mentor(s)

  • Yuchen Zhang, Helen F. Graham Cancer Center & Research Institute, ChristianaCare

Abstract

This study aims to enhance the data repository encompassing social determinants of health, clinical, and environmental data. The generated data will be integrated into the ChristianaCare centralized database, facilitating future clinical research and hospital quality improvement efforts. The project employs R, a programming language, to design data frames, maps, and visualizations utilizing information from the United States Census Bureau, American Community Survey (ACS), and United States Environmental Protection Agency (EPA) to be integrated in healthcare research at ChristianaCare. Using the 2010 decennial census data, information was extracted based on sex and age categories for all counties in Delaware, with nine different metrics created for tracts and block groups. Maps were produced using the tracts in New Castle County to visualize the data. Additionally, the 2019 ACS data was used to develop a similar data frame for tracts, necessitating additional aggregation due to differences in data structures between ACS and decennial data. Furthermore, EPA data was retrieved and modified to create tools for the brownfields, Toxic Release Inventory (TRI) sites, and National Air Toxics Assessments (NATA) data in New Castle County. Brownfields for the county were mapped using a data frame with different colored points representing the generator location. TRI sites were also mapped, with point size and color indicating the amount of toxic waste released. Additionally, NATA data was utilized to generate two maps, showing area-level data compared to point-level data. The first map illustrated population density, while the second map indicated the total cancer risk. Altogether, this project’s integration of internal and external data sources offers a valuable contribution to enhancing data quality and informs future research and quality improvement efforts at ChristianaCare. The R script developed enables the reproducibility and automation of the data extraction, transformation and visualization of these data sources.