Addressing Missing Data in SDOH: Imputation and Translation
Presentation Type
Poster
Student
Yes
Abstract
The presence of missing data in social determinants of health (SDOH) can hinder the effectiveness of statistical models aimed at understanding and addressing health disparities. This project focuses on testing and implementing different methods for imputing SDOH data that is missing at random as well as translating SDOH data that is missing by design. Different approaches including Bayesian regression, linear regression, and predictive mean matching using the r-package MICE (multiple imputations for chained equations) were tested and evaluated on a training dataset. Each method was evaluated using root mean squared error (RMSE), correlation between the imputed and actual values, mean absolute percentage error (MAPE), and computation time. In terms of RMSE and correlation, no model consistently showed any significant advantage over the others. In terms of MAPE, the models using predictive mean matching were consistently better than those using Bayesian and linear regression. In terms of computation time, the Bayesian approach was the fastest, but was not significantly faster than the linear regression, and the predictive mean matching method took the longest. Since goal of this project is to create values that can be used to fill-in missing data without changing the underlying patterns and relationships within that data while still preserving the variability of the data, predictive mean matching was determined to be the best imputation method for the missing at random SDOH data.
ACKNOWLEDGMENTS: The research reported in this abstract was supported by South Dakota State University, AIM-AHEAD Coordinating Center, award number OTA-21-017, and was, in part, funded by the National Institutes of Health Agreement No. 1OT2OD032581. The work is solely the responsibility of the authors and does not necessarily represent the official view of AIM-AHEAD or the National Institutes of Health.
Start Date
2-7-2025 1:00 PM
End Date
2-7-2025 2:30 PM
Addressing Missing Data in SDOH: Imputation and Translation
Volstorff A
The presence of missing data in social determinants of health (SDOH) can hinder the effectiveness of statistical models aimed at understanding and addressing health disparities. This project focuses on testing and implementing different methods for imputing SDOH data that is missing at random as well as translating SDOH data that is missing by design. Different approaches including Bayesian regression, linear regression, and predictive mean matching using the r-package MICE (multiple imputations for chained equations) were tested and evaluated on a training dataset. Each method was evaluated using root mean squared error (RMSE), correlation between the imputed and actual values, mean absolute percentage error (MAPE), and computation time. In terms of RMSE and correlation, no model consistently showed any significant advantage over the others. In terms of MAPE, the models using predictive mean matching were consistently better than those using Bayesian and linear regression. In terms of computation time, the Bayesian approach was the fastest, but was not significantly faster than the linear regression, and the predictive mean matching method took the longest. Since goal of this project is to create values that can be used to fill-in missing data without changing the underlying patterns and relationships within that data while still preserving the variability of the data, predictive mean matching was determined to be the best imputation method for the missing at random SDOH data.
ACKNOWLEDGMENTS: The research reported in this abstract was supported by South Dakota State University, AIM-AHEAD Coordinating Center, award number OTA-21-017, and was, in part, funded by the National Institutes of Health Agreement No. 1OT2OD032581. The work is solely the responsibility of the authors and does not necessarily represent the official view of AIM-AHEAD or the National Institutes of Health.