Understanding what puts Veterans at an increased risk for chronic diseases and illnesses is critical to protecting and treating the more than 22 million US Veterans alive today. To better understand the factors that impact Veteran health, the Veterans Health Administration Innovation Ecosystem (VHA IE) is using artificial data – also known as synthetic data – to measure a broad array of health risks that will improve Veteran care.
Many VHA initiatives are using synthetic data with promising results. Recently, VHA IE partnered with MDClone to generate synthetic data sets for use across VHA. Synthetic data is being used to improve patient care pathways in suicide prevention and chronic disease management, and to track trends in population health.
VHA IE’s partnership with PrecisionFDA used synthetic data to launch an AI challenge to create predictive models for COVID-19 in Veteran populations. For Veterans, this results in improved patient care and clinical outcomes, while ensuring the security of their personal health information.
What is synthetic data?
Synthetic data is not linked to real people, events, or circumstances. Instead, it is generated via computer programs that mirror real-world information by observing attributes of real data. Synthetic data sets can be reliable and accurate when they are based on trends and traits of real data.
Synthetic health data has all the characteristics of health records – such as information about blood pressure, diabetes, weight and illnesses – without personally identifiable information, like names, social security numbers and contact information. Synthetic data is different from de-identified data, which simply hides attached identities but still uses real Veteran information. It has been shown that de-identification is time consuming, difficult, and may still carry the risk of re-identification with modern computing techniques. With synthetic data, there are no identities to hide.
While the algorithms that build synthetic data are quite complex, the concept is simple. Imagine a data set of one thousand heart rate readings, each attached to a real person. This is the original data set. To generate synthetic data, a computer algorithm will observe the trends of the original data set: it will look at the average heart rate, the trends in heart rates as associated with age and gender, and the ways the heart rates change over time. Then, it will generate artificial data based on characteristics of the real data. The result is a data set of heart rate readings that are not attached to any real identities but are clinically accurate and useful.
Why is VA using synthetic data?
VA protects Veteran health data and privacy, meaning there are multiple steps involved for anyone who wishes to use Veteran data for their work. This can often hamper innovation and on-the-ground development. For quality improvement leaders, researchers and medical providers who need to quickly access Veteran health information, especially those on the frontlines of the COVID-19 pandemic, synthetic data can be a reliable, private and quick alternative to real data.
VA is constantly looking forward, innovating with the goal of improving Veteran care. Synthetic data allows data scientists to simulate events or circumstances that have not yet happened in the real world but might in the future. Where real data does not exist, synthetic data can create and test how different interventions may work if certain real-word events happen, like a future pandemic.
As VA continues to innovate using synthetic data, there will be greater opportunities to partner with health technology and research companies to find new ways to train VA providers and improve Veteran health care. Those interested in learning more about the promise of synthetic data to improve Veteran care are encouraged to visit https://precision.fda.gov/challenges/11.
Amanda Purnell is a VHA senior innovation fellow who helped coordinate the 2020 Data Summit for Suicide Prevention.