Synthea Synthetic Data Overview

Learning objectives
  1. What is synthetic data?
    Synthetic data is data that has been artificially generated (“synthesized”) by a computer or person rather than collected from real events.
  2. What is Synthea?
    Synthea is a software tool that generates realistic but not real synthetic electronic health records.
Relevant roles:
  • Investigator
  • Informaticist
  • Software Engineer
  • Clinician Scientist/Trainee

Accessing real healthcare data for research purposes is difficult for many reasons including:

These factors and more limit what researchers can do with real healthcare data.

Synthetic data is an alternative to real healthcare data that avoids these challenges. Synthetic data is artifically generated, by computer or by hand, rather than collected from the real world. When using real healthcare data isn’t feasible due to privacy, cost, or other restrictions, synthetic data is a good alternative.

Synthetic data is not deidentified data

Researchers often use real healthcare data that have had personal identifiers removed. This includes:

  • Deidentified data
  • Anonymized data
  • Pseudonymized data

Because this is real data, it is valuable for research. However, this data also carries the risk of re-identification.1

In contrast, synthetic data is constructed so there is no privacy risk. When no individual’s data was used to create a dataset, no individual’s data can be in the dataset.

1 Synthea

Synthea™ is a synthetic data generator that models the life and medical history of synthetic patients. It creates realistic, but not real, synthetic electronic health records. The records are intended to be realistic at the individual level and population level.

Synthea is open source and built from publicly available information, so the resulting records are free of cost and free of privacy restrictions.

Basic Data Architecture of Synthea, showing how various sources of data are used to create a synthetic population

Synthea starts with demographic information for a region based on the US Census. Using these demographics, Synthea randomly creates individuals with realistic race, sex, target age, etc., for the region.

Synthea simulates each individual independently from birth until their death or the current day. As each individual lives out their synthetic life, they flow through disease modules that represent the progression and treatment of various diseases. Disease modules are built from publicly available incidence and prevalence statistics, along with care guidelines from medical institutions. No real person’s data is ever used to create a Synthea module.

Once the simulation is complete, the patient record is exported into industry-standard formats such as FHIR®, C-CDA®, CSV, or plain text.

{
  "resourceType": "Patient",
  "id": "2497ee24-c7c3-5d9a-3425-85da2e9e8b23",
  "meta": {
    "profile": [ "http://hl7.org/fhir/us/core/StructureDefinition/us-core-patient" ]
  },
  "text": {
    "status": "generated",
    "div": "<div xmlns=\"http://www.w3.org/1999/xhtml\">Generated by <a href=\"https://github.com/synthetichealth/synthea\">Synthea</a>.Version identifier: v3.1.0-354-g3a6a93487\n .   Person seed: -2317076407365535282  Population seed: 123</div>"
  },
  "extension": [ {
    "url": "http://hl7.org/fhir/us/core/StructureDefinition/us-core-race",
    "extension": [ {
      "url": "ombCategory",
      "valueCoding": {
        "system": "urn:oid:2.16.840.1.113883.6.238",
        "code": "2106-3",
        "display": "White"
      }
    }, {
      "url": "text",
      "valueString": "White"
    } ]
  }, {
    "url": "http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity",
    "extension": [ {
      "url": "ombCategory",
      "valueCoding": {
        "system": "urn:oid:2.16.840.1.113883.6.238",
        "code": "2186-5",
        "display": "Not Hispanic or Latino"
      }
    }, {
      "url": "text",
      "valueString": "Not Hispanic or Latino"
    } ]
  }, {
    "url": "http://hl7.org/fhir/StructureDefinition/patient-mothersMaidenName",
    "valueString": "Nadine465 Wunsch504"
  }, {
    "url": "http://hl7.org/fhir/us/core/StructureDefinition/us-core-birthsex",
    "valueCode": "M"
  }, {
    "url": "http://hl7.org/fhir/StructureDefinition/patient-birthPlace",
    "valueAddress": {
      "city": "Boston",
      "state": "Massachusetts",
      "country": "US"
    }
  }, {
    "url": "http://synthetichealth.github.io/synthea/disability-adjusted-life-years",
    "valueDecimal": 0.0
  }, {
    "url": "http://synthetichealth.github.io/synthea/quality-adjusted-life-years",
    "valueDecimal": 18.0
  } ],
  "identifier": [ {
    "system": "https://github.com/synthetichealth/synthea",
    "value": "2497ee24-c7c3-5d9a-3425-85da2e9e8b23"
  }, {
    "type": {
      "coding": [ {
        "system": "http://terminology.hl7.org/CodeSystem/v2-0203",
        "code": "MR",
        "display": "Medical Record Number"
      } ],
      "text": "Medical Record Number"
    },
    "system": "http://hospital.smarthealthit.org",
    "value": "2497ee24-c7c3-5d9a-3425-85da2e9e8b23"
  }, {
    "type": {
      "coding": [ {
        "system": "http://terminology.hl7.org/CodeSystem/v2-0203",
        "code": "SS",
        "display": "Social Security Number"
      } ],
      "text": "Social Security Number"
    },
    "system": "http://hl7.org/fhir/sid/us-ssn",
    "value": "999-32-6148"
  }, {
    "type": {
      "coding": [ {
        "system": "http://terminology.hl7.org/CodeSystem/v2-0203",
        "code": "DL",
        "display": "Driver's license number"
      } ],
      "text": "Driver's license number"
    },
    "system": "urn:oid:2.16.840.1.113883.4.3.25",
    "value": "S99930905"
  } ],
  "name": [ {
    "use": "official",
    "family": "Stoltenberg489",
    "given": [ "Mitchell808" ],
    "prefix": [ "Mr." ]
  } ],
  "telecom": [ {
    "system": "phone",
    "value": "555-726-6485",
    "use": "home"
  } ],
  "gender": "male",
  "birthDate": "2004-05-11",
  "address": [ {
    "extension": [ {
      "url": "http://hl7.org/fhir/StructureDefinition/geolocation",
      "extension": [ {
        "url": "latitude",
        "valueDecimal": 42.40293333299843
      }, {
        "url": "longitude",
        "valueDecimal": -71.68746648659892
      } ]
    } ],
    "line": [ "352 Bailey Neck Apt 40" ],
    "city": "Clinton",
    "state": "MA",
    "postalCode": "01510",
    "country": "US"
  } ],
  "maritalStatus": {
    "coding": [ {
      "system": "http://terminology.hl7.org/CodeSystem/v3-MaritalStatus",
      "code": "S",
      "display": "Never Married"
    } ],
    "text": "Never Married"
  },
  "multipleBirthBoolean": false,
  "communication": [ {
    "language": {
      "coding": [ {
        "system": "urn:ietf:bcp:47",
        "code": "en-US",
        "display": "English (United States)"
      } ],
      "text": "English (United States)"
    }
  } ]
}
{
  "resourceType": "Observation",
  "id": "f83286e0-5797-e51d-a7e6-6708d8085623",
  "status": "final",
  "category": [ {
    "coding": [ {
      "system": "http://terminology.hl7.org/CodeSystem/observation-category",
      "code": "laboratory",
      "display": "Laboratory"
    } ]
  } ],
  "code": {
    "coding": [ {
      "system": "http://loinc.org",
      "code": "4548-4",
      "display": "Hemoglobin A1c/Hemoglobin.total in Blood"
    } ],
    "text": "Hemoglobin A1c/Hemoglobin.total in Blood"
  },
  "subject": {
    "reference": "urn:uuid:3daf29a9-f7b1-9d9f-45ba-4be258308a75"
  },
  "encounter": {
    "reference": "urn:uuid:91fe93c0-52ae-98bc-e14c-29df89c8119d"
  },
  "effectiveDateTime": "2013-10-05T08:13:20-04:00",
  "issued": "2013-10-05T08:13:20.014-04:00",
  "valueQuantity": {
    "value": 6.38,
    "unit": "%",
    "system": "http://unitsofmeasure.org",
    "code": "%"
  }
}

This page overviews key features of Synthea. If you’d like more information, visit the Synthea wiki on GitHub.

2 Generic Modules

At Synthea’s core is a set of disease modules, representing the progression and treatment of various conditions. Below is a small snippet of the Appendicitis module.

A section of the Synthea Appendicitis module, demonstrating various state types and transitions available in Synthea modules

Disease modules are state transition machines where each individual flows through the modules based on logical conditions and weighted randomness. Behind the scenes, modules are stored as JavaScript Object Notation (JSON) files, but nearly all users view or edit modules exclusively using a graphical interface called the Synthea Module Builder.

Every synthetic patient starts in each module’s Initial state at birth and immediately begins progressing through the module’s states. Each state represents a spot where something happens. There are two broad categories of states: control states, which drive the flow of a patient through the module, for example:

  • Delay: Wait a certain amount of time before progressing, commonly used to represent how the risk of certain conditions changes with age.
  • Guard: Wait until given criteria become true before progressing.

And clinical states, which add entries to a patient’s health record, for example:

  • ConditionOnset: Represent the spot where the patient acquires a given condition, not necessarily where it is diagnosed.
  • Procedure: Represent the point in time in a healthcare encounter that a procedure is performed.

For a full list of state types, see the Synthea wiki.

Each state has a transition, which points to the state the patient will progress to next:

  • Direct transitions: Point to a single state.
  • Distributed transitions: Point to multiple states, each with a weighted probability. A patient progresses to a randomly chosen state.
  • Conditional transitions: Include logical rules showing which path to follow.
  • Complex transitions: Are a combination of conditional and distributed transitions.

Modules will run until either the simulation ends (at patient death or when it reaches the current date) or until the module reaches a Terminal state.2

Combining these simple concepts allows module developers to build robust and detailed models of disease progression and treatment.

Because Synthea is open source and accepts contributions from a global user base, the level of detail varies across modules. For instance, the Appendicitis module was the first to be created, and the level of detail is minimal. On the other hand, the COVID-19 module and submodules were designed to replicate the disease’s progression as closely as possible, and is probably the largest and most detailed module.

Further, the number of disease modules is limited. Early efforts focused on the “top ten” causes of premature death and reasons people see their primary care provider. Further additions have added a large number of modules representing common conditions, but rarer and more complex conditions may not be represented.

The Synthea community encourages and welcomes users to create new modules representing conditions of interest or to improve the detail and realism of existing modules.

You can view, modify, and create Synthea modules with no programming experience using the Synthea Module Builder. For more information, read Customizing Synthea ). There is also a short video introduction to the Module Builder, and a tutorial on the Synthea Wiki.

3 FHIR Resources

Synthea generates basic FHIR resources: it includes required fields but rarely populates optional fields. If you require fields that Synthea doesn’t populate, you can customize Synthea to add those fields. Read Customizing Synthea for more information.

By default, Synthea exports one file per patient, as a Bundle with type: transaction. This Bundle contains a single Patient resource as the first entry, followed by other patient-specific resources such as Conditions, Observations, Procedures, etc, roughly grouped by Encounter in chronological order. Synthea exports Organizations and Practitioners separately since these resources may be referenced by multiple patients’ resources. Synthea may also be configured to export FHIR Bulk Data.

As of April 2023, Synthea can produce the following resource types:

  • AllergyIntolerance
  • Bundle
  • CarePlan
  • CareTeam
  • Claim
  • Condition
  • Coverage
  • Device
  • DiagnosticReport
  • DocumentReference
  • Encounter
  • ExplanationOfBenefit
  • Goal
  • ImagingStudy
  • Immunization
  • Location
  • Medication
  • MedicationRequest
  • MedicationAdministration
  • Observation
  • Organization
  • Patient
  • Practitioner
  • PractitionerRole
  • Procedure
  • Provenance

Note that not all patient records will contain instances of every resource type, and certain resource types will only be produced if certain settings are enabled. See Customizing Synthea for more information on settings.

4 Pre-generated Datasets

Instead of running Synthea yourself, you can use a pre-generated dataset. Pre-generated datasets are available at the following locations:

References

Ohm, Paul. 2009. “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization.” UCLA Law Review, Vol. 57, p. 1701, 2010, U of Colorado Law Legal Studies Research Paper No. 9-12. https://ssrn.com/abstract=1450006.

Footnotes

  1. Perhaps the most well-known example was an instance in 1997 where Latanya Sweeney re-identified the record belonging to then-Governor of Massachusetts William Weld from a dataset where identifiers had been removed. See Ohm (2009) for details.↩︎

  2. Terminal here means “the end of this module”, not “the patient has a terminal condition and died”. For instance, the Appendicitis module terminates after the patient has an appendectomy. Compare to the Sore Throat module which does not have a Terminal state since people are always at risk of common viral conditions that present as sore throat.↩︎