Analyzing FHIR Data in a Tabular Format With Python

Learning objectives
  1. Understand the high-level approaches for converting FHIR-formatted data into tabular for analysis in Python.
  2. Learn how the FHIR-PYrate library facilitates requesting data from a FHIR server, and creating tidy tabular data tables.
Relevant roles:
  • Informaticist

Data analysis approaches in Python often use Pandas DataFrames to store tabular data. There are two primary approaches to loading FHIR-formatted data into Pandas DataFrames:

  1. Writing Python code to manually convert FHIR instances in JSON format into DataFrames.

    This does not require any special skills beyond data manipulation in Python, but in practice can be laborious (especially with large number of data elements) and prone to bugs.

  2. Using a purpose-built library like FHIR-PYrate to automatically convert FHIR instances into DataFrames.

    It is recommended to try this approach first, and only fall back to (1) if needed.

To use FHIR-PYrate, you will need a Python 3 runtime with FHIR-PYrate and Pandas installed.

1 FHIR testing server

The examples in this module use a FHIR testing server populated with Synthea data in FHIR R4 format via Logica Health’s Sandbox service.

The endpoint for this testing server is:

https://api.logicahealth.org/FHIRResearchSynthea/open

However, any FHIR server loaded with testing data can be used. See Standing up a FHIR Testing Server for instructions to set up your own test server.

The code blocks in the following section show sample output immediately after. This is similar to the code cells and results in a Jupyter notebook.

2 Retrieving FHIR data

Once your environment is set up, you can run the following Python code to retrieve instances of the Patient resource from a test server:

# Load dependencies
from fhir_pyrate import Pirate
import pandas as pd

# Instantiate a Pirate object using the FHIR-PYrate library to query a test FHIR server
search = Pirate(
    auth=None,
    base_url="https://api.logicahealth.org/FHIRResearchSynthea/open/",
    print_request_url=True,
)

# Use the whimsically named `steal_bundles()` method to instantiate a search interaction
#
# For more information, see https://github.com/UMEssen/FHIR-PYrate/#pirate
bundles = search.steal_bundles(
    resource_type='Patient',
    request_params={
        "_count": 10 # Get 10 instances per page
    },
    num_pages=1 # Get 1 page (so a total of 10 instances)
)

# Execute the search and convert to a Pandas DataFrame
df = search.bundles_to_dataframe(bundles)

df.head(5)
https://api.logicahealth.org/FHIRResearchSynthea/open/Patient?_count=10
Query (Patient):   0%|          | 0/1 [00:00<?, ?it/s]Query (Patient): 100%|██████████| 1/1 [00:00<00:00, 889.19it/s]
resourceType id meta_versionId meta_lastUpdated meta_source text_status text_div extension_0_url extension_0_extension_0_url extension_0_extension_0_valueCoding_system ... address_0_country maritalStatus_coding_0_system maritalStatus_coding_0_code maritalStatus_coding_0_display maritalStatus_text multipleBirthBoolean communication_0_language_coding_0_system communication_0_language_coding_0_code communication_0_language_coding_0_display communication_0_language_text
0 Patient 1 1 2023-04-06T20:52:11.000+00:00 #wQwWCylvgEiNKNbB generated <div xmlns="http://www.w3.org/1999/xhtml">Gene... http://hl7.org/fhir/us/core/StructureDefinitio... ombCategory urn:oid:2.16.840.1.113883.6.238 ... US http://terminology.hl7.org/CodeSystem/v3-Marit... M M M False urn:ietf:bcp:47 en-US English English
1 Patient 180 1 2023-04-06T20:53:35.000+00:00 #N1sYXaxE6wGQilNA generated <div xmlns="http://www.w3.org/1999/xhtml">Gene... http://hl7.org/fhir/us/core/StructureDefinitio... ombCategory urn:oid:2.16.840.1.113883.6.238 ... US http://terminology.hl7.org/CodeSystem/v3-Marit... S Never Married Never Married False urn:ietf:bcp:47 en-US English English
2 Patient 293 1 2023-04-06T20:56:31.000+00:00 #ZAAeVXzDXxeetcQS generated <div xmlns="http://www.w3.org/1999/xhtml">Gene... http://hl7.org/fhir/us/core/StructureDefinitio... ombCategory urn:oid:2.16.840.1.113883.6.238 ... US http://terminology.hl7.org/CodeSystem/v3-Marit... S S S False urn:ietf:bcp:47 en-US English English
3 Patient 612 1 2023-04-06T20:57:01.000+00:00 #sTSlRxyopDL1X4fa generated <div xmlns="http://www.w3.org/1999/xhtml">Gene... http://hl7.org/fhir/us/core/StructureDefinitio... ombCategory urn:oid:2.16.840.1.113883.6.238 ... US http://terminology.hl7.org/CodeSystem/v3-Marit... S S S False urn:ietf:bcp:47 en-US English English
4 Patient 931 1 2023-04-06T20:57:13.000+00:00 #ElqD7MaHoYuk6qSo generated <div xmlns="http://www.w3.org/1999/xhtml">Gene... http://hl7.org/fhir/us/core/StructureDefinitio... ombCategory urn:oid:2.16.840.1.113883.6.238 ... US http://terminology.hl7.org/CodeSystem/v3-Marit... S S S False urn:ietf:bcp:47 en-US English English

5 rows × 87 columns

It is easier to see the contents of this DataFrame by printing out its first row vertically:

# Print the first row of the DataFrame vertically for easier reading.
pd.set_option('display.max_rows', 100)  # Show all rows
df.head(1).T
0
resourceType Patient
id 1
meta_versionId 1
meta_lastUpdated 2023-04-06T20:52:11.000+00:00
meta_source #wQwWCylvgEiNKNbB
text_status generated
text_div <div xmlns="http://www.w3.org/1999/xhtml">Gene...
extension_0_url http://hl7.org/fhir/us/core/StructureDefinitio...
extension_0_extension_0_url ombCategory
extension_0_extension_0_valueCoding_system urn:oid:2.16.840.1.113883.6.238
extension_0_extension_0_valueCoding_code 2054-5
extension_0_extension_0_valueCoding_display Black or African American
extension_0_extension_1_url text
extension_0_extension_1_valueString Black or African American
extension_1_url http://hl7.org/fhir/us/core/StructureDefinitio...
extension_1_extension_0_url ombCategory
extension_1_extension_0_valueCoding_system urn:oid:2.16.840.1.113883.6.238
extension_1_extension_0_valueCoding_code 2186-5
extension_1_extension_0_valueCoding_display Not Hispanic or Latino
extension_1_extension_1_url text
extension_1_extension_1_valueString Not Hispanic or Latino
extension_2_url http://hl7.org/fhir/StructureDefinition/patien...
extension_2_valueString Delois358 Hintz995
extension_3_url http://hl7.org/fhir/us/core/StructureDefinitio...
extension_3_valueCode M
extension_4_url http://hl7.org/fhir/StructureDefinition/patien...
extension_4_valueAddress_city Lawrence
extension_4_valueAddress_state Massachusetts
extension_4_valueAddress_country US
extension_5_url http://synthetichealth.github.io/synthea/disab...
extension_5_valueDecimal 0.098274
extension_6_url http://synthetichealth.github.io/synthea/quali...
extension_6_valueDecimal 62.901726
identifier_0_system https://github.com/synthetichealth/synthea
identifier_0_value 439b24b4-6f25-4093-b101-47a39bd061ca
identifier_1_type_coding_0_system http://terminology.hl7.org/CodeSystem/v2-0203
identifier_1_type_coding_0_code MR
identifier_1_type_coding_0_display Medical Record Number
identifier_1_type_text Medical Record Number
identifier_1_system http://hospital.smarthealthit.org
identifier_1_value 439b24b4-6f25-4093-b101-47a39bd061ca
identifier_2_type_coding_0_system http://terminology.hl7.org/CodeSystem/v2-0203
identifier_2_type_coding_0_code SS
identifier_2_type_coding_0_display Social Security Number
identifier_2_type_text Social Security Number
identifier_2_system http://hl7.org/fhir/sid/us-ssn
identifier_2_value 999-57-3355
identifier_3_type_coding_0_system http://terminology.hl7.org/CodeSystem/v2-0203
identifier_3_type_coding_0_code DL
identifier_3_type_coding_0_display Driver's License
identifier_3_type_text Driver's License
identifier_3_system urn:oid:2.16.840.1.113883.4.3.25
identifier_3_value S99925942
identifier_4_type_coding_0_system http://terminology.hl7.org/CodeSystem/v2-0203
identifier_4_type_coding_0_code PPN
identifier_4_type_coding_0_display Passport Number
identifier_4_type_text Passport Number
identifier_4_system http://standardhealthrecord.org/fhir/Structure...
identifier_4_value X42032818X
name_0_use official
name_0_family Moen819
name_0_given_0 Willian804
name_0_prefix_0 Mr.
telecom_0_system phone
telecom_0_value 555-135-7303
telecom_0_use home
gender male
birthDate 1955-10-09
address_0_extension_0_url http://hl7.org/fhir/StructureDefinition/geoloc...
address_0_extension_0_extension_0_url latitude
address_0_extension_0_extension_0_valueDecimal 42.319305
address_0_extension_0_extension_1_url longitude
address_0_extension_0_extension_1_valueDecimal -71.173653
address_0_line_0 545 Tromp Port Unit 55
address_0_city Needham
address_0_state Massachusetts
address_0_postalCode 02492
address_0_country US
maritalStatus_coding_0_system http://terminology.hl7.org/CodeSystem/v3-Marit...
maritalStatus_coding_0_code M
maritalStatus_coding_0_display M
maritalStatus_text M
multipleBirthBoolean False
communication_0_language_coding_0_system urn:ietf:bcp:47
communication_0_language_coding_0_code en-US
communication_0_language_coding_0_display English
communication_0_language_text English

If you look at the output above, you can see FHIR-PYrate collapsed the hierarchical FHIR data structure into DataFrame columns. FHIR-PYrate does this by taking an element from the FHIR-formatted data like Patient.identifier[0].value and converting to an underscore-delimited column name like identifier_0_value. (Note that Patient.identifier has multiple values in the FHIR data, so there are multiple identifier_N_... columns in the DataFrame.)

3 Selecting specific columns

Usually not every single value from a FHIR instance is needed for analysis. There are two ways to get a more concise DataFrame:

  1. Use the approach above to load all elements into a DataFrame, remove the unneeded columns, and rename the remaining columns as needed. The process_function capability in FHIR-PYrate allows you to integrate this approach into the bundles_to_dataframe() method call.
  2. Use FHIRPath to select specific elements and map them onto column names.

The second approach is typically more concise. For example, to generate a DataFrame like this…

id gender date_of_birth marital_status

…you could use the following code:

# Instantiate and perform the FHIR search interaction in a single function call
df = search.steal_bundles_to_dataframe(
    resource_type='Patient',
    request_params={
        "_count": 10  # Get 10 instances per page
    },
    num_pages=1,  # Get 1 page (so a total of 10 instances)
    fhir_paths=[
        ("id", "identifier[0].value"),
        ("gender", "gender"),
        ("date_of_birth", "birthDate"),
        ("marital_status", "maritalStatus.coding[0].code")
    ])
df
https://api.logicahealth.org/FHIRResearchSynthea/open/Patient?_count=10
Query & Build DF (Patient):   0%|          | 0/1 [00:00<?, ?it/s]Query & Build DF (Patient): 100%|██████████| 1/1 [00:00<00:00, 247.70it/s]
id gender date_of_birth marital_status
0 439b24b4-6f25-4093-b101-47a39bd061ca male 1955-10-09 M
1 99285aac-e5e3-4f5b-857d-f67271c97304 female 2017-12-02 S
2 2fa15bc7-8866-461a-9000-f739e425860a male 1945-12-10 S
3 2fa15bc7-8866-461a-9000-f739e425860a male 1945-12-10 S
4 2fa15bc7-8866-461a-9000-f739e425860a male 1945-12-10 S
5 41166989-975d-4d17-b9de-17f94cb3eec1 male 1946-03-29 M
6 2b083021-e93f-4991-bf49-fd4f20060ef8 female 2002-10-24 S
7 29e51479-f742-4474-8f8e-d2607d5269f6 male 1999-12-12 S
8 262b819a-5193-404a-9787-b7f599358035 male 2002-04-15 S
9 aff8f143-2375-416f-901d-b0e4c73e3e58 male 1997-12-26 S

While FHIRPath can be quite complex, its use in FHIR-PYrate is often straight forward. Nested elements are separated with ., and elements with multiple sub-values are identified by [N] where N is an integer starting at 0. The element paths can typically be constructed by loading all elements into a DataFrame and then manually deriving the FHIRPaths from the column names, or by looking at the hierarchy resource pages in the FHIR specification (see Key FHIR Resources for more information on reading the FHIR specification).

4 Elements with multiple sub-values

There are multiple identifier[N].value values for each instance of Patient in this dataset.

# Instantiate and perform the FHIR search interaction in a single function call
df = search.steal_bundles_to_dataframe(
    resource_type='Patient',
    request_params={
        "_count": 10  # Get 10 instances per page
    },
    num_pages=1,  # Get 1 page (so a total of 10 instances)
    fhir_paths=[
        ("id", "identifier[0].value"),
        ("identifiers", "identifier.value")
    ])
df
https://api.logicahealth.org/FHIRResearchSynthea/open/Patient?_count=10
Query & Build DF (Patient):   0%|          | 0/1 [00:00<?, ?it/s]Query & Build DF (Patient): 100%|██████████| 1/1 [00:00<00:00, 484.95it/s]
id identifiers
0 439b24b4-6f25-4093-b101-47a39bd061ca [439b24b4-6f25-4093-b101-47a39bd061ca, 439b24b...
1 99285aac-e5e3-4f5b-857d-f67271c97304 [99285aac-e5e3-4f5b-857d-f67271c97304, 99285aa...
2 2fa15bc7-8866-461a-9000-f739e425860a [2fa15bc7-8866-461a-9000-f739e425860a, 2fa15bc...
3 2fa15bc7-8866-461a-9000-f739e425860a [2fa15bc7-8866-461a-9000-f739e425860a, 2fa15bc...
4 2fa15bc7-8866-461a-9000-f739e425860a [2fa15bc7-8866-461a-9000-f739e425860a, 2fa15bc...
5 41166989-975d-4d17-b9de-17f94cb3eec1 [41166989-975d-4d17-b9de-17f94cb3eec1, 4116698...
6 2b083021-e93f-4991-bf49-fd4f20060ef8 [2b083021-e93f-4991-bf49-fd4f20060ef8, 2b08302...
7 29e51479-f742-4474-8f8e-d2607d5269f6 [29e51479-f742-4474-8f8e-d2607d5269f6, 29e5147...
8 262b819a-5193-404a-9787-b7f599358035 [262b819a-5193-404a-9787-b7f599358035, 262b819...
9 aff8f143-2375-416f-901d-b0e4c73e3e58 [aff8f143-2375-416f-901d-b0e4c73e3e58, aff8f14...

To convert to separate columns, you can do the following:

df.join(
    pd.DataFrame(
        df.pop('identifiers').values.tolist()
    ).add_prefix('identifier_')
)
id identifier_0 identifier_1 identifier_2 identifier_3 identifier_4
0 439b24b4-6f25-4093-b101-47a39bd061ca 439b24b4-6f25-4093-b101-47a39bd061ca 439b24b4-6f25-4093-b101-47a39bd061ca 999-57-3355 S99925942 X42032818X
1 99285aac-e5e3-4f5b-857d-f67271c97304 99285aac-e5e3-4f5b-857d-f67271c97304 99285aac-e5e3-4f5b-857d-f67271c97304 999-62-8033 None None
2 2fa15bc7-8866-461a-9000-f739e425860a 2fa15bc7-8866-461a-9000-f739e425860a 2fa15bc7-8866-461a-9000-f739e425860a 999-93-7537 S99948707 X14078167X
3 2fa15bc7-8866-461a-9000-f739e425860a 2fa15bc7-8866-461a-9000-f739e425860a 2fa15bc7-8866-461a-9000-f739e425860a 999-93-7537 S99948707 X14078167X
4 2fa15bc7-8866-461a-9000-f739e425860a 2fa15bc7-8866-461a-9000-f739e425860a 2fa15bc7-8866-461a-9000-f739e425860a 999-93-7537 S99948707 X14078167X
5 41166989-975d-4d17-b9de-17f94cb3eec1 41166989-975d-4d17-b9de-17f94cb3eec1 41166989-975d-4d17-b9de-17f94cb3eec1 999-17-8717 S99933732 X75257608X
6 2b083021-e93f-4991-bf49-fd4f20060ef8 2b083021-e93f-4991-bf49-fd4f20060ef8 2b083021-e93f-4991-bf49-fd4f20060ef8 999-83-6040 S99998967 None
7 29e51479-f742-4474-8f8e-d2607d5269f6 29e51479-f742-4474-8f8e-d2607d5269f6 29e51479-f742-4474-8f8e-d2607d5269f6 999-48-8328 S99955051 None
8 262b819a-5193-404a-9787-b7f599358035 262b819a-5193-404a-9787-b7f599358035 262b819a-5193-404a-9787-b7f599358035 999-19-7941 S99986287 None
9 aff8f143-2375-416f-901d-b0e4c73e3e58 aff8f143-2375-416f-901d-b0e4c73e3e58 aff8f143-2375-416f-901d-b0e4c73e3e58 999-35-9084 S99998334 X10377495X

This will give you separate identifier_0, identifier_1, … columns for each Patient.identifier[N] value.

6 Additional resources

NIH’s Office of Data Science Strategy has online exercises for converting FHIR-formatted data into tabular format for further analysis. These exercises include implementations in both Python and R.

Note that the Python version of these exercises was created before the FHIR-PYrate library existed, so they use the “writing Python code to manually convert FHIR instances in JSON Format into DataFrames” approach. This approach may still be useful for use cases that are not well-supported by FHIR-PYrate or for integrating FHIR data with data from other web APIs.