Analyzing FHIR Data in a Tabular Format With Python
min. reading time
Not yet updated for 2026
Roles:Informaticist
Learning objectives
Understand the high-level approaches for converting FHIR-formatted data into tabular for analysis in Python.
Learn how the FHIR-PYrate library facilitates requesting data from a FHIR server, and creating tidy tabular data tables.
Data analysis approaches in Python often use Pandas DataFrames to store tabular data. There are two primary approaches to loading FHIR-formatted data into Pandas DataFrames:
Writing Python code to manually convert FHIR instances in JSON format into DataFrames.
This does not require any special skills beyond data manipulation in Python, but in practice can be laborious (especially with large number of data elements) and prone to bugs.
Using a purpose-built library like FHIR-PYrate to automatically convert FHIR instances into DataFrames.
It is recommended to try this approach first, and only fall back to (1) if needed.
To use FHIR-PYrate, you will need a Python 3 runtime with FHIR-PYrate and Pandas installed.
However, any FHIR server loaded with testing data can be used. See Standing up a FHIR Testing Server for instructions to set up your own test server.
The code blocks in the following section show sample output immediately after. This is similar to the code cells and results in a Jupyter notebook.
2 Retrieving FHIR data
Once your environment is set up, you can run the following Python code to retrieve instances of the Patient resource from a test server:
# Load dependenciesfrom fhir_pyrate import Pirateimport pandas as pd# Instantiate a Pirate object using the FHIR-PYrate library to query a test FHIR serversearch = Pirate( auth=None, base_url="https://hapi.fhir.org/baseR4", print_request_url=True,)# Use the whimsically named `steal_bundles()` method to instantiate a search interaction## For more information, see https://github.com/UMEssen/FHIR-PYrate/#piratebundles = search.steal_bundles( resource_type="Patient", request_params={"_count": 10, # Get 10 instances per page"identifier": "https://github.com/synthetichealth/synthea|", }, num_pages=1, # Get 1 page (so a total of 10 instances))# Execute the search and convert to a Pandas DataFramedf = search.bundles_to_dataframe(bundles)df.head(5)
It is easier to see the contents of this DataFrame by printing out its first row vertically:
# Print the first row of the DataFrame vertically for easier reading.pd.set_option("display.max_rows", 100) # Show all rowsdf.head(1).T
0
resourceType
Patient
id
129c6ac7-8d06-89de-ad63-0204a93e76c3
meta_versionId
4
meta_lastUpdated
2026-06-06T10:10:02.293+00:00
meta_source
#pOh0yLEonU9VLYRE
meta_tag_0_system
http://terminology.hl7.org/CodeSystem/v3-Obser...
meta_tag_0_code
SUBSETTED
meta_tag_0_display
Resource encoded in summary mode
identifier_0_system
https://github.com/synthetichealth/synthea
identifier_0_value
129c6ac7-8d06-89de-ad63-0204a93e76c3
identifier_1_type_coding_0_system
http://terminology.hl7.org/CodeSystem/v2-0203
identifier_1_type_coding_0_code
MR
identifier_1_type_coding_0_display
Medical Record Number
identifier_1_type_text
Medical Record Number
identifier_1_system
http://hospital.smarthealthit.org
identifier_1_value
129c6ac7-8d06-89de-ad63-0204a93e76c3
identifier_2_type_coding_0_system
http://terminology.hl7.org/CodeSystem/v2-0203
identifier_2_type_coding_0_code
SS
identifier_2_type_coding_0_display
Social Security Number
identifier_2_type_text
Social Security Number
identifier_2_system
http://hl7.org/fhir/sid/us-ssn
identifier_2_value
999-94-5397
identifier_3_type_coding_0_system
http://terminology.hl7.org/CodeSystem/v2-0203
identifier_3_type_coding_0_code
DL
identifier_3_type_coding_0_display
Driver's license number
identifier_3_type_text
Driver's license number
identifier_3_system
urn:oid:2.16.840.1.113883.4.3.25
identifier_3_value
S99940903
identifier_4_type_coding_0_system
http://terminology.hl7.org/CodeSystem/v2-0203
identifier_4_type_coding_0_code
PPN
identifier_4_type_coding_0_display
Passport Number
identifier_4_type_text
Passport Number
identifier_4_system
http://standardhealthrecord.org/fhir/Structure...
identifier_4_value
X53631011X
identifier_5_use
official
identifier_5_type_coding_0_system
http://terminology.hl7.org/CodeSystem/v2-0203
identifier_5_type_coding_0_code
MR
identifier_5_type_coding_0_display
Medical Record Number
identifier_5_type_text
MRN
identifier_5_system
http://hospital.example.org/mrn
identifier_5_value
9447890438
name_0_use
official
name_0_family
Soto
name_0_given_0
Lisa
gender
female
birthDate
1970-07-10
identifier_3_use
NaN
If you look at the output above, you can see FHIR-PYrate collapsed the hierarchical FHIR data structure into DataFrame columns. FHIR-PYrate does this by taking an element from the FHIR-formatted data like Patient.identifier[0].value and converting to an underscore-delimited column name like identifier_0_value. (Note that Patient.identifier has multiple values in the FHIR data, so there are multiple identifier_N_... columns in the DataFrame.)
3 Selecting specific columns
Usually not every single value from a FHIR instance is needed for analysis. There are two ways to get a more concise DataFrame:
Use the approach above to load all elements into a DataFrame, remove the unneeded columns, and rename the remaining columns as needed. The process_function capability in FHIR-PYrate allows you to integrate this approach into the bundles_to_dataframe() method call.
Use FHIRPath to select specific elements and map them onto column names.
The second approach is typically more concise. For example, to generate a DataFrame like this…
id
gender
date_of_birth
marital_status
…
…
…
…
…you could use the following code:
# Instantiate and perform the FHIR search interaction in a single function calldf = search.steal_bundles_to_dataframe( resource_type="Patient", request_params={"_count": 10, # Get 10 instances per page"identifier": "https://github.com/synthetichealth/synthea|", }, num_pages=1, # Get 1 page (so a total of 10 instances) fhir_paths=[ ("id", "identifier[0].value"), ("gender", "gender"), ("date_of_birth", "birthDate"), ("marital_status", "maritalStatus.coding[0].code"), ],)df
While FHIRPath can be quite complex, its use in FHIR-PYrate is often straight forward. Nested elements are separated with ., and elements with multiple sub-values are identified by [N] where N is an integer starting at 0. The element paths can typically be constructed by loading all elements into a DataFrame and then manually deriving the FHIRPaths from the column names, or by looking at the hierarchy resource pages in the FHIR specification (see Key FHIR Resources for more information on reading the FHIR specification).
4 Elements with multiple sub-values
There are multiple identifier[N].value values for each instance of Patient in this dataset.
# Instantiate and perform the FHIR search interaction in a single function calldf = search.steal_bundles_to_dataframe( resource_type="Patient", request_params={"_count": 10, # Get 10 instances per page"identifier": "https://github.com/synthetichealth/synthea|", }, num_pages=1, # Get 1 page (so a total of 10 instances) fhir_paths=[("id", "identifier[0].value"), ("identifiers", "identifier.value")],)df
This will give you separate identifier_0, identifier_1, … columns for each Patient.identifier[N] value.
5 Retrieving related data
To retrieve instances of related resources, additional request_params can be added. See Using the FHIR API to Access Data for more information on constructing the parameters for FHIR search interactions.
In the example below, instances of Patient and instances of related Observation resources are requested:
# Instantiate and perform the FHIR search interaction in a single function calldfs = search.steal_bundles_to_dataframe( resource_type="Patient", request_params={# Get instances of Observation where `Observation.patient` refers to a fetched Patient instance"_revinclude": "Observation:patient","identifier": "https://github.com/synthetichealth/synthea|","_count": 10, # Get 10 instances per page }, num_pages=1, # Get 1 page (so a total of 10 instances))# `dfs` is a dictionary where the key is the FHIR resource type, and the value is the DataFrame## Split these into separate variables for easy access:df_patients = dfs["Patient"]df_observations = dfs["Observation"]# Look at the first row of the Observations DataFramedf_observations.head(1).T
---------------------------------------------------------------------------KeyError Traceback (most recent call last)
File ~/work/fhir-for-research/fhir-for-research/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key) 3811try:
-> 3812returnself._engine.get_loc(casted_key) 3813exceptKeyErroras err:
File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()KeyError: 'Patient'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
CellIn[6], line 16 2 dfs = search.steal_bundles_to_dataframe(
3 resource_type="Patient",
4 request_params={
(...) 10 num_pages=1, # Get 1 page (so a total of 10 instances) 11 )
13# `dfs` is a dictionary where the key is the FHIR resource type, and the value is the DataFrame 14# 15# Split these into separate variables for easy access:---> 16 df_patients = dfs["Patient"] 17 df_observations = dfs["Observation"]
19# Look at the first row of the Observations DataFrameFile ~/work/fhir-for-research/fhir-for-research/.venv/lib/python3.12/site-packages/pandas/core/frame.py:4113, in DataFrame.__getitem__(self, key) 4111ifself.columns.nlevels > 1:
4112returnself._getitem_multilevel(key)
-> 4113 indexer = self.columns.get_loc(key) 4114if is_integer(indexer):
4115 indexer = [indexer]
File ~/work/fhir-for-research/fhir-for-research/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key) 3814ifisinstance(casted_key, slice) or (
3815isinstance(casted_key, abc.Iterable)
3816andany(isinstance(x, slice) for x in casted_key)
3817 ):
3818raise InvalidIndexError(key)
-> 3819raiseKeyError(key) fromerr 3820exceptTypeError:
3821# If we have a listlike key, _check_indexing_error will raise 3822# InvalidIndexError. Otherwise we fall through and re-raise 3823# the TypeError. 3824self._check_indexing_error(key)
KeyError: 'Patient'
To work around this, you can also iterate over all the rows in a DataFrame and request related resources using trade_rows_for_bundles():
df_observations2 = search.trade_rows_for_dataframe( df_patients, resource_type="Observation", request_params={"_count": "10", # Get 10 instances per page }, num_pages=1,# Load Observations where `Observation.subject` references the instance of Patient# identified by `id` in the `df_patients` DataFrame df_constraints={"subject": "id"}, fhir_paths=[ ("observation_id", "id"), ("patient", "subject.reference"), ("status", "status"), ("code", "code.coding[0].code"), ("code_display", "code.coding[0].display"), ("value", "valueQuantity.value"), ("value_units", "valueQuantity.unit"), ("datetime", "effectiveDateTime"), ],)# Look at the first row of the Observations DataFramedf_observations2.head(15)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[7], line 2 1 df_observations2 = search.trade_rows_for_dataframe(
----> 2df_patients,
3 resource_type="Observation",
4 request_params={
5"_count": "10", # Get 10 instances per page 6 },
7 num_pages=1,
8# Load Observations where `Observation.subject` references the instance of Patient 9# identified by `id` in the `df_patients` DataFrame 10 df_constraints={"subject": "id"},
11 fhir_paths=[
12 ("observation_id", "id"),
13 ("patient", "subject.reference"),
14 ("status", "status"),
15 ("code", "code.coding[0].code"),
16 ("code_display", "code.coding[0].display"),
17 ("value", "valueQuantity.value"),
18 ("value_units", "valueQuantity.unit"),
19 ("datetime", "effectiveDateTime"),
20 ],
21 )
23# Look at the first row of the Observations DataFrame 24 df_observations2.head(15)
NameError: name 'df_patients' is not defined
Note that this will only display value for instances of Observation that record a value in Observation.valueQuantity. Typically, you would filter by Observation.code and then choose the appropriate data type for Observation.value[x] to import. For example, http://loinc.org|72166-2 is the LOINC for smoking status. To get smoking status records for all patients in df_patients:
# Directly search for smoking status observationsdf_observations2 = search.steal_bundles_to_dataframe( resource_type="Observation", request_params={"code": "http://loinc.org|72166-2", # LOINC code for smoking status"_count": 20, # Get more observations since we're not limiting by patient }, num_pages=1, fhir_paths=[ ("observation_id", "id"), ("patient", "subject.reference"), ("status", "status"), ("code", "code.coding[0].code"), ("code_display", "code.coding[0].display"), ("value", "valueCodeableConcept.coding[0].code"), ("value_display", "valueCodeableConcept.coding[0].display"), ("datetime", "effectiveDateTime"), ],)# Look at the first row of the Observations DataFramedf_observations2.head(15)