# Load dependencies
library(fhircrackr)
library(tidyverse) # Not strictly necessary, but helpful for working with data in R
# Define the URL of the FHIR server and the request that will be made
<- fhir_url(url = "https://api.logicahealth.org/FHIRResearchSynthea/open", resource = "Patient")
request
# Perform the request
<- fhir_search(request = request, max_bundles = 1, verbose = 0)
patient_bundle
# This method defines the mapping from FHIR to data frame columns.
# If the `cols` argument is omitted, all data elements will be included in the data frame.
<- fhir_table_description(
table_desc_patient resource = "Patient"
)
# Convert to R data frame
<- fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose = 0)
df_patient
%>% head(5) df_patient
Analyzing FHIR Data in a Tabular Format With R
Data analysis approaches in R typically uses data frames to store tabular data. There are two primary approaches to loading FHIR-formatted data into Pandas DataFrames:
Writing R code to manually convert FHIR instances in JSON format into data frames.
Using a purpose-built library like fhircrackr to automatically convert FHIR instances into DataFrames.
It is recommended to try this approach first. If it is not possible to use fhircrackr for your use case, it may be easier to convert the data from FHIR to tabular format using Python and then export it to R format compared to doing this completely in R. The Reticulate package may facilitate this by allowing Python and R code to share data objects within RStduio.
To use fhircrackr, you will need a R runtime with fhircrackr installed. Typically R users work in the RStudio IDE but this is not strictly necessary.
1 FHIR testing server
The examples in this module use a FHIR testing server populated with Synthea data in FHIR R4 format via Logica Health’s Sandbox service.
The endpoint for this testing server is:
https://api.logicahealth.org/FHIRResearchSynthea/open
However, any FHIR server loaded with testing data can be used. See Standing up a FHIR Testing Server for instructions to set up your own test server.
The code blocks in the following section show sample output immediately after. This is similar to the code blocks and results in a rendered RMarkdown file.
2 Retrieving FHIR data
Once your environment is set up, you can run the following R code to retrieve instances of the Patient resource from a test server:
It is easier to see the contents of this DataFrame by printing out its first row vertically:
1,] %>% t df_patient[
1
address.city "Needham"
address.country "US"
address.extension "http://hl7.org/fhir/StructureDefinition/geolocation"
address.extension.extension "latitude:::longitude"
address.extension.extension.valueDecimal "42.319304553912225:::-71.17365303910063"
address.line "545 Tromp Port Unit 55"
address.postalCode "02492"
address.state "Massachusetts"
birthDate "1955-10-09"
communication.language.coding.code "en-US"
communication.language.coding.display "English"
communication.language.coding.system "urn:ietf:bcp:47"
communication.language.text "English"
deceasedDateTime NA
extension "http://hl7.org/fhir/us/core/StructureDefinition/us-core-race:::http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity:::http://hl7.org/fhir/StructureDefinition/patient-mothersMaidenName:::http://hl7.org/fhir/us/core/StructureDefinition/us-core-birthsex:::http://hl7.org/fhir/StructureDefinition/patient-birthPlace:::http://synthetichealth.github.io/synthea/disability-adjusted-life-years:::http://synthetichealth.github.io/synthea/quality-adjusted-life-years"
extension.extension "ombCategory:::text:::ombCategory:::text"
extension.extension.valueCoding.code "2054-5:::2186-5"
extension.extension.valueCoding.display "Black or African American:::Not Hispanic or Latino"
extension.extension.valueCoding.system "urn:oid:2.16.840.1.113883.6.238:::urn:oid:2.16.840.1.113883.6.238"
extension.extension.valueString "Black or African American:::Not Hispanic or Latino"
extension.valueAddress.city "Lawrence"
extension.valueAddress.country "US"
extension.valueAddress.state "Massachusetts"
extension.valueCode "M"
extension.valueDecimal "0.09827400029822156:::62.90172599970178"
extension.valueString "Delois358 Hintz995"
gender "male"
id "1"
identifier.system "https://github.com/synthetichealth/synthea:::http://hospital.smarthealthit.org:::http://hl7.org/fhir/sid/us-ssn:::urn:oid:2.16.840.1.113883.4.3.25:::http://standardhealthrecord.org/fhir/StructureDefinition/passportNumber"
identifier.type.coding.code "MR:::SS:::DL:::PPN"
identifier.type.coding.display "Medical Record Number:::Social Security Number:::Driver's License:::Passport Number"
identifier.type.coding.system "http://terminology.hl7.org/CodeSystem/v2-0203:::http://terminology.hl7.org/CodeSystem/v2-0203:::http://terminology.hl7.org/CodeSystem/v2-0203:::http://terminology.hl7.org/CodeSystem/v2-0203"
identifier.type.text "Medical Record Number:::Social Security Number:::Driver's License:::Passport Number"
identifier.value "439b24b4-6f25-4093-b101-47a39bd061ca:::439b24b4-6f25-4093-b101-47a39bd061ca:::999-57-3355:::S99925942:::X42032818X"
maritalStatus.coding.code "M"
maritalStatus.coding.display "M"
maritalStatus.coding.system "http://terminology.hl7.org/CodeSystem/v3-MaritalStatus"
maritalStatus.text "M"
meta.lastUpdated "2023-04-06T20:52:11.000+00:00"
meta.source "#wQwWCylvgEiNKNbB"
meta.versionId "1"
multipleBirthBoolean "false"
multipleBirthInteger NA
name.family "Moen819"
name.given "Willian804"
name.prefix "Mr."
name.use "official"
telecom.system "phone"
telecom.use "home"
telecom.value "555-135-7303"
text.status "generated"
If you look at the output above, you can see fhircrackr collapsed the hierarchical FHIR data structure into data frame columns, with multiple values delimited by :::
by default. For example, Patient.identifier
has multiple values that appear in the data frame as:
Column name | Example Values |
---|---|
identifier.type.text |
Medical Record Number:::Social Security Number:::Driver's License:::Passport Number |
identifier.value |
439b24b4-6f25-4093-b101-47a39bd061ca:::439b24b4-6f25-4093-b101-47a39bd061ca:::999-57-3355:::S99925942:::X42032818X |
Splitting up these values is discussed below.
3 Selecting specific columns
Usually not every single value from a FHIR instance is needed for analysis. There are two ways to get a more concise data frame:
- Use the approach above to load all elements into a data frame, remove the unneeded columns, and rename the remaining columns as needed.
- Use XPath to select specific elements and map them onto column names.
The second approach is typically more concise. For example, to generate a DataFrame like this…
id | gender | date_of_birth | marital_status |
---|---|---|---|
… | … | … | … |
…you could use the following code:
<- fhir_table_description(
table_desc_patient resource = "Patient",
cols = c(
id = "id",
gender = "gender",
date_of_birth = "birthDate",
# Rather than having fhircrackr concatenate all `Patient.maritalStatus` values
# into one cell, you can select a specific value with XPath:
marital_status = "maritalStatus/coding[1]/code"
)
)
<- fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose=0)
df_patient
%>% head(5) df_patient
While XPath expressions can be quite complex, thier use in fhircrackr is often straight-forward. Nested elements are separated with /
, and elements with multiple sub-values are identified by [N]
where N
is an integer starting at 1.
There are two approaches to identifying element paths to construct XPath expressions:
Look at the FHIR specification or the relevant FHIR Implementation Guide to determine the paths of available data elements. For example, the Patient page in the FHIR specification describes the elements and their hierarchy for instances of Patient.
Print out the raw data returned by the FHIR server. Fhircrackr uses XML-formatted data, and the following code will print out one of the instances of Patient requested above:
::xml_find_first(x = patient_bundle[[1]], xpath = "./entry[1]/resource") %>% xml2%>% paste0 cat
<resource> <Patient> <id value="1"/> <meta> <versionId value="1"/> <lastUpdated value="2023-04-06T20:52:11.000+00:00"/> <source value="#wQwWCylvgEiNKNbB"/> </meta> <text> <status value="generated"/> </text> <extension url="http://hl7.org/fhir/us/core/StructureDefinition/us-core-race"> <extension url="ombCategory"> <valueCoding> <system value="urn:oid:2.16.840.1.113883.6.238"/> <code value="2054-5"/> <display value="Black or African American"/> </valueCoding> </extension> <extension url="text"> <valueString value="Black or African American"/> </extension> </extension> <extension url="http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity"> <extension url="ombCategory"> <valueCoding> <system value="urn:oid:2.16.840.1.113883.6.238"/> <code value="2186-5"/> <display value="Not Hispanic or Latino"/> </valueCoding> </extension> <extension url="text"> <valueString value="Not Hispanic or Latino"/> </extension> </extension> <extension url="http://hl7.org/fhir/StructureDefinition/patient-mothersMaidenName"> <valueString value="Delois358 Hintz995"/> </extension> <extension url="http://hl7.org/fhir/us/core/StructureDefinition/us-core-birthsex"> <valueCode value="M"/> </extension> <extension url="http://hl7.org/fhir/StructureDefinition/patient-birthPlace"> <valueAddress> <city value="Lawrence"/> <state value="Massachusetts"/> <country value="US"/> </valueAddress> </extension> <extension url="http://synthetichealth.github.io/synthea/disability-adjusted-life-years"> <valueDecimal value="0.09827400029822156"/> </extension> <extension url="http://synthetichealth.github.io/synthea/quality-adjusted-life-years"> <valueDecimal value="62.90172599970178"/> </extension> <identifier> <system value="https://github.com/synthetichealth/synthea"/> <value value="439b24b4-6f25-4093-b101-47a39bd061ca"/> </identifier> <identifier> <type> <coding> <system value="http://terminology.hl7.org/CodeSystem/v2-0203"/> <code value="MR"/> <display value="Medical Record Number"/> </coding> <text value="Medical Record Number"/> </type> <system value="http://hospital.smarthealthit.org"/> <value value="439b24b4-6f25-4093-b101-47a39bd061ca"/> </identifier> <identifier> <type> <coding> <system value="http://terminology.hl7.org/CodeSystem/v2-0203"/> <code value="SS"/> <display value="Social Security Number"/> </coding> <text value="Social Security Number"/> </type> <system value="http://hl7.org/fhir/sid/us-ssn"/> <value value="999-57-3355"/> </identifier> <identifier> <type> <coding> <system value="http://terminology.hl7.org/CodeSystem/v2-0203"/> <code value="DL"/> <display value="Driver's License"/> </coding> <text value="Driver's License"/> </type> <system value="urn:oid:2.16.840.1.113883.4.3.25"/> <value value="S99925942"/> </identifier> <identifier> <type> <coding> <system value="http://terminology.hl7.org/CodeSystem/v2-0203"/> <code value="PPN"/> <display value="Passport Number"/> </coding> <text value="Passport Number"/> </type> <system value="http://standardhealthrecord.org/fhir/StructureDefinition/passportNumber"/> <value value="X42032818X"/> </identifier> <name> <use value="official"/> <family value="Moen819"/> <given value="Willian804"/> <prefix value="Mr."/> </name> <telecom> <system value="phone"/> <value value="555-135-7303"/> <use value="home"/> </telecom> <gender value="male"/> <birthDate value="1955-10-09"/> <address> <extension url="http://hl7.org/fhir/StructureDefinition/geolocation"> <extension url="latitude"> <valueDecimal value="42.319304553912225"/> </extension> <extension url="longitude"> <valueDecimal value="-71.17365303910063"/> </extension> </extension> <line value="545 Tromp Port Unit 55"/> <city value="Needham"/> <state value="Massachusetts"/> <postalCode value="02492"/> <country value="US"/> </address> <maritalStatus> <coding> <system value="http://terminology.hl7.org/CodeSystem/v3-MaritalStatus"/> <code value="M"/> <display value="M"/> </coding> <text value="M"/> </maritalStatus> <multipleBirthBoolean value="false"/> <communication> <language> <coding> <system value="urn:ietf:bcp:47"/> <code value="en-US"/> <display value="English"/> </coding> <text value="English"/> </language> </communication> </Patient> </resource>
In some cases, you may need to construct more complex expressions like the one to extract marital_status
from Patient.maritalStatus.coding[0].code
. You can use a tool like this XPath tester to help generate XPath expressions, though online tools such as these should not be used with real patient data. For more information on XPath, see this guide.
4 Elements with multiple sub-values
There are multiple identifier[N].value
values for each instance of Patient in this dataset. By default, fhircrackr will concatenate these into a single cell per row, delimited with :::
(this is configurable; use fhir_table_description(..., sep = ' | ', ...)
to delimit with |
instead).
Fhircrackr provides some tools to split up multiple values stored in the same cell into separate rows in a “long” data frame:
<- fhir_table_description(
table_desc_patient resource = "Patient",
# Prefix values in cells with indices to facilitate handling cells that contain
# multiple values
brackets = c("[", "]")
)
<- fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose = 0)
df_patient_indexed
<- fhir_melt(
df_patient_identifiers indexed_data_frame = df_patient_indexed,
columns = c("identifier.type.text", "identifier.value"),
brackets = c("[", "]"),
sep = ":::",
all_columns = FALSE
)
%>% head(10) df_patient_identifiers
The df_patient_identifiers
data frame printed above has one row for each value of Patient.identifier
for each instance of Patient. The in-cell indices (surrounded by [ ]
) can be removed:
<- fhir_rm_indices(indexed_data_frame = df_patient_identifiers, brackets = c("[", "]"))
df_patient_identifiers
%>% head(10) df_patient_identifiers
These can then be merged back into the original data frame as needed. For example, if you want to include the synthetic “Social Security Number” in the original data:
%>%
df_patient # Add in row numbers for joining
mutate(
row_number = row_number()
%>%
) left_join(
%>%
df_patient_identifiers # Note: this assumes there is just one social security number for each patient in the data.
# If this was not true, it would be necessary to remove extra data before joining so there
# was one row per patient.
filter(`identifier.type.text` == "Social Security Number") %>%
rename(
"ssn" = "identifier.value"
%>%
)
# Exclude the `identifier.type.text` column so it doesn't appear in the joined data frame
select(resource_identifier, ssn) %>%
# Fhircrackr generates the `resource_identifier` column as a string, but it needs to be
# an integer for joining.
mutate(resource_identifier = as.integer(resource_identifier))
,by=c("row_number" = "resource_identifier")
%>% head(5) )
You can see that the synthetic SSNs are now split out into a separate column.
6 Additional resources
NIH’s Office of Data Science Strategy has online exercises for converting FHIR-formatted data into tabular format for further analysis. These exercises include implementations in both Python and R. The R exercises go into greater depth on using fhircrackr to access FHIR data in R, including integrating FHIR data with data from other web APIs.