vignettes/utilities.Rmd
utilities.Rmd
The CDC provides a SAS
macro for computing BMI percentiles and Z-scores. The function
ext_bmiz()
, included in growthcleanr
, provides
an equivalent feature. ext_bmiz()
calculates the sigma
(scale parameter for the half-normal distribution, extended BMI
percentile), extended BMIz, and the CDC LMS z-scores for weight, height,
and BMI. Note that for BMIs ≤ 95th percentile of the CDC growth charts,
the extended values for BMI are equal to the LMS values. The extended
values differ only for children who have a BMI > 95th percentile.
The function assumes a variable ‘sex’ and variables for age in
months, weight (kg), height (cm), and BMI (weight/ht2). Please be
careful with age - the units should be months and use the most accurate
information available (e.g., 23.4928 months. The extended BMIz is the
inverse cumulative distribution function (CDF) of the extended BMI
percentile. If the extended percentile is very close to 100, the
qnorm
function in R produces an infinite value. The occurs
only if the extended BMI percentile is > 99.99999999999999. This
occurs infrequently, such as a 48-month-old with a BMI > 39, and it
is likely that these BMIs represent data entry errors. For these cases,
extended BMIz is set to 8.21, a value that is slightly greater than the
largest value that can be calculated.
The longwide()
function provides a convenient way to
prepare data for use with ext_bmiz()
. The example in the
next section shows a potential workflow for taking the “long” (one
observation per row) data from cleangrowth()
and converting
it to the “wide” (height and weight on one line) format required by
ext_bmiz()
. An additional function
recode_sex()
supports recoding coded values for
sex
from one value set to another.
The CDC SAS macro was updated in December 2022, according to the
findings of this NCHS
report. The ext_bmiz()
function has been updated to
match it as of growthcleanr v2.1.0.
Because ext_bmiz()
performs cross-sectional analysis of
BMI, observation data must be in a wide format, i.e. with height and
weight information on the same row. This is distinct from
cleangrowth()
, which performs longitudinal analysis on all
observations for each subject, presented in a long format with one
observation per row. To facilitate use of both functions,
growthcleanr
includes utility functions to transform data
used with cleangrowth()
for use with
ext_bmiz()
. They are optimized to move data directly from
the output of cleangrowth()
into input for
ext_bmiz()
, but have options to support independent use as
well.
Using the syngrowth
example dataset, to convert the data
after it has been cleaned by cleangrowth()
for use with
ext_bmiz()
, use longwide()
and
simple_bmi()
:
# Use the built-in utility function to convert the input observations to wide
# format for BMI calculation
cleaned_data_wide <- longwide(cleaned_data)
# Compute simple BMI values (adds column "bmi")
cleaned_data_bmi <- simple_bmi(cleaned_data_wide)
# Compute Z-scores and percentiles
cleaned_data_bmiz <- ext_bmiz(cleaned_data_bmi)
Note that this assumes that cleaned_data
has the same
structure as described in Quickstart - Data
preparation:
names(cleaned_data)
1] "id" "subjid" "sex" "agedays" "param" "measurement" "gcr_result" [
The wide dataset cleaned_data_wide
will include rows
with aligned height and weight measurements drawn from the observations
in cleaned_data
marked by cleangrowth()
for
inclusion. As such, it will be a shorter dataset (fewer rows) based on
fewer observations.
dim(cleaned_data)
1] 85728 7
[
dim(cleaned_data_wide)
1] 26701 9
[
head(cleaned_data_wide)
subjid agey agem sex wt wt_id ht ht_id agedays1 002986c5-354d-bb9d-c180-4ce26813ca28 56.0964 673.1568 2 71.7 83331 151.1 83330 20489.22
2 002986c5-354d-bb9d-c180-4ce26813ca28 57.1122 685.3464 2 73.2 83333 151.1 83332 20860.22
3 002986c5-354d-bb9d-c180-4ce26813ca28 58.1279 697.5348 2 74.6 83336 151.1 83335 21231.22
4 002986c5-354d-bb9d-c180-4ce26813ca28 59.1437 709.7244 2 72.8 83338 151.1 83337 21602.22
5 002986c5-354d-bb9d-c180-4ce26813ca28 59.2012 710.4144 2 72.4 83340 151.1 83339 21623.22
6 002986c5-354d-bb9d-c180-4ce26813ca28 60.1594 721.9128 2 69.4 83343 151.1 83342 21973.22
In this example, the subject identifiers previously marked as
subjid
are now in the id
column; individual
identifiers for observations of a single parameter are not present.
longwide()
can be called with name mapping parameters if
your input set uses different column names. For example, if
my_cleaned_data
specifies age in days as aged
and parameter type as type
, specify each, with quotes:
head(my_cleaned_data)
id subjid sex aged type measurement gcr_result1: 1510 775155 0 889 HEIGHTCM 84.90 Exclude-Extraneous-Same-Day
2: 1511 775155 0 889 HEIGHTCM 89.06 Include
3: 1518 775155 0 889 WEIGHTKG 13.10 Include
4: 1512 775155 0 1071 HEIGHTCM 92.50 Include
5: 1519 775155 0 1071 WEIGHTKG 14.70 Include
6: 1513 775155 0 1253 HEIGHTCM 96.20 Include
longwide(my_cleaned_data, agedays = "aged", param = "type")
By default, longwide()
will only transform records
flagged by cleangrowth()
for inclusion. To include more
categories assigned by cleangrowth()
, use the
inclusion_types
option. For example, to include carried
forward values along with included records for the BMI calculation:
cleaned_data_wide_cf <- longwide(
long_df = cleaned_data,
inclusion_types=c("Include", "Exclude-Carried-Forward")
)
Another option, include_all
, set to FALSE
by default, will include all observations for transformation. Additional
options provide flexibility to preserve additional columns and unmatched
observation rows.
See longwide()
for full details.
With wide data in hand, output taken directly from
longwide()
can have BMI added with
simple_bmi()
, and then the output can be passed to
ext_bmiz()
, as shown in the simple example above.
Alternatively, you can provide a similarly formatted data frame directly
to ext_bmiz()
.
Note that ext_bmiz()
allows for the sex
variable to be coded using a range of possible
values, but not the same 0
and 1
values as
cleangrowth()
. This difference from the
growthcleanr
data preparation
specification sustains compatibility with the CDC SAS macro. The
longwide()
function will handle this conversion from
growthcleanr
’s 0
(male) or 1
(female), but not from other coded values.
If you are using input data with different value codes for
sex
with ext_bmiz()
, use
recode_sex()
to ensure your values are recoded first. For
example, if you have data in the PCORnet CDM format (using
M
and F
), and want to prepare it for
ext_bmiz()
:
recode_sex(
input_data = cdm_formatted,
sourcecol = "sex",
sourcem = "M",
sourcef = "F",
targetm = 1L,
targetf = 2L
)
recode_sex()
can also be used for other purposes, such
as recoding values in preparation for cleaning with
cleangrowth()
, or transforming growthcleanr
output to match external specifications.
With data in wide format with BMI, and with sex
values
properly coded (as any of ‘1’, ‘b’, ‘B’, ‘Boys’, ‘m’, ‘M’, ‘male’, or
‘Male’ for male subjects and any of ‘2’, ‘g’, ‘G’, ‘Girls’, ‘f’, ‘F’,
‘female’, or ‘Female’ for female subjects), ext_bmiz()
can
be called:
<- ext_bmiz(cleaned_data_bmi)
cleaned_data_bmiz head(cleaned_data_bmiz)
subjid agey age sex wt wt_id ht ht_id agedays bmi bmiz<char> <num> <num> <int> <num> <int> <num> <int> <int> <num> <num>
1: 001aa16d-bf0e-a077-3b3d-5ab8b58545ad 10.0233 120.2796 2 35.4 17 141.6 15 3661 17.65537 0.3236612
2: 001aa16d-bf0e-a077-3b3d-5ab8b58545ad 11.0390 132.4680 2 39.2 19 147.9 18 4032 17.92048 0.1734315
3: 001aa16d-bf0e-a077-3b3d-5ab8b58545ad 12.0548 144.6576 2 44.8 21 155.1 20 4403 18.62320 0.1832443
4: 001aa16d-bf0e-a077-3b3d-5ab8b58545ad 12.5914 151.0968 2 47.8 23 158.7 22 4599 18.97903 0.1829183
5: 001aa16d-bf0e-a077-3b3d-5ab8b58545ad 13.0705 156.8460 2 50.5 26 160.8 24 4774 19.53077 0.2586449
6: 001aa16d-bf0e-a077-3b3d-5ab8b58545ad 3.9288 47.1456 2 16.6 2 102.6 1 1435 15.76933 0.3453978
bmip waz wp haz hp p95 p97 bmip95 mod_bmiz mod_waz mod_haz<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: 62.69027 0.3498817 63.67862 0.5140553 69.63933 22.96109 24.57902 76.89254 0.18485501 0.2399201 0.5008944
2: 56.88438 0.2311645 59.14065 0.5002225 69.15408 24.13836 25.90525 74.24067 0.09543668 0.1556259 0.4955022
3: 57.26968 0.3237374 62.69316 0.4803298 68.45036 25.26981 27.17179 73.69745 0.10065274 0.2201655 0.4855100
4: 57.25689 0.3812700 64.84986 0.5153244 69.68368 25.83904 27.80781 73.45100 0.10011666 0.2596479 0.5212281
5: 60.20454 0.4440465 67.14955 0.4849566 68.61464 26.32757 28.35405 74.18371 0.14353771 0.3024342 0.4885546
6: 63.51023 0.4369963 66.89430 0.5348018 70.36065 18.02950 18.60078 87.46409 0.24462581 0.3332011 0.5224816
sigma original_bmip original_bmiz sev_obese obese<num> <num> <num> <int> <int>
1: 4.443536 62.69027 0.3236612 0 0
2: 4.797031 56.88438 0.1734315 0 0
3: 5.148292 57.26968 0.1832443 0 0
4: 5.332930 57.25689 0.1829183 0 0
5: 5.497248 60.20454 0.2586449 0 0
6: 2.274792 63.51023 0.3453978 0 0
The output columns include:
variable | description |
---|---|
bmi | BMI |
bmiz, bmip | LMS / Extended z-score and percentile |
waz, wp | LMS Weight-for-sex/age z-score and percentile |
haz, hp | LMS Height-for-sex/age z-score and percentile |
p95, p97 | 95th and 97th percentile of BMI in growth charts |
bmip95 | BMI expressed as a percentage of the 95th percentile. A value ≥ 120 is widely used as the cut point for severe obesity. |
mod_bmiz, mod_waz, mod_haz | Modified BMI-for-age, Weight-for-age, and Height-for-age z-scores for identifying outliers (see the information in the CDC SAS growth charts program website) |
sigma | Scale parameter for half-normal distribution |
original_bmiz, original_bmip | LMS BMI-for-sex/age z-score and percentile |
sev_obese | BMI >= 120% of 95th percentile (0/1) |
obese | BMI >= 95th percentile (0/1) |
For convenience, these labels are available on the output of
ext_bmiz()
, e.g., when viewed in RStudio with
View(cleaned_data_bmi)
.
Like longwide()
, ext_bmiz()
also includes
options for mapping alternate column names, for age, weight, height, and
BMI. The default column names are the same as the output from
longwide()
for convenience. If you have different column
names, specify the column names without quotes. For example, for a
dataset using “heightcm” and “weightkg” instead of “ht” and “wt”:
my_cleaned_data_bmiz <- ext_bmiz(my_cleaned_data_wide_bmi, ht = heightcm, wt = weightkg)
For ext_bmiz()
, use the most precise age in months
available. If an input dataset only has age in months as integer values,
by default ext_bmiz()
will automatically convert these to
double values and add 0.5
to account for the distribution
of actual ages over the range of days within a month. This is enabled
with the option adjust.integer.age
, set to
TRUE
by default. Specify FALSE
to disable.
my_cleaned_data_bmi <- ext_bmiz(my_cleaned_data_wide, adjust.integer.age = FALSE)
ext_bmiz()
uses reference data provided by the CDC,
included in the growthcleanr
package as
inst/extdata/CDCref_d.csv
. This file is automatically
loaded and used by default. If you are working with a different
reference dataset or developing the growthcleanr
package,
specify an alternate path to this file with ref.data.path
,
as for cleangrowth()
.
The CDC provides a SAS
Program for the 2000 CDC Growth Charts which can also be used to
identify biologically implausible values using a different approach, as
also implemented for growthcleanr
in the function
ext_bmiz()
, described above. The SAS program was updated in
December, 2022, according to the findings of this NCHS report, and
ext_bmiz()
has been updated to match it as of growthcleanr
v2.1.0.
GrowthViz provides
insights into how growthcleanr
assesses data, packaged in a
Jupyter notebook. It ships with the same syngrowth
synthetic example dataset as growthcleanr
, with cleaning
results included.