To get started with growthcleanr
, the
growthcleanr
package must be installed. To install the
latest growthcleanr
release from CRAN:
install.packages("growthcleanr")
To install the latest development version from GitHub using
devtools
:
devtools::install_github("carriedaymont/growthcleanr", ref="main")
Installing growthcleanr
will install several additional
packages in turn.
Further installation details and notes can be found under Installation.
Once the growthcleanr
package and its dependencies are
installed, the minimal input data vectors needed for cleaning data
are:
subjid
- a unique identifier for each subject; if you
have long (64-bit) integer subjid
values, add the
bit64
R package to your install list to ensure R will
process these correctlyparam
- a string designating the measurement being
‘HEIGHTCM
’, ‘LENGTHCM
’, or
‘WEIGHTKG
’agedays
- a numeric value for age in days at time of
measurement; will be rounded down (floor) to nearest integer if needed
(note that there are 30.4375 days in one month; see also the additional
notes
on calculating agemos
from whole numbers of monthssex
- a numeric value of 0
(male) or
1
(female)measurement
- numeric value of measurement in cm for
height or kg for weightNote: the dataset should be sorted by
subjid
, param
, and agedays
, as
demonstrated in the Example below.
Starting with a dataset such as the following:
subjid | param | agedays | sex | measurement |
---|---|---|---|---|
1 | HEIGHTCM | 2790 | 0 | 118.5 |
1 | HEIGHTCM | 3677 | 0 | 148.22 |
1 | WEIGHTKG | 2790 | 0 | 23.8 |
1 | WEIGHTKG | 3677 | 0 | 38.41 |
2 | HEIGHTCM | 2112 | 1 | 118.2 |
2 | HEIGHTCM | 2410 | 1 | 123.9 |
2 | HEIGHTCM | 2708 | 1 | 128.03 |
2 | HEIGHTCM | 3029 | 1 | 135.3 |
2 | WEIGHTKG | 2112 | 1 | 25.84 |
2 | WEIGHTKG | 2410 | 1 | 28.61 |
2 | WEIGHTKG | 2708 | 1 | 28.61 |
2 | WEIGHTKG | 3029 | 1 | 34.5 |
… | … | … | … | … |
Note that in this example, we see data from two subjects, sorted as
described above, with two pairs of height/weight measurements for the
first patient and four pairs for the second. Most of these values pass
the “eyeball test”, but the third weight measurement for
subjid
2 looks like it might be a value that has been
carried forward from the previous value, given that their same ageday
height has changed by several centimeters but their third height value
is exactly the same as their second. This is an example of an
implausible value and exclusion type that growthcleanr
might identify.
The cleangrowth()
function provided by
growthcleanr
will return a vector that specifies, for each
measurement, whether to:
Include
” the measurement,Missing
”,Exclude
” the measurement (numerous variations of
Exclude
exist, as specified in Understanding growthcleanr output).For a data.frame object source_data
containing growth
data:
library(growthcleanr)
# prepare data as a data.table
data <- as.data.table(source_data)
# set the data.table key for better indexing
setkey(data, subjid, param, agedays)
# generate new exclusion flag field using function
cleaned_data <- data[, gcr_result := cleangrowth(subjid, param, agedays, sex, measurement)]
# extract data limited only to values flagged for inclusion:
only_included_data <- cleaned_data[gcr_result == "Include"]
If our example dataset above were named source_data
,
examining cleaned_data
would show:
subjid | param | agedays | sex | measurement | gcr_result |
---|---|---|---|---|---|
1 | HEIGHTCM | 2790 | 0 | 118.5 | Include |
1 | HEIGHTCM | 3677 | 0 | 148.22 | Include |
1 | WEIGHTKG | 2790 | 0 | 23.8 | Include |
1 | WEIGHTKG | 3677 | 0 | 38.41 | Include |
2 | HEIGHTCM | 2112 | 1 | 118.2 | Include |
2 | HEIGHTCM | 2410 | 1 | 123.9 | Include |
2 | HEIGHTCM | 2708 | 1 | 128.03 | Include |
2 | HEIGHTCM | 3029 | 1 | 135.3 | Include |
2 | WEIGHTKG | 2112 | 1 | 25.84 | Include |
2 | WEIGHTKG | 2410 | 1 | 28.61 | Include |
2 | WEIGHTKG | 2708 | 1 | 28.61 | Exclude-Carried-Forward |
2 | WEIGHTKG | 3029 | 1 | 34.5 | Include |
… | … | … | … | … | … |
In this sample output, we see that growthcleanr
has
indeed flagged the third weight measurement for subjid 2
as
an apparent carried forward value. The measurement remains in the
dataset, and it is up to the researcher to determine which exclusions to
accept directly or investigate further.
This example uses the default configuration options for
cleangrowth()
. Longer examples demonstrating some common
options are in Usage, and full details about
all options are documented in Configuration.
To compute percentiles and Z-scores using the included CDC macro, continue with these next steps:
# Use the built-in utility function to convert the input observations to wide
# format for BMI calculation
cleaned_data_wide <- longwide(cleaned_data)
# Compute simple BMI
cleaned_data_bmi <- simple_bmi(cleaned_data_wide)
# Compute Z-scores and percentiles
cleaned_data_bmiz <- ext_bmiz(cleaned_data_bmi)
The longwide()
function defaults to converting only
records included by cleangrowth()
, with options to change
this default, and ext_bmiz()
will return the wide version
of the cleaned data with many metrics added. Both of these functions are
described in Computing BMI percentiles and
Z-scores.