With the growthcleanr
package installed as described in
Installation, it can be loaded as a
package:
For convenience, growthcleanr
ships with an example
synthetic dataset created using Synthea, with
support for simulated growth measurement errors based on the protocol
included as supplementary material to the Daymont et
al. paper. This dataset is automatically loaded with the
growthcleanr
library, and is called
syngrowth
.
dim(syngrowth)
## [1] 77721 6
head(syngrowth)
## id subjid param agedays sex measurement
## 1 1 001aa16d-bf0e-a077-3b3d-5ab8b58545ad HEIGHTCM 1435 1 102.6
## 2 2 001aa16d-bf0e-a077-3b3d-5ab8b58545ad WEIGHTKG 1435 1 16.6
## 3 3 001aa16d-bf0e-a077-3b3d-5ab8b58545ad HEIGHTCM 1806 1 102.6
## 4 4 001aa16d-bf0e-a077-3b3d-5ab8b58545ad WEIGHTKG 1806 1 16.6
## 5 5 001aa16d-bf0e-a077-3b3d-5ab8b58545ad WEIGHTKG 1806 1 19.6
## 6 6 001aa16d-bf0e-a077-3b3d-5ab8b58545ad HEIGHTCM 2177 1 117.0
It can be used as in the following example. Note that processing the
example data with cleangrowth()
will likely take a few
minutes to complete.
library(data.table)
library(dplyr)
# Convert the `syngrowth` data frame to a `data.table`
<- as.data.table(syngrowth)
data
# `setkey()` creates an efficient sorting key on the `data.table`; this is required
# for `cleangrowth()`
setkey(data, subjid, param, agedays)
# Add a column `gcr_result` using `cleangrowth`
<- data[, gcr_result := cleangrowth(subjid, param, agedays, sex, measurement)]
cleaned_data
# View a sample of the results
head(cleaned_data)
id subjid sex agedays param measurement gcr_result1: 83330 002986c5-354d-bb9d-c180-4ce26813ca28 1 20489.22 HEIGHTCM 151.1 Include
2: 83332 002986c5-354d-bb9d-c180-4ce26813ca28 1 20860.22 HEIGHTCM 151.1 Include
3: 83334 002986c5-354d-bb9d-c180-4ce26813ca28 1 20860.22 HEIGHTCM 150.6 Exclude-Same-Day-Extraneous
4: 83335 002986c5-354d-bb9d-c180-4ce26813ca28 1 21231.22 HEIGHTCM 151.1 Include
5: 83337 002986c5-354d-bb9d-c180-4ce26813ca28 1 21602.22 HEIGHTCM 151.1 Include
6: 83339 002986c5-354d-bb9d-c180-4ce26813ca28 1 21623.22 HEIGHTCM 151.1 Include
# Summarize results by result type
%>% group_by(gcr_result) %>% tally(sort=TRUE)
cleaned_data # A tibble: 26 x 2
gcr_result n<fct> <int>
1 Include 61652
2 Exclude-Extraneous-Same-Day 11263
3 Exclude-Carried-Forward 7093
4 Exclude-Same-Day-Extraneous 4010
5 Exclude-Same-Day-Identical 623
6 Exclude-SD-Cutoff 175
7 Exclude-EWMA-8 139
8 Exclude-Distinct-3-Or-More 125
9 Exclude-BIV 108
10 Exclude-EWMA-Extreme 99
# … with 16 more rows
If you are able to run these steps and see a similar result, you have
the growthcleanr
package installed correctly. The resulting
cleaned_data
can be reviewed, subsetted, and compared in
more detail using all the tools R provides.
For complete information about the options that can be set on the
cleangrowth()
function, see Configuration. Below are a few additional
examples:
This example shows three configuration options in use:
cleaned_data <- data[,
gcr_result_both := cleangrowth(
subjid, param, agedays, sex, measurement,
lt3.exclude.mode = "flag.both",
ref.data.path = "inst/extdata/",
quietly = FALSE
)
]
lt3.exclude.mode = "flag.both"
will exclude both
measurements for a subject if they only have two unexcluded measurements
of one parameter type with at least one implausible value and no same
age-day measurements of the other parameter.
ref.data.path = "inst/extdata"
shouldn’t be
necessary if you have the growthcleanr
package installed,
but if you are running it from its source directly you may need to
specify its full path.
quietly = F
enables verbose output, marking the
progress of the algorithm through its many processing steps. This can be
very helpful while testing.
This example shows built-in options for processing data in parallel batches, which can speed the process while working with large data sets:
cleaned_data <- data[,
gcr_result_both := cleangrowth(
subjid, param, agedays, sex, measurement,
parallel = TRUE,
num.batches = 4,
log.path = "logs"
)
]
parallel
tells cleangrowth()
to run in
parallel; the default is FALSE
num.batches
specifies how many batches to process in
parallel
The best num.batches
value for your environment may vary
depending on the computing resources you have available. If you do not
specify num.batches
, growthcleanr
will
estimate a batch count based on R functions for checking the system
hardware. If you are working with large datasets, you may need to
experiment with these options to determine the best settings for your
needs. You may also find it helpful to review the additional notes on Working with large datasets.
The default value of log.path
is "."
, the
current working directory. growthcleanr
will write
batch-specific log files to the log.path
directory, and
will create the directory if necessary.
Note that if you run growthcleanr
in parallel with
multiple batches and see warning errors such as the following, they can
be ignored.
:
Warning messages1: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
2: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
This set of warnings is coming from one of
growthcleanr
’s dependencies, and does not indicate either
failure or improper execution.
This final example demonstrates using two option specific to adult
subjects, adult_cutpoint
and weight_cap
.
cleaned_data <- data[,
gcr_result_both := cleangrowth(
subjid, param, agedays, sex, measurement,
adult_cutpoint = 18,
weight_cap = 181.4
)
]
adult_cutpoint
defines a subject age in years at
which to cut growthcleanr
analysis into separate sets for
pediatric and adult subjects. The default is 20 years, and this may be
lowered to 18, as shown above.
weight_cap
specifies a known hard limit clamp for
weight values. sometimes data shared for research may have measurement
values clamped to hard limits for privacy protection. The growthcleanr
adult algorithm can be configured to watch for a known weight cap and
adjust its assessment accordingly.
The full algorithm for assessing adult measurements is described in detail in Adult algorithm.