growthcleanr
offers several options for configuration,
with default values set to address common cases. You may wish to
experiment with the settings to discover which work best for your
research and your dataset.
All of the following options may be set as additional parameters in
the call to cleangrowth()
.
The following options change the behavior of the growthcleanr algorithm.
recover.unit.error
- default FALSE
;
when FALSE
, measurements identified as unit errors (e.g.,
apparent height values in inches instead of centimeters) will be flagged
but not corrected, when TRUE
these values will be corrected
and included as valid measurements for cleaning.
sd.extreme
- default 25
; a very extreme
value check on modified (recentered) Z-scores used as a first-pass
elimination of clearly implausible values, often due to misplaced
decimals.
z.extreme
- default 25
; similar usage
as sd.extreme
, for absolute Z-scores.
NOTE: many different steps in the
growthcleanr
algorithm use highly refined techniques to
identify implausible values that require more subtlety to detect. These
default values for sd.extreme
and z.extreme
are set very high by design to eliminate completely implausible values
early in the process; these extreme values require none of
growthcleanr
’s additional, more subtle approaches to be
identified as exclusions. If sd.extreme
and
z.extreme
are configured to be lower values, measurements
eliminated at the early step which checks against these extreme values
will not be further considered using later techniques.
include.carryforward
- default FALSE
;
if set to TRUE
, growthcleanr
will skip
algorithm step 9, which identifies carried forward measurements, and
will not flag these values for exclusion.
ewma.exp
- default -1.5
; the exponent
used for weighting measurements when calculating exponentially weighted
moving average (EWMA). This exponent should be negative to weight growth
measurements closer to the measurement being evaluated more strongly.
Exponents that are further from zero (e.g., -3
) will
increase the relative influence of measurements close in time to the
measurement being evaluated compared to using the default
exponent.
ref.data.path
- defaults to using CDC reference data
from year 2000; supply a file path to use alternate reference data. Note
that when running from an installed growthcleanr
package
(e.g. having called library(growthcleanr)
), this path does
not need to be specified. Developers testing the source code directly
from the source directory will need to specify this as well.
error.load.mincount
- default 2
;
minimum count of exclusions on parameter for one subject before
considering excluding all measurements.
error.load.threshold
- default 0.5
;
threshold to exceed for percentage of excluded measurement count
relative to count of included other measurement (e.g., if 3 of 5 WTs are
excluded, and 5 corresponding HTs are included, this exceeds the 0.5
threshold and the other two WTs will be excluded).
lt3.exclude.mode
- default default
;
determines type of exclusion procedure to use for 1 or 2 measurements of
one type without matching same ageday measurements for the other
parameter. Options include:
default
- standard growthcleanr approachflag.both
- in case of two measurements with at least
one beyond thresholds, flag both instead of one (as in default)sd.recenter
- default NA
; specifies how
to recenter medians. May be a data frame or table w/median SD-scores per
day of life by gender and parameter, or “nhanes
” or
“derive
” as a character vector.
sd.recenter
is specified as a data set, use the data
setsd.recenter
is specified as “nhanes
”,
use NHANES reference medianssd.recenter
is specified as “derive
”,
derive from inputsd.recenter
is not specified or NA
:
If specifying a data set, columns must include param, sex, agedays, and sd.median (referred to elsewhere as “modified Z-score”), and those medians will be used for centering. This data set must include a row for every ageday present in the dataset to be cleaned; the NHANES reference medians include a row for every ageday in the range (731-7305 days). A summary of how the NHANES reference medians were derived is below under NHANES reference data.
adult_cutpoint
- default 20
; number
between 18 and 20, describing age limit in years above which the
pediatric algorithm should not apply (< adult_cutpoint), and the
adult algorithm should apply (>= adult_cutpoint). Numbers outside
this range will be changed to the closest number within the
range.
weight_cap
- default Inf
; weight_cap
Positive number, describing a weight cap in kg (rounded to the nearest
.1, +/- .1) within an adult dataset. This may be used when a dataset
shared for research is known to clamp values for privacy reasons. If
there is no weight cap, set to Inf (default). This option is not used
with pediatric subjects.
The following options change the execution of the program overall, with no effect on the algorithm itself.
parallel
- default FALSE
; set to
TRUE
to run growthcleanr
in parallel. Running
in parallel will split the input data into batches (while keeping all
records for each subject together) and process each batch on a different
processor/core to maximize throughput. Recommended for large datasets
with more than 100K rows.
num.batches
- specifies the number of batches to run
in parallel; only applies if parallel
is set to
TRUE
. Defaults to the number of workers returned by the
getDoParWorkers()
function in the foreach
package. Note that processing in parallel may affect overall system
performance.
sdmedian.filename
- filename for optionally saving
sd.median data calculated on the input dataset to as CSV. Defaults to
""
, for which this data will not be saved. Use for
extracting medians for parallel processing scenarios other than the
built-in parallel option. See notes on large data
sets for details.
sdrecentered.filename
- filename to save re-centered
data to as CSV. Defaults to ““, for which this data will not be saved.
Useful for post-processing and debugging.
adult_columns_filename
- default ""
;
Name of file to save original adult data, with additional output columns
to as CSV. Defaults to ““, for which this data will not be saved. Useful
for post-analysis. For more information on this output, please see
README.
quietly
- default TRUE
; when
TRUE
, displays function messages and will output log files
when parallel
is TRUE
.
log.path
- default "."
; sets directory
for batch log file output when processing with
parallel = TRUE
. A new directory will be created if
necessary.
growthcleanr
releases
up to 1.2.4 offered two options for recentering medians, either the
default of deriving medians from the input set, or supplying an
externally-defined set of medians. These left out an option for
researchers working with either small datasets or with data which might
otherwise not be representative of the population, as deriving medians
from the input set in those cases might be problematic. To provide a
standard default reference to address these latter cases, a set of
medians were derived from the National Health and
Nutrition Examination Survey (NHANES). A summary of that process is
below. As of release 1.2.5, the default behavior is:
sd.recenter
is specified as a data set, use the data
setsd.recenter
is specified as nhanes
, use
NHANESsd.recenter
is specified as derive
,
derive from inputsd.recenter
is not specified or NA
:
With the verbose cleangrowth()
option
quietly = FALSE
, the recentering medians approach used will
be noted in the output. If the input set has fewer than 100 observations
for any age-year, this will also be noted in the output.
The NHANES reference medians are based primarily on data from NHANES 2009-2010 through 2017-2018, including approximately 39,000 heights/lengths and weights from children and adolescents between the ages of 0 months and <240 months. Weight and height SD scores were calculated from the L, M, and S parameters for the CDC growth charts were used as the reference to calculate weight and height SD scores for the NHANES 2009-2010 through 2017-2018 samples. Based on the distributions of age-days in children at 0 months, an age adjustment was made based on the median number of days among these infants. This adjustment was made after consultation with the National Center for Health Statistics confirmed that a general assumption of ages occurring at the midpoint of the indicated integer month of age did not apply to children recorded as 0 months, and uses 0.75 months instead.
Weights were supplemented with a random sample of birthweights from
NCHS’s Vital
Statistics Natality Birth Data for 2018. These had sample weights
assigned so that the sum of the sample weights for the sample equalled
the sum of the sample weights for each month for infants in NHANES, as
NHANES is a multi-stage complex survey. The reference data was then
smoothed using the svysmooth()
function in the R survey
package to estimate the weight and height SD scores for each day up to
7,305 days, with a bandwidth chosen to balance between over- and
under-fitting, and interpolation between the estimates from this
function was used to obtain an estimate for each day of age. Predictions
from a regression model fit to smoothed height SDs between 23 and 365
days (the youngest child in NHANES had an estimated age in days of 23)
were used to extend smoothed height SD scores to children between 1 and
22 days of age.