Clean growth measurements
cleangrowth(
subjid,
param,
agedays,
sex,
measurement,
recover.unit.error = FALSE,
sd.extreme = 25,
z.extreme = 25,
lt3.exclude.mode = "default",
height.tolerance.cm = 2.5,
error.load.mincount = 2,
error.load.threshold = 0.5,
sd.recenter = NA,
sdmedian.filename = "",
sdrecentered.filename = "",
include.carryforward = FALSE,
ewma.exp = -1.5,
ref.data.path = "",
log.path = NA,
parallel = FALSE,
num.batches = NA,
quietly = TRUE,
adult_cutpoint = 20,
weight_cap = Inf,
adult_columns_filename = "",
prelim_infants = FALSE
)
Vector of unique identifiers for each subject in the database.
Vector identifying each measurement, may be 'WEIGHTKG', 'WEIGHTLBS', 'HEIGHTCM', 'HEIGHTIN', 'LENGTHCM', or 'HEADCM'. 'HEIGHTCM'/'HEIGHTIN' vs. 'LENGTHCM' only affects z-score calculations between ages 24 to 35 months (730 to 1095 days). All linear measurements below 731 days of life (age 0-23 months) are interpreted as supine length, and all linear measurements above 1095 days of life (age 36+ months) are interpreted as standing height. Note: at the moment, all LENGTHCM will be converted to HEIGHTCM. In the future, the algorithm will be updated to consider this difference. Additionally, imperial 'HEIGHTIN' and 'WEIGHTLBS' measurements are converted to metric during algorithm calculations.
Numeric vector containing the age in days at each measurement.
Vector identifying the gender of the subject, may be 'M', 'm', or 0 for males, vs. 'F', 'f' or 1 for females.
Numeric vector containing the actual measurement data. Weight must be in kilograms (kg), and linear measurements (height vs. length) in centimeters (cm).
Indicates whether the cleaning algorithm should attempt to identify unit errors (I.e. inches vs. cm, lbs vs. kg). If unit errors are identified, the value will be corrected and retained within the cleaning algorithm as a valid measurement. Defaults to FALSE.
Measurements more than sd.extreme standard deviations from the mean (either above or below) will be flagged as invalid. Defaults to 25.
Measurements with an absolute z-score greater than z.extreme will be flagged as invalid. Defaults to 25.
Determines type of exclusion procedure to use for 1 or 2 measurements of one type without matching same ageday measurements for the other parameter. Options include "default" (standard growthcleanr approach), and "flag.both" (in case of two measurements of one type without matching values for the other parameter, flag both for exclusion if beyond threshold)
maximum decrease in height tolerated for sequential measurements
minimum count of exclusions on parameter before considering excluding all measurements. Defaults to 2.
threshold of percentage of excluded measurement count to included measurement count that must be exceeded before excluding all measurements of either parameter. Defaults to 0.5.
specifies how to recenter medians. May be a data frame or table w/median SD-scores per day of life by gender and parameter, or "NHANES" or "derive" as a character vector.
If sd.recenter
is specified as a data set, use the data set
If sd.recenter
is specified as "nhanes
", use NHANES reference medians
If sd.recenter
is specified as "derive
", derive from input
If sd.recenter
is not specified or NA
:
If the input set has at least 5,000 observations, derive medians from input
If the input set has fewer than 5,000 observations, use NHANES
If specifying a data set, columns must include param, sex, agedays, and sd.median (referred to elsewhere as "modified Z-score"), and those medians will be used for recentering. A summary of how the NHANES reference medians were derived is available in README.md. Defaults to NA.
Name of file to save sd.median data calculated on the input dataset to as CSV. Defaults to "", for which this data will not be saved. Use for extracting medians for parallel processing scenarios other than the built-in parallel option.
Name of file to save re-centered data to as CSV. Defaults to "", for which this data will not be saved. Useful for post-processing and debugging.
Determines whether Carry-Forward values are kept in the output. Defaults to False.
Exponent to use for weighting measurements in the exponentially weighted moving average calculations. Defaults to -1.5. This exponent should be negative in order to weight growth measurements closer to the measurement being evaluated more strongly. Exponents that are further from zero (e.g. -3) will increase the relative influence of measurements close in time to the measurement being evaluated compared to using the default exponent.
Path to reference data. If not supplied, the year 2000 Centers for Disease Control (CDC) reference data will be used.
Path to log file output when running in parallel (non-quiet mode). Default is NA. A new directory will be created if necessary. Set to NA to disable log files.
Determines if function runs in parallel. Defaults to FALSE.
Specify the number of batches to run in parallel. Only applies if parallel is set to TRUE. Defaults to the number of workers returned by the getDoParWorkers function in the foreach package.
Determines if function messages are to be displayed and if log files (parallel only) are to be generated. Defaults to TRUE
Number between 18 and 20, describing ages when the pediatric algorithm should not be applied (< adult_cutpoint), and the adult algorithm should apply (>= adult_cutpoint). Numbers outside this range will be changed to the closest number within the range. Defaults to 20.
Positive number, describing a weight cap in kg (rounded to the nearest .1, +/- .1) within the adult dataset. If there is no weight cap, set to Inf. Defaults to Inf.
Name of file to save original adult data, with additional output columns to as CSV. Defaults to "", for which this data will not be saved. Useful for post-analysis. For more information on this output, please see README.
TRUE/FALSE. Run the in-development release of the infants algorithm (expands pediatric algorithm to improve performance for children 0 – 2 years). Not recommended for use in research. For more information regarding the logic of the algorithm, see the vignette 'Preliminary Infants Algorithm.' Defaults to FALSE.
Vector of exclusion codes for each of the input measurements.
Possible values for each code are:
'Include', 'Unit-Error-High', 'Unit-Error-Low', 'Swapped-Measurements', 'Missing',
'Exclude-Carried-Forward', 'Exclude-SD-Cutoff', 'Exclude-EWMA-Extreme', 'Exclude-EWMA-Extreme-Pair',
'Exclude-Extraneous-Same-Day',
'Exclude-EWMA-8', 'Exclude-EWMA-9', 'Exclude-EWMA-10', 'Exclude-EWMA-11', 'Exclude-EWMA-12', 'Exclude-EWMA-13', 'Exclude-EWMA-14',
'Exclude-Min-Height-Change', 'Exclude-Max-Height-Change',
'Exclude-Pair-Delta-17', 'Exclude-Pair-Delta-18', 'Exclude-Pair-Delta-19',
'Exclude-Single-Outlier', 'Exclude-Too-Many-Errors', 'Exclude-Too-Many-Errors-Other-Parameter'
# \donttest{
# Run calculation using a small subset of given data
df_stats <- as.data.frame(syngrowth)
df_stats <- df_stats[df_stats$subjid %in% unique(df_stats[, "subjid"])[1:5], ]
clean_stats <-cleangrowth(subjid = df_stats$subjid,
param = df_stats$param,
agedays = df_stats$agedays,
sex = df_stats$sex,
measurement = df_stats$measurement)
# Once processed you can filter data based on result value
df_stats <- cbind(df_stats, "clean_result" = clean_stats)
clean_df_stats <- df_stats[df_stats$clean_result == "Include",]
# Parallel processing: run using 2 cores and batches
clean_stats <- cleangrowth(subjid = df_stats$subjid,
param = df_stats$param,
agedays = df_stats$agedays,
sex = df_stats$sex,
measurement = df_stats$measurement,
parallel = TRUE,
num.batches = 2)
#> Warning: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
#> Warning: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
#> Warning: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
#> Warning: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
# }