The nature of the growthcleanr
algorithm is repetitive.
It runs many checks over each individual subject’s measurements, in a
specific sequence, and revises its assessments as it goes. Because of
this, it can take some time to process even a modest dataset. For
reference, the syngrowth
synthetic data example packaged
with growthcleanr
takes 2-3 minutes to process on a
contemporary laptop. growthcleanr
uses optimized libraries
(e.g. data.table
) to go as fast as possible, but there are
limits given the repeated passes over data required. If you have under
one million records, using growthcleanr
may just
necessitate planning to wait a little while the job processes.
If you have a much larger dataset, there are several strategies to improve performance you might consider:
cleangrowth()
option parallel
should take advantage of multiple CPU cores when your hardware
allows.growthcleanr
on a machine with more CPU cores
and more RAM can improve this further.data.table
are installed properly for your machinegcdriver.R
script described
below.For very large datasets, on the order of millions of records or more,
cleaning data can take several hours or more. When running this kind of
job, the risk of failures due to external factors such as power outages
increases. The current structure of growthcleanr
does not
checkpoint results as progress is made, so if a job has to run overnight
and fails in the middle, a researcher would likely need to start over
from scratch.
Because growthcleanr
operates for the most part on
individual subjects one at a time, however, this issue might be
mitigated by splitting the input data into many small files, then
running growthcleanr
separately on each file, with results
re-combined at the end. The primary benefit of this strategy would be
the saving of the results of each file as it completes, allowing a
researcher whose job fails overnight to resume processing on only the as
yet incomplete input files. A secondary benefit might be a slight
improvement of use of available RAM.
Adopting this approach might require some custom code, and there are few pitfalls to avoid. The following lays out a rough approach:
growthcleanr
uses a consistent set of recentering means for each run. This can be
achieved a few ways, the simplest of which is to specify the
cleangrowth()
option sd.recenter
as
"NHANES"
, which will use a built-in reference set of
recentering means derived from
NHANES. Another option is to run cleangrowth()
once on
your entire dataset to extract sd.median
values for the
entire dataset as a whole using the sdmedian.filename
option. This would then create a recentering means file you could
specify using sd.recenter
on each run on a subset of your
data. Either way, addressing this concern is critical, as
growthcleanr
will may otherwise generate these values
itself for a large input dataset if they are not otherwise specified. In
other words, if a large dataset is split into 1,000 smaller files to be
cleaned separately, each of those 1,000 growthcleanr
jobs
needs to use the same sd.median
values to recenter or the
results will be inconsistent.splitinput()
function will perform this split safely,
e.g.:
library(growthcleanr)
count <- splitinput(syngrowth, fname = "mydata", fdir = tempdir())
count
## [1] 7
list.files(tempdir(), pattern = "mydata.*")
## [1] "mydata.00000.csv" "mydata.00001.csv" "mydata.00002.csv" "mydata.00003.csv"
## [5] "mydata.00004.csv" "mydata.00005.csv" "mydata.00006.csv" "mydata.00007.csv"
exec/gcdriver.R
to
execute growthcleanr` on each separate file, then write out its results
when complete. An example is below.To invoke exec/gcdriver.R
on a single input file using
Rscript
:
Rscript exec/gcdriver.R --quietly --sdrecenter nhanes mydata.csv mydata-cleaned.csv
If you have many small files saved, for example, as
mydata.00001.csv
, mydata.00002.csv
, etc., this
driver can be invoked using parallel
:
ls mydata.?????.csv | parallel -j2 --eta \
"Rscript exec/gcdriver.R --quietly --sdrecenter nhanes {} {}-clean.csv"
This lists your input files, and passes the filename list to parallel
to use in invoking the driver script, one file at a time, and to process
as parallel jobs as resources are available. Each run of the job then
saves the cleaned output with -clean.csv
appended to the
input filename, and as each completes, the next file on the list will be
started until all are complete. A few things to note:
-j2
option specifies running two jobs at once;
-j0
would use as many CPU cores as a machine has available.
This might need to vary to match specific hardware.--eta
option will report on progress.--quietly
option on the driver script will make it
easier to monitor progress with less verbose output coming from
growthcleanr
.--sdrecenter
option on the driver script should be
set to ensure each individual file is recentered using the same set for
the entire input. This example uses the built-in NHANES reference
set; you could substitute your own.growthcleanr
, with the main difference
being the saving of intermediate output as each smaller file
completes.If your data to be cleaned is very large, it might help to store it
compressed, for example with gzip
and its corresponding
.gz
filename extension. growthcleanr
can read
in .gz
input, but you might need to install the
R.utils
package first. R
will provide a
message if this is required.
The adult algorithm is new as of
release 2.0.0 and may require further optimization to improve speed
performance. Testing discovered that performing an inline data split (as
opposed to physically splitting the data into many small input sets and
processing each separately, as described above) can gain substantial
performance on a single machine with many cores available. Two options
to the exec/gcdriver.R
script take advantage of this
performance improvement.
The following example may be used on a large dataset of adult data.
Rscript exec/gcdriver.R --numbatches 4 --adult_split 50 my-large-input.csv my-large-input-cleaned.csv
--numbatches 4
may be a good starting point for testing
on a desktop-class machine with four virtual cores.--adult_split 50
will divide the input set into 50
smaller subsets, keeping all records for individual subjects
togetherYou may find that adjusting these numbers to take advantage of your available hardware can make a difference in overall run time.
gcdriver.R
All available options for gcdriver.R
are described
below.
Rscript exec/gcdriver.R --help
usage: gcdriver.R [--] [--help] [--quietly] [--opts OPTS] [--sdrecenter
SDRECENTER] [--adult_cutpoint ADULT_CUTPOINT] [--weightcap
WEIGHTCAP] [--numbatches NUMBATCHES] [--adult_split ADULT_SPLIT]
infile outfile
CLI driver for growthcleanr
positional arguments:
infile input file
outfile output file
flags:
-h, --help show this help message and exit
-q, --quietly Disable verbose output
optional arguments:
-x, --opts RDS file containing argument values
-s, --sdrecenter sd.recenter data file [default: ]
-a, --adult_cutpoint adult cutpoint [default: 20]
-w, --weightcap weight cap [default: Inf]
-n, --numbatches Number of batches [default: 1]
--adult_split Number of splits to run data on [default: Inf]