Working with large data sets

The nature of the growthcleanr algorithm is repetitive. It runs many checks over each individual subject’s measurements, in a specific sequence, and revises its assessments as it goes. Because of this, it can take some time to process even a modest dataset. For reference, the syngrowth synthetic data example packaged with growthcleanr takes 2-3 minutes to process on a contemporary laptop. growthcleanr uses optimized libraries (e.g. data.table) to go as fast as possible, but there are limits given the repeated passes over data required. If you have under one million records, using growthcleanr may just necessitate planning to wait a little while the job processes.

If you have a much larger dataset, there are several strategies to improve performance you might consider:

The built-in cleangrowth() option parallel should take advantage of multiple CPU cores when your hardware allows.
Running growthcleanr on a machine with more CPU cores and more RAM can improve this further.
Verify that supporting libraries like data.table are installed properly for your machine
If you are cleaning very large datasets with several million observations or more, you might consider the techniques for splitting input and processing in parallel jobs described below.
If you are cleaning adult data, an in-memory data-splitting technique is implemented in the gcdriver.R script described below.
Anecdotally, Windows users have sometimes seen slower performance when compared with similarly-appointed hardware running Linux or macOS.

Splitting input datasets

For very large datasets, on the order of millions of records or more, cleaning data can take several hours or more. When running this kind of job, the risk of failures due to external factors such as power outages increases. The current structure of growthcleanr does not checkpoint results as progress is made, so if a job has to run overnight and fails in the middle, a researcher would likely need to start over from scratch.

Because growthcleanr operates for the most part on individual subjects one at a time, however, this issue might be mitigated by splitting the input data into many small files, then running growthcleanr separately on each file, with results re-combined at the end. The primary benefit of this strategy would be the saving of the results of each file as it completes, allowing a researcher whose job fails overnight to resume processing on only the as yet incomplete input files. A secondary benefit might be a slight improvement of use of available RAM.

Adopting this approach might require some custom code, and there are few pitfalls to avoid. The following lays out a rough approach:

For cleaning pediatric data, ensure that growthcleanr uses a consistent set of recentering means for each run. This can be achieved a few ways, the simplest of which is to specify the cleangrowth() option sd.recenter as "NHANES", which will use a built-in reference set of recentering means derived from NHANES. Another option is to run cleangrowth() once on your entire dataset to extract sd.median values for the entire dataset as a whole using the sdmedian.filename option. This would then create a recentering means file you could specify using sd.recenter on each run on a subset of your data. Either way, addressing this concern is critical, as growthcleanr will may otherwise generate these values itself for a large input dataset if they are not otherwise specified. In other words, if a large dataset is split into 1,000 smaller files to be cleaned separately, each of those 1,000 growthcleanr jobs needs to use the same sd.median values to recenter or the results will be inconsistent.
Split the data into many small files, but keep each individual subject’s measurements together in one file. If, for example, a subject has 12 measurements, all 12 should be in one and only one of the smaller files to maximize the longitudinal analysis of their values. The splitinput() function will perform this split safely, e.g.:

library(growthcleanr)
count <- splitinput(syngrowth, fname = "mydata", fdir = tempdir())
count

## [1] 7

list.files(tempdir(), pattern = "mydata.*")

## [1] "mydata.00000.csv" "mydata.00001.csv" "mydata.00002.csv" "mydata.00003.csv"
## [5] "mydata.00004.csv" "mydata.00005.csv" "mydata.00006.csv" "mydata.00007.csv"

Use the standalone driver script exec/gcdriver.R to execute growthcleanr` on each separate file, then write out its results when complete. An example is below.
Invoke the job using a tool like GNU Parallel, which can be run on all major platforms. An example invocation is below as well.
If a failure occurs, set aside both the completed inputs and their results. Then re-run the job on only the remaining input data.
When complete, re-combine the data into one file or as otherwise appropriate.

To invoke exec/gcdriver.R on a single input file using Rscript:

Rscript exec/gcdriver.R --quietly --sdrecenter nhanes mydata.csv mydata-cleaned.csv

If you have many small files saved, for example, as mydata.00001.csv, mydata.00002.csv, etc., this driver can be invoked using parallel:

ls mydata.?????.csv | parallel -j2 --eta \
  "Rscript exec/gcdriver.R --quietly --sdrecenter nhanes {} {}-clean.csv"

This lists your input files, and passes the filename list to parallel to use in invoking the driver script, one file at a time, and to process as parallel jobs as resources are available. Each run of the job then saves the cleaned output with -clean.csv appended to the input filename, and as each completes, the next file on the list will be started until all are complete. A few things to note:

The careful use of naming and wildcards when listing input files and naming output files can save accidental re-running of data from output files
The -j2 option specifies running two jobs at once; -j0 would use as many CPU cores as a machine has available. This might need to vary to match specific hardware.
The --eta option will report on progress.
The --quietly option on the driver script will make it easier to monitor progress with less verbose output coming from growthcleanr.
The --sdrecenter option on the driver script should be set to ensure each individual file is recentered using the same set for the entire input. This example uses the built-in NHANES reference set; you could substitute your own.
If multiple cores are available, the job should proceed with speedup roughly similar to what can be gained using the parallel batching feature built in to growthcleanr, with the main difference being the saving of intermediate output as each smaller file completes.

If your data to be cleaned is very large, it might help to store it compressed, for example with gzip and its corresponding .gz filename extension. growthcleanr can read in .gz input, but you might need to install the R.utils package first. R will provide a message if this is required.

Batch splitting for adult data

The adult algorithm is new as of release 2.0.0 and may require further optimization to improve speed performance. Testing discovered that performing an inline data split (as opposed to physically splitting the data into many small input sets and processing each separately, as described above) can gain substantial performance on a single machine with many cores available. Two options to the exec/gcdriver.R script take advantage of this performance improvement.

The following example may be used on a large dataset of adult data.

Rscript exec/gcdriver.R --numbatches 4 --adult_split 50 my-large-input.csv my-large-input-cleaned.csv

--numbatches 4 may be a good starting point for testing on a desktop-class machine with four virtual cores.
--adult_split 50 will divide the input set into 50 smaller subsets, keeping all records for individual subjects together

You may find that adjusting these numbers to take advantage of your available hardware can make a difference in overall run time.

Reference for `gcdriver.R`

All available options for gcdriver.R are described below.

Rscript exec/gcdriver.R --help
usage: gcdriver.R [--] [--help] [--quietly] [--opts OPTS] [--sdrecenter
       SDRECENTER] [--adult_cutpoint ADULT_CUTPOINT] [--weightcap
       WEIGHTCAP] [--numbatches NUMBATCHES] [--adult_split ADULT_SPLIT]
       infile outfile

CLI driver for growthcleanr

positional arguments:
  infile                input file
  outfile               output file

flags:
  -h, --help            show this help message and exit
  -q, --quietly         Disable verbose output

optional arguments:
  -x, --opts            RDS file containing argument values
  -s, --sdrecenter      sd.recenter data file [default: ]
  -a, --adult_cutpoint  adult cutpoint [default: 20]
  -w, --weightcap       weight cap [default: Inf]
  -n, --numbatches      Number of batches [default: 1]
  --adult_split         Number of splits to run data on [default: Inf]

2024-02-02

Splitting input datasets

Batch splitting for adult data

Reference for gcdriver.R

Reference for `gcdriver.R`