A package to extend the capabilities available in the
sparklyr package with support for working with nested data.
Note that per the
sparklyr installation instructions, you will need to install Spark if you have not already done so or are not using a cluster where it is already installed.
Full documentation is available here: https://mitre.github.io/sparklyr.nested/
sparklyr package makes working with Spark in R easy. The goal of this package is to extend
sparklyr so that working with nested data is easy. The flagship functions are
Suppose I have data about aircraft phase of flight (e.g., climb, cruise, descent). The data is somewhat complex, storing radar data points marked as the start and end points of a given phase. Furthermore, the data is structured such that for a given flight, there are several phases (disjoint in time) in a nested array.
This is a data set that is not very natural for more R use cases (though the
tidyr package helps close this gap) but is fairly typical for Hadoop storage (e.g., using Avro or Parquet). The schema viewer (coupled with a json schema getter
sdf_schema_json) makes understanding the structure of the data simple.
spark_data is a Spark data frame. The structure may be understood by explanding/collapsing the schema via
It is also handy to use the schema viewer to quickly inspect what a pipeline of operations is going to return. This enables you to anticipate the structure of the output you are going to get (and make sure the operations are valid with respect to schema modification) without doing any actual computation on your data. For example:
sdf_select function makes it possible to select nested elements. For example, given the schema displayed above, we may be interested in only the phase start point; and within that, only the time and altitude fields. Grabbing these elements is not possible with a simple
select, but can be done via:
In java dots are not valid characters in a field name so the dot-operator is handled. However, since the dollar sign is typically used for this purpose in R, that is supported as well. The following will trigger the same operation as the above:
What if you want to keep all top level fields, except phases, from which you still want to start point time and altitude? This can be accomplished via:
In this particular example things are not so bad. But imagine a case where you have a wide data set. Listing out each top level field would quickly become cumbersome. To ease this pain, the
dplyr selection helpers are supported, so the above operation may be accomplished using:
The example above would work, but would return something of a mess. The
altitude fields would contain vectors of time and altitude values (respectively) for each row-wise element. Thus for a single
aircraft_id you would have N values in the corresponding
altitude fields. The explode functionality will flatten the data in the sense that it will replicate the top level record once for each element in a nested array along which you are exploding. It will not change the schema beyond transforming arrays to structures.
The above example could be repeated like so to get something more typical for R where there is a single value per field per “row” (record). In this case the
aircraft_id values would be replicated N times instead of having vectors in the
Unnesting here works essentially the same way as in the
tidyr package. It is necessary here because
tidyr functions cannot be called directly on spark data frames. An unnest operation is a combination of an explode with a corresponding nested select to promote the (exploded) nested fields up one level of the data schema.
Thus the following operations are equivalent:
spark_data %>% sdf_explode(phases) %>% sdf_select(aircraft_id, terrain_fusion_key, vertical_taxonomy_key, sequence, start_point=phases.start_point, start_point=phases.start_point, phase=phases.phase, takeoff_mode=phases.takeoff_model, landing_model=phases.landing_model, phases_primary_key=phases.primary_key, primary_key)
There are a few things to notice about how
sdf_unnest does things:
It will only dig one level deep into the schema. If there are fields nested 2 levels deep, they will still be nested (albeit only down 1 level) after
sdf_unnest is executed. Therefore you are not guaranteed totally flat data after calling
It will promote every one-level-deep field, even if you only care about a few of them.
In the event of a name conflict (e.g., there is a top level
primary_key and a
primary_key nested inside the
phases field) then all of the nested fields will be disambiguated using the name of the field in which it was nested.