Unnest data along a column

Unnesting is an (optional) explode operation coupled with a nested select to promote the sub-fields of the exploded top level array/map/struct to the top level. Hence, given a, an array with fields a1, a2, a3, then codesdf_explode(df, a) will produce output with each record replicated for every element in the a array and with the fields a1, a2, a3 (but not a) at the top level. Similar to tidyr::unnest.

sdf_unnest(x, column, keep_all = FALSE)

Arguments

x	An object (usually a `spark_tbl`) coercible to a Spark DataFrame.
column	The field to explode
keep_all	Logical. If `FALSE` then records where the exploded value is empty/null will be dropped.

Details

Note that this is a less precise tool than using sdf_explode and sdf_select directly because all fields of the exploded array will be kept and promoted. Direct calls to these methods allows for more targetted use of sdf_select to promote only those fields that are wanted to the top level of the data frame.

Additionally, though sdf_select allows users to reach arbitrarily far into a nested structure, this function will only reach one layer deep. It may well be that the unnested fields are themselves nested structures that need to be dealt with accordingly.

Note that map types are supported, but there is no is_map argument. This is because the function is doing schema interrogation of the input data anyway to determine whether an explode operation is required (it is of maps and arrays, but not for bare structs). Given this the result of the schema interrogation drives the value o is_map provided to sdf_explode.

Examples

# NOT RUN {
# first get some nested data
iris2 <- copy_to(sc, iris, name="iris")
iris_nst <- iris2 %>%
  sdf_nest(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width, .key="data") %>%
  group_by(Species) %>%
  summarize(data=collect_list(data))

# then explode it
iris_nst %>% sdf_unnest(data)
# }

Arguments

Details

Examples

Contents