The select function works well for keeping/dropping top level fields. It does not however support access to nested data. This function will accept complex field names such as x.y.z where z is a field nested within y which is in turn nested within x. Since R uses "$" to access nested elements and java/scala use ".", sdf_select(data, x.y.z) and sdf_select(data, x$y$z) are equivalent.

sdf_select(x, ..., .aliases, .drop_parents = TRUE, .full_name = FALSE)

Arguments

x

An object (usually a spark_tbl) coercible to a Spark DataFrame.

...

Fields to select

.aliases

Character. Optional. If provided these names will be matched positionally with selected fields provided in .... This is more useful when calling from a function and less natural to use when calling the function directly. It is likely to get you into trouble if you are using dplyr select helpers. The alternative with direct calls is to put the alias on the left side of the expression (e.g. sdf_select(df, fld_alias=parent.child.fld))

.drop_parents

Logical. If TRUE then any field from which nested elements are extracted will be dropped, even if they were included in the selected .... This better supports using dplyr field matching helpers like everything() and starts_with.

.full_name

Logical. If TRUE then nested field names that are not named (either using a LHS name=field_name construct or the .aliases argument) will be disambiguated using the parent field name. For example sdf_select(df, x.y) will return a field named x_y. If FALSE then the parent field name is dropped unless it is needed to avoid duplicate names.

Selection Helpers

dplyr allows the use of selection helpers (e.g., see everything). These helpers only work for top level fields however. For now all nested fields that should be promoted need to be explicitly identified.

Examples

# NOT RUN {
# produces a dataframe with an array of characteristics nested under
# each unique species identifier
iris2 <- copy_to(sc, iris, name="iris")
iris_nst <- iris2 %>%
  sdf_nest(Sepal_Length, Sepal_Width, .key="Sepal")

# using java-like dot-notation
iris_nst %>%
  sdf_select(Species, Petal_Width, Sepal.Sepal_Width)

# using R-like dollar-sign-notation
iris_nst %>%
  sdf_select(Species, Petal_Width, Sepal$Sepal_Width)

# using dplyr selection helpers
iris_nst %>%
  sdf_select(Species, matches("Petal"), Sepal$Sepal_Width)
# }