column_transformation.R

Note: For better comprehension, this function should be read after understanding the mungebit and mungepiece classes defined in this package.

In general, transformations of a single data.frame into another data.frame fall in three categories:

Column transformations. A one-variable function applied to an atomic vector (a column) that yields a new vector of the same length.
Row transformations. A column transformation with a matrix transposition composed to the left and right of it (i.e., operating on rows instead of columns).
Whole dataframe transformations. Any transformation that cannot be expressed as a column or row transformation: for example, a transposition or multiple imputation.

The column_transformation function is a helper that takes a function with at least one argument–the atomic vector (column) being operated on, with additional arguments acting as further parametrization–and turns that function into a function suitable for use with a mungebit that will operate on an entire data.frame. For example,

stripper <- column_transformation(function(x) {
  gsub("[[:space:]]", "", x)
})
new_dataset <- stripper(dataset, c("column1", "column2"))

The function produced, stripper, accepts a data.frame as its first argument and as its second argument a vector of column names (or several other formats; see the standard_column_format helper).

The argument name is reserved, and if you create a column transformation from a function that includes this argument, its value will be set to the name of the column:

adjoin_name <- column_transformation(function(x, name) {
  paste0(x, "_", name)
})
new_dataset <- adjoin_name(dataset, c("column1", "column2"))
# If column1 and column2 are character vectors, they will now
# have all their values prefixed with `column1_` and `column2_`,
# respectively.

TODO: (RK) List many more examples here and explain column transformation standard column format usage.

              #' Pure column transformations.
#'
#' A mungebit which affects multiple columns identically and independently
#' can be abstracted into a column transformation. This function allows one
#' to specify what happens to an individual column, and the mungebit will be
#' the resulting column transformation applied to an arbitrary combination of
#' columns.
#'
#' @param transformation function. The function's first argument will
#'    receive an atomic vector derived from some \code{data.frame}. If the
#'    \code{transformation} has a \code{name} argument, it will receive
#'    the column name. Any other arguments will be received as the
#'    \code{list(...)} from calling the function produced by 
#'    \code{column_transformation}.
#' @param nonstandard logical. If \code{TRUE}, nonstandard evaluation support
#'    will be provided for the derived function, so it will be possible
#'    to capture the calling expression for each column. By default \code{FALSE}.
#'    Note this will slow the transformation by 0.1ms on each column.
#' @return a function which takes a data.frame and a vector of column
#'    names (or several other formats, see \code{\link{standard_column_format}})
#'    and applies the \code{transformation}.
#' @seealso \code{\link{multi_column_transformation}}, \code{\link{standard_column_format}}
#' @note The function produced by calling \code{column_transformation} will
#'    not run independently. It must be used a train or predict function for
#'    a \code{\link{mungebit}}.
#' @export
#' @examples
#' doubler <- column_transformation(function(x) { 2 * x })
#' # doubles the Sepal.Length column in the iris dataset
#' iris2 <- mungebit$new(doubler)$run(iris, c("Sepal.Length")) 
column_transformation <- function(transformation, nonstandard = FALSE) {

We will construct a function from scratch. Since R is almost LISP under the hood, it is possible to construct a function piece-by-piece.

In general, an R function consists of three components:

Formals. The arguments to the function. You can access these for any function using the formals helper. This is a named list of expressions, with the values being the defaults for each argument.
Body. The body of the function. In R, a block of code can be represented within R itself as a language object. Specifically, using quote can be used to construct the body of a function, as in quote({ a <- 1; print(a); return(a) }). This is a form of reflection.
Environment. The R environment the function has access to when looking for local variables. In other words, its lexical environment as a closure.

For a column_transformation, its derived transformation will be a new function that takes a data argument and a vector of columns, and executes the transformation on each column.

Note we have to inject a few helpers like %||% and list2env_safe, which are defined in the mungebits2 package internals, since these may not be available when a mungebit is serialized and exported out of the active R session (if mungebits2 is not attached to the search path).

                full_transformation <- function(data, columns = colnames(data), ...) { }
  was_debugged <- isdebugged(transformation)
  environment(transformation) <- list2env(list(
    input = NULL, trained = NULL, has_no_null = NULL
  ), parent = environment(transformation) %||% baseenv())

We create a copy of the standard_column_format helper in this package so it can accompany the column_transformation to R sessions even where this package is not present.

                standard_column_format_dup <- standard_column_format
  environment(standard_column_format_dup) <- globalenv()

  environment(full_transformation) <- list2env(
    list(transformation = transformation, nonstandard = isTRUE(nonstandard),
         "%||%" = `%||%`, list2env_safe = list2env_safe,
         named = is.element("name", names(formals(transformation))),
         env = environment(transformation), was_debugged = was_debugged,
         standard_column_format = standard_column_format_dup),
    parent = globalenv()
  )
  body(full_transformation) <- column_transformation_body
  # Add some convenient metadata for overloading `debug` and `print`.
  class(full_transformation) <- c("column_transformation", "transformation", "function")
  full_transformation
}

As promised, we specify the body for the derived transformation generated by calling column_transformation. Since this will not change, we can store it in the package namespace.

              column_transformation_body <- quote({
  # Recall that `data` and `columns` are formals.

In this function, optimization matters. Column transformations will run millions of times over various datasets, so even microsecond shaved off is valuable. Throughout, note the code may be slightly improved for readability but at a speed cost. When developing new packages, one should follow the old adage to first make it functional, then make it beautiful, then make it fast. In this case, we prefer speed over beauty!

If we are supporting non-standard evaluation, we precompute the expression used, or we will lose it upon first reference of data.

                if (nonstandard) {
    data_expr <- substitute(data)
  }

  if (!isTRUE(trained)) {

The dataset passed in may look different depending on whether we are running the mungebit in train or predict mode. If columns are 1:4 and the dataset is shuffled, we will be referencing the wrong columns after running the mungebit the second time in predict mode! To avoid this problem, keeping in mind that R data.frames have unique column names by design, we store the character vector of column names in the mungebit input so that we know exactly which columns this transformation should apply to in predict mode.

If you require operating totally different column names during training versus prediction, it is by definition not the same mathematical transformation, and thus a mungebit is likely not the appropriate tool for your problem.

                  input$columns <- intersect(colnames(data), standard_column_format(columns, data))
  }

  indices <- match(input$columns, names(data))

  # An optimization trick to avoid the slow `[.data.frame` operator.
  old_class   <- class(data)

Try to run print(`[.data.frame`) from your R console. Notice how much code is run to perform data.frame subsetting! The same is true for print(`[[<-.data.frame`), data.frame element assignment. Since we use this operation below, we want to skip over the typical checks for the sake of performance and use straight-up list subsetting (which will use underlying C code).

                class(data) <- "list" 

  env$trained <- trained

If we wish to pass along the expression the transformation was called with so we can use substitute correctly, the only effective way to perform this capture is to use alist and retain the parent.frame() during do.call below.

                if (nonstandard) {
    arguments  <- c(list(NULL), eval(substitute(alist(...))))
    eval_frame <- parent.frame()
  }

If the mungebit has not been trained yet (recall that trained is injected during the mungebit$train and mungebit$predict functions), we create a vector of environments, one for each column the transformation is run on, so that each respective run has access to an input environment to store computations that will be required during predict (e.g., storing the mean of the column during imputation).

                if (!isTRUE(trained)) {
    input$sub_inputs <- structure(replicate(
      length(input$columns), new.env(parent = emptyenv()), simplify = FALSE
    ), .Names = input$columns)
  }

Dataframe subset assignment ([<-.data.frame) does not behave in the same manner as list assignment ([<-). Since we stripped the data.frame of its class earlier, the next line will perform list assignment. This has advantageous speedups, but in particular if we drop some of the columns by including NULL in the output of the transformation, this will corrupt the data.frame with actual NULL values instead of dropping columns. We work around this with performance considerations by recording whether any of the values in the inner loop are NULL.

                env$has_no_null <- TRUE

  data[indices] <- lapply(seq_along(indices), function(j, ...) {

Since indices match the column names to iterate over on the nose, sub_inputs[[j]] will be the correct environment to use for the jth column. Here, .subset2 is a trick to speed things up a tiny bit by calling the C function that does the actual subsetting.

                  env$input <- .subset2(.subset2(input, "sub_inputs"), j)

Assigning a function's environment clears its internal debug flag, so if the function was previously being debugged we retain this property.

                  if (was_debugged) {
      debug(transformation)
    }

And the non-standard evaluation trick! Imagine a user had called a column transformation with the code below.

ct <- column_transformation(nonstandard = TRUE, function(x) { y <- substitute(x) })
some_data <- data.frame(first = 1:2, second = c("a", "b"))
mungebit$new(ct)$run(some_data)

Then substitute(x) would be precisely the expression some_data[["first"]] during the first call and some_data[["second"]] during the second call (in other words, it is equivalent to y <- quote(some_data[["first"]]) in the first call, etc.).

                  # Support non-standard evaluation at a slight speed cost.
    if (nonstandard) {
      if (named) {

Recall that if the transformation has a formal argument called “name”, we must pass along the column name.

                      arguments$name <- .subset2(names(data), .subset2(indices, j))
      }

We replace the first argument with the column to apply the transformation to.

                    arguments[[1L]] <- bquote(.(data_expr)[[.(
        if (named) arguments$name else .subset2(names(data), .subset2(indices, j))
      )]])
      result <- .Internal(do.call(transformation, arguments, eval_frame))
    } else {

If NSE should not be carried over we do not bother with the magic and simply send the function the value.

                    if (named) {
        result <- transformation(.subset2(data, .subset2(indices, j)), ...,
                       name = .subset2(names(data), .subset2(indices, j)))
      } else {
        result <- transformation(.subset2(data, .subset2(indices, j)), ...)
      }
    }

Using a has_no_null flag is slightly faster than has_null, since we can save on a call to ! in the condition below.

                  if (env$has_no_null && is.null(result)) {
      env$has_no_null <- FALSE
    }

    result
  }, ...)

After training, we lock the input environments so that the user cannot modify them during predict.

                if (!isTRUE(trained)) {
    lapply(input$sub_inputs, lockEnvironment, bindings = TRUE)
  }

Finally, if some of the columns were dropped, explicitly remove them from the dataframe using [[<- list assignment. This ensures that we do not drop any attributes and is faster than subsetting to non-NULL columns.

                if (!env$has_no_null) {
    count <- 0
    for (i in which(vapply(data, is.null, logical(1)))) {

As we're dropping columns, we need to “shift” the indices.

                    data[[i - count]] <- NULL
      count <- count + 1
    }
  }

Finally, we reset the class to data.frame after stripping it for a speed optimization. If you study the code of (`[.data.frame`), you will see this is exactly the same trick the R base library uses to delegate to the list subsetting after the data.frame-specific checks have been completed.

                class(data) <- old_class
  data
})

#' @export
print.column_transformation <- function(x, ...) {
  # `print_transformation` parameters include `indent = 0L, full = FALSE`.
  print_transformation(x, ..., byline = "Column transformation")
}

#' @method all.equal transformation
#' @export
all.equal.transformation <- function(target, current, ...) {
  identical(parent.env(environment(target))$transformation,
            parent.env(environment(current))$transformation)
}

debug.R

Debugging the train and predict function of a mungebit should be transparent to the user. Say we have a mungebit called value_replacer. By calling debug(value_replacer), we should be able to simultaneously set debug hooks on both the train_function and predict_function of the mungebit. Calling undebug(value_replacer) will remove the hooks.

R has a tremendous array of debugging tools. You should familiarize yourself with them to make your life much simpler. A great resource is chapter 8 of The R Inferno.

              #' Generic debugging.
#'
#' @inheritParams base::debug
#' @seealso \code{\link[base]{debug}}
#' @export
debug <- function(fun, text = "", condition = NULL) {

The standard S3 generic.

                UseMethod("debug")
}

By default, debugging should preserve the behavior from the base package.

              #' @export
debug.default <- function(fun, text = "", condition = NULL) {
  base::debug(fun, text = "", condition = NULL)
}

#' @export 
debug.mungebit <- function(fun, text = "", condition = NULL) {

To debug a mungebit, we loop over the train and predict functions and set their internal debugging flag. The if statement is necessary in case either are NULL (e.g., there is no train or predict step).

                for (fn in list(fun$.train_function, fun$.predict_function)) {
    if (is.function(fn)) {
      debug(fn, text, condition)
    }
  }
}

#' @export 
debug.mungepiece <- function(fun, text = "", condition = NULL) {

To debug a mungepiece, we delegate all the work to the mungebit.

                debug(fun$mungebit(), text, condition)
}

#' Generic removal of debugging.
#'
#' @inheritParams base::undebug
#' @seealso \code{\link[base]{undebug}}
#' @export
undebug <- function(fun) {
  UseMethod("undebug")
}

#' @export
undebug.default <- function(fun) {
  base::undebug(fun)
}

#' @export 
undebug.mungebit <- function(fun) {

To undebug a mungebit, we loop over the train and predict functions and unset their internal debugging flag. The if statement is necessary in case either are NULL (e.g., there is no train or predict step), and to avoid throwing a warning if the function isn't already being debugged.

                for (fn in list(fun$.train_function, fun$.predict_function)) {
    if (is.function(fn) && isdebugged(fn)) {
      undebug(fn)
    }
  }
}

#' @export 
undebug.mungepiece <- function(fun) {

To undebug a mungepiece, we delegate all the work to the mungebit.

                undebug(fun$mungebit())
}

#' @export
debug.transformation <- function(fun, text, condition) {
  debug(get("transformation", envir = environment(fun)))
}

#' @export
undebug.transformation <- function(fun) {
  transformation <- get("transformation", envir = environment(fun))
  if (isdebugged(transformation)) {
    undebug(transformation)
  }
}

messages.R

A dictionary of messages used by the package. We separate these into its own file to avoid cluttering the R code with a multitude of strings.

              messages <- list(
  parse_mungepiece_dual_error = c(
    "When using a fully named list to construct a mungepiece, ",
    "it must consist solely of ", sQuote("train"), " and ",
    sQuote("predict"), " elements giving a list with the ",
    "respective train or predict function and any additional ",

Note the use of {{{error}}} in conjunction with whisker below.

                  "train or predict arguments. ", crayon::red("{{{error}}}."),
    " For example,\n\n",
    crayon::green(paste("list(train = list(discretize, cuts = 5),",
                        "predict = list(restore_levels))")),
    "\n\nwill specify to use the ", sQuote("discretize"),
    " function in training with the argument ",
    sQuote("cuts"), " equal to 5, while the ",
    sQuote("restore_levels"), " function will be used without ",
    "arguments during prediction.\n"
  ),

  parse_mungepiece_dual_error_type = c(
    "When using the explicit train/predict syntax to construct a mungepiece ",
    "you must pass a list on both sides:\n\n",
    crayon::green(paste("list(train = list(discretize, cuts = 5),",
                        "predict = list(restore_levels))")),
    "\n\nInstead, I got a ", crayon::red("{{{class}}}"), " on the ",
    "{{{type}}} side."
  ),

  parse_mungepiece_dual_error_unnamed = c(
    "When using the explicit train/predict syntax to construct a mungepiece ",
    "you must have at least one unnamed element on the {{{type}}} side ",
    "which will be used for the {{{type}}} function."
  ),

  parse_mungepiece_dual_error_nonfunction = c(
    "When using the explicit train/predict syntax to construct a mungepiece, ",
    "the first unnamed element on the {{{type}}} side must be a function. ",
    "Instead, I got a ", crayon::red("{{{class}}}"), "."
  ),

  parse_mungepiece_hybrid_error = c(
    "When using a non-function as the first argument when constructing a ",
    "mungepiece, the only accepted format is a pair of two functions, ",
    "with either one but not both NULL.\n\n",
    crayon::green(paste("list(list(discretize, restore_levels), variables)")),
    "\n\nThe first function will be used for training and the second for ",
    "prediction. Please double check your syntax."
  ),

  munge_type_error = c(
    "The second parameter to ", sQuote("munge"),
    " must be a list; instead I got a ",
    crayon::red("{{{class}}}")
  ),

  munge_lack_of_mungepieces_attribute = c(
    "If the second parameter to ", sQuote("munge"),
    " is a data.frame, it must have a ", sQuote("mungepieces"),
    " attribute (usually created automatically during a previous ",
    "run through ", sQuote("munge"), ")"
  )
)

Cleanse the message a little after fetching it from the messages list.

              msg <- function(name) {
  stopifnot(name %in% names(messages))

The gsub will squish multiple spaces into a single space, while the paste(collapse = "", ...) usage will ensure we can take vectors of characters in the above messages list.

                paste(collapse = "", gsub("[ ]+", " ", messages[[name]]))
}

We use the whisker templating engine to inject any additional values into the message string. For example,

m("parse_mungepiece_dual_error", error = "Bloop")

would return the appropriate error with the string “Bloop” injected in the appropriate place.

              m <- function(name, ...) {

Note the use of do.call, a very handy R metaprogramming tool when we do not know exactly which arguments we will pass.

                do.call(whisker::whisker.render, list(msg(name), list(...)))
}

multi_column_transformation.R

Note: For better comprehension, this function should be read after understanding the mungebit and mungepiece classes and the column_transformation function defined in this package.

Recall that in general, transformations of a single data.frame into another data.frame fall in three categories:

Column transformations. A one-variable function applied to an atomic vector (a column) that yields a new vector of the same length.
Row transformations. A column transformation with a matrix transposition composed to the left and right of it (i.e., operating on rows instead of columns).
Whole dataframe transformations. Any transformation that cannot be expressed as a column or row transformation: for example, a transposition or multiple imputation.

The third class can further be broken down into:

Multi-column transformations. Transformations that take a subset of the columns in the dataset and map to another subset of columns (some possibly new).
Exceptional transformations. Functions that legitimately require the entire data set as input and cannot be broken down into any of the previous transformations. The nice thing here is that this class includes transposition and similar transformations which are rarely “natural” operations to perform for data wrangling.

The multi_column_transformation function is a helper that takes a function with a fixed number of arguments, with each argument corresponding to a column in the dataset, and produces another set of columns. This can be used, for example, to compute feature engineering that unifies multiple columns:

divider <- multi_column_transformation(function(x, y) {
  x / y
})
new_dataset <- divider(dataset, c("column1", "column2"), "column_ratio")
# A new variable `column_ratio` is created that is the ratio of 
# `column1` and `column2`.

Note the above example can be written more succinctly as multi_column_transformation(`/`), that is, one can pass any function to the helper, including primitive R functions.

The function produced, divider, accepts a data.frame as its first argument and as its second and third argument a vector of column names of inputs and outputs, respectively (or several other formats; see the standard_column_format helper).

If there is more than one output column given, the function should produce an ordered or named list giving the output column values.

If you think about it, a multi-column transformation with a single input and output column that equal each other is actually just a column_transformation, so this function is strictly more general.

              #' Multi column transformations.
#'
#' A mungebit which takes a fixed group of columns and produces a new
#' group of columns (or a single new column) can be abstracted into a 
#' multi-column transformation. This functions allows one to specify what
#' happens to a fixed list of columns, and the mungebit will be the
#' resulting multi-column transformation applied to an arbitrary combination
#' of columns. An arity-1 multi-column transformation with a single output
#' column equal to its original input column is simply a
#' \code{\link{column_transformation}}.
#' can be abstracted into a column transformation. This function allows one
#' to specify what happens to an individual column, and the mungebit will be
#' the resulting column transformation applied to an arbitrary combination of
#' columns.
#'
#' @param transformation function. The function's first argument will
#'    receive atomic vectors derived from some \code{data.frame}.
#'    Any other arguments will be received as the
#'    \code{list(...)} from calling the function produced by 
#'    \code{multi_column_transformation}.
#' @param nonstandard logical. If \code{TRUE}, nonstandard evaluation support
#'    will be provided for the derived function, so it will be possible
#'    to capture the calling expression for each column. By default \code{FALSE}.
#'    Note this will slow the transformation by 0.1ms on each column.
#' @return a function which takes a data.frame and a vector of column
#'    names (or several other formats, see \code{\link{standard_column_format}})
#'    and applies the \code{transformation}.
#' @seealso \code{\link{column_transformation}}, \code{\link{standard_column_format}}
#' @note The function produced by calling \code{multi_column_transformation} will
#'    not run independently. It must be used a train or predict function for
#'    a \code{\link{mungebit}}.
#' @export
#' @examples
#' divider <- multi_column_transformation(function(x, y) { x / y })
#' # Determines the ratio of Sepal.Length and Sepal.Width in the iris dataset.
#' iris2 <- mungebit$new(divider)$run(iris, c("Sepal.Length", "Sepal.Width"), "Sepal.Ratio") 
multi_column_transformation <- function(transformation, nonstandard = FALSE) {
  full_transformation <- function(data, input_columns, output_columns, ...) { }
  was_debugged <- isdebugged(transformation)
  environment(transformation) <- list2env(list(
    trained = NULL
  ), parent = environment(transformation) %||% baseenv())

  environment(full_transformation) <- list2env(
    list(transformation = transformation, nonstandard = isTRUE(nonstandard),
         "%||%" = `%||%`, is.simple_character_vector = is.simple_character_vector,
         named = is.element("names", names(formals(transformation))),
         env = environment(transformation), was_debugged = was_debugged),
    parent = globalenv()
  )
  body(full_transformation) <- multi_column_transformation_body
  # Add some convenient metadata for overloading `debug` and `print`.
  class(full_transformation) <- c("multi_column_transformation", "transformation", "function")
  full_transformation
}

multi_column_transformation_body <- quote({
  # Recall that `data`, `input_columns`, and `output_columns` are
  # the arguments to this function.

In this function, optimization matters. Column transformations will run millions of times over various datasets, so even microsecond shaved off is valuable. Throughout, note the code may be slightly improved for readability but at a speed cost. When developing new packages, one should follow the old adage to first make it functional, then make it beautiful, then make it fast. In this case, we prefer speed over beauty!

If we are supporting non-standard evaluation, we precompute the expression used, or we will lose it upon first reference of data.

                if (nonstandard) {
    data_expr <- substitute(data)

Unfortunately, we forcibly have to disable nonstandard evaluation support if a call was passed in instead of an atomic symbol, since then we could be re-computing side effectful computations!

                  if (!is.name(data_expr)) nonstandard <- FALSE
  }

  if (!isTRUE(trained)) {
    if (!is.simple_character_vector(input_columns)) {
      stop("The ", sQuote("input_columns"), " for a ",
           sQuote("multi_column_transformation"), " must be given by a ",
           "non-zero character vector of non-NA, non-blank unique strings.")
    }
    input$columns <- input_columns
  }

  if (!is.simple_character_vector(output_columns)) {
    stop("The ", sQuote("output_columns"), " for a ",
         sQuote("multi_column_transformation"), " must be given by a ",
         "non-zero character vector of non-NA, non-blank unique strings.")
  }

If the data.frame has duplicate column names, a rare but possible corruption, the for loop below that applies the transformations will malfunction, so we should error.

                indices <- match(input$columns, names(data))

Try to run print(`[.data.frame`) from your R console. Notice how much code is run to perform data.frame subsetting! The same is true for print(`[[<-.data.frame`), data.frame element assignment. Since we use this operation below, we want to skip over the typical checks for the sake of performance and use straight-up list subsetting (which will use underlying C code).

                # An optimization trick to avoid the slow `[.data.frame` operator.
  old_class <- class(data)
  class(data) <- "list" 

  env$trained <- trained

  if (nonstandard) {

We reserve the first few arguments for the input columns.

                  arguments <- c(vector("list", length(input$columns)),

This standard trick allows us to capture the unevaluated expressions in the ... parameter.

                                 eval(substitute(alist(...))))
  } else {
    arguments <- c(vector("list", length(input$columns)), list(...))
  }
  eval_frame <- parent.frame()

  env$input <- input

Assigning a function's environment clears its internal debug flag, so if the function was previously being debugged we retain this property.

                if (was_debugged) {
    debug(transformation)
  }

  if (named) {
    arguments$names <- input$columns
  }
  arguments[seq_along(input$columns)] <- data[indices]

  if (length(output_columns) == 1) {
    data[[output_columns]] <- .Internal(do.call(transformation, arguments, eval_frame))
  } else {
    data[output_columns] <- .Internal(do.call(transformation, arguments, eval_frame))
  }

  if (!isTRUE(trained)) {
    lockEnvironment(env$input, bindings = TRUE) 
  }

Finally, we reset the class to data.frame after stripping it for a speed optimization. If you study the code of (`[.data.frame`), you will see this is exactly the same trick the R base library uses to delegate to the list subsetting after the data.frame-specific checks have been completed.

                class(data) <- old_class
  data
})

#' @export
print.multi_column_transformation <- function(x, ...) {
  # `print_transformation` parameters include `indent = 0L, full = FALSE`.
  print_transformation(x, ..., byline = "Multi column transformation")
}

munge.R

Usually, we wish to perform more than one cleaning operation on a dataset before it is ready to be fed to a machine learning classifier.

The munge function defined in this file allows for a quick way to apply a list of mungepieces to a dataframe.

munged_data <- munge(raw_data, list(
  "Drop useless vars" = list(list(drop_vars, vector_of_variables),
                             list(drop_vars, c(vector_variables, "dep_var"))),
  "Impute variables"  = list(imputer, imputed_vars),
  "Discretize vars"   = list(list(discretize, restore_levels), discretized_vars)
))

Translated in English, we are saying:

Drop a static list of useless variables from the data set. When the model is trained, drop the dependent variable as well since we will no longer need it.
Impute the variables in the static list of imputed_vars. When the model is trained, the imputer will have some logic to restore the means obtained during training of the mungepiece (assuming we are using mean imputation).
Discretize the static list of variables in the discretized_vars character vector. After model training, when new data points come in, the original training set is no longer available. The discretize method stored the necessary cuts for each variable in the mungebit's input, which the restore_levels function uses to bin the numerical features in the list of discretized_vars into factor (categorical) variables.

Instead of building the mungepieces and bits by hand by calling the mungepiece$new and mungebit$new constructors (which is another alternative), we use this convenient format to construct and apply the mungepieces on-the-fly.

The end result is the munged_data, which is the final cleaned and ready-to-model data, together with an attribute “mungepieces” which stores the list of trained mungepieces. In other words, the munged data remembers how it was obtained from the raw data. The list of trained mungepieces, informally called a munge procedure, can be used to replicate the munging in a real-time streaming production system without having to remember the full training set:

munged_single_row <- munge(single_row, attr(munged_data, "mungepieces"))

# A syntactic shortcut enabled by the munge helper. It knows to look for
# the mungepieces attribute.
munged_single_row <- munge(single_row, munged_data)

We can feed single rows of data (i.e., streaming records coming through in a production system) to the trained munge procedure and it will correctly replicate the same munging it performed during model training.

              #' Apply a list of mungepieces to data.
#'
#' The \code{munge} function allows a convenient format for applying a
#' sequence of \code{\link{mungepiece}} objects to a dataset.
#'
#' The \code{munge} helper accepts a raw, pre-munged (pre-cleaned)
#' dataset and a list of lists. Each sublist represents the code
#' and hyperparameters necessary to clean the dataset. For example,
#' the first row could consist of an imputation function and a list
#' of variables to apply the imputation to. It is important to
#' understand what a \code{\link{mungebit}} and \code{\link{mungepiece}}
#' does before using the \code{munge} helper, as it constructs these
#' objects on-the-fly for its operation.
#'
#' The end result of calling \code{munge} is a fully cleaned data set
#' (i.e., one to whom all the mungepieces have been applied and trained)
#' adjoined with a \code{"mungepieces"} attribute: the list of trained
#' mungepieces.
#'
#' For each sublist in the list of pre-mungepieces passed to \code{munge},
#' the following format is available. See the examples for a more hands-on
#' example.
#'
#' \enumerate{
#'   \item{\code{list(train_fn, ...)}}{ -- If the first element of \code{args} is
#'     a function followed by other arguments, the constructed mungepiece
#'     will use the \code{train_fn} as both the \emph{train and predict}
#'     function for the mungebit, and \code{list(...)} (that is, the remaining
#'     elements in the list) will be used as both the train and predict
#'     arguments in the mungepiece. In other words, using this format
#'     specifies you would like \emph{exactly the same behavior in
#'     training as in prediction}. This is appropriate for mungebits
#'     that operate in place and do not need information obtained
#'     from the training set, such as simple value replacement or
#'     column removal.
#'   }
#'   \item{\code{list(list(train_fn, predict_fn), ...)}}{
#'     -- If \code{args} consists of a two-element pair in its first
#'     element, it must be a pair of either \code{NULL}s or functions,
#'     with not both elements \code{NULL}. If the \code{train_fn}
#'     or \code{predict_fn}, resp., is \code{NULL}, this will signify to have
#'     \emph{no effect} during training or prediction, resp.
#'
#'     The remaining arguments, that is \code{list(...)}, will be used
#'     as both the training and prediction arguments.
#'
#'     This structure is ideal if the behavior during training and prediction
#'     has an identical parametrization but very different implementation,
#'     such as imputation, so you can pass two different functions.
#'
#'     It is also useful if you wish to have no effect during prediction,
#'     such as removing faulty rows during training, or no effect during
#'     training, such as making a few transformations that are only
#'     necessary on raw production data rather than the training data.
#'   }
#'   \item{\code{list(train = list(train_fn, ...), predict = list(predict_fn, ...))}}{
#'     If \code{args} consists of a list consisting of exactly two named
#'     elements with names "train" and "predict", then the first format will be
#'     used for the respective fields. In other words, a mungepiece will
#'     be constructed consisting of a mungebit with \code{train_fn} as the
#'     training function, \code{predict_fn} as the predict fuction, and
#'     the mungepiece train arguments will be the train list of additional
#'     arguments \code{list(...)}, and similarly the predict arguments will be
#'     the predict list of additional arguments \code{list(...)}.
#'
#'     Note \code{train_fn} and \code{predict_fn} must \emph{both} be functions
#'     and not \code{NULL}, since then we could simply use the second format
#'     described above.
#'
#'     This format is ideal when the parametrization differs during training and
#'     prediction. In this case, \code{train_fn} usually should be the same
#'     as \code{predict_fn}, but the additional arguments in each list can
#'     be used to identify the parametrized discrepancies. For example, to
#'     sanitize a dataset one may wish to drop unnecessary variables. During
#'     training, this excludes the dependent variable, but during prediction
#'     we may wish to drop the dependent as well.
#'
#'     This format can also be used to perform totally different behavior on
#'     the dataset during training and prediction (different functions and
#'     parameters), but mungebits should by definition achieve the same
#'     operation during training and prediction, so this use case is rare
#'     and should be handled carefully.
#'   }
#' }
#'
#' @export
#' @param data data.frame. Raw, uncleaned data.
#' @param mungelist list. A list of lists which will be translated to a
#'    list of mungepieces. It is also possible to pass a list of mungepieces,
#'    but often the special syntax is more convenient. See the examples section.
#' @param stagerunner logical or list. Either \code{TRUE} or \code{FALSE}, by default
#'    the latter. If \code{TRUE}, a \code{\link[stagerunner]{stagerunner}}
#'    object will be returned whose context will contain a key \code{data}
#'    after being ran, namely the munged data set (with a "mungepieces"
#'    attribute).
#'
#'    One can also provide a list with a \code{remember} parameter,
#'    which will be used to construct a stagerunner with the same value
#'    for its \code{remember} parameter.
#' @param list logical. Whether or not to return the list of mungepieces
#'    instead of executing them on the \code{data}. By default \code{FALSE}.
#' @param parse logical. Whether or not to pre-parse the \code{mungelist}
#'    using \code{\link{parse_mungepiece}}. Note that if this is \code{TRUE},
#'    any trained mungepieces will be duplicated and marked as untrained.
#'    By default, \code{TRUE}.
#' @return A cleaned \code{data.frame}, the result of applying each
#'    \code{\link{mungepiece}} constructed from the \code{mungelist}.
#' @seealso \code{\link{mungebit}}, \code{\link{mungepiece}},
#'    \code{\link{parse_mungepiece}}
#' @examples
#' # First, we show off the various formats that the parse_mungepiece
#' # helper accepts. For this exercise, we can use dummy train and
#' # predict functions and arguments.
#' train_fn   <- predict_fn   <- function(x, ...) { x }
#' train_arg1 <- predict_arg1 <- dual_arg1 <- TRUE # Can be any parameter value.
#'
#' # The typical way to construct mungepieces would be using the constructor.
#' piece <- mungepiece$new(
#'   mungebit$new(train_fn, predict_fn),
#'   list(train_arg1), list(predict_arg1)
#' )
#'
#' # This is tedious and can be simplified with the munge syntax, which
#' # allows one to specify a nested list that defines all the mungebits
#' # and enclosing mungepieces at once.
#'
#' raw_data <- iris
#' munged_data <- munge(raw_data, list(
#'   # If the train function with train args is the same as the predict function
#'   # with predict args, we use this syntax. The first element should be
#'   # the funtion we use for both training and prediction. The remaining
#'   # arguments will be used as both the `train_args` and `predict_args`
#'   # for the resulting mungepiece.
#'   "Same train and predict" = list(train_fn, train_arg1, train_arg2 = "blah"),
#'
#'   # If the train and predict arguments to the mungepiece match, but we
#'   # wish to use a different train versus predict function for the mungebit.
#'   "Different functions, same args" =
#'     list(list(train_fn, predict_fn), dual_arg1, dual_arg2 = "blah"),
#'
#'   # If we wish to only run this mungepiece during training.
#'   "Only run in train" = list(list(train_fn, NULL), train_arg1, train_arg2 = "blah"),
#'
#'   # If we wish to only run this mungepiece during prediction.
#'   "Only run in predict" = list(list(NULL, predict_fn), predict_arg1, predict_arg2 = "blah"),
#'
#'   # If we wish to run different arguments but the same function during
#'   # training versus prediction.
#'   "Totally different train and predict args, but same functions" =
#'      list(train = list(train_fn, train_arg1),
#'           predict = list(train_fn, predict_arg1)),
#'
#'   # If we wish to run different arguments with different functions during
#'   # training versus prediction.
#'   "Totally different train and predict function and args" =
#'     list(train = list(train_fn, train_arg1),
#'                       predict = list(predict_fn, predict_arg1))
#' )) # End the call to munge()
#'
#' # This is an abstract example that was purely meant to illustrate syntax
#' # The munged_data variable will have the transformed data set along
#' # with a "mungepieces" attribute recording a list of trained mungepieces
#' # derived from the above syntax.
#'
#' # A slightly more real-life example.
#' \dontrun{
#' munged_data <- munge(raw_data, list(
#'   "Drop useless vars" = list(list(drop_vars, vector_of_variables),
#'                              list(drop_vars, c(vector_variables, "dep_var"))),
#'   "Impute variables"  = list(imputer, imputed_vars),
#'   "Discretize vars"   = list(list(discretize, restore_levels), discretized_vars)
#' ))
#'
#' # Here, we have requested to munge the raw_data by dropping useless variables,
#' # including the dependent variable dep_var after model training,
#' # imputing a static list of imputed_vars, discretizing a static list
#' # of discretized_vars being careful to use separate logic when merely
#' # using the computed discretization cuts to bin the numeric features into
#' # categorical features. The end result is a munged_data set with an
#' # attribute "mungepieces" that holds the list of mungepieces used for
#' # munging the data, and can be used to perform the exact same set of
#' # operations on a single row dataset coming through in a real-time production
#' # system.
#' munged_single_row_of_data <- munge(single_row_raw_data, munged_data)
#' }
#' # The munge function uses the attached "mungepieces" attribute, a list of
#' # trained mungepieces.
munge <- function(data, mungelist, stagerunner = FALSE, list = FALSE, parse = TRUE) {
  stopifnot(is.data.frame(data) ||
    (is.environment(data) &&
     (!identical(stagerunner, FALSE) || any(ls(data) == "data"))))

  if (length(mungelist) == 0L) {
    return(data)
  }

  if (!is.list(mungelist)) {

This error is grabbed from the messages.R file.

                  stop(m("munge_type_error", class = class(mungelist)[1L]))
  }

We allow munging in prediction mode using an existing trained data.frame. For example, if we had ran iris2 <- munge(iris, some_list_of_mungepieces), an attributes would be created on iris2 with the name "mungepieces". The munge function is capable of re-using this attribute, a list of trained mungepieces, and apply it to a new dataset. In English, it is asking to “perform the exact same munging that was done on iris2”.

                if (is.data.frame(mungelist)) {
    if (!is.element("mungepieces", names(attributes(mungelist)))) {
      stop(m("munge_lack_of_mungepieces_attribute"))
    }
    Recall(data, attr(mungelist, "mungepieces"), stagerunner = stagerunner,
           list = list, parse = FALSE)
  } else if (methods::is(mungelist, "tundraContainer")) {
    # An optional interaction effect with the tundra package.
    Recall(data, mungelist$munge_procedure, stagerunner = stagerunner,
           list = list, parse = FALSE)
  } else {
    munge_(data, mungelist, stagerunner, list, parse)
  }
}

# Assume proper arguments.
munge_ <- function(data, mungelist, stagerunner, list_output, parse) {
  if (isTRUE(parse)) {

It is possible to have nested stagerunners of mungeprocedures, but this is a more advanced feature. We skip stagerunners when parsing the munge procedure using parse_mungepiece.

                  runners <- vapply(mungelist, is, logical(1), "stageRunner")
    # TODO: (RK) Intercept errors and inject with name for helpfulness!
    mungelist[!runners] <- lapply(mungelist[!runners], parse_mungepiece)
  }

  if (isTRUE(list_output)) {
    return(mungelist)
  }

We will construct a stagerunner and execute it on a context (environment) containing just a data key.

The stages of the stagerunner will be one for each mungepiece, defined by the mungepiece_stages helper.

                stages <- mungepiece_stages(mungelist)
  if (is.environment(data)) {
    context <- data
  } else {
    context <- list2env(list(data = data), parent = emptyenv())
  }

  remember <- is.list(stagerunner) && isTRUE(stagerunner$remember)
  runner <- stagerunner::stageRunner$new(context, stages, remember = remember)

  if (identical(stagerunner, FALSE)) {
    runner$run()
    context$data
  } else {
    runner
  }
}

mungepiece_stages <- function(mungelist, contiguous = FALSE) {

As before, remember that a munge procedure can consist of mungepieces but also stagerunners earlier produced by munge. If we have a mungelist that looks like list(mungepiece, runner, mungepiece, mungepiece, runner, ...) then each contiguous block of mungepieces needs to be transformed into a sequence of stages. We will see later why this is necessary: we have to append the mungepieces to the data.frame after the last mungepiece has been executed.

                if (!isTRUE(contiguous)) {
    singles <- which(vapply(mungelist, Negate(is), logical(1), "stageRunner"))
    groups  <- cumsum(diff(c(singles[1L] - 1, singles)) != 1)
    split(mungelist[singles], groups) <- lapply(
      split(mungelist[singles], groups), mungepiece_stages, contiguous = TRUE
    )
    mungelist
  } else {
    mungepiece_stages_contiguous(mungelist)
  }
}

mungepiece_stages_contiguous <- function(mungelist) {
  shared_context <- list2env(parent = globalenv(),
    list(size = length(mungelist), mungepieces = mungelist,
         newpieces = list())
  )

  mungepiece_names <- names(mungelist) %||% character(length(mungelist))
  lapply(Map(list, seq_along(mungelist), mungepiece_names), mungepiece_stage, shared_context)
}

mungepiece_stage <- function(mungepiece_index_name_pair, context) {

  stage <- function(env) { }

For mungebits2 objects (as opposed to mungebits) we have to use different logic to allow backwards-compatible mixing with legacy mungebits. We set the body of the stage accordingly by checking whether the mungebit2 is an R6 object.

                if (methods::is(context$mungepieces[[mungepiece_index_name_pair[[1]]]], "R6")) {
    body(stage) <- mungepiece_stage_body()
  } else {
    body(stage) <- legacy_mungepiece_stage_body()
  }

  environment(stage) <- list2env(parent = context,
    list(mungepiece_index = mungepiece_index_name_pair[[1]],
         mungepiece_name  = mungepiece_index_name_pair[[2]])
  )
  stage
}

mungepiece_stage_body <- function() {
  quote({

Each mungepiece will correspond to one stage in a stagerunner. We will construct a new mungepiece on-the-fly to avoid sharing state with other mungepiece objects, run that mungepiece, and then modify the newpieces to store the trained mungepiece.

                  # Make a fresh copy to avoid shared stage problems.
    piece <- mungepieces[[mungepiece_index]]$duplicate()
    piece$run(env)
    newpieces[[mungepiece_index]] <<- piece
    if (isTRUE(nzchar(mungepiece_name) & !is.na(mungepiece_name))) {
      names(newpieces)[mungepiece_index] <<- mungepiece_name
    }

When we are out of mungepieces, that is, when the current index equals the number of mungepieces in the actively processed contiguous chain of mungepieces, we append the mungepieces to the dataframe's "mungepieces" attribute. This is what allows us to later replay the munging actions on new data by passing the dataframe as the second argument to the munge function.

                  if (mungepiece_index == size) {
      attr(env$data, "mungepieces") <-
        append(attr(env$data, "mungepieces"), newpieces)
    }
  })
}

To achieve backwards-compatibility with mungebits), we use different parsing logic for legacy mungepieces.

              legacy_mungepiece_stage_body <- function() {
  quote({

This code is taken directly from legacy mungebits.

                  if (!requireNamespace("mungebits", quietly = TRUE)) {
      stop("To use legacy mungebits with mungebits2, make sure you have ",
           "the mungebits package installed.")
    }
    reference_piece <- mungepieces[[mungepiece_index]]
    bit <- mungebits:::mungebit$new(
      reference_piece$bit$train_function, reference_piece$bit$predict_function,
      enforce_train = reference_piece$bit$enforce_train
    )
    bit$trained <- reference_piece$bit$trained
    bit$inputs  <- reference_piece$bit$inputs

    piece <- mungebits:::mungepiece$new(
      bit, reference_piece$train_args, reference_piece$predict_args
    )

    newpieces[[mungepiece_index]] <<- piece

    piece$run(env)

    if (mungepiece_index == size) {
      names(newpieces) <<- names(mungepieces)
      attr(env$data, "mungepieces") <-
        append(attr(env$data, "mungepieces"), newpieces)
    }
  })
}

mungebit-initialize.R

              #' Constructor for mungebit class.
#'
#' Mungebits are atomic data transformations of a data.frame that,
#' loosely speaking, aim to modify "one thing" about a variable or
#' collection of variables. This is pretty loosely defined, but examples
#' include dropping variables, mapping values, discretization, etc.
#'
#' @param train_function function. This specifies the behavior to perform
#'    on the dataset when preparing for model training. A value of NULL
#'    specifies that there should be no training step, i.e., the data
#'    should remain untouched.
#' @param predict_function function. This specifies the behavior to perform
#'    on the dataset when preparing for model prediction. A value of NULL
#'    specifies that there should be no prediction step, i.e., the data
#'    should remain untouched.
#' @param enforce_train logical. Whether or not to flip the trained flag
#'    during runtime. Set this to FALSE if you are experimenting with
#'    or debugging the mungebit.
#' @param nse logical. Whether or not we expect to use non-standard evaluation
#'    with this mungebit. Non-standard evaluation allows us to obtain the
#'    correct R expression when using \code{substitute} from within the body
#'    of a train or predict function for the mungebit. By default, \code{FALSE},
#'    meaning non-standard evaluation will not be available to the train and
#'    predict functions, but this ability can be switched on at a slight speed
#'    detriment (2-3x prediction slowdown for the fastest functions, somewhat
#'    negligible for slower functions).
#' @examples
#' mb <- mungebit$new(column_transformation(function(column, scale = NULL) {
#'   # `trained` is a helper provided by mungebits indicating TRUE or FALSE
#'   # according as the mungebit has been run on a dataset.
#'   if (!trained) {
#'     cat("Column scaled by ", input$scale, "\n")
#'   } else {
#'     # `input` is a helper provided by mungebits. We remember the
#'     # the `scale` so we can re-use it during prediction.
#'     input$scale <- scale
#'   }
#'   column * input$scale
#' }))
#' 
#' # We make a lightweight wrapper to keep track of our data so
#' # the mungebit can perform side effects (i.e., modify the data without an
#' # explicit assignment <- operator).
#' irisp <- list2env(list(data = iris))
#' #mb$run(irisp, 'Sepal.Length', 2)
#'
#' #head(mp$data[[1]] / iris[[1]])
#' # > [1] 2 2 2 2 2 2
#' #mb$run(mp, 'Sepal.Length')
#' # > Column scaled by 2
#' #head(mp$data[[1]] / iris[[1]])
#' # > [1] 4 4 4 4 4 4 
mungebit_initialize <- function(train_function   = base::identity,
                                predict_function = train_function,
                                enforce_train    = TRUE, nse = FALSE) {
  stopifnot(isTRUE(enforce_train) || identical(enforce_train, FALSE),
            isTRUE(nse) || identical(nse, FALSE))

  if (!is.acceptable_function(train_function)) {
    stop("To create a new mungebit, please pass a ",
         sQuote("function"), " as the first argument. I received ",
         "something of class ", sQuote(crayon::red(class(train_function)[1L])), ".")
  }

  if (!is.acceptable_function(predict_function)) {
    stop("To create a new mungebit, please pass a ",
         sQuote("function"), " as the second argument. I received ",
         "something of class ", sQuote(crayon::red(class(predict_function)[1L])), ".")
  }

  self$.input            <- new.env(parent = emptyenv())
  self$.train_function   <- to_function(train_function, "train")
  if (!is.null(self$.train_function)) {
    environment(self$.train_function) <- list2env(list(
      input = self$.input, trained = FALSE),
      parent = environment(self$.train_function) %||% globalenv())
  }
  self$.predict_function <- to_function(predict_function, "predict")
  if (!is.null(self$.predict_function)) {
    environment(self$.predict_function) <- list2env(list(
      input = self$.input, trained = FALSE),
      parent = environment(self$.predict_function) %||% globalenv())
  }
  self$.trained          <- FALSE
  self$.enforce_train    <- enforce_train
  self$.nse              <- nse
}

mungebit-run.R

Imagine running an imputation script on a dataset. On a training set, we have to compute the mean and replace the NAs with its value. However, when a single row comes in through a streaming production system, we merely need to memorize the computed mean and replace a variable with it if it is NA.

Calling the run method on a mungebit will store any data it needs for production, such as imputed means, in the input member. The second time $run is called (i.e., during prediction or real-time production use), it will be using the predict_function rather than the train_function, which will be less computationally expensive since it does not have to operate in reference to a training set and can use the memorized results in input to achieve the same transformation as the train_function.

              #' Run a mungebit.
#' 
#' Imagine flipping a switch on a set of train tracks. A mungebit
#' behaves like this: once the \code{trained} switch is flipped,
#' it can only run the \code{predict_function}, otherwise it will
#' run the \code{train_function}.
#'
#' @rdname mungebit
#' @param data environment or data.frame. Essentially an environment
#'   containing a \code{data} variable. In this case, that \code{data} variable
#'   will have a side effect enacted on it. If a \code{data.frame}, then 
#'   the return value will be the modified \code{data.frame} and the mungebit
#'   will record any results it must memorize in its \code{input}.
#' @param ... additional arguments to the mungebit's \code{train_function} or
#'   \code{predict_function}.
#' @return The modified \code{data}, whether it is an \code{environment}
#'   or \code{data.frame}.
mungebit_run <- function(data, ...) {
  if (is.environment(data)) {
    if (!get("exists", envir = parent.frame())("data", envir = data, inherits = FALSE)) {
      stop("If you are passing an environment to a mungebit, you must ",
           "provide one that contains a ", sQuote("data"), " key.")
    }

    if (self$.nse) {
      args <- c(list(bquote(.(substitute(data))$data)), eval(substitute(alist(...))))
      data$data <- do.call(self$run, args, envir = parent.frame())
    } else {
      data$data <- Recall(data$data, ...)
    }
    data
  } else if (isTRUE(self$.trained)) {
    if (is.null(self$.predict_function)) {
      data
    } else if (self$.nse) {
      args <- c(list(substitute(data)), eval(substitute(alist(...))))
      args$`_envir` <- parent.frame()
      do.call(self$predict, args, envir = parent.frame())
    } else {
      self$predict(data, ...)
    }
  } else {
    if (self$.nse) {
      args <- c(list(substitute(data)), eval(substitute(alist(...))))
      do.call(self$train, args, envir = parent.frame())
    } else {
      self$train(data, ...)
    }
  }
}

mungebit-train_predict.R

              #' Run the train function on a mungebit.
#'
#' The train function is responsible for performing a munging step and
#' storing metadata that can replicate the munging step in a live
#' production environment without the full reference data set.
#'
#' The purpose of the train function is to
#'
#' \enumerate{
#'   \item{Perform some munging on the data set, such as renaming
#'     columns, creating derived features, performing principal component
#'     analysis, replacing some values, removing outliers, etc.}
#'   \item{Storing the metadata necessary to replicate the munging operation
#'     after the original training set is no longer available. For example,
#'     if we are imputing a variable, we would need to remember its mean
#'     so we can use it later to replace \code{NA} values.}
#' }
#'
#' @rdname mungebit
#' @inheritParams mungebit_run
#' @param _envir environment. Internal argument used for determining
#'   the execution context of the invoked \code{train_function} or
#'   \code{predict_function}.
#' @return The modified \code{data}, whether it is an \code{environment}
#'   or \code{data.frame}. Side effects on the \code{input} local variable
#'   provided to the \code{train_function} will be recorded on the mungebit
#'   object.
mungebit_train <- function(data, ..., `_envir` = parent.frame()) {
  if (self$.enforce_train) {
    if (isTRUE(self$.trained)) {
      stop("This mungebit has already been trained, cannot re-train.")
    }
    on.exit(self$.trained <- TRUE, add = TRUE)
  }

The input environment used by the mungebit to record metadata from the munging performed at training-time is the only opportunity for affecting the input environment. Afterwards, we lock it so that we are confident the user does not modify it during prediction time (i.e., when it is run in a real-time production system).

                if (isTRUE(self$.enforce_train)) {
    on.exit({
      lockEnvironment(self$.input, TRUE)
      if (!is.null(self$.predict_function)) {
        environment(self$.predict_function)$trained <- TRUE
      }
    }, add = TRUE)
  }

  if (is.null(self$.train_function)) {
    data
  } else if (self$.nse) {
    args <- c(list(substitute(data)), eval(substitute(alist(...))))
    do.call(self$.train_function, args, envir = `_envir`)
  } else {
    self$.train_function(data, ...)
  }
}

#' Run the predict function on a mungebit.
#'
#' The predict function is responsible for performing a munging step
#' using metadata it computed during an earlier training step.
#' This is usually done in a live production environment setting.
#'
#' The purpose of the predict function is to
#'
#' \enumerate{
#'   \item{Perform some munging on the data set, such as renaming
#'     columns, creating derived features, performing principal component
#'     analysis, replacing some values, removing outliers, etc.}
#'   \item{Use the metadata computed during the \code{train} step
#'    to correctly perform this munging.}
#' }
#'
#' @rdname mungebit
#' @inheritParams mungebit_run
#' @return The modified \code{data}, whether it is an \code{environment}
#'   or \code{data.frame}. Side effects on the \code{input} local variable
#'   provided to the \code{predict_function} will be recorded on the mungebit
#'   object.
mungebit_predict <- function(data, ..., `_envir` = parent.frame()) {
  # For some reason, accessing this takes some time..
  if (!isTRUE(self$.trained)) {
    stop("This mungebit cannot predict because it has not been trained.")
  }

  if (self$.nse) {
    args <- c(list(substitute(data)), eval(substitute(alist(...))))
    do.call(self$.predict_function, args, envir = `_envir`)
  } else {
    self$.predict_function(data, ...)
  }
}

mungebit.R

              #' @include mungebit-initialize.R mungebit-run.R mungebit-train_predict.R
NULL

The idea behind mungebits grew out of a year-long session attempting to productionize R code without translating it into another programming language.

Almost every package that implements a statistical predictor requires the user to provide a wrangled dataset, that is, one stripped of outliers, with correctly coerced types, and an array of other “data preparation” aspects that may affect the final performance of the model.

Consider, for example, making use of a categorical variable that has many unique values, some of which occur commonly and others incredibly rarely. It may improve performance of some classifiers to take the rare values, say those which occur with a frequency of less than 5% in the data set, and label them as the value “OTHER”.

The choice of which variables make it into the “OTHER” label is determined by the training set, which may differ across random cross-validation splits and change as an organization gathers more data or the distribution shifts, such as due to a changing consumer base or market conditions.

When one refits a model with the new dataset, it would be ideal if the data preparation automatically reflected the updated values by picking the set of labels that occur with greater than 5% frequency and labeling all others as “OTHER”.

In code, we may say that

during_training <- function(factor_column) {
  frequencies <- table(factor_column)
  most_common <- names(which(frequencies / length(factor_column) > 0.05))
  factor_column <- factor(
    ifelse(factor_column %in% most_common, factor_column, "OTHER"),
    levels = c(most_common, "OTHER")
  )
  list(new_column = factor_column, most_common = most_common)
}

# Let's create an example variable.
factor_column <- factor(rep(1:20, 1:20))
output <- during_training(factor_column)
factor_column <- output$new_column

# We would hold on to output$most_common and "feed it" to
# munging code that ran in production on single data points.
during_prediction <- function(factor_column, most_common) {
  factor(ifelse(factor_column %in% most_common, factor_column, "OTHER"),
    levels = c(most_common, "OTHER"))
}

# Notice we have re-used our earlier code for constructing the new
# column. We will have to use the above function for munging in
# production and supply it the list `most_common` levels computed
# earlier during training.

single_data_point <- 5
stopifnot(identical(
  during_prediction(5, output$most_common),
  factor("OTHER", levels = c(as.character(11:20), "OTHER"))
))

single_data_point <- 15
stopifnot(identical(
  during_prediction(15, output$most_common),
  factor("15", levels = c(as.character(11:20), "OTHER"))
))

# In a real setting, we would want to operate on full data.frames
# instead of only on atomic vectors.

It may seem silly to create a factor variable with a single value and a surplus of unused levels, but that is only the case if you have never tried to productionize your data science models! Remember, even if you trained a simple regression, your factor columns will need to be converted to 0/1 columns using something like the model.matrix helper function, and this will yell at you if the correct levels are not there on the factor column.

The point of mungebits is to replace all that hard work–which in the experience of the author has sometimes spanned data preparation procedures composed of hundreds of steps like the above for collections of thousands of variables–with the much simplified

# During offline training.
replace_uncommon_levels_mungebit$run(dataset)

The mungebit has now been “trained” and remembers the common_levels defined earlier. In a production system, we will be able to run the exact same code on a single row of data, as long as we serialize the mungebit object and recall it during production. This gives us a streaming machine learning engine that includes hard data wrangling work–in R.

# During real-time prediction.
replace_uncommon_levels_mungebit$run(dataset)

After understanding mungebits, data science will stop being data janitor work and you will get back to the math.

              #' Construct a new mungebit.
#'
#' The majority of data projects are overcome by the burden of excessive
#' data wrangling. Part of the problem lies in the fact that when new
#' data is introduced that was drawn from the same source as the original,
#' such as a training set for a statistical model, \emph{different} code
#' needs to be written to achieve the same transformations. Mungebits solve
#' this problem by forcing the user to determine how to correctly munge
#' on out-of-sample data (such as live streaming data in the form of one-row
#' data.frames) at "munge-time", when the reason for the wrangling is still
#' apparent. A frequent source of data errors is treating this process as an
#' afterthought.
#'
#' Consider the following problem. Imagine we wish to discretize a variable,
#' say determined algorithmically with cuts [0, 0.5), [0.5, 1.5), [1.5, 3).
#' When we apply the same transformation on a new data set, we cannot run
#' the same discretization code, since it may produce new cutoffs, and hence
#' invalidate the results if, for example, we had trained a model on the
#' prior cutoffs. To ensure the exact same mathematical transformation
#' is performed on new data--whether a new test set derived from recent
#' data or a one-row data.frame representing a single record streaming
#' through a production system--we must run \emph{different code} on
#' the "original" set versus the new set.
#'
#' Mathematically speaking, a transformation of a data set can be represented
#' by a single mathematical function that is implemented differently during
#' "training" versus "prediction." Here, "training" refers to the first
#' time the transformation is performed, and "prediction" refers to 
#' subsequent times, such as on newly obtained data or a one-row data.frame
#' representing a single new record in a production system.
#'
#' Therefore, the \emph{correct} approach to data preparation, if you
#' wish to re-use it in the future on new data sets or in a live production
#' environment, is to treat it as a collection of tuples
#' \code{(train_function, predict_function, input)}, where
#' \code{train_function} represents the original code, \code{input} represents
#' an arbitrary R object such as a list, used for storing "metadata"
#' necessary to re-create the original transformation, and the
#' \code{predict_function} takes this \code{input} metadata and produces
#' the identical transformation on an out-of-sample data set.
#'
#' For example, if we wish to impute a data set, \code{train_function}
#' might compute the mean, store it in \code{input$mean}, replace
#' the \code{NA} values with the mean, and return the dataset. Meanwhile,
#' the \code{predict_function} simply replaces the \code{NA} values
#' with the cached \code{input$mean}.
#'
#' Usually, these steps would be in disjoint code bases: the modeler
#' would perform the ad-hoc munging while playing with the dataset,
#' and a software engineer would take the computed \code{input$mean}
#' and hard code it into a "data pipeline". It would be infeasible
#' to recompute the mean on-the-fly since \emph{it depends on the
#' original data set}, which may be prohibitively large. However,
#' while it may require a lot of space and time to compute the
#' original \code{input}, as they are parameterized potentially by
#' a very large data set, usually the \code{input} itself is small
#' and the resulting \code{predict_function} is inexpensive. 
#'
#' The fundamental problem of data preparation, and the reason why
#' \href{http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html}{data scientists spend over 90\% of their time on data preparation},
#' is a lack of respect for this dichotomy. Using mungebits makes
#' this duality blatantly apparent in all circumstances and will hopefully
#' reduce the amount of time wasted on cumbersome wrangling.
#'
#' @docType class
#' @format NULL
#' @name mungebit
#' @export
#' @examples
#' \dontrun{
#' mb <- mungebit(column_transformation(function(col, scale = NULL) {
#'   if (!isTRUE(trained)) { # trained is an injected keyword
#'    cat("Column scaled by ", input$scale, "\n")
#'   } else {
#'    input$scale <- scale
#'   }
#'  
#'   col * input$scale
#' }))
#' 
#' iris2 <- mb$run(iris, "Sepal.Length", 2)
#' # iris2 now contains a copy of iris with Sepal.Length doubled.
#' iris3 <- mb$run(iris2, "Sepal.Length")
#' # > Column scaled by 2
#' head(iris3[[1]] / iris[[1]])
#' # > [1] 4 4 4 4 4 4 
#' }
mungebit <- R6::R6Class("mungebit",
  public = list(
    .train_function   = NULL,  # Function or NULL
    .predict_function = NULL,  # Function or NULL
    .input            = NULL,  # Environment
    .trained          = FALSE, # Logical
    .enforce_train    = TRUE,  # Logical
    .nse              = FALSE, # Logicl

    initialize = mungebit_initialize,
    run        = mungebit_run,
    train      = mungebit_train,
    predict    = mungebit_predict,

    debug      = function() { debug(self) },
    undebug    = function() { undebug(self) },
    train_function   = function() { self$.train_function   },
    predict_function = function() { self$.predict_function },
    trained    = function(val) {
      if (missing(val)) self$.trained
      else {
        if (!is.null(self$.train_function) && !is.null(environment(self$.train_function))) {
          environment(self$.train_function)$trained <- isTRUE(val)
        }
        if (!is.null(self$.predict_function) && !is.null(environment(self$.predict_function))) {
          environment(self$.predict_function)$trained <- isTRUE(val)
        }
        self$.trained <- isTRUE(val)
      }
    },
    input       = function(val, list = TRUE) {
      if (missing(val) && isTRUE(list)) as.list(self$.input)
      else if (missing(val) && !isTRUE(list)) self$.input
      else if (is.environment(val)) self$.input <- val
      else self$.input <- list2env(val, parent = parent.env(self$.input))
    },
    nonstandard = function() { isTRUE(self$.nse) },
    duplicate   = function(...) { duplicate_mungebit(self, ...) }
  )
)

A helper used to make a fresh untrained replica of an existing mungebit

              duplicate_mungebit <- function(bit, private = FALSE) {
  newbit <- mungebit$new(bit$train_function(), bit$predict_function())
  if (isTRUE(private)) {
    copy_env(newbit$.input, bit$input(list = FALSE))
    newbit$trained(bit$trained())
  }
  newbit
}

#' Determine whether an object is a mungebit.
#'
#' @keywords typecheck
#' @param x ANY. An R object to check.
#' @return TRUE or FALSE according as it has class mungebit.
#' @export
is.mungebit <- function(x) {
  inherits(x, "mungebit")
}

#' Copy one environment into another recursively.
#' 
#' @param to environment. The new environment.
#' @param from environment. The old environment.
#' @note Both \code{to} and \code{from} must be pre-existing environments
#'   or this function will error.
copy_env <- function(to, from) {
  stopifnot(is.environment(to) && is.environment(from))
  rm(list = ls(to, all.names = TRUE), envir = to)
  for (name in ls(from, all.names = TRUE)) {
    swap_environments <- function(obj) {
      # TODO: (RK) Attributes?
      if (is.environment(obj)) {
        env <- new.env(parent = parent.env(obj))
        copy_env(env, obj)
        env
      } else if (is.recursive(obj)) {
        for (i in seq_along(obj)) {
          obj[[i]] <- Recall(obj[[i]])
        }
        obj
      } else { obj }
    }

    # Copy sub-environments in full.
    assign(name, swap_environments(from[[name]]), envir = to)
  }
}

#' @export
print.mungebit <- function(x, ...) {
  print_mungebit(x, ...)
}

mungepiece-initialize.R

              #' Construct a new mungepiece.
#'
#' A mungebit defines an atomic data transformation of an \emph{arbitrary}
#' data set. In order to specify the parameters that may be relevant for
#' a \emph{particular} data set (such as restricting its effect to
#' specific columns, fixing certain parameters such as the imputation
#' method, or providing information that may contain domain knowledge)
#' one uses a mungepiece. A mungepiece defined a \emph{domain-specific}
#' atomic transformation of a data set.
#'
#' A mungepiece is defined by the collection of
#'
#' \enumerate{
#'   \item{A mungebit. }{The mungebit determines the qualitative nature
#'      of the data transformation. The mungebit may represent
#'      a discretization method, principal component analysis,
#'      replacement of outliers or special values, and so on.
#'
#'      If a training set represents automobile data and there are
#'      variables like "weight" or "make," these variables should not be
#'      hardcoded in the mungebit's \code{train} and \code{predict}
#'      functions. The mungebit should only represent that abstract
#'      \emph{mathematical} operation performed on the data set.}
#'   \item{Training arguments. }{While the mungebit represents the code
#'      necessary for performing some \emph{abstract mathematical operation}
#'      on the data set, the training arguments record the metadata
#'      necessary to perform the operation on a \emph{particular}
#'      data set.
#'
#'      For example, if we have an automobile data set and know the
#'      "weight" column has some missing values, we might pass a vector
#'      of column names that includes "weight" to an imputation mungebit
#'      and create an imputation-for-this-automobile-data mungepiece.
#'
#'      If we have a medical data set that includes special patient type
#'      codes and some of the codes were mistyped during data entry or
#'      are synonyms for the same underlying "type," we could pass a list
#'      of character vectors to a "grouper" mungebit that would condense
#'      the categorical feature by grouping like types.
#'
#'      If we know that some set of variables is predictive for modeling a
#'      particular statistical question but are unsure about the remaining
#'      variables, we could use this intuition to pass the list of known
#'      variables as exceptions to a "sure independence screening" mungebit.
#'      The mungebit would run a univariate regression against each variable
#'      not contained in the exceptions list and drop those totally uncorrelated
#'      with the dependent variable. This is a typical technique for high
#'      dimensionality reduction. Knowledge of the exceptions would reduce
#'      the computation time necessary for recording which variables are
#'      nonpredictive, an operation that may be very computationally expensive.
#'
#'      In short, the mungebit records what we are doing to the data set
#'      from an abstract level and does not contain any domain knowledge.
#'      The training arguments, the arguments passed to the mungebit's 
#'      \code{train_function}, record the details that pinpoint the
#'      abstract transformation to a \emph{particular} training set intended for
#'      use with a predictive model.}
#'   \item{Prediction arguments. }{It is important to understand the 
#'      train-predict dichotomy of the mungebit. If we are performing an
#'      imputation, the mungebit will record the means computed from the
#'      variables in the training set for the purpose of replacing \code{NA}
#'      values. The training arguments might be used for specifying the columns
#'      to which the imputation should be restricted.
#'
#'      The prediction arguments, by default the same as the training arguments,
#'      are metadata passed to the mungebit's \code{predict_function}, such as
#'      again the variables the imputation applies to. Sometimes the prediction
#'      arguments may differ slightly from the training arguments, such as when
#'      handling the dependent variable (which will not be present during
#'      prediction) or when the code used for prediction needs some further
#'      parametrization to replicate the behavior of the \code{train_function}
#'      on one-row data sets (i.e., real-time points in a production setting).}
#' } 
#'
#' In short, mungepieces \emph{parametrize} a \emph{single transformation}
#' of a data set \emph{for that particular data set}. While a mungebit is
#' abstract and domain-independent and may represent computations like 
#' imputation, outlier detection, and dimensionality reduction, a mungepiece
#' records the human touch and domain knowledge that is necessary for
#' ensuring the mungebits operate on the appropriate features and optimize
#' space-time tradeoffs (for example, the modeler may know that certain
#' columns do not contain missing values and avoid passing them to the
#' imputation mungebit).
#'
#' Informally speaking, you can think of a mungebit as the \emph{raw mold}
#' for a transformation and a mungepiece as the
#' \emph{cemented product constructed from the mold} that is specific to
#' a particular data set. A mungepiece affixes a mungebit so it works on
#' a specific data set, and domain knowledge may be necessary to construct
#' the mungepiece optimally.
#'
#' @param mungebit mungebit. A mungebit \code{\link{mungebit}} representing
#'    an abstract transformation of a data set, such as type conversion,
#'    imputation, outlier detection, dimensionality reduction,
#'    value replacement, and so on.
#' @param train_args list. Arguments to pass to the mungebit when it is
#'    run for the first time, i.e., on a \emph{training set} that will be
#'    fed to a predictive model and may be quite large. These arguments,
#'    passed directly to the mungebit's \code{train_function}, should 
#'    contain domain-specific metadata that is necessary to apply the
#'    mungebit to this specific data set.
#'
#'    For example, if the modeler knows certain columns do not contain 
#'    missing values, they might pass a character vector of column names
#'    to an imputation mungebit that avoids attempting to impute the
#'    columns guaranteed to be fully present. Doing this heuristically might
#'    require an unnecessary pass over the data, potentially expensive if
#'    the data consists of thousands of features; domain-specific knowledge
#'    might be used to pinpoint the few features that require imputation.
#' @param predict_args list. Arguments to pass to the mungebit when it
#'    is run for the second or subsequent time, i.e., on a \code{prediction set}
#'    that will usually be coming from model validation code or a real-time
#'    production environment. After the mungebit has been trained on the
#'    training set, it should be capable of predicting on
#'    \emph{single row data sets}, i.e., new "points" coming through in
#'    a live production setting.
#'
#'    Usually, the prediction arguments will be the same as the training
#'    arguments for the mungepiece.
#' @examples
#' \dontrun{
#'   doubler <- mungebit$new(column_transformation(function(x) x * 2))
#'   cols    <- c("Sepal.Length", "Petal.Length")
#'   mp      <- mungepiece$new(doubler, list(cols))
#'   iris2   <- mp$run(iris)
#'   stopifnot(identical(iris2[cols], 2 * iris[cols]))
#' }
mungepiece_initialize <- function(mungebit     = NULL,
                                  train_args   = list(),
                                  predict_args = train_args) {

  if (!is.mungebit(mungebit)) {
    stop("To create a new mungepiece, please pass a ",
         sQuote("mungebit"), " as the first argument. I received ",
         "something of class ", sQuote(crayon::red(class(mungebit)[1L])), ".")
  }

  if (!is.list(train_args)) {
    stop("To create a new mungepiece, please pass a list (of training ",
         "arguments) as the second argument. I received something of ",
         "class ", sQuote(crayon::red(class(train_args)[1L])), ".")
  }

  if (!is.list(predict_args)) {
    stop("To create a new mungepiece, please pass a list (of prediction ",
         "arguments) as the third argument. I received something of ",
         "class ", sQuote(crayon::red(class(predict_args)[1L])), ".")
  }

  self$.mungebit     <- mungebit
  self$.train_args   <- make_env(train_args)
  self$.predict_args <- make_env(predict_args)

  lockEnvironment(self$.train_args,   bindings = TRUE)
  lockEnvironment(self$.predict_args, bindings = TRUE)
}

mungepiece-run.R

Running a mungepiece respects the same laws as running a mungebit. During training, the goal is to record the necessary metadata the mungebit needs in order to run during prediction (i.e., on one row data sets in a real-time production environment).

The first time mungepiece$run is called, the call is delegated to the mungebit attached to the mungepiece with the appropriate training arguments.

For example, imagine we have a mungebit that discretizes a variable.

discretizer_train <- function(data, columns, breaks = 5) {
  # Recall that the first argument to a mungebit's train function
  # is *always* the data set. The additional arguments, in this 
  # case the column names to discretize, will be the list of
  # training arguments on the mungepiece.
  stopifnot(is.character(columns), all(columns %in% colnames(data)))

  # We record the columns that were discretized.
  input$`_columns` <- columns

  for (column in columns) {
    # Record the values to discretize at, i.e., the bounds of each interval.
    quantiles <- quantile(data[[column]], breaks = breaks)
    # `cuts` will be the discretized variable using R's `base::cut`.
    cuts <- cut(data[[column]], breaks = quantiles)
    # We need to remember the cut points and levels to discretize during 
    # prediction correctly.
    input[[column]] <- list(cuts = quantiles, levels = levels(cuts)) 
    # We assume there are no missing values.
    data[[column]]  <- cuts
  }

  data
}

# This function will be pretty slow in R. You can rewrite it in Rcpp.
# It also suffers from a few bugs on the boundaries due to open/closed
# interval issues, but a full implementation defeats the illustration.
discretizer_predict <- function(data, columns, ...) {
  # We leave the last argument as ... in case the user left the train 
  # arguments the same as the predict arguments so that we may absorb
  # the `breaks` argument without error.
  if (missing(columns)) columns <- input$`_columns`

  # We only allow columns that were discretized during training and are
  # present in the dataset. A more strict approach would throw an error.
  columns <- intersect(intersect(columns, input$`_columns`), colnames(data))
  # Some helper functions.
  coalesce <- function(x, y) { if (length(x) == 0) y[1L] else x[1L] }
  min2     <- function(x) { if (length(x) == 0) NULL else min(x) }

  for (column in columns) {
    cuts <- vapply(data[[column]], function(value) {
      # Convince yourself that `ix` will be the index of the correct
      # label. For example, if value is `2.5` and `levels` are [0, 1],
      # (1, 2], (2, 3], (3, 4], then `ix` will be 3.
      ix <- max(1, coalesce(
        min2(which(c(-Inf, input[[column]]$cuts[-1L]) >= value)),
        length(input[[column]]$levels) + 1
       ) - 1)
      input[[column]]$levels[ix]
    }, character(1))
    data[[column]] <- factor(cuts, levels = input[[column]]$levels)
  }

  data   
}

bit <- mungebit$new(discretizer_train, discretizer_predict)

Note that the code to implement discretization during training and prediction is quite different! We can turn this mungebit into a mungepiece that operates on the iris dataset.

piece <- mungepiece$new(bit, list(c("Sepal.Width", "Sepal.Length")))
iris2 <- mungepiece$run(iris) # Train the mungepiece.
head(iris2$Sepal.Length)
# [1] (4.3,5.1] (4.3,5.1] (4.3,5.1] (4.3,5.1] (4.3,5.1] (5.1,5.8]
# Levels: (4.3,5.1] (5.1,5.8] (5.8,6.4] (6.4,7.9]
iris3 <- piece$run(iris[1:6, ]) # It has been trained and run live.
print(iris3$Sepal.Length)
# [1] (4.3,5.1] (4.3,5.1] (4.3,5.1] (4.3,5.1] (4.3,5.1] (5.1,5.8]
# Levels: (4.3,5.1] (5.1,5.8] (5.8,6.4] (6.4,7.9] 
stopifnot(identical(head(iris2$Sepal.Length), iris3$Sepal.Length))

The mungepiece also works correctly on outliers.

irisc <- iris; irisc[1:2, 1] <- c(0, 10)
print(piece$run(irisc[1:2, ])$Sepal.Length)
# [1] (4.3,5.1] (6.4,7.9]
# Levels: (4.3,5.1] (5.1,5.8] (5.8,6.4] (6.4,7.9]

It is important to handle such cases if new points in a live production setting have values that are outside the observed range of the training set.

              #' Run a mungepiece and prepare it for a live production setting.
#'
#' Running a mungepiece achieves the same effect as running the mungebit
#' attached to the mungepiece: the first time it is run, we \emph{train}
#' the mungebit so it remembers metadata it will need to replicate the
#' operation in a live production setting on a single row of data. The
#' second and subsequent times we run the mungepiece, it will execute
#' the predict function of the underlying mungebit.
#'
#' @inheritParams mungebit_run
#' @param _envir environment. The calling environment for the train
#'    or predict function on the underlying mungebit. This is an internal
#'    argument and is \code{parent.frame()} by default.
#' @return If the \code{data} parameter is an environment, the transformed
#'    environment (i.e., the transformed data in the environment) after 
#'    application of the underlying mungebit. If \code{data} is a data.frame,
#'    the transformed data.frame is returned.
mungepiece_run <- function(data, ..., `_envir` = parent.frame()) {
  # TODO: (RK) Document literately.
  if (self$.mungebit$trained()) {
    calling_environment <- self$.predict_args
    reference_function  <- self$.mungebit$predict_function()
  } else {
    calling_environment <- self$.train_args
    reference_function  <- self$.mungebit$train_function()
  }

  args <- eval(substitute(alist(...)))
  args <- two_way_argument_merge(strip_arguments(reference_function, 1),
                                 calling_environment, args)

  parent.env(calling_environment) <- `_envir`
  on.exit(parent.env(calling_environment) <- emptyenv(), add = TRUE)

  args <- c(list(substitute(data)), args)

  do.call(self$.mungebit$run, args, envir = calling_environment)
}

strip_arguments <- function(fun, n) {
  if (length(formals(fun)) > 0L) {
    formals(fun) <- formals(fun)[setdiff(seq_along(formals(fun)), seq_len(n))]
    fun
  } else {
    fun
  }
}

two_way_argument_merge <- function(reference_function, calling_environment, args) {
  if (length(formals(reference_function)) == 0L) {
    # If reference_function is `[`, calling match.call gives an
    # "invalid definition" error.
    reference_function <- function() { }
    if (length(args) > 0L) {
      args
    } else {
      default_args <- env2listcall(calling_environment)
      names(default_args) <- attr(calling_environment, "initial_names")
      default_args
    }
  } else {
    call <- as.call(c(alist(self), args))
    base_args <- as.list(match.call(reference_function, call)[-1L])

    default_args <- env2listcall(calling_environment)
    names(default_args) <- attr(calling_environment, "initial_names")
    call         <- as.call(c(alist(self), default_args))
    default_args <- as.list(match.call(reference_function, call)[-1L])

    if (unnamed_count(default_args) > 0 && unnamed_count(base_args) > 0) {
      default_args[unnamed(default_args)] <- NULL
    }

    list_merge(default_args, base_args)
  }
}

mungepiece.R

              #' @include mungepiece-initialize.R mungepiece-run.R
NULL

Mungebits are intended to record the dichotomy between computations that occur at training time, such as computing the means of variables during imputation, and prediction time, such as restoring NA values with the precomputed means at prediction time.

While a mungebit records the general computation that can apply to arbitrary datasets, a mungepiece records the training and prediction arguments applicable to the mungebit. For example, if we used an imputation mungebit that looked as follows

imputation_mungebit <- mungebit$new(function(data, columns) {
  data[columns] <- lapply(columns, function(column) {
    if (isTRUE(trained)) {
      input[[column]] <- mean(data[[column]], na.rm = TRUE)
    }
    ifelse(is.na(data[[column]]), input[[column]], data[[column]])
  })
})

then we may wish to record the columns to which the imputation applies. In this case, we can use a mungepiece.

piece <- mungepiece$new(imputation_mungebit, imputed_columns, imputed_columns)

To run the mungepiece on our data set we can say piece$run(data, column_names). The advantage of this approach is that after the mungepiece has been trained, it will remember the means and can be used on single row data.frames (i.e., those coming in from production) without a change in syntax or calling convention. This means that the typical disproportion of spending the majority of one's time “munging data” is drastically reduced and no further code has to be written to ensure the transformations run correctly in a production setting.

              #' Mungepiece.
#'
#' @name mungepiece
#' @docType class
#' @export
mungepiece <- R6::R6Class("mungepiece", 
  public = list(
    .mungebit     = NULL, # mungebit
    .train_args   = NULL, # list
    .predict_args = NULL,

    initialize = mungepiece_initialize,
    run        = mungepiece_run,

    debug      = function() { debug(self$.mungebit) },
    undebug    = function() { undebug(self$.mungebit) },
    trained    = function() { self$.mungebit$trained() },
    mungebit   = function() { self$.mungebit },

    train_args   = function() { env2list(self$.train_args) },
    predict_args = function() { env2list(self$.predict_args) },

    duplicate  = function(...) { duplicate_mungepiece(self, ...) }
  )
)

A helper used to make a fresh untrained replica of an existing mungepiece.

              duplicate_mungepiece <- function(piece, ...) {

To ensure backwards compatibility with legacy mungebits, we perform nothing is the piece is not an R6 object (and hence a new mungepiece in the style of this package).

                if (is.legacy_mungepiece(piece)) {
    piece
  } else {
    mungepiece$new(piece$mungebit()$duplicate(..., private = piece$trained()),
                   piece$train_args(), piece$predict_args())
  }
}

#' Determine whether an object is a mungepiece.
#'
#' @keywords typecheck
#' @param x ANY. An R object to check.
#' @return TRUE or FALSE according as it has class mungepiece
#' @export
is.mungepiece <- function(x) {
  inherits(x, "mungepiece")
}

#' @export
print.mungepiece <- function(x, ...) {
  print_mungepiece(x, ...)
}

package.mungebits2.R

              #' An approach to data preparation that is compatible with production systems.
#'
#' Mungebits2 defines a way of thinking about data preparation that
#' couples the definition of what happens in batch processing versus
#' online prediction so that both can be described by the same codebase.
#'
#' For example, consider the simple example of imputation. While the
#' general concept of imputing a variable works on arbitrary codebases,
#' a \emph{separate} data transformation will have to be defined for
#' each model that uses imputation in a production setting. This is 
#' because the imputed value depends inherently on the dataset.
#' We must remember the mean of the data set encountered during
#' training, and recall this value when performing replacement in
#' a production setting.
#'
#' Mungebits provide a sort of "train track switch" that allows one
#' to write data preparation offline, but ensure it works online 
#' (on a stream of new data, such as one-row data.frames).
#'
#' By reframing data preparation as the process of constructing
#' a "munge procedure", a list of trained mungebits that can
#' reproduce the same mathematical operation on a dataset in
#' a production environment without additional code, the process
#' of productionizing a machine learning model should become
#' significantly simplified.
#'
#' @name mungebits2
#' @import crayon lazyeval R6 stagerunner whisker
#' @docType package
NULL

utils::globalVariables(c("self", "newpieces", "mungepieces", "size"))

parse_mungepiece.R

Constructing a mungepiece is not incredibly straightforward. First, we must construct the mungebit, which represents the code that will be executed when running the mungepiece on a training dataset to later feed to a machine learning classifier (i.e., the train function) in conjunction with the code that executes on streaming records coming through in a production system performing the same mathematical operation on a 1-row dataset (i.e., the predict function).

Next, we must determine the training and prediction arguments to the mungebit that specify the difference in how to use the mungebit on offline training versus realtime prediction data.

Thus, constructing a mungepiece looks something like:

piece <- mungepiece$new(
  mungebit$new(train_function, predict_function),
  train_args, predict_args
)

In particular, we have to invoke the mungebit constructor every time we create a mungepiece. Instead the parse_mungepiece helper defined in this file provides a more convenient format:

# If the train function with train args is the same as the predict function
# with predict args.
piece <- parse_mungepiece((list(train_fn, train_arg1, train_arg2 = "blah"))

# If the train and predict arguments to the mungepiece match, but we
# wish to use a different train versus predict function for the mungebit.
piece <- parse_mungepiece(list(list(train_fn, predict_fn), dual_arg1, dual_arg2 = "blah"))

# If we wish to only run this mungepiece during training.
piece <- parse_mungepiece(list(list(train_fn, NULL), train_arg1, train_arg2 = "blah"))

# If we wish to only run this mungepiece during prediction
piece <- parse_mungepiece(list(list(NULL, predict_fn), predict_arg1, predict_arg2 = "blah"))

# If we wish to run different arguments but the same function during
# training versus prediction.
piece <- parse_mungepiece(list(train = list(train_fn, train_arg1),
                               predict = list(train_fn, predict_arg1)))

# If we wish to run different arguments with different functions during
# training versus prediction.
piece <- parse_mungepiece(list(train = list(train_fn, train_arg1),
                               predict = list(predict_fn, predict_arg1)))

This is a full partition of the potential arguments used to initialize a mungebit + mungepiece combo. Using this syntax in conjunction with the munge helper function speeds up coding of munge procedures (lists of mungepieces) and increases the readability of munging code.

# The munge function calls out to the parse_mungepiece helper.
munged_data <- munge(raw_data, list(
  "Drop useless vars" = list(list(drop_vars, vector_of_variables),
                             list(drop_vars, c(vector_variables, "dep_var"))),
  "Impute variables"  = list(imputer, imputed_vars),
  "Discretize vars"   = list(list(discretize, restore_levels), discretized_vars)
))

Translated in English, we are saying:

Drop a static list of useless variables from the data set. When the model is trained, drop the dependent variable as well since we will no longer need it.
Impute the variables in the static list of imputed_vars. When the model is trained, the imputer will have some logic to restore the means obtained during training of the mungepiece (assuming we are using mean imputation).
Discretize the static list of variables in the discretized_vars character vector. After model training, when new data points come in, the original training set is no longer available. The discretize method stored the necessary cuts for each variable in the mungebit's input, which the restore_levels function uses to bin the numerical features in the list of discretized_vars into factor (categorical) variables.

If one took an initial (training) data set, ran it through the munge helper as above, took the resulting list of mungepieces, and ran the original data set through them a second time (so they are running in “predict mode”), we should obtain the same result.

We can use the list of trained mungepieces to replicate the munging on the training set in a production system on single row data sets (i.e., new records being streamed in real-time).

              #' Translate a list of train / predict functions and arguments to a mungepiece.
#'
#' Constructing mungepieces and mungebits by hand is a little tedious.
#' To simplify the process, we introduce a tiny DSL that allows for
#' easier construction of mungebits. The intention is for this function
#' to be used in conjuction with a list passed to the \code{\link{munge}}
#' helper.
#'
#' @note To understand the documentation of this helper, please read
#'   the documentation on \code{\link{mungebit}} and \code{\link{mungepiece}}
#'   first.
#' @param args list. A list of arguments. This can be one of the following formats
#'   
#'   \enumerate{
#'     \item{\code{list(train_fn, ...)}}{ -- If the first element of \code{args} is
#'       a function followed by other arguments, the constructed mungepiece
#'       will use the \code{train_fn} as both the \emph{train and predict}
#'       function for the mungebit, and \code{list(...)} (that is, the remaining 
#'       elements in the list) will be used as both the train and predict
#'       arguments in the mungepiece. In other words, using this format
#'       specifies you would like \emph{exactly the same behavior in
#'       training as in prediction}. This is appropriate for mungebits
#'       that operate in place and do not need information obtained
#'       from the training set, such as simple value replacement or
#'       column removal.
#'     }
#'     \item{\code{list(list(train_fn, predict_fn), ...)}}{
#'       -- If \code{args} consists of a two-element pair in its first
#'       element, it must be a pair of either \code{NULL}s or functions,
#'       with not both elements \code{NULL}. If the \code{train_fn}
#'       or \code{predict_fn}, resp., is \code{NULL}, this will signify to have
#'       \emph{no effect} during training or prediction, resp.
#'        
#'       The remaining arguments, that is \code{list(...)}, will be used
#'       as both the training and prediction arguments.
#'   
#'       This structure is ideal if the behavior during training and prediction
#'       has an identical parametrization but very different implementation,
#'       such as imputation, so you can pass two different functions.
#'
#'       It is also useful if you wish to have no effect during prediction,
#'       such as removing faulty rows during training, or no effect during
#'       training, such as making a few transformations that are only
#'       necessary on raw production data rather than the training data.
#'     }
#'     \item{\code{list(train = list(train_fn, ...), predict = list(predict_fn, ...))}}{
#'       If \code{args} consists of a list consisting of exactly two named
#'       elements with names "train" and "predict", then the first format will be
#'       used for the respective fields. In other words, a mungepiece will
#'       be constructed consisting of a mungebit with \code{train_fn} as the
#'       training function, \code{predict_fn} as the predict fuction, and
#'       the mungepiece train arguments will be the train list of additional
#'       arguments \code{list(...)}, and similarly the predict arguments will be
#'       the predict list of additional arguments \code{list(...)}.
#'  
#'       Note \code{train_fn} and \code{predict_fn} must \emph{both} be functions
#'       and not \code{NULL}, since then we could simply use the second format
#'       described above.
#'
#'       This format is ideal when the parametrization differs during training and
#'       prediction. In this case, \code{train_fn} usually should be the same
#'       as \code{predict_fn}, but the additional arguments in each list can
#'       be used to identify the parametrized discrepancies. For example, to
#'       sanitize a dataset one may wish to drop unnecessary variables. During
#'       training, this excludes the dependent variable, but during prediction
#'       we may wish to drop the dependent as well.
#'
#'       This format can also be used to perform totally different behavior on
#'       the dataset during training and prediction (different functions and
#'       parameters), but mungebits should by definition achieve the same
#'       operation during training and prediction, so this use case is rare
#'       and should be handled carefully.
#'     }
#'   }
#'
#'   Note that the above trichotomy is exhaustive: any mungepiece can be
#'   constructed using this helper, regardless of its mungebit's
#'   train or predict function or its own train or predict arguments.
#'   In the first two formats, the first unnamed list element is always
#'   reserved and will never belong to the \code{train_args} or \code{predict_args}
#'   of the mungepiece.
#'
#'   Also note that in the first two formats, the first list element must be
#'   unnamed.
#' @return The constructed \code{\link{mungepiece}}.
#' @seealso \code{\link{mungepiece}}, \code{\link{mungebit}}.
#' @export
#' @examples
#' # First, we show off the various formats that the parse_mungepiece
#' # helper accepts. For this exercise, we can use dummy train and
#' # predict functions and arguments.
#' train_fn   <- predict_fn   <- function(x, ...) { x }
#' train_arg1 <- predict_arg1 <- dual_arg1 <- TRUE # Can be any parameter value.
#'
#' # If the train function with train args is the same as the predict function
#' # with predict args.
#' piece <- parse_mungepiece(list(train_fn, train_arg1, train_arg2 = "blah"))
#'
#' # If the train and predict arguments to the mungepiece match, but we
#' # wish to use a different train versus predict function for the mungebit.
#' piece <- parse_mungepiece(list(list(train_fn, predict_fn), dual_arg1, dual_arg2 = "blah"))
#' 
#' # If we wish to only run this mungepiece during training.
#' piece <- parse_mungepiece(list(list(train_fn, NULL), train_arg1, train_arg2 = "blah"))
#' 
#' # If we wish to only run this mungepiece during prediction
#' piece <- parse_mungepiece(list(list(NULL, predict_fn), predict_arg1, predict_arg2 = "blah"))
#'
#' # If we wish to run different arguments but the same function during
#' # training versus prediction.
#' piece <- parse_mungepiece(list(train = list(train_fn, train_arg1),
#'                                predict = list(train_fn, predict_arg1)))
#'
#' # If we wish to run different arguments with different functions during
#' # training versus prediction.
#' piece <- parse_mungepiece(list(train = list(train_fn, train_arg1),
#'                                predict = list(predict_fn, predict_arg1)))
#'
#' # The munge function uses the format defined in parse_mungepiece to create
#' # and execute a list of mungepieces on a dataset.
#' \dontrun{
#' munged_data <- munge(raw_data, list(
#'   "Drop useless vars" = list(list(drop_vars, vector_of_variables),
#'                              list(drop_vars, c(vector_variables, "dep_var"))),
#'   "Impute variables"  = list(imputer, imputed_vars),
#'   "Discretize vars"   = list(list(discretize, restore_levels), discretized_vars)
#' ))
#' 
#' 
#' # Here, we have requested to munge the raw_data by dropping useless variables,
#' # including the dependent variable dep_var after model training,
#' # imputing a static list of imputed_vars, discretizing a static list
#' # of discretized_vars being careful to use separate logic when merely
#' # using the computed discretization cuts to bin the numeric features into
#' # categorical features. The end result is a munged_data set with an 
#' # attribute "mungepieces" that holds the list of mungepieces used for
#' # munging the data, and can be used to perform the exact same set of
#' # operations on a single row dataset coming through in a real-time production
#' # system.
#' munged_single_row_of_data <- munge(single_row_raw_data, munged_data)
#' }
#' # The munge function uses the attached "mungepieces" attribute, a list of
#' # trained mungepieces.
parse_mungepiece <- function(args) {
  if (is.mungepiece(args) || is.mungebit(args)) { args <- list(args) }

  if (length(args) == 1L && is.mungepiece(args[[1L]])) {

We duplicate the mungepiece to avoid training it.

                  duplicate_mungepiece(args[[1L]])
  } else if (length(args) == 1L && is.mungebit(args[[1L]])) {

This case is technically handled already in parse_mungepiece_single, but we make it explicit here.

                  create_mungepiece(to_function(args[[1L]], "train"),
                      to_function(args[[1L]], "predict"),
                      legacy = is.legacy_mungebit(args[[1L]]))

The third permissible format requires no unnamed arguments, since it must be a list consisting of a “train” and “predict” key.

                } else if (is.list(args) && length(args) > 0L) {
      if (unnamed_count(args) == 0L) {

For this case, we delegate the work to parse_mungepiece_dual.

                    parse_mungepiece_dual(args)
    } else {

Otherwise, the training and prediction arguments are the same.

                    parse_mungepiece_single(args)
    }
  } else {
    stop("Invalid format passed to ", sQuote("parse_mungepiece"))
  }
}

This is used for the third format.

piece <- parse_mungepiece(list(
  train = list(train_fn, train_arg1),
  predict = list(train_fn, predict_arg1))
)

              parse_mungepiece_dual <- function(args) {
  if (!setequal(c("train", "predict"), names(args))) {

This check ensures the list has names exactly equal to “train” and “predict”.

                  if (length(args) == 2) {
      error <- paste("Instead, you provided a list with keys",
        paste(sapply(names(args), sQuote), collapse = " and "))
    } else {
      error <- paste0("Instead, you provided a list of length ", length(args))
    }

We use the m helper defined in the messages.R file to provide a descriptive error.

                  stop(m("parse_mungepiece_dual_error", error = error))
  }

We use the Map built-in to perform a zip operation and translate a pair of list(train_function, train_args) and list(predict_function, predict_args) to a pair list(train_function, predict_function) and list(train_args, predict_args).

                args <- Map(list, parse_mungepiece_dual_chunk(args$train,   type = "train"),
                    parse_mungepiece_dual_chunk(args$predict, type = "predict"))

This is the format we need to use the mungebit and mungepiece constructors.

                do.call(create_mungepiece, c(args[[1L]], args[[2]]))
}

We perform type dispatch to support accepting both a function and a list(train_or_predict_function, ...) when we need a non-zero number of train or predict arguments.

              parse_mungepiece_dual_chunk <- function(args, type) {
  UseMethod("parse_mungepiece_dual_chunk")
}

This route would have been called for the “train” side if we had done

parse_mungepiece(list(train = identity, predict = list(foo, "1")))

We interpret train = identity to mean train = list(identity), (i.e. run a mungebit with identity as its train function and no train_args on the mungepiece).

              parse_mungepiece_dual_chunk.function <- function(args, type) {
  list(args, list())
}

parse_mungepiece_dual_chunk.mungebit <- function(args, type) {
  list(to_function(args, type), list())
}

parse_mungepiece_dual_chunk.mungepiece <- function(args, type) {
  list(to_function(args, type),
       if (type == "train") args$train_args() else args$predict_args())
}

In the above example, the predict side would be parsed through this route.

              parse_mungepiece_dual_chunk.list <- function(args, type) {

If there are no unnamed arguments, it violates our convention, since we are unable to determine the train/predict function.

                if (unnamed_count(args) == 0) {
    stop(m("parse_mungepiece_dual_error_unnamed", type = type))
  }

  fn_index <- unnamed(args)[1L]
  fn       <- args[[fn_index]]
  
  if (!is.convertible_to_function(fn)) {
    stop(m("parse_mungepiece_dual_error_nonfunction", type = type,
           class = class(fn)[1L]))
  }
  fn <- to_function(fn, type)

Otherwise, we extract a list of function and additional arguments, whether for train or for predict.

                list(fn, args[-fn_index])
}

If none of the below two formats were passed, we error.

parse_mungepiece(list(train = identity, predict = list(foo, "1")))

              parse_mungepiece_dual_chunk.default <- function(args, type) {
  stop(m("parse_mungepiece_dual_error_type", type = type, 
         class = class(args)[1L]))
}

We will use this parsing strategy if we have an unnamed element, as in:

# Using parse_mungepiece_simple
parse_mungepiece(list(train_fn, ...))

# Using parse_mungepiece_hybrid
parse_mungepiece(list(list(train_fn, predict_fn), ...))

              parse_mungepiece_single <- function(args) {
  fn_index  <- unnamed(args)[1L]

Extract the first unnamed element and use it as the train/predict function.

                fn <- args[[fn_index]]

We can substitute a function with a mungebit or mungepiece, since its underlying train or predict function can be extracted.

                if (is.convertible_to_function(fn)) {
    parse_mungepiece_simple(args[-fn_index], fn)
  } else {
    parse_mungepiece_hybrid(args[-fn_index], fn)
  }
}

parse_mungepiece_simple <- function(args, func) {

There is no real work to be done in the simple case when we call parse_mungepiece(list(train_fn, ...)).

                if (is.mungebit(func) || is.mungepiece(func)) {
    create_mungepiece(train_function   = to_function(func, "train"),
                      predict_function = to_function(func, "predict"),
                      train_args       = args)
  } else {
    create_mungepiece(train_function = func, train_args = args)
  }
}

parse_mungepiece_hybrid <- function(args, funcs) {

If we called parse_mungepiece(list(list(train_fn, predict_fn), ...)), we have to check that the pair in the first element is valid. It must consist of two functions, one of which (but not both) may be NULL. The functions could also be substituted with mungebits, in which the train or predict function (depending on whether the mungebit is on the left or right hand side) will be extracted. This is not an encouraged format but is implemented for convenience.

                if (!is.acceptable_hybrid_pair(funcs)) {
    stop(m("parse_mungepiece_hybrid_error"))
  }

We will avoid passing train or predict arguments, respectively, to the mungepiece constructor if there is no train or predict function, respectively, to avoid bloating the mungepiece object.

                dual_args <- list(if (!is.null(funcs[[1L]])) args else list(), 
                    if (!is.null(funcs[[2L]])) args else list())

We use the to_function helper to extract the train / predict function in case the user passed a mungebit.

                funcs <- list(to_function(funcs[[1L]], "train"),
                to_function(funcs[[2L]], "predict"))

  create_mungepiece(funcs[[1]], funcs[[2]], dual_args[[1]], dual_args[[2]])
}

To support backwards-compatibility with legacy mungebits, we allow creation of both legacy and new mungebits using the munge method.

              create_mungepiece <- function(train_function, predict_function = train_function,
                              train_args = list(), predict_args = train_args, legacy) {
  missing_legacy <- missing(legacy)
  is.invalid_pair <- function(fn1, fn2) {
    if (!missing_legacy) FALSE
    else is.legacy_mungebit_function(fn1) &&
      !is.null(fn2) && !is.legacy_mungebit_function(fn2)
  }

If we are creating a legacy mungebit, both the train and predict function must be decorated with the "legacy_mungebit_function" class.

                if (is.invalid_pair(train_function, predict_function) ||
      is.invalid_pair(predict_function, train_function)) {
    stop("Cannot mix new and legacy mungebit train or predict functions.")
  } else if ((!missing_legacy && isTRUE(legacy)) ||
             is.legacy_mungebit_function(train_function) ||
             is.legacy_mungebit_function(predict_function)) {
    ensure_legacy_mungebits_package()
    getFromNamespace("mungepiece", "mungebits")$new(
      getFromNamespace("mungebit", "mungebits")$new(train_function, predict_function),
      train_args, predict_args
    )
  } else {
    mungepiece$new(mungebit$new(train_function, predict_function),
                   train_args, predict_args)
  }
}

to_function <- function(func, type) {
  UseMethod("to_function")
}

to_function.mungepiece <- function(func, type) {
  to_function(func$mungebit(), type)
}

to_function.mungebit <- function(func, type) {
  if (is.legacy_mungebit(func)) {
    if (type == "train") {
      func$train_function
    } else {
      func$predict_function
    }
  } else {
    if (type == "train") {
      func$train_function()
    }
    else {
      func$predict_function()
    }
  }
}

to_function.default <- function(func, type) {
  func
}

is.convertible_to_function <- function(obj, null_ok = FALSE) {
  is.function(obj) || is.mungebit(obj) || is.mungepiece(obj) ||
  (isTRUE(null_ok) && is.null(obj))
}

is.acceptable_hybrid_pair <- function(funcs) {

funcs must consist of two functions, one of which (but not both) may be NULL.

                is.list(funcs) && length(funcs) == 2 &&
  is.convertible_to_function(funcs[[1L]], null_ok = TRUE) && 
  is.convertible_to_function(funcs[[2L]], null_ok = TRUE) && 
  !(is.null(funcs[[1L]]) && is.null(funcs[[2L]]))
}

print.R

Elegant printing is hard work! This file contains some helpers to make outputting mungebits objects beautiful.

              # Print a `mungepiece` object.
print_mungepiece <- function(x, ...) {
  if (is.legacy_mungepiece(x)) {
    cat(crayon::blue$bold("Legacy mungepiece"))
    return()
  }
  cat(crayon::blue$bold("Mungepiece"), "with:\n")
  if (length(x$train_args()) > 0 && identical(x$train_args(), x$predict_args())) {
    print_args(x$train_args(), "train and predict", "green", ...)
  } else {
    if (length(x$train_args()) > 0) {
      print_args(x$train_args(), "train", "green", ...)
    }
    if (length(x$predict_args()) > 0) {
      print_args(x$predict_args(), "predict", "green", ...)
    }
  }
  print(x$mungebit(), ..., indent = 1L, prefix2 = "* ")
}

print_args <- function(args, type, color, ..., full = FALSE, label = "arguments") {

A dynamic way to fetch the color palette from the crayon package.

                style <- getFromNamespace(color, "crayon")$bold
  cat(sep = "", "  * ", style(paste(type, label)), ":\n")
  max_lines <- if (isTRUE(full)) Inf else 5L
  cat(crayon::silver(deparse2(args, max_lines = max_lines, indent = 3L)), "\n")
}

# Print a `mungebit` object.
print_mungebit <- function(x, ..., indent = 0L, prefix2 = "", show_trained = TRUE, full = FALSE) {
  if (is.legacy_mungebit(x)) {
    cat(crayon::blue$bold("Legacy mungebit"))
    return()
  }
  prefix <- paste(rep("  ", indent), collapse = "")
  trained <- function() {
    if (isTRUE(show_trained)) {
      paste0(" (", (if (x$trained()) crayon::green$bold("trained")
                    else  crayon::red$bold("untrained")), ") ")
    } else " "
  }
  cat(sep = "", prefix, prefix2, crayon::green("Mungebit"), trained(), "with",
      if (x$nonstandard()) " nonstandard evaluation", ":\n")
  if (length(x$input()) > 0) {
    cat(sep = "", prefix, "  * ", crayon::magenta$bold("input"), ": \n")
    max_lines <- if (isTRUE(full)) Inf else 5L
    cat(crayon::silver(deparse2(x$input(), max_lines = max_lines, indent = indent + 2L)), "\n")
  }
  if (isTRUE(all.equal(x$train_function(), x$predict_function()))) {
    print_mungebit_function(x$train_function(), "train and predict",
                            "green", indent + 1L, ..., full = full)
  } else {
    print_mungebit_function(x$train_function(),   "train",   "green",  indent + 1L, ..., full = full)
    print_mungebit_function(x$predict_function(), "predict", "yellow", indent + 1L, ..., full = full)
  }
}

print_mungebit_function <- function(fn, type, color, indent, ..., full = FALSE) {
  prefix <- paste(rep("  ", indent), collapse = "")

A dynamic way to fetch the color palette from the crayon package.

                style <- getFromNamespace(color, "crayon")$bold
  if (is.null(fn)) {
    cat(sep = "", prefix, "* ", style(paste0("No ", type, " function.")), "\n")
  } else {
    cat(sep = "", prefix, "* ", style(paste0(type, " function")), ":\n")
    if (methods::is(fn, "transformation")) {

We delegate the printing work to the transformation.

                    print(fn, indent = indent, ..., full = isTRUE(full))
    } else {
      max_lines <- if (isTRUE(full)) Inf else 5L
      cat(crayon::silver(function_snippet(fn, max_lines = max_lines, indent = indent + 1L)), "\n")
    }
  }
}

This is the general helper used to print both column_transformation and multi_column_transformation objects.

              print_transformation <- function(x, ..., indent = 0L, full = FALSE,
                                        byline = "Transformation") {
  prefix <- paste(rep("  ", indent), collapse = "")
  cat(sep = "", prefix, crayon::yellow(byline),
      if (isTRUE(environment(x)$nonstandard)) " with non-standard evaluation", ":")

A little helper function to convert the function x into a character vector of length 1.

                snippet <- function(full. = full) {
    function_snippet(unclass(get("transformation", envir = environment(x))),
                     indent = indent + 1L,
                     max_lines = if (isTRUE(full.)) Inf else 5L)
  }

If the snippet generated by trimming long bodies does not equal the snippet generated by printing the full function, show the user how to print the full body (by passing full = TRUE to print).

                if (!isTRUE(full) && !identical(snippet(FALSE), snippet(TRUE))) {
    cat(" use", crayon::bold("print(fn, full = TRUE)"), "to show full body)")
  }

  cat(sep = "", "\n", crayon::silver(snippet()), "\n")
}

A helper function to turn functions into their string representations for convenient printing.

              function_snippet <- function(fn, indent = 0L, max_lines = 5L) {
  prefix <- paste(rep("  ", indent), collapse = "")

Note that utils::head will convert the function to a string for us. We use this to get a character representation of the formals of the function along with its body.

                str_fn <- as.character(utils::head(fn, 10000))

However, utils::head uses four spaces per tab instead of two.

                str_fn <- gsub("    ", "  ", str_fn)

  if (!is.call(body(fn)) || !identical(body(fn)[[1L]], as.name("{"))) {

If the function does not have braces { surrounding its body, squish the last two lines into a single line, so e.g., function(x)\n x becomes function(x) x.

                  str_fn[length(str_fn) - 1] <-
      c(paste(str_fn[seq(length(str_fn) - 1, length(str_fn))], collapse = ""))
    str_fn <- str_fn[seq_len(length(str_fn) - 1)]

If the function body spills over, trim it.

                  if (length(str_fn) > max_lines + 1) {
      str_fn <- c(str_fn[seq_len(max_lines)], paste0("..."))
    }
  } else {

Squish the { onto a single line.

                  braces <- str_fn == "{"

Note the first line can never be just { since it is the formals of the function.

                  str_fn[which(braces) - 1L] <- vapply(str_fn[which(braces) - 1L],
      function(s) paste0(s, "{"), character(1))
    str_fn <- str_fn[!braces]

If the function body spills over, trim it.

                  if (length(str_fn) > max_lines + 2) {
      str_fn <- c(str_fn[seq_len(max_lines)], "  # Rest of body...", "}")
    }
  }

  paste(vapply(str_fn, function(s) paste0(prefix, s), character(1)), collapse = "\n")
}

Instead of translating list(a = 1) to the rather overcumbersome string structure(list(a = 1), .Names = "a"), this helper will simply turn it to list(a = 1).

              deparse2 <- function(obj, collapse = "\n", indent = 0L, max_lines = 5L) {
  conn <- textConnection(NULL, "w")

Avoid printing unnecessary attributes.

                dput(obj, conn, control = c("keepNA", "keepInteger"))
  out <- textConnectionValue(conn)
  close(conn)

dput uses four-space instead of two-space tabs.

                out <- gsub("    ", "  ", out)
  prefix <- paste(rep("  ", indent), collapse = "")
  out <- vapply(out, function(s) paste0(prefix, s), character(1))
  if (length(out) > max_lines + 1L) {
    out <- c(out[seq_len(max_lines)], paste0(prefix, "..."))
  }
  paste(out, collapse = collapse)
}

standard_column_format.R

Let's say we called munge with

munge(data, list(column_transformation(function(x) 2 * x), 1:3))

This will double the first three columns. Alternatively, we could say

munge(data, list(column_transformation(function(x) 2 * x),
                 list(is.numeric, 1:3)))

where the innermost list serves as conjunction and says “of the first three columns, double those which are numeric.” The standard_column_format helper figures out this logic given the data set and the selection, e.g. 1:3 or list(is.numeric, 1:3).

              #' Converts a logical / numeric / character vector or a function
#' into a character vector of column names for a dataframe.
#'
#' If a function is provided, it will be applied to each column of
#' the dataframe and must return a logical; those with resulting value TRUE
#' will be returned as a character vector.
#'
#' @param cols a vector or function. If logical / numeric / character vector,
#'    it will attempt to find those columns matching these values. If \code{cols}
#'    is a function, it will apply this function to each column of the dataframe
#'    and return the names of columns for which it was \code{TRUE}. Additionally,
#'    \code{cols} can be a list of any combination of the above, which will 
#'    require all the conditions to be met.
#' @param dataframe a reference dataframe. Necessary for computing the
#'    column names if a numeric or logical vector is specified for \code{cols}.
#' @export
#' @examples
#' standard_column_format(c(1,5), iris)  # c('Sepal.Length', 'Species')
#' standard_column_format(c(TRUE,FALSE,FALSE,FALSE,TRUE), iris)  # c('Sepal.Length', 'Species')

This function is rather messy, but we cannot split up its body as it will be injected into functions generated using column_transformation, which should be portable even when the mungebits package is not installed (for example, if the user wishes to send their list of mungepieces to a friend, or install it in production).

              #' standard_column_format('Sepal.Length', iris)  # 'Sepal.Length'
#' standard_column_format(list(is.numeric, c(1,5)), iris)  # 'Sepal.Length'
#' # TODO: (RK) Explain except()
standard_column_format <- function(cols, dataframe) {

If no columns are provided, we assume we are running on the entire data set.

                if (missing(cols)) colnames(dataframe) 
  else {
    process1 <- function(subcols) {

Let's say we called munge with

munge(data, list(column_transformation(as.numeric), is.character))

to convert our remaining character columns to numerics. However, say we are performing a multi-class classification and expect the dependent variable, dep_var, to be character but not numeric. We can then include it as an exception:

munge(data, list(column_transformation(as.numeric),
                 list(except("dep_var"), is.character)))

If we had simply written except("dep_var"), it would mean “all variables except dep_var”, or if we had written except(is.character), it would mean “all variables except the character variables.”

                    if (methods::is(subcols, "except")) {
        unexcepted <- unexcept(subcols)
        if (!is.list(unexcepted)) unexcepted <- list(unexcepted)
        setdiff(colnames(dataframe), process(unexcepted))
      } else if (is.function(subcols)) {
        # Much faster than lapply here.
        colnames(dataframe)[local({
          ix <- logical(length(dataframe))
          if (is.element("name", names(formals(subcols)))) {
            for (i in seq_along(dataframe)) {
              ix[i] <- subcols(.subset2(dataframe, i), name = .subset2(colnames(dataframe), i))
            }
          } else {

The [ and [[ operator internally call out to .subset2, which references the actual C function and is thus faster (avoiding unnecessary checks).

                          for (i in seq_along(dataframe)) ix[i] <- subcols(.subset2(dataframe, i))
          }
          ix

If you scroll up a little, the section for is.function(subcols) applies to the notation

munge(data, list(column_transformation(as.numeric), is.character))

where we take the is.character function and apply it to each column. It must always return TRUE or FALSE and will only apply the column_transformation to columns satisfying the condition.

                      })]
      }
      else if (is.character(subcols)) force(subcols) 
      else if (is.list(subcols)) process(subcols)
      else colnames(dataframe)[subcols]
    }

    process <- function(xcols) {
      Reduce(intersect, lapply(xcols, process1))
    }

Lots of recursion tricks here! Even if I tried to explain what is going on, I'd likely fail, so just take this function home and study it. Usually, we'd break it up into many smaller pieces, but as mentioned before that would be inconvenient here since we must include it in full in column_transformations to ensure they are portable.

                  if (methods::is(cols, "except")) {
      setdiff(colnames(dataframe), process(list(unexcept(cols))))
    } else if (is.list(cols)) {
      process(cols)
    } else {
      process1(cols)
    }
  }
}

We use the "except" S3 class to tag any inputs to standard_column_format whose meaning should be negated (i.e., do not apply to these columns).

              #' Ignore during standard column format.
#'
#' @param x ANY. An R object.
#' @export
except <- function(x) {
  class(x) <- c("except", class(x))
  x
}

unexcept <- function(x) {
  class(x) <- setdiff(class(x), "except")
  x
}

utils.R

              # TODO: (RK) Document this file literately.

#' Merge two lists and overwrite latter entries with former entries
#' if names are the same.
#'
#' For example, \code{list_merge(list(a = 1, b = 2), list(b = 3, c = 4))}
#' will be \code{list(a = 1, b = 3, c = 4)}.
#' @param list1 list
#' @param list2 list
#' @return the merged list.
#' @export
#' @examples
#' stopifnot(identical(list_merge(list(a = 1, b = 2), list(b = 3, c = 4)),
#'                     list(a = 1, b = 3, c = 4)))
#' stopifnot(identical(list_merge(NULL, list(a = 1)), list(a = 1)))
list_merge <- function(list1, list2) {
  list1 <- list1 %||% list()
  # Pre-allocate memory to make this slightly faster.
  list1[Filter(function(x) nchar(x) > 0, names(list2) %||% c())] <- NULL
  for (i in seq_along(list2)) {
    name <- names(list2)[i]
    if (!identical(name, NULL) && !identical(name, "")) list1[[name]] <- list2[[i]]
    else list1 <- append(list1, list(list2[[i]]))
  }
  list1
}

`%||%` <- function(x, y) if (is.null(x)) y else x

is.acceptable_function <- function(x) {
  is.function(x) || 
  is.null(x)     ||
  is.mungebit(x)
}

is.simple_character_vector <- function(x) {
  is.character(x) && all(nzchar(x)) &&
  !any(is.na(x)) && length(x) > 0 &&
  length(unique(x)) == length(x)
}

# If an environment contains variables "a" and "b",
# create a list (a = quote(a), b = quote(b)).
env2listcall <- function(env) {
  names <- ls(env)
  if ("name_order" %in% names(attributes(env))) {
    names <- names[attr(env, "name_order")]
  }
  setNames(lapply(names, as.name), nm = names)
}

# Revert the operation in mungepiece initialization that turns a list
# into an environment.
env2list <- function(env) {
  if (length(ls(env)) == 0L) {
    list()
  } else {
    lst <- as.list(env)
    lst <- lst[match(names(lst), attr(env, "parsed_names"))]
    if (any(nzchar(attr(env, "initial_names")))) {
      names(lst) <- attr(env, "initial_names")
    } else {
      names(lst) <- NULL
    }
    lst
  }
}

make_env <- function(lst, parent = emptyenv()) {
  initial_names <- names(lst) %||% character(length(lst))
  names(lst) <- ifelse(unnamed(lst),
    paste0("_", seq_along(lst)),
    paste0("_", initial_names)
  )

  if (anyDuplicated(names(lst))) {
    stop("Cannot accept lists with duplicate names")
  }

  if (length(lst) == 0) {
    env <- new.env(parent = parent)
  } else {
    env <- list2env(lst, parent = parent)
  }

  name_order <- match(names(lst), ls(env))
  attr(env, "name_order")    <- name_order
  attr(env, "initial_names") <- initial_names
  attr(env, "parsed_names")  <- names(lst)
  env
}

list2env_safe <- function(lst, ...) {
  if (length(lst) > 0L) {
    list2env(lst)
  } else {
    new.env(...) 
  }
}

unnamed <- function(el) {
  "" == (names(el) %||% character(length(el)))
}

unnamed_count <- function(el) {
  sum(unnamed(el))
}

To ensure backwards compatibility with legacy mungebits, we perform nothing in many cases if the piece is not an R6 object (and hence a new mungepiece in the style of this package).

              #' Whether a mungepiece is a legacy mungepiece (from the mungepieces package).
#'
#' @param x ANY. An R object to test.
#' @export
#' @return TRUE or FALSE according as the mungepiece is a legacy mungepiece.
is.legacy_mungepiece <- function(x) {
  methods::is(x, "mungepiece") && !methods::is(x, "R6")
}

#' Whether a mungebit is a legacy mungebit (from the mungebits package).
#'
#' @param x ANY. An R object to test.
#' @export
#' @return TRUE or FALSE according as the mungebit is a legacy mungebit.
is.legacy_mungebit <- function(x) {
  methods::is(x, "mungebit") && !methods::is(x, "R6")
}

#' Whether a train or predict function is a legacy function (from the mungebits package).
#'
#' Note that only functions constructed by the \code{munge} helper
#' will be identifiable using this method.
#'
#' @param x ANY. An R object to test.
#' @export
#' @return TRUE or FALSE according as the mungebit is a legacy train or
#'    predict function, determined using the \code{"legacy_mungebit_function"}
#"    class.
is.legacy_mungebit_function <- function(x) {
  methods::is(x, "legacy_mungebit_function")
}

ensure_legacy_mungebits_package <- function() {
  if (!requireNamespace("mungebits", quietly = TRUE)) {
    stop("The legacy mungebits package is required to create legacy mungebits.")
  }
}

#' Tag a function as a legacy mungebit function.
#'
#' @param x function. An R function to tag.
#' @return \code{x} with additional class "legacy_mungebit_function".
as.legacy_function <- function(x) {
  class(x) <- c("legacy_mungebit_function", class(x))
  x
}

#' Determine whether or not a given object is a transformation.
#'
#' Transformations can be either column or multi column transformations.
#'
#' @param x ANY. R object to test.
#' @return \code{TRUE} or \code{FALSE} according as it is a transformation.
#' @export
is.transformation <- function(x) {
  inherits(x, "transformation")
}

zzz.R

In R 3.1, calling a function like

function(data) {
  data[[1]] <- 2 * data[[1]]
  data
}

will create a copy of the first column of data. Before R 3.1, it will create a copy of the entire dataset, even if it has thousands of other columns. This affects the performance of train and predict functions in mungebit objects. There is a workaround in the mungebits package at the expense of using non-standard evaluation and making every train and predict function look like

function(data) {
  eval.parent(substitute({ # Evaluating in the calling environment
                           # prevents creation of a copy.
    data[[1]] <- 2 * data[[1]]
    data
  })
}

Gross!

              .onLoad <- function(libPath, pkg) {
  if (as.package_version(R.version) < as.package_version("3.1.0")) {
    packageStartupMessage(crayon::red(paste0(
      "Using the mungebits2 package with R version < 3.1 will result ",
      "in dramatic performance slowdowns: use the mungebits package instead ",
      "(https://github.com/robertzk/mungebits)"
    )))
  }
}

If mungebits is attached after mungebits2, it will overwrite many functions from mungebits2, since they share the same name. This lack of compatible namespacing is a design flaw in R, but for now we just alert the user.

This trick was shamelessly borrowed / stolen from dplyr. :)

              .onAttach <- function(libname, pkgname) {
  setHook(packageEvent("mungebits", "attach"), function(...) {
    packageStartupMessage(crayon::red$bold(paste0(
      "You have loaded mungebits after mungebits2 - ",
      "this is likely to cause problems.\nIf you need functions from both ",
      "mungebits and mungebits2 (which is unlikely), please load mungebits first, ",
      "then mungebits2:\nlibrary(mungebits); library(mungebits2)")))
  })
}