Module 6 Model objects

  • Define model objects
  • Learn about methods specific to model objects
  • apply a model to a DatasetExperiment object

6.1 The model template

The model template defined by struct is the fundamental building block of data processing workflows using struct. The model template itself isn’t used directly; it is extended by model specific templates. It is used to define methods and slots common to all models so that they can function together in a workflow.

For example, in the last module we introduced the PCA object. It is a model object, which itself is derived from the the struct_class template. Because of this all struct_class and model methods work for the PCA object. By extending the model template the PCA object provides PCA specific input parameters, such as number_components, and overrides default methods (such as model_train; see next section) to apply PCA specific calculations.

6.1.1 model object methods

The model template defines four key methods used by all models:

  1. The model_train method. This method allows you to train a model using an input dataset. For example, we might train a PCA model using the iris dataset. Or we might calculate a threshold based on some training data. Some output slots of a model object will be populated after model_train is used.

  2. The model_predict method. This method allows you to apply a trained model to a second dataset. For the PCA example this means projecting the new data on to the existing model (instead of training a new one). Or for the second example we might remove samples below the computed threshold. Some output slots of a model object will be populated after model_predict is used.

  3. The model_apply method. This method applies model_predict immediately after model_train to the same data (sometimes called “autoprediction”). For the PCA example we might use this if we want to explore the training data in more detail. For the second example we might want to filter our training data based on the threshold we calculated.

  4. The predicted method. This method returns the pred slot of a model object. This is the default output of the object when connecting models in a workflow (we will discuss this further in a future section)

6.1.2 Applying models

To use any of the first three methods above we have to provide a model, and a dataset. For example, to apply PCA to the Iris dataset:

# prepare data
DE = iris_DatasetExperiment()

# prepare model
M = PCA(number_components = 2)

# apply model M to data DE
M = model_apply(M,DE)

The PCA model has extended the model template by providing PCA specific input and output parameters, and providing model_train and model_predict methods that apply PCA to the input data when the input model to those method is a PCA model.

The data must be in DatasetExperiment format, and the model must be based on the model template for the model_apply method to work.

To use the predicted method we only need to provide the model we want the results from.

predicted(M)
A "DatasetExperiment" object
----------------------------
name:          
description:   
data:          150 rows x 2 columns
sample_meta:   150 rows x 1 columns
variable_meta: 2 rows x 1 columns

6.2 Exercise

Apply PCA to the MTBLS79 Dataset

In this exercise you will create a PCA object and apply it to the MTBLS79 dataset. PCA doesnt support missing values, so we need to impute (estimate) them, which can do using the knn_impute model.

  1. Import the filtered MTBLS79 dataset into your environment
  2. Create a knn_impute model with 5 neighbours and apply it to the MTBLS79 data.
  3. Create a PCA model with 8 components and apply it to the imputed MTBLS79 data
  4. Print a summary of the computed PCA scores to the console
  5. How would the PCA scores compare using model_train and model_predict instead of model_apply?