Module 6 Model objects

Define model objects
Learn about methods specific to model objects
apply a model to a DatasetExperiment object

6.1 The `model` template

The model template defined by struct is the fundamental building block of data processing workflows using struct. The model template itself isn’t used directly; it is extended by model specific templates. It is used to define methods and slots common to all models so that they can function together in a workflow.

For example, in the last module we introduced the PCA object. It is a model object, which itself is derived from the the struct_class template. Because of this all struct_class and model methods work for the PCA object. By extending the model template the PCA object provides PCA specific input parameters, such as number_components, and overrides default methods (such as model_train; see next section) to apply PCA specific calculations.

6.1.1 `model` object methods

The model template defines four key methods used by all models:

The model_train method. This method allows you to train a model using an input dataset. For example, we might train a PCA model using the iris dataset. Or we might calculate a threshold based on some training data. Some output slots of a model object will be populated after model_train is used.
The model_predict method. This method allows you to apply a trained model to a second dataset. For the PCA example this means projecting the new data on to the existing model (instead of training a new one). Or for the second example we might remove samples below the computed threshold. Some output slots of a model object will be populated after model_predict is used.
The model_apply method. This method applies model_predict immediately after model_train to the same data (sometimes called “autoprediction”). For the PCA example we might use this if we want to explore the training data in more detail. For the second example we might want to filter our training data based on the threshold we calculated.
The predicted method. This method returns the pred slot of a model object. This is the default output of the object when connecting models in a workflow (we will discuss this further in a future section)

6.1.2 Applying models

To use any of the first three methods above we have to provide a model, and a dataset. For example, to apply PCA to the Iris dataset:

# prepare data
DE = iris_DatasetExperiment()

# prepare model
M = PCA(number_components = 2)

# apply model M to data DE
M = model_apply(M,DE)

The PCA model has extended the model template by providing PCA specific input and output parameters, and providing model_train and model_predict methods that apply PCA to the input data when the input model to those method is a PCA model.

The data must be in DatasetExperiment format, and the model must be based on the model template for the model_apply method to work.

To use the predicted method we only need to provide the model we want the results from.

predicted(M)

A "DatasetExperiment" object
----------------------------
name:          
description:   
data:          150 rows x 2 columns
sample_meta:   150 rows x 1 columns
variable_meta: 2 rows x 1 columns

6.2 Exercise

Apply PCA to the MTBLS79 Dataset

In this exercise you will create a PCA object and apply it to the MTBLS79 dataset. PCA doesnt support missing values, so we need to impute (estimate) them, which can do using the knn_impute model.

Import the filtered MTBLS79 dataset into your environment
Create a knn_impute model with 5 neighbours and apply it to the MTBLS79 data.
Create a PCA model with 8 components and apply it to the imputed MTBLS79 data
Print a summary of the computed PCA scores to the console
How would the PCA scores compare using model_train and model_predict instead of model_apply?

Hints

see ?knn_impute for help using this object
see ?predicted

Solutions

We can import this data exactly as we did in Module 5, with the filtered = TRUE input.
```
# import data
DE = MTBLS79_DatasetExperiment(filtered = TRUE)
```
The knn_impute object uses the model template, so we can apply it using model_apply. We can specify the number of neighbours; other parmaters will use the default values unless provided.
```
# prepare the model
K = knn_impute(neighbours = 5)

# apply model K to data DE
K = model_apply(K,DE)
```
We can create the PCA model exactly as we did in Module 4. We apply it to the predicted output of the knn_impute object, which contains the imputed data.
```
# prepare PCA model
P = PCA(number_components = 8)

# apply M to imputed data
P = model_apply(P,predicted(K))
```

The scores slot of the PCA model is a DatasetExperiment object, so we can use the show method to summarise it on the console.

show(P$scores)

A "DatasetExperiment" object
----------------------------
name:          
description:   
data:          172 rows x 8 columns
sample_meta:   172 rows x 7 columns
variable_meta: 8 rows x 1 columns

The model_apply method is shorthand for using model_train followed immediately by model_predict, so the PCA scores will be identical.

# train the model
N = model_train(P,predicted(K)) # NB stored in a new variable

# use the *trained model N* to get predictions for training data
N = model_predict(N,predicted(K))

# check all scores are identical to using model_apply
all(P$scores$data == N$scores$data) # expect TRUE

[1] TRUE

Module 6 Model objects

6.1 The model template

6.1.1 model object methods

6.1.2 Applying models

6.2 Exercise

6.1 The `model` template

6.1.1 `model` object methods