- Define
modelobjects - Learn about methods specific to model objects
- apply a model to a DatasetExperiment object
Module 6 Model objects
6.1 The model template
The model template defined by struct is the fundamental building block of data processing workflows using struct. The model template itself isn’t used directly; it is extended by model specific templates. It is used to define methods and slots common to all models so that they can function together in a workflow.
For example, in the last module we introduced the PCA object. It is a model object, which itself is derived from the the struct_class template. Because of this all struct_class and model methods work for the PCA object. By extending the model template the PCA object provides PCA specific input parameters, such as number_components, and overrides default methods (such as model_train; see next section) to apply PCA specific calculations.
6.1.1 model object methods
The model template defines four key methods used by all models:
The
model_trainmethod. This method allows you to train a model using an input dataset. For example, we might train a PCA model using the iris dataset. Or we might calculate a threshold based on some training data. Someoutputslots of a model object will be populated aftermodel_trainis used.The
model_predictmethod. This method allows you to apply a trained model to a second dataset. For the PCA example this means projecting the new data on to the existing model (instead of training a new one). Or for the second example we might remove samples below the computed threshold. Someoutputslots of a model object will be populated aftermodel_predictis used.The
model_applymethod. This method appliesmodel_predictimmediately aftermodel_trainto the same data (sometimes called “autoprediction”). For the PCA example we might use this if we want to explore the training data in more detail. For the second example we might want to filter our training data based on the threshold we calculated.The
predictedmethod. This method returns thepredslot of a model object. This is the default output of the object when connecting models in a workflow (we will discuss this further in a future section)
6.1.2 Applying models
To use any of the first three methods above we have to provide a model, and a dataset. For example, to apply PCA to the Iris dataset:
# prepare data
DE = iris_DatasetExperiment()
# prepare model
M = PCA(number_components = 2)
# apply model M to data DE
M = model_apply(M,DE)The PCA model has extended the model template by providing PCA specific input and output parameters, and providing model_train and model_predict methods that apply PCA to the input data when the input model to those method is a PCA model.
The data must be in DatasetExperiment format, and the model must be based on the model template for the model_apply method to work.
To use the predicted method we only need to provide the model we want the results from.
A "DatasetExperiment" object
----------------------------
name:
description:
data: 150 rows x 2 columns
sample_meta: 150 rows x 1 columns
variable_meta: 2 rows x 1 columns
6.2 Exercise
Apply PCA to the MTBLS79 Dataset
In this exercise you will create a PCA object and apply it to the MTBLS79 dataset.
PCA doesnt support missing values, so we need to impute (estimate) them, which can do using the knn_impute model.
- Import the filtered MTBLS79 dataset into your environment
- Create a
knn_imputemodel with 5 neighbours and apply it to the MTBLS79 data. - Create a
PCAmodel with 8 components and apply it to the imputed MTBLS79 data - Print a summary of the computed PCA scores to the console
- How would the PCA scores compare using
model_trainandmodel_predictinstead ofmodel_apply?
- see
?knn_imputefor help using this object - see
?predicted
We can import this data exactly as we did in Module 5, with the
filtered = TRUEinput.The
knn_imputeobject uses themodeltemplate, so we can apply it usingmodel_apply. We can specify the number of neighbours; other parmaters will use the default values unless provided.We can create the PCA model exactly as we did in Module 4. We apply it to the
predictedoutput of theknn_imputeobject, which contains the imputed data.The
scoresslot of the PCA model is aDatasetExperimentobject, so we can use theshowmethod to summarise it on the console.A "DatasetExperiment" object ---------------------------- name: description: data: 172 rows x 8 columns sample_meta: 172 rows x 7 columns variable_meta: 8 rows x 1 columnsThe
model_applymethod is shorthand for usingmodel_trainfollowed immediately bymodel_predict, so the PCA scores will be identical.# train the model N = model_train(P,predicted(K)) # NB stored in a new variable # use the *trained model N* to get predictions for training data N = model_predict(N,predicted(K)) # check all scores are identical to using model_apply all(P$scores$data == N$scores$data) # expect TRUE[1] TRUE