- Learn how a data processing workflow is defined in the context of this course
- Learn about templates and objects defined by the
structpackage - Interact with some of the base objects/templates from the
structpackage
Module 4 Workflows, templates and objects
4.1 What is a data processing workflow?
There many definitions of a workflow, so its useful to be specific about what we mean by a workflow in this course.
Here, a workflow is a sequence of steps, or modules, where each step implements a different data processing step.
For example, one step in the workflow might be to normalise the data, another to scale it. The workflow steps can be connected in any order, and the data flows into one step, is processed, and then the output is used as an input to the next step. For this example, the raw data would be first normalised, and the normalised data would then be scaled.
Your workflow
- What data processing steps do you think you might need for your workflow?
- What statistical analysis might you like to include as part of your workflow?
- Quality filtering, normalisation, scaling, transformation, imputation
- Univariate significance tests, multivariate visualisation
4.2 Templates and Objects
As you will already have seen, there is a large number of R packages. This is one of the great strengths of R. However, when writing code to carry out a number of workflow steps it can be difficult to integrate all of the required packages to together in a way that is transparent and reproducible.
The struct package tries to address this by defining a number of “templates”. These templates define what the data structure should be, how data will flow in and out of a workflow step, and which output is transferred to the input of the next object.
In the figure below the green boxes represent workflow steps. Data flows in and out of the workflow steps before finally generating some charts. Each workflow step uses the same basic template, but applies a different data processing step by overriding the template defaults.

None of the templates in struct implement any data processing steps, they only define what a step should look like. The idea is that other packages can implement their code inside a template to make it compatible with other workflow steps. For example, the structToolbox package uses the struct templates to convert processing steps from other packages into workflow steps using templates. All workflow steps are then compatible with each other, and you no longer need to worry about how to implement the steps, you can focus on what steps to use and in what order.
4.2.1 Template slots
Slots, or fields, of a template are how struct defines the content of a template. All struct templates have a name and a description slot. They also have a citation slot that contains a list of relevant literature in BibTeX format, a libraries slot naming any additional packages needed to use the object, and an ontology slot that provides links to FAIR descriptions of the method(s) implemented by the template.
struct also provides a mechanism to flexibly include additional slots in the form of input parameters params and output parameters outputs. Input and output slots are specific to the method being implemented by the template.
For example a “Principal Component Analysis” template might include a “number_of_components” slot for the user to specify how many components to use in the computation.
The results of a computation will be put in output slots; until the object is used with some data the output slots will be empty. For the PCA example, the computed component scores might be stored in the scores output slot.
You will find out how to apply computations to a DatasetExperiment later in the course.
4.2.2 Template methods
As well as the templates struct also provides a number of “methods” for each template. Methods are functions that change how they work depending on the input template. Some methods defined by struct enable easy access to inputs and outputs of the templates, while others, such as show, define what is displayed when the template is printed to the console.
struct defines a $ method that allows you to easily get and set slot values. You will see examples of this in the next section.
4.3 Working with Templates and Objects
All templates “inherit” functionality from parent templates. You can list the inherited templates for an object using the is function. As an example here we show the inherited templates of the PCA template.
[1] "PCA" "model" "struct_class"
[4] "model_OR_model_seq" "model_OR_iterator"
All struct templates have struct_class as a parent, so the methods defined for this template will work for all struct templates. You will learn about some of the other templates in later modules.
You will never normally use the struct_class template directly, but we can use it here to demonstrate some of the features of a template. To initialise an instance of a template you use its name as a function call. When a template is initialised it becomes an “object”. You can print the details of an object to the console using the show command.
A "struct_class" object
-----------------------
name:
description:
We can access slots of the object using dollar notation. For example here we change the values of the name and description slots:
A "struct_class" object
-----------------------
name: example name
description: example description
You can also set slot values when intialising the object e.g:
A "struct_class" object
-----------------------
name: second example name
description: second description
This approach will be used a lot when creating workflows in later modules.
4.4 Exercise
Exploring the PCA object
In this exercise you will test your understanding of struct_class objects
using a PCA object, which is based on the struct_class template. You will find out
more about the PCA object later in the course.
- Create a new
PCAobject using thePCA()command. - Print a summary of the object to the console.
- How many output parameters does the PCA object have?
- What kind of parameter is
number_components? - How many components are set by default?
- Change the number of components to 10.
- How do we initialise a PCA object with 8 components?
- What do you think will happen if you try to set
number_components = "cake"? Why does this happen?
- make sure you have activated the
structToolboxlibrary - methods that work for
struct_classobjects will work for any object based off of that template
We can initialise a PCA object in a similar way to
struct_classwas initialised in the course text. Here we store it asMin the environment.We can use the
showmethod to print the object to console.A "PCA" object -------------- name: Principal Component Analysis (PCA) description: PCA is a multivariate data reduction technique. It summarises the data in a smaller number of Principal Components that maximise variance. input params: number_components outputs: scores, loadings, eigenvalues, ssx, correlation, that predicted: that seq_in: dataThe console output indicates that the PCA object has 6 output slots.
The
number_componentsslot is an input parameter.We can retrieve the value of the
number_componentsslot using dollar notation.[1] 2We can also set the value of the
number_componentsslot using dollar notation.We can set a value for slot by specifying it inside the call to create the object.
You will receive an error when trying to set
number_components = "cake". This is because the PCA template requires thatnumber_of_componentsis either anumericor aninteger. You can see this in the documentation (e.g.?PCA).