Module 4 Workflows, templates and objects

  • Learn how a data processing workflow is defined in the context of this course
  • Learn about templates and objects defined by the struct package
  • Interact with some of the base objects/templates from the struct package

4.1 What is a data processing workflow?

There many definitions of a workflow, so its useful to be specific about what we mean by a workflow in this course.

Here, a workflow is a sequence of steps, or modules, where each step implements a different data processing step.

For example, one step in the workflow might be to normalise the data, another to scale it. The workflow steps can be connected in any order, and the data flows into one step, is processed, and then the output is used as an input to the next step. For this example, the raw data would be first normalised, and the normalised data would then be scaled.

Your workflow

  • What data processing steps do you think you might need for your workflow?
  • What statistical analysis might you like to include as part of your workflow?

4.2 Templates and Objects

As you will already have seen, there is a large number of R packages. This is one of the great strengths of R. However, when writing code to carry out a number of workflow steps it can be difficult to integrate all of the required packages to together in a way that is transparent and reproducible.

The struct package tries to address this by defining a number of “templates”. These templates define what the data structure should be, how data will flow in and out of a workflow step, and which output is transferred to the input of the next object.

In the figure below the green boxes represent workflow steps. Data flows in and out of the workflow steps before finally generating some charts. Each workflow step uses the same basic template, but applies a different data processing step by overriding the template defaults.

None of the templates in struct implement any data processing steps, they only define what a step should look like. The idea is that other packages can implement their code inside a template to make it compatible with other workflow steps. For example, the structToolbox package uses the struct templates to convert processing steps from other packages into workflow steps using templates. All workflow steps are then compatible with each other, and you no longer need to worry about how to implement the steps, you can focus on what steps to use and in what order.

4.2.1 Template slots

Slots, or fields, of a template are how struct defines the content of a template. All struct templates have a name and a description slot. They also have a citation slot that contains a list of relevant literature in BibTeX format, a libraries slot naming any additional packages needed to use the object, and an ontology slot that provides links to FAIR descriptions of the method(s) implemented by the template.

struct also provides a mechanism to flexibly include additional slots in the form of input parameters params and output parameters outputs. Input and output slots are specific to the method being implemented by the template.

For example a “Principal Component Analysis” template might include a “number_of_components” slot for the user to specify how many components to use in the computation.

The results of a computation will be put in output slots; until the object is used with some data the output slots will be empty. For the PCA example, the computed component scores might be stored in the scores output slot.

You will find out how to apply computations to a DatasetExperiment later in the course.

4.2.2 Template methods

As well as the templates struct also provides a number of “methods” for each template. Methods are functions that change how they work depending on the input template. Some methods defined by struct enable easy access to inputs and outputs of the templates, while others, such as show, define what is displayed when the template is printed to the console.

struct defines a $ method that allows you to easily get and set slot values. You will see examples of this in the next section.

4.3 Working with Templates and Objects

All templates “inherit” functionality from parent templates. You can list the inherited templates for an object using the is function. As an example here we show the inherited templates of the PCA template.

# inheritance of PCA
is(PCA())
[1] "PCA"                "model"              "struct_class"      
[4] "model_OR_model_seq" "model_OR_iterator" 

All struct templates have struct_class as a parent, so the methods defined for this template will work for all struct templates. You will learn about some of the other templates in later modules.

You will never normally use the struct_class template directly, but we can use it here to demonstrate some of the features of a template. To initialise an instance of a template you use its name as a function call. When a template is initialised it becomes an “object”. You can print the details of an object to the console using the show command.

# create an instance of a struct_class template
SC = struct_class()

# print to console
show(SC)
A "struct_class" object
-----------------------
name:          
description:   

We can access slots of the object using dollar notation. For example here we change the values of the name and description slots:

SC$name = 'example name'
SC$description = 'example description'
show(SC)
A "struct_class" object
-----------------------
name:          example name
description:   example description

You can also set slot values when intialising the object e.g:

SC = struct_class(
        name = 'second example name', 
        description = 'second description')
show(SC)
A "struct_class" object
-----------------------
name:          second example name
description:   second description

This approach will be used a lot when creating workflows in later modules.

4.4 Exercise

Exploring the PCA object

In this exercise you will test your understanding of struct_class objects using a PCA object, which is based on the struct_class template. You will find out more about the PCA object later in the course.