Module 5 DatasetExperiment objects

learn about the DatasetExperiment object
find out why we use the DatasetExperiment object
use a DatasetExperiment object containing example metabolmics data

5.1 The DatasetExperiment template

The DatasetExperiment template defines the format data should be in in order to be compatible with other struct objects. By ensuring that the input data follows a strict template, struct can ensure that the data is compatible with all steps in a workflow.

The DatasetExperiment template consists of three key elements: data, sample_meta and variable_meta.

data: a table of peak areas/heights
The DatasetExperiment template defines that the data should be formatted as a data.frame with features (metabolites) in the columns, and samples in the rows. Each row of the data.frame therefore contains the peak areas for all metabolites measured in that sample.

sample_meta: meta data for the samples
Contains information about the samples in addition to the sample names. It can come in many forms. For example it could be categorical: e.g. whether the sample was a control sample or a treated sample, or it could be continuous: e.g. the BMI of the subject. The DatasetExperiment template defines that the sample meta data should be a data.frame where the samples are in rows, and each column corresponds to one piece of meta data.

variable_meta: meta data for the features
Like the sample meta data except that it contains additional information about each feature (variable, metabolite), such as m/z and retention time, or maybe annotation information. The DatasetExperiment template defines that this should be a data.frame where the features are in rows and the columns correspond to on piece of meta data.

The above definitions are for untargeted metabolomics, but the format is compatible with other data types provided you can arrange the data into a table with samples in rows and variables in column. For example, the data for an NMR study might have total areas for each ppm bucket in the data table instead of metabolite peak areas, and ppm ranges in the variable meta-data instead of m/z and retention time.

5.2 Iris DatasetExperiment object

For the struct package Fisher’s classic Iris dataset has been converted to DatasetExperiment format. You can import it into your environment as follows:

# import fishers iris data
DE = iris_DatasetExperiment()

We can examine the DE object in your environment using the show command.

show(DE)

A "DatasetExperiment" object
----------------------------
name:          Fisher's Iris dataset
description:   This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of
                 the variables sepal length and width and petal length and width,
                 respectively, for 50 flowers from each of 3 species of iris. The species are
                 Iris setosa, versicolor, and virginica.
data:          150 rows x 4 columns
sample_meta:   150 rows x 1 columns
variable_meta: 4 rows x 1 columns

From this output you can see that the DE object has a name and a description defined. These fields are common to all struct objects and can be accessed using dollar notation:

DE$name

[1] "Fisher's Iris dataset"

DE$description

[1] "This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica."

Fields (formally called “slots”) in a struct object can be assigned new values using the similar notation:

# change the name
DE$name  = 'Fisher/Anderson Iris dataset'
# show updated object
DE

A "DatasetExperiment" object
----------------------------
name:          Fisher/Anderson Iris dataset
description:   This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of
                 the variables sepal length and width and petal length and width,
                 respectively, for 50 flowers from each of 3 species of iris. The species are
                 Iris setosa, versicolor, and virginica.
data:          150 rows x 4 columns
sample_meta:   150 rows x 1 columns
variable_meta: 4 rows x 1 columns

The default output is to show the contents of the object, so we were lazy and skipped a specific call to show.

For this DatasetExperiment object we can see that there are 4 columns of data for 150 samples. There is a single column of meta data for the samples, and a single column of meta data for the variables. These slots can be accessed in the same way as the other slots, using dollar notation. Here we show the first 6 rows of the data.

head(DE$data)

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

5.3 Exercise

MTBLS79 DatasetExperiment

In this exercise you will be able to test your understanding of DatasetExperiment objects.

Task

The structToolbox packages provides the MTBLS79 dataset as an example DatasetExperiment object.

find out about MTBLS79_DatasetExperiment() using the package documentation.
import MTBLS79 into your environment and display a summary of the contents.
update the description to better reflect to documentation.
compare the data with filtered = TRUE to the data when filtered = FALSE.

Hints

You can view the help by preceding the function name with a question mark
all slots of a DatasetExperiment can be accessed using $ notation
how many rows/columns does each dataset have?

Solutions

We can append a question mark to the function name to obtain documentation for a function.
```
# display documentation for MTBLS79
?MTBLS79_DatasetExperiment()
```
Note that in R studio the documentation will be displayed in the help tab (default location is a tab in the bottom right panel).

The show function summarises the contents of a struct object.

# use show to display a summary of contents
DE = MTBLS79_DatasetExperiment()
show(DE)

A "DatasetExperiment" object
----------------------------
name:          MTBLS79
description:   Converted from SE provided by the pmp package
data:          172 rows x 2063 columns
sample_meta:   172 rows x 7 columns
variable_meta: 2063 rows x 0 columns

We can use dollar notation to get and set values for a slot.

# update description e.g.
DE$description = 'A systematic evaluation of the reproducibility of a multi-batch DIMS metabolomics study of cardiac tissue extracts'
    show(DE)

A "DatasetExperiment" object
----------------------------
name:          MTBLS79
description:   A systematic evaluation of the reproducibility of a multi-batch DIMS metabolomics study of
                 cardiac tissue extracts
data:          172 rows x 2063 columns
sample_meta:   172 rows x 7 columns
variable_meta: 2063 rows x 0 columns

We can use filtered = TRUE as an input to the function and compare the output of show for the filtered and unfiltered data.

DEfiltered = MTBLS79_DatasetExperiment(filtered=TRUE)
show(DEfiltered)

A "DatasetExperiment" object
----------------------------
name:          MTBLS79
description:   Converted from SE provided by the pmp package
data:          172 rows x 1579 columns
sample_meta:   172 rows x 7 columns
variable_meta: 1579 rows x 0 columns

The unfiltered data has 172 samples and 2063 features. After filtering the dataset has 1579 features.