Exploring the MTox700+ library
Gavin Rhys Lloyd
2024-01-19
Source:vignettes/exploring_mtox.Rmd
exploring_mtox.Rmd
Getting Started
The latest versions of struct and
MetMashR
that are compatible with your current R version
can be installed using BiocManager.
# install BiocManager if not present
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# install MetMashR and dependencies
BiocManager::install("MetMashR")
Once installed you can activate the packages in the usual way:
# load the packages
library(MetMasheR)
library(ggplot2)
library(structToolbox)
library(dplyr)
library(DT)
Introduction
MTox700+ is a list of toxicologically relevant metabolites derived from publications, public databases and relevant toxicological assays.
In this vignette we import the MTox700+ database and combine.merge and “mash” it with other databases to explore its contents and it’s coverage of chemical, biological and toxicological space.
Importing the MTox700+ database
The MTox700+ database can be imported using the
MTox700plus_database
object. It can be imported to a
data.frame using the read_database
method.
# prep object
MT = MTox700plus_database(
version = 'latest',
tag = 'MTox700+'
)
# import
df = read_database(MT)
# show contents
.DT(df)
# prepare workflow that uses MTox700+ as a source
M = import_source()
trim_whitespace(
column_name = '.all',
which = 'both',
whitespace = '[\\h\\v]'
)
#> A "trim_whitespace" object
#> --------------------------
#> name: Trim whitespace
#> description: A wrapper for [`trimws()`]. Removes leading and/or trailing whitespace from character strings.
#> input params: column_names, which, whitespace
#> outputs: updated
#> predicted: updated
#> seq_in: data
# apply
M = model_apply(M,MT)
Exploring the chemical space
The chemical (or “metabolite”) space covered by the MTox700+ database can be explored in several ways using the data included in the database.
For example, we can generate images of the molecules using the SMILES included in the database. Here we generate images of the first 6 metabolites in the database.
# prepare chart
C = openbabel_structure(
smiles_column = 'smiles',
row_index = 1,
scale_to_fit = FALSE,
image_size = 300,
title_column = 'metabolite_name',
view_port = 400
)
# first six
G = list()
for (k in 1:6) {
# set row idx
C$row_index=k
# plot
G[[k]] = chart_plot(C,predicted(M))
}
# layout
cowplot::plot_grid(plotlist = G,nrow=2)
The MTox700+ database also contains information about the structural classification of the metabolites based on ChemOnt (a chemical taxonomy) and ClassyFire (software to compute the taxonomy of a structure) [10.1186/s13321-016-0174-y].
In this plot we show the number of metabolites in the MTox700+ database that are assigned to a “superclass” of molecules.
# initialise chart object
C = annotation_bar_chart(
factor_name = 'superclass',
label_rotation = TRUE,
label_location = 'outside',
label_type = 'percent',
legend = TRUE
)
# plot
g = chart_plot(C,predicted(M)) + ylim(c(0,600)) +
guides(fill=guide_legend(nrow=6,title = element_blank())) +
theme(legend.position = 'bottom',legend.margin=margin())
# layout
leg = cowplot::get_legend(g)
cowplot::plot_grid(g+theme(legend.position = 'none'),leg,nrow=2,rel_heights = c(75,25))
Exploring the biological space
To explore the biological space covered by the metabolites in MTox700+ we need mash the database with additional information about the biological pathways that the metabolites are part of.
We use the PathBank for this
purpose. A struct_database
object for PathBank is already
included in MetMashR
.
Importing PathBank
MetMashR
provides the
PathBank_metabolite_databse
object to import the PathBank
database. You can choose to import:
- The “primary” database. This is a smaller version of the database restricted to primary pathways.
- The “complete” database, which includes all pathways in the database.
The “complete” database is a >50mb download, and unzipped is >1Gb. Unzipping and caching of the database is handled by [BiocFileCache].
For the vignette we restrict to the “primary” PathBank database to keep file sizes and downloads to a minimum.
We can use the database in two ways:
- convert it to a source and “mash” it with other sources
- use it as a lookup table to add information to an existing source.
To explore the biological space covered by MetMashR we will do both.
Comparing PathBank and MTox700+
It is useful to visualise the overlap between PathBank and MTox700+. MTox700+ is a much smaller database due to it being a curated list of metabolites with toxicologial relevance, and PathBank is more general.
In th example below we import PathBank as a source, and use a venn diagram to compare the overlap between inchikey identifiers in PathBank and MTox700+.
# object M already contains the MTox700+ database as a source
# prepare PathBank as a source
P = PathBank_metabolite_database(
version = 'primary',
tag = 'PathBank'
)
# import
P = read_source(P)
# prepare chart
C = annotation_venn_chart(
factor_name = c('inchikey','InChI.Key'),legend = FALSE,
fill_colour = '.group',
line_colour = 'white'
)
# plot
chart_plot(C,predicted(M),P)
The diagram shows that less than half of the metabolites in MTox700+ are also present in the PathBank database for primary pathways.
Combining MTox700+ with PathBank
To combine the pathway information in PathBank with the MTox700+
database we can use PathBank as a lookup table based on inchikeys. To do
this we use the database_lookup
object.
Note that PathBank is not downloaded a second time; it is automatically retrieved from the cache.
We request a number of columns from PathBank, including pathway information and additional identifiers such as HMBD ID and KEGG ID.
# prepare object
X = database_lookup(
query_column = 'inchikey',
database = P$data,
database_column = 'InChI.Key',
include = c(
"PathBank.ID","Pathway.Name","Pathway.Subject","Species",
"HMDB.ID","KEGG.ID","ChEBI.ID","DrugBank.ID","SMILES"),
suffix = ''
)
# apply
X = model_apply(X,predicted(M))
We can now visualise e.g. the subject of the pathways captured by the MTox700+ database.
C = annotation_bar_chart(
factor_name = 'Pathway.Subject',
label_rotation = TRUE,
label_location = 'outside',
label_type = 'percent',
legend = TRUE
)
chart_plot(C,predicted(X))+ylim(c(0,17500))
We can see that MTox700+ largely focuses on metabolites related to Disease metabolism and general metabolism, which is concomitant with the database being curated to contain metabolites relevant to toxicology in humans.
Combining records
Metabolites can appear in multiple pathways. The PathBank database therefore contains multiple records for the same metabolite, and the relationship between MTox700+ and PathBank is one-to-many.
After obtaining pathway information from PathBank the new table has many more rows than the original MTox700+ database, as each MTox700+ record has been replicated for each match in the PathBank database.
e.g. after importing MTox700+ the number of records was:
After combing with PathBank the number of records is:
Sometimes it is useful to collapse this information into a single
record per metabolite. We can use the combine_records
object and its helper functions to do this in a MetMashR
workflow.
# prepare object
X = database_lookup(
query_column = 'inchikey',
database = P$data,
database_column = 'InChI.Key',
include = c(
"PathBank.ID","Pathway.Name","Pathway.Subject","Species",
"HMDB.ID","KEGG.ID","ChEBI.ID","DrugBank.ID","SMILES"),
suffix = '') +
combine_records(
group_by = 'inchikey',
default_fcn = .unique(' || ')
)
# apply
X = model_apply(X,predicted(M))
We have used the .unique
helper function so that records
for each inchikey are combined into a single record by only retaining
unique values in each field (column). If there are multiple unique
values for a field then they are combined into a single string using the
” || ” separator.
We can now extract the pathways associated with a particular metabolite. For example Glycolic acid:
The pathways associated with Glycolic acid are:
# print list of pathways
predicted(X)$data$Pathway.Name[w]
#> [1] "Inner Membrane Transport || Glycolate and Glyoxylate Degradation || D-Arabinose Degradation I || Ethylene Glycol Degradation"
Session Info
sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] DT_0.31 dplyr_1.1.4 structToolbox_1.14.0
#> [4] ggplot2_3.4.4 MetMasheR_0.1.0 struct_1.14.0
#> [7] BiocStyle_2.30.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.1 bitops_1.0-7
#> [3] gridExtra_2.3 rlang_1.1.3
#> [5] magrittr_2.0.3 matrixStats_1.2.0
#> [7] e1071_1.7-14 compiler_4.3.2
#> [9] RSQLite_2.3.4 systemfonts_1.0.5
#> [11] vctrs_0.6.5 stringr_1.5.1
#> [13] pkgconfig_2.0.3 crayon_1.5.2
#> [15] fastmap_1.1.1 dbplyr_2.4.0
#> [17] magick_2.8.2 XVector_0.42.0
#> [19] ellipsis_0.3.2 labeling_0.4.3
#> [21] utf8_1.2.4 rmarkdown_2.25
#> [23] ragg_1.2.7 purrr_1.0.2
#> [25] bit_4.0.5 xfun_0.41
#> [27] zlibbioc_1.48.0 cachem_1.0.8
#> [29] GenomeInfoDb_1.38.5 jsonlite_1.8.8
#> [31] RVenn_1.1.0 blob_1.2.4
#> [33] highr_0.10 DelayedArray_0.28.0
#> [35] R6_2.5.1 bslib_0.6.1
#> [37] stringi_1.8.3 GenomicRanges_1.54.1
#> [39] jquerylib_0.1.4 Rcpp_1.0.12
#> [41] bookdown_0.37 SummarizedExperiment_1.32.0
#> [43] knitr_1.45 IRanges_2.36.0
#> [45] Matrix_1.6-5 tidyselect_1.2.0
#> [47] abind_1.4-5 yaml_2.3.8
#> [49] ggVennDiagram_1.5.0 curl_5.2.0
#> [51] lattice_0.22-5 tibble_3.2.1
#> [53] Biobase_2.62.0 withr_3.0.0
#> [55] evaluate_0.23 ontologyIndex_2.11
#> [57] desc_1.4.3 sf_1.0-15
#> [59] units_0.8-5 proxy_0.4-27
#> [61] BiocFileCache_2.10.1 pillar_1.9.0
#> [63] BiocManager_1.30.22 filelock_1.0.3
#> [65] MatrixGenerics_1.14.0 KernSmooth_2.23-22
#> [67] stats4_4.3.2 ChemmineOB_1.40.0
#> [69] generics_0.1.3 sp_2.1-2
#> [71] RCurl_1.98-1.14 S4Vectors_0.40.2
#> [73] munsell_0.5.0 scales_1.3.0
#> [75] class_7.3-22 glue_1.7.0
#> [77] tools_4.3.2 fs_1.6.3
#> [79] cowplot_1.1.2 grid_4.3.2
#> [81] crosstalk_1.2.1 colorspace_2.1-0
#> [83] GenomeInfoDbData_1.2.11 cli_3.6.2
#> [85] textshaping_0.3.7 rsvg_2.6.0
#> [87] fansi_1.0.6 ggthemes_5.0.0
#> [89] S4Arrays_1.2.0 gtable_0.3.4
#> [91] sass_0.4.8 digest_0.6.34
#> [93] BiocGenerics_0.48.1 classInt_0.4-10
#> [95] SparseArray_1.2.3 farver_2.1.1
#> [97] htmlwidgets_1.6.4 memoise_2.0.1
#> [99] htmltools_0.5.7 pkgdown_2.0.7
#> [101] lifecycle_1.0.4 httr_1.4.7
#> [103] bit64_4.0.5