Use case • neighborfinder

library(rmarkdown)
library(knitr)
library(neighborfinder)
#> 
#> Attaching package: 'neighborfinder'
#> The following object is masked from 'package:rmarkdown':
#> 
#>     metadata

Presentation of dataset: CRC Example

The data provided in the package was extracted from this repository: Taxonomic profiles, functional profiles and manually curated metadata of human fecal metagenomes from public projects coming from colorectal cancer studies

Three subgroups are defined following 3 geographic areas: Japan, China and Europe (gathering Italy, Austria, Germany and France). The focus is on patients with colorectal cancer.

Preview of the data

Below, each object needed in this tutorial is loaded and a preview is displayed.

1) The abundance table

Here is a preview of what an abundance table from the data object looks like, with species as rows and samples as columns. The first column is the taxonomic annotation at the species level, the second one is the module ID, and then each column is a sample.

data(data)
data$CRC_JPN[1:5, 1:5] %>% kable()

species	msp_id	DRS091121	DRS091048	DRS091122
Blastocystis sp. subtype 3	msp_0001	0e+00	0e+00	0e+00
Blastocystis sp. subtype 1	msp_0002	0e+00	0e+00	0e+00
Bacteroides cellulosilyticus	msp_0003	0e+00	8e-07	2e-07
Blastocystis sp. subtype 2	msp_0004	0e+00	0e+00	0e+00
Escherichia coli	msp_0005	1e-07	9e-07	0e+00

2) The metadata

The metadata gathers characteristics (in columns) for each sample (in rows).

data(metadata)
metadata$CRC_JPN[1:5, 1:5] %>% kable()

sample	HQ_clean_read_count	gut_mapped_read_count	gut_mapped_pc	oral_mapped_read_count
DRS091371	39159954	32390300	82.7128142183211	32390300
DRS091370	35443586	28999607	81.8190546520885	28999607
DRS091369	43922552	36841645	83.8786530436574	36841645
DRS091368	29627016	25334944	85.5129790998864	25334944
DRS091366	49606076	41361270	83.3794432762632	41361270

3) The taxonomy

The taxonomic file indicates for each msp (i.e. metagenomic species) its species and genus annotation as well as the catalog in which the msp is mostly found.

data(taxo)
taxo[1:5, ] %>% kable()

msp_id	species	genus	catalogue
msp_0001	Blastocystis sp. subtype 3	Blastocystis	gut
msp_0002	Blastocystis sp. subtype 1	Blastocystis	gut
msp_0003	Bacteroides cellulosilyticus	Bacteroides	gut
msp_0004	Blastocystis sp. subtype 2	Blastocystis	gut
msp_0005	Escherichia coli	Escherichia	gut & oral

4) The graph

This dataframe is an adjacency matrix encoding the graph with “cluster-like” structure produced with graph_step(). A value of 1 corresponds to an edge. This object is needed to simulate semi-synthetic data.

data(graphs)
graphs$CRC_JPN[1:5, 1:5] %>% kable()

	msp_0003	msp_0005	msp_0007	msp_0008	msp_0009
msp_0003	0	0	0	0	0
msp_0005	0	0	0	0	0
msp_0007	0	0	0	0	0
msp_0008	0	0	0	0	0
msp_0009	0	0	0	0	0

graphs$CRC_JPN[5:10, 87:90] %>% kable()

	msp_0145	msp_0146c	msp_0147
msp_0009	0	0	0
msp_0010	0	0	0
msp_0011	0	0	0
msp_0012	0	1	0
msp_0013	0	0	0
msp_0014	1	0	1

Aim of this use case

In this example, we are focusing on Escherichia coli because it is one bacterium that is associated with colorectal cancer (CRC). Here is the article mentionning it. Some E. coli strains can produce colibactin, which is a toxin inducing DNA damage that may lead to colorectal cancer.

The goal here is to explore the ecosystem around E. coli by identifying its direct neighbors in patients with colorectal cancer.

Test if the default parameters of NeighborFinder are suitable for your species of interest & dataset

This function enables one to test the NeighborFinder method for different parameter values thanks to simulated data based on the provided dataset. There will be different performance scores depending on the species of interest and the dataset provided.

To do this step, generating a graph is needed.

It is recommended to save it to make next steps quicker. This step was commented since it takes few minutes to execute and also because it is included in the ‘graphs’ object.

# G <- graph_step(data_with_annotation = data$CRC_JPN_CHN_EUR,
#                 col_module_id = "msp_id",
#                 annotation_level = "species"
#                 )

G <- graphs$CRC_JPN_CHN_EUR

NeighborFinder uses by default the level of prevalence of 30% and filtering the top 20%. If the results from the table rendered by choose_params_values() indicate better performance scores at other values of these parameters, the user can adjust the prev_level and filtering_top parameters in the function apply_NeighborFinder().

choose_params_values(
  data_with_annotation = data$CRC_JPN,
  object_of_interest = "Escherichia coli",
  sample_size = 100,
  prev_list = c(0.20, 0.25, 0.30),
  filtering_list = c(10, 20, 30),
  graph_file = graphs$CRC_JPN,
  col_module_id = "msp_id",
  annotation_level = "species"
) %>%
  dplyr::mutate(filtering_top = as.numeric(filtering_top)) %>%
  as.data.frame() %>%
  kable()
#> Defining and saving true neighbors...
#> Calculating scores...

prev_level	filtering_top	F1_before	F1_after
0.20	10	0.0120	0.00
0.20	20	0.0120	0.00
0.20	30	0.0120	1.00
0.25	10	0.0130	0.67
0.25	20	0.0130	1.00
0.25	30	0.0130	1.00
0.30	10	0.0058	0.67
0.30	20	0.0058	1.00
0.30	30	0.0058	1.00

The displayed table indicates the F1 scores (harmonic mean of precision and recall) before and after applying the NeighborFinder method, for each prev_level and filtering_top parameters tested. In this example, one of the combinations enabling a satisfying performance is prev_level=0.30 and filtering_top=30.

Note that this step is optional, and a summary of the expected performance scores (calculated on 8 semi-synthetic simulated datasets) is shown in Fig.4 of the Tech report.

Apply NeighborFinder & look for Escherichia coli neighbors in CRC patients

Running apply_NeighborFinder() renders a dataframe gathering all neighbors of your species of interest. The required data format in input is as follows: species are the rows and samples are the columns. The first column must be the species name, the second is the msps name, and each subsequent column is a sample (see above for the abundance table preview).

# JAPAN
res_CRC_JPN <- apply_NeighborFinder(
  data_with_annotation = data$CRC_JPN,
  object_of_interest = "Escherichia coli",
  col_module_id = "msp_id",
  annotation_level = "species",
  prev_level = 0.30,
  filtering_top = 30
)
res_CRC_JPN %>% kable()

node1	node2	coef
msp_0005	msp_0030	0.0872394
msp_0005	msp_0103	0.0997908
msp_0005	msp_0208	-0.1143868
msp_0005	msp_0467	-0.0870316
msp_0005	msp_2448	0.0988908

Visualize the corresponding network

Here is a visualization of NeighborFinder results.

It is possible to adjust the size of nodes and labels. One can also define the color of nodes corresponding to their species of interest. The taxo_option generates the same network but with species names instead of msps ones.

visualize_network(
  res_CRC_JPN,
  taxo,
  object_of_interest = "Escherichia coli",
  col_module_id = "msp_id",
  annotation_level = "species",
  label_size = 5
)


visualize_network(
  res_CRC_JPN,
  taxo,
  object_of_interest = "Escherichia coli",
  col_module_id = "msp_id",
  annotation_level = "species",
  label_size = 5,
  annotation_option = TRUE,
  seed = 2
)

Apply NeighborFinder with covariate option

Including covariates makes it possible to correct for the effect of external factors (e.g. the sequencing technology used to generate the data, the extraction method) that can introduce bias or variability into the observations. This allows the model to better distinguish between what is part of the true structure in the network and what is simply due to technical or contextual differences.

Here is an example of applying the same method, but giving a covariate: in this case, the name of the study. To include a covariate, the function would run only if the following all 3 arguments are given:

covar: takes a formula or the name of the column of the covariate in the metadata table. Note that “study_accession” is equivalent to ~study_accession.
meta_df: the dataframe giving metadata information.
sample_col: the name of the column in metadata indicating the sample names, it should be consistent with the colnames of “data_with_taxo”.

# On CRC patients
# CHINA
res_CRC_CHN <- apply_NeighborFinder(
  data$CRC_CHN,
  object_of_interest = "Escherichia coli",
  col_module_id = "msp_id",
  annotation_level = "species",
  prev_level = 0.30,
  filtering_top = 30,
  covar = ~study_accession,
  meta_df = metadata$CRC_CHN,
  sample_col = "secondary_sample_accession"
)

# EUROPE
res_CRC_EUR <- apply_NeighborFinder(
  data$CRC_EUR,
  object_of_interest = "Escherichia coli",
  col_module_id = "msp_id",
  annotation_level = "species",
  prev_level = 0.30,
  filtering_top = 30,
  covar = ~study_accession,
  meta_df = metadata$CRC_EUR,
  sample_col = "secondary_sample_accession"
)

visualize_network(
  res_CRC_CHN,
  taxo,
  object_of_interest = "Escherichia coli",
  col_module_id = "msp_id",
  annotation_level = "species",
  label_size = 5
)


visualize_network(
  res_CRC_EUR,
  taxo,
  object_of_interest = "Escherichia coli",
  col_module_id = "msp_id",
  annotation_level = "species",
  label_size = 5
)

Look at the intersection of neighbors found in the 3 subgroups

Here is an example of how to merge the results of several datasets into a single network. This operation can be performed with 2 or more datasets. One can choose the visualization threshold, corresponding to the minimum number of datasets in which neighbors have been found.

intersections_network(
  res_list = list(res_CRC_JPN, res_CRC_CHN, res_CRC_EUR),
  taxo,
  threshold = 2,
  "Escherichia coli",
  col_module_id = "msp_id",
  annotation_level = "species",
  label_size = 7,
  edge_label_size = 4,
  node_size = 15,
  annotation_option = TRUE,
  seed = 3
)

Another function used here is equivalent to the network thanks to a summary table, indicating in which datasets intersections were found above the set threshold.

intersections_table(
  res_list = list(res_CRC_JPN, res_CRC_CHN, res_CRC_EUR),
  threshold = 2,
  taxo,
  col_module_id = "msp_id",
  annotation_level = "species",
  "Escherichia coli"
) %>% kable()

node1	module1	node2	module2	datasets	intersections	mean_coef
msp_0005	Escherichia coli	msp_0030	Klebsiella pneumoniae	n_ 1, n_ 2	2	0.1006331
msp_0005	Escherichia coli	msp_0103	Clostridium_AQ innocuum	n_ 1, n_ 2	2	0.0744551
msp_0005	Escherichia coli	msp_0208	Blautia_A faecis	n_ 1, n_ 3	2	-0.1269152

One of the resulting neighbors is Klebsiella pneumoniae. It is also described in this article as a bacterium associated with CRC, producing the same toxin.

The article summarized the prevalence of pks positive E. coli and K. pneumoniae in CRC patient samples. pks stands for polyketide synthase, K. pneumoniae genes can synthetize a bacterial toxin which is colibactin.

“It seems that pks positive bacteria can induce mutation of CRC driver genes and, therefore, pks may become a marker of CRC carcinogenesis and therapy.” (N. Strakova, K. Korena, R Karpiskova, 2021)