Collecting and Identification the Outbreak Cluster

The goal of Collecting and Identification the Outbreak Cluster (caIRA) is to find the cluster based tree and metadata of the data. The package is based on the paper:

Ragonnet-Cronin, M., Hodcroft, E., Hué, S. et al. Automated analysis of phylogenetic clusters. BMC Bioinformatics 14, 317 (2013). https://doi.org/10.1186/1471-2105-14-317

Hall M, Woolhouse M, Rambaut A (2015) Epidemic Reconstruction in a Phylogenetics Framework: Transmission Trees as Partitions of the Node Set. PLoS Comput Biol 11(12): e1004613. https://doi.org/10.1371/journal.pcbi.1004613

This package is the part of Dhihram Tenrisau, MSc Health Data Science summer project, ‘Phylodynamic of Norovirus in UK 2003-2023’. The project is supervised by Stéphane Hué.

By using this package, the hope will be emerging (not the the disease emerging) like the lyrics of the song ‘Ah Ça Ira’:

Ah ça ira, réjouis-toi! (It’ll be fine, rejoice!)

Ah ça ira, le bon temps viendra (It’ll be fine, good times will come)

Installation

You can install the development version of huebreaker from GitHub with:

# install.packages("devtools")
devtools::install_github("Dhihram/caIRA")

Example

This package needs the additional package tidyverse, ape, treeio, and dplyr

library(tidyverse)
library(ape)
library(dplyr)
library(treeio)
library(caIRA)

Data

You need the 2 data in this file, the first data is tree files (newick or nexus) and metadata file. The metadata file consists of label, location, and date columns. The label column must be same with the label in tree file.

#open data
tree <- read.tree("https://raw.githubusercontent.com/Dhihram/huebreaker/master/data/random_tree.nwk")
metat <- read.csv("https://raw.githubusercontent.com/Dhihram/huebreaker/master/data/random_metadata.csv")

#change date format
metat$date <- as.Date(metat$date, format = "%m/%d/%Y")

#check the data
str(tree)
#> List of 5
#>  $ edge       : int [1:38, 1:2] 21 22 23 24 24 23 25 26 27 27 ...
#>  $ edge.length: num [1:38] 0.217 0.217 0.389 0.942 0.963 ...
#>  $ Nnode      : int 19
#>  $ node.label : chr [1:19] "59" "87" "97" "100" ...
#>  $ tip.label  : chr [1:20] "t8" "t5" "t1" "t17" ...
#>  - attr(*, "class")= chr "phylo"
#>  - attr(*, "order")= chr "cladewise"
str(metat)
#> 'data.frame':    20 obs. of  4 variables:
#>  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ label   : chr  "t8" "t5" "t1" "t17" ...
#>  $ location: chr  "Area1" "Area1" "Area1" "Area1" ...
#>  $ date    : Date, format: "2020-01-31" "2020-01-05" ...

The intial tree can be seen below with ggtree package

library(ggtree)
gg <- ggtree(tree) + geom_tiplab(size = 2)+
  geom_text2(aes(subset=!isTip, label=label),
             size = 3,
             color = "#0063B1",
             hjust = 1,
             vjust = -1.5) + ggtitle("Random Phylogenetic Tree with Bootstrap Values")
gg

Package Utilization

This part of package, genclus will utilize:

  1. Finding and clustering the monophylectic groups in the tree

  2. Add the parameter of the clusters: bootstrap_treshold, data_range, and samearea

  3. Keep the maximum monophylectic groups in the cluster identify

The bootstrap_treshold is the minimum bootstrap value to be considered as a cluster. The data_range is the range of the days to be considered as a cluster. The samearea is the boolean value to consider the same area as a cluster.

res <- genclus(tree, metat, bootstrap_threshold = 80, date_range = 30, samearea = TRUE)
knitr::kable(res)
Group Tips Bootstrap ParentNode AreaName MinDate MaxDate diff NumTips
t1, t17 t1, t17 93 27 Area1 2020-01-20 2020-01-28 8 2
t14, t18 t14, t18 84 39 Area3 2020-01-23 2020-01-26 3 2

We can check the clusters in tree

# Highlight all parent nodes
for (node in res$ParentNode) {
  gg_clus <- gg +
    geom_hilight(node = res$ParentNode, fill = "gold", alpha = 0.5)
}
gg_clus

Generating from BEAST tree

Since BEAST tree has different structure of the tree, we developed beastclus. In BEAST tree, the bootstrapping value is posterior probability with scale 0-1. The beast tree is exported using read.beast from treeio In this example, I will use the Monkeypox phylogenetic tree and metadata from GISAID:

#beast tree

beast_tree <- read.beast('https://raw.githubusercontent.com/Dhihram/caIRA/refs/heads/master/inst/extdata/pox_strict_comb.tree')

#metadata
metadata<-read.csv('https://raw.githubusercontent.com/Dhihram/caIRA/refs/heads/master/inst/extdata/metadata_samp.csv')

Generating cluster

res2 <- beastclus(beast_tree, metadata, post_threshold = 0.50, date_range = 90, samearea = TRUE)
knitr::kable(res2)
ParentNode label Posterior AreaName min_date max_date NumTips dif
79 hMpxV/Indonesia/BT-Biokes-MP303/2023, hMpxV/Indonesia/JK-NIHRD-MP015/2023 0.83 Jabodetabek 2023-10-20 2023-12-12 2 53
64 hMpxV/Indonesia/JB-NIHRD-MP254/2023, hMpxV/Indonesia/JK-Biokes-MP310/2023, hMpxV/Indonesia/JK-NIHRD-MP022/2023, hMpxV/Indonesia/JK-NIHRD-MP039/2023, hMpxV/Indonesia/JK-NIHRD-MP080/2023, hMpxV/Indonesia/JK-NIHRD-MP081/2023, hMpxV/Indonesia/JK-NIHRD-MP159/2023, hMpxV/Indonesia/JK-NIHRD-MP218/2023, hMpxV/Indonesia/JK-NIHRD-MP225/2023, hMpxV/Indonesia/JK-NIHRD-MP246/2023, hMpxV/Indonesia/JK-NIHRD-MP252/2023 0.77 Jabodetabek 2023-10-23 2023-12-15 11 53
89 hMpxV/Indonesia/JK-Biokes-MP010/2024, hMpxV/Indonesia/JK-Biokes-MP011/2024, hMpxV/Indonesia/JK-Biokes-MP020/2024 0.99 Jabodetabek 2024-01-22 2024-02-17 3 26

Also, we can check the clusters in tree

tr <- ggtree(beast_tree, mrsd = min(metadata$date)) +
  theme_tree2() 
# Replace `highlight_nodes` with the node numbers or clade labels you want to highlight
n <- as.numeric(res2$ParentNode)
tr + 
  geom_highlight(node = n, fill = "lightblue", alpha = 0.5)