Collecting and Identification the Outbreak Cluster

The goal of Collecting and Identification the Outbreak Cluster (caIRA) is to find the cluster based tree and metadata of the data. The package is based on the paper:

Ragonnet-Cronin, M., Hodcroft, E., Hué, S. et al. Automated analysis of phylogenetic clusters. BMC Bioinformatics 14, 317 (2013). https://doi.org/10.1186/1471-2105-14-317

Hall M, Woolhouse M, Rambaut A (2015) Epidemic Reconstruction in a Phylogenetics Framework: Transmission Trees as Partitions of the Node Set. PLoS Comput Biol 11(12): e1004613. https://doi.org/10.1371/journal.pcbi.1004613

This package is the part of Dhihram Tenrisau, MSc Health Data Science summer project, ‘Phylodynamic of Norovirus in UK 2003-2023’. The project is supervised by Stéphane Hué.

By using this package, the hope will be emerging (not the the disease emerging) like the lyrics of the song ‘Ah Ça Ira’:

Installation

You can install the development version of huebreaker from GitHub with:

# install.packages("devtools")
devtools::install_github("Dhihram/caIRA")

Example

This package needs the additional package tidyverse, ape, treeio, and dplyr

library(tidyverse)
library(ape)
library(dplyr)
library(treeio)
library(caIRA)

Data

You need the 2 data in this file, the first data is tree files (newick or nexus) and metadata file. The metadata file consists of label, location, and date columns. The label column must be same with the label in tree file.

#open data
tree <- read.tree("https://raw.githubusercontent.com/Dhihram/huebreaker/master/data/random_tree.nwk")
metat <- read.csv("https://raw.githubusercontent.com/Dhihram/huebreaker/master/data/random_metadata.csv")

#change date format
metat$date <- as.Date(metat$date, format = "%m/%d/%Y")

#check the data
str(tree)
#> List of 5
#>  $ edge       : int [1:38, 1:2] 21 22 23 24 24 23 25 26 27 27 ...
#>  $ edge.length: num [1:38] 0.217 0.217 0.389 0.942 0.963 ...
#>  $ Nnode      : int 19
#>  $ node.label : chr [1:19] "59" "87" "97" "100" ...
#>  $ tip.label  : chr [1:20] "t8" "t5" "t1" "t17" ...
#>  - attr(*, "class")= chr "phylo"
#>  - attr(*, "order")= chr "cladewise"
str(metat)
#> 'data.frame':    20 obs. of  4 variables:
#>  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ label   : chr  "t8" "t5" "t1" "t17" ...
#>  $ location: chr  "Area1" "Area1" "Area1" "Area1" ...
#>  $ date    : Date, format: "2020-01-31" "2020-01-05" ...

The intial tree can be seen below with ggtree package

library(ggtree)
gg <- ggtree(tree) + geom_tiplab(size = 2)+
  geom_text2(aes(subset=!isTip, label=label),
             size = 3,
             color = "#0063B1",
             hjust = 1,
             vjust = -1.5) + ggtitle("Random Phylogenetic Tree with Bootstrap Values")
gg

Package Utilization

This part of package, genclus will utilize:

Finding and clustering the monophylectic groups in the tree
Add the parameter of the clusters: bootstrap_treshold, data_range, and samearea
Keep the maximum monophylectic groups in the cluster identify

The bootstrap_treshold is the minimum bootstrap value to be considered as a cluster. The data_range is the range of the days to be considered as a cluster. The samearea is the boolean value to consider the same area as a cluster.

res <- genclus(tree, metat, bootstrap_threshold = 80, date_range = 30, samearea = TRUE)
knitr::kable(res)

Group	Tips	Bootstrap	ParentNode	AreaName	MinDate	MaxDate	diff	NumTips
t1, t17	t1, t17	93	27	Area1	2020-01-20	2020-01-28	8	2
t14, t18	t14, t18	84	39	Area3	2020-01-23	2020-01-26	3	2

We can check the clusters in tree

# Highlight all parent nodes
for (node in res$ParentNode) {
  gg_clus <- gg +
    geom_hilight(node = res$ParentNode, fill = "gold", alpha = 0.5)
}
gg_clus

Generating from BEAST tree

Since BEAST tree has different structure of the tree, we developed beastclus. In BEAST tree, the bootstrapping value is posterior probability with scale 0-1. The beast tree is exported using read.beast from treeio In this example, I will use the Monkeypox phylogenetic tree and metadata from GISAID:

#beast tree

beast_tree <- read.beast('https://raw.githubusercontent.com/Dhihram/caIRA/refs/heads/master/inst/extdata/pox_strict_comb.tree')

#metadata
metadata<-read.csv('https://raw.githubusercontent.com/Dhihram/caIRA/refs/heads/master/inst/extdata/metadata_samp.csv')

Generating cluster

res2 <- beastclus(beast_tree, metadata, post_threshold = 0.50, date_range = 90, samearea = TRUE)
knitr::kable(res2)

ParentNode	label	Posterior	AreaName	min_date	max_date	NumTips	dif
79	hMpxV/Indonesia/BT-Biokes-MP303/2023, hMpxV/Indonesia/JK-NIHRD-MP015/2023	0.83	Jabodetabek	2023-10-20	2023-12-12	2	53
64	hMpxV/Indonesia/JB-NIHRD-MP254/2023, hMpxV/Indonesia/JK-Biokes-MP310/2023, hMpxV/Indonesia/JK-NIHRD-MP022/2023, hMpxV/Indonesia/JK-NIHRD-MP039/2023, hMpxV/Indonesia/JK-NIHRD-MP080/2023, hMpxV/Indonesia/JK-NIHRD-MP081/2023, hMpxV/Indonesia/JK-NIHRD-MP159/2023, hMpxV/Indonesia/JK-NIHRD-MP218/2023, hMpxV/Indonesia/JK-NIHRD-MP225/2023, hMpxV/Indonesia/JK-NIHRD-MP246/2023, hMpxV/Indonesia/JK-NIHRD-MP252/2023	0.77	Jabodetabek	2023-10-23	2023-12-15	11	53
89	hMpxV/Indonesia/JK-Biokes-MP010/2024, hMpxV/Indonesia/JK-Biokes-MP011/2024, hMpxV/Indonesia/JK-Biokes-MP020/2024	0.99	Jabodetabek	2024-01-22	2024-02-17	3	26

Also, we can check the clusters in tree

tr <- ggtree(beast_tree, mrsd = min(metadata$date)) +
  theme_tree2() 
# Replace `highlight_nodes` with the node numbers or clade labels you want to highlight
n <- as.numeric(res2$ParentNode)
tr + 
  geom_highlight(node = n, fill = "lightblue", alpha = 0.5)