The goal of Collecting and Identification the Outbreak Cluster (caIRA) is to find the cluster based tree and metadata of the data. The package is based on the paper:
Ragonnet-Cronin, M., Hodcroft, E., Hué, S. et al. Automated analysis of phylogenetic clusters. BMC Bioinformatics 14, 317 (2013). https://doi.org/10.1186/1471-2105-14-317
Hall M, Woolhouse M, Rambaut A (2015) Epidemic Reconstruction in a Phylogenetics Framework: Transmission Trees as Partitions of the Node Set. PLoS Comput Biol 11(12): e1004613. https://doi.org/10.1371/journal.pcbi.1004613
This package is the part of Dhihram Tenrisau, MSc Health Data Science summer project, ‘Phylodynamic of Norovirus in UK 2003-2023’. The project is supervised by Stéphane Hué.
By using this package, the hope will be emerging (not the the disease emerging) like the lyrics of the song ‘Ah Ça Ira’:
Ah ça ira, réjouis-toi! (It’ll be fine, rejoice!)
Ah ça ira, le bon temps viendra (It’ll be fine, good times will come)
You can install the development version of huebreaker from GitHub with:
This package needs the additional package tidyverse,
ape, treeio, and dplyr
You need the 2 data in this file, the first data is tree files
(newick or nexus) and metadata file. The metadata file consists of
label, location, and date
columns. The label column must be same with the label in
tree file.
#open data
tree <- read.tree("https://raw.githubusercontent.com/Dhihram/huebreaker/master/data/random_tree.nwk")
metat <- read.csv("https://raw.githubusercontent.com/Dhihram/huebreaker/master/data/random_metadata.csv")
#change date format
metat$date <- as.Date(metat$date, format = "%m/%d/%Y")
#check the data
str(tree)
#> List of 5
#> $ edge : int [1:38, 1:2] 21 22 23 24 24 23 25 26 27 27 ...
#> $ edge.length: num [1:38] 0.217 0.217 0.389 0.942 0.963 ...
#> $ Nnode : int 19
#> $ node.label : chr [1:19] "59" "87" "97" "100" ...
#> $ tip.label : chr [1:20] "t8" "t5" "t1" "t17" ...
#> - attr(*, "class")= chr "phylo"
#> - attr(*, "order")= chr "cladewise"
str(metat)
#> 'data.frame': 20 obs. of 4 variables:
#> $ X : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ label : chr "t8" "t5" "t1" "t17" ...
#> $ location: chr "Area1" "Area1" "Area1" "Area1" ...
#> $ date : Date, format: "2020-01-31" "2020-01-05" ...The intial tree can be seen below with ggtree
package
library(ggtree)
gg <- ggtree(tree) + geom_tiplab(size = 2)+
geom_text2(aes(subset=!isTip, label=label),
size = 3,
color = "#0063B1",
hjust = 1,
vjust = -1.5) + ggtitle("Random Phylogenetic Tree with Bootstrap Values")
ggThis part of package, genclus will utilize:
Finding and clustering the monophylectic groups in the tree
Add the parameter of the clusters:
bootstrap_treshold, data_range, and
samearea
Keep the maximum monophylectic groups in the cluster identify
The bootstrap_treshold is the minimum bootstrap value to
be considered as a cluster. The data_range is the range of
the days to be considered as a cluster. The samearea is the
boolean value to consider the same area as a cluster.
res <- genclus(tree, metat, bootstrap_threshold = 80, date_range = 30, samearea = TRUE)
knitr::kable(res)| Group | Tips | Bootstrap | ParentNode | AreaName | MinDate | MaxDate | diff | NumTips |
|---|---|---|---|---|---|---|---|---|
| t1, t17 | t1, t17 | 93 | 27 | Area1 | 2020-01-20 | 2020-01-28 | 8 | 2 |
| t14, t18 | t14, t18 | 84 | 39 | Area3 | 2020-01-23 | 2020-01-26 | 3 | 2 |
We can check the clusters in tree
# Highlight all parent nodes
for (node in res$ParentNode) {
gg_clus <- gg +
geom_hilight(node = res$ParentNode, fill = "gold", alpha = 0.5)
}
gg_clusSince BEAST tree has different structure of the tree, we developed
beastclus. In BEAST tree, the bootstrapping value is
posterior probability with scale 0-1. The beast tree is exported using
read.beast from treeio In this example, I will
use the Monkeypox phylogenetic tree and metadata from GISAID:
#beast tree
beast_tree <- read.beast('https://raw.githubusercontent.com/Dhihram/caIRA/refs/heads/master/inst/extdata/pox_strict_comb.tree')
#metadata
metadata<-read.csv('https://raw.githubusercontent.com/Dhihram/caIRA/refs/heads/master/inst/extdata/metadata_samp.csv')Generating cluster
res2 <- beastclus(beast_tree, metadata, post_threshold = 0.50, date_range = 90, samearea = TRUE)
knitr::kable(res2)| ParentNode | label | Posterior | AreaName | min_date | max_date | NumTips | dif |
|---|---|---|---|---|---|---|---|
| 79 | hMpxV/Indonesia/BT-Biokes-MP303/2023, hMpxV/Indonesia/JK-NIHRD-MP015/2023 | 0.83 | Jabodetabek | 2023-10-20 | 2023-12-12 | 2 | 53 |
| 64 | hMpxV/Indonesia/JB-NIHRD-MP254/2023, hMpxV/Indonesia/JK-Biokes-MP310/2023, hMpxV/Indonesia/JK-NIHRD-MP022/2023, hMpxV/Indonesia/JK-NIHRD-MP039/2023, hMpxV/Indonesia/JK-NIHRD-MP080/2023, hMpxV/Indonesia/JK-NIHRD-MP081/2023, hMpxV/Indonesia/JK-NIHRD-MP159/2023, hMpxV/Indonesia/JK-NIHRD-MP218/2023, hMpxV/Indonesia/JK-NIHRD-MP225/2023, hMpxV/Indonesia/JK-NIHRD-MP246/2023, hMpxV/Indonesia/JK-NIHRD-MP252/2023 | 0.77 | Jabodetabek | 2023-10-23 | 2023-12-15 | 11 | 53 |
| 89 | hMpxV/Indonesia/JK-Biokes-MP010/2024, hMpxV/Indonesia/JK-Biokes-MP011/2024, hMpxV/Indonesia/JK-Biokes-MP020/2024 | 0.99 | Jabodetabek | 2024-01-22 | 2024-02-17 | 3 | 26 |
Also, we can check the clusters in tree
tr <- ggtree(beast_tree, mrsd = min(metadata$date)) +
theme_tree2()
# Replace `highlight_nodes` with the node numbers or clade labels you want to highlight
n <- as.numeric(res2$ParentNode)
tr +
geom_highlight(node = n, fill = "lightblue", alpha = 0.5)