Deelder, WA; (2022) Machine learning methods for infectious diseases: Applications for Tuberculosis and Malaria. PhD thesis, London School of Hygiene & Tropical Medicine. DOI: https://doi.org/10.17037/PUBS.04670671
Permanent Identifier
Use this Digital Object Identifier when citing or linking to this resource.
Abstract
Infectious diseases such as malaria (caused by Plasmodium spp parasites) and tuberculosis (TB, caused by Mycobacterium bacteria) are major public health challenges, being leading causes of death worldwide, particularly in low-income countries. The genomes of the underlying causal pathogens contain valuable information to guide clinical treatment and programmatic control decision making. Whole genome sequencing (WGS) has therefore emerged as an increasingly common approach to characterize genetic mutations (e.g., single nucleotide polymorphisms; SNPs) and understand the diversity of these pathogens. However, WGS leads to high dimensional datasets (“big data”). Some established statistical methods are less suited to such big data analysis, and machine learning (ML) approaches offer a promising alternative for modelling and inference. In this thesis, I explore the application of ML methods, including deep learning, to WGS datasets for malaria parasites (P. falciparum and P. vivax) and M. tuberculosis bacteria. For M. tuberculosis (n=17k; >100k SNPs; genome size 4.4 Mbp), I applied non-parametric classification-tree and gradient-boosted-tree models to predict drug resistance across 14 anti-TB drugs. For established first-line drugs, the models had high predictive ability (area under the receiver operating curve > 0.85), and included SNPs in candidate genes linked to drug-resistance. For drugs with less established knowledge, I developed a customized decision tree approach (“Treesist-TB”), which performs TB drug resistance prediction by extracting and evaluating genomic variants across multiple studies. Treesist-TB revealed both known and novel putative SNPs for resistance and had improved predictive sensitivity compared to a widely-used TB mutation database (TB-Profiler tool). For P. falciparum (n>1,100; >74k SNPs; genome size 26.8 Mbp) and P. vivax (n>350, >125k SNPs; genome size 23.3 Mbp), I developed an image-based convolutional neural network (CNN) approach (“DeepSweep”), with the aim of identifying genetic regions subject to recent positive selection, such as those linked to the onset of antimalarial drug resistance. DeepSweep detected genetic regions proximal to known and suspected drug resistance loci for both P. falciparum (e.g., pfcrt, pfdhps and pfmdr1) and P. vivax (e.g., pvmrp1), and detected signals overlapping with those from two established extended haplotype homozygosity methods. Finally, I applied ML approaches, including CNNs, to predict the geographic origin of P. falciparum and P. vivax infections at different levels of geographic granularity (continents, countries, GPS locations). Classification methods had the lowest distance errors, and >90% accuracy at a country level, thereby demonstrating the utility of ML approaches for detecting imported infections and the geo-classification of malaria parasites. Overall, these applications demonstrate the potential of ML methods to extract new insights from large WGS datasets and assist infection control. However, there are risks in applying ML methods on WGS data “out of the box” without context-specific adaptation of the algorithms. My work demonstrates how adaptation of standard ML methods can lead to better predictions and more interpretable results, offering greater assistance to infection control decision making.
Item Type | Thesis |
---|---|
Thesis Type | Doctoral |
Thesis Name | PhD |
Contributors | Clark, TG and Palla, L |
Faculty and Department | Faculty of Epidemiology and Population Health > Dept of Medical Statistics |
Copyright Holders | Wouter André Deelder |
Download
Filename: 2022_EPH_PhD_Deelder_W.pdf
Licence: Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0
Download