Hines, O; (2023) Assumption-Lean Inference for Causal and Statistical Questions in the Era of Machine Learning. PhD thesis, London School of Hygiene & Tropical Medicine. DOI: https://doi.org/10.17037/PUBS.04670988
Permanent Identifier
Use this Digital Object Identifier when citing or linking to this resource.
Abstract
Owing to the advent of sophisticated machine learning methods that excel at prediction modelling tasks, the field of statistics finds itself at a crossroads. Rather than pure prediction, the goal of statistics is usually more fundamental: to answer scientifically motivated questions of interest e.g. in the fields of epidemiology, sociology, psychology and economics. Traditionally, parametric statistical models have been used to frame and answer such questions, since model parameters often act as convenient and interpretable summaries of the aspects of the data which are of interest. This has lead to an uneasy tension between choosing complicated models that more accurately reflect the relationships between the variables of interest versus choosing simpler models that provide greater scientific interpretability. To overcome this tension, a so-called ‘roadmap’ was developed in which analysis is centred around target ‘estimands’ rather than model parameters. In this context, estimands are nonparametrically defined mappings of the true data generating distribution, which quantitatively answer scientific questions of interest. According to the roadmap, estimand inference is carried out using machine learning based estimators for requisite statistical functionals, or else more rarely, under limited semi-parametric assumptions. These developments are quite revolutionary and have heralded new directions in how data is analysed. It is my view that for the roadmap to be successful it is necessary to enrich the space of available estimands, which at present is relatively unexplored. More often than not, estimands are proposed and interpreted within the framework of causal inference, with the average treatment effect of a binary exposure on an outcome being a canonical example. Extensions related to treatment effect heterogeneity and continuous exposures, however, are limited and this thesis makes contributions in both of these settings. Moreover, when considering potential estimands, it remains unclear the extent to which efficiency and model extrapolation concerns should be prioritised against scientific relevance of the estimand. This thesis studies questions of this type e.g. by considering optimal estimands that minimise nonparametric efficiency bounds, and by considering score based inference approaches that perform well when normality of the estimator breaks down. I argue that, in many cases, greater scientific insight can be gained by focussing on estimands that are less ambitious, in the sense that they pose questions about counterfactual worlds which are more similar to our own. These estimands can often be estimated with greater efficiency and with a lesser reliance on correct modelling of statistical functionals.
Item Type | Thesis |
---|---|
Thesis Type | Doctoral |
Thesis Name | PhD |
Contributors | Díaz-Ordaz, K and Vansteelandt, S |
Faculty and Department | Faculty of Epidemiology and Population Health > Dept of Medical Statistics |
Funder Name | Medical Research Council |
Copyright Holders | Oliver Hines |
Download
Filename: 2022_EPH_PhD_HINES_O.pdf
Licence: Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0
Download