Module Aims:
This core module aims to provide students with a knowledge of the fundamentals of statistical theory and some experience of analysing data using statistical software. Students following the Health Data Science theme are required to take this core module instead of Principles of Biostatistics. Other students with sufficient pre-requisite knowledge may also choose to take this module.
Module Learning Outcomes:
By the end of the module, students should be able to:
- Compare and contrast frequentist and Bayesian statistical theory
- Choose appropriate models and methods for analysing a given dataset
- Explain the assumptions made by these models and methods
- Interpret the results obtained from application of these methods
- Describe how the parameters of these models are estimated
- Fit these models and apply these methods using R, and interpret the output.
Pre-requisites:
Understanding of the concepts of integration and differentiation, logarithms and exponents, and matrix inversion; ability to perform simple manipulation of vectors and matrices (e.g. addition and multiplication); good knowledge of basic concepts of probability theory (e.g. probability density functions, marginal and conditional distributions, random variables, expectation and variance).
Teaching Strategy:
Lectures, computer practicals. Some preliminary reading may be required.
Assessment:
Written assessment at end of the module (50% of module grade).
A take-home assignment involving a dataset and a structured analysis plan, with students providing code, tables/figures/results, and written interpretation (50% of module grade).
Session List:
- General introduction to key concepts in health data science, and overview of statistics and machine-learning.
- Introduction to R (basic operations, read in data from Excel, use of R libraries).
- Probability theory and probability distributions: interpretation of probability, probability spaces, random variables, expectation, variance and covariance, conditional probability; commonly used probability distributions.
- Statistical models and likelihood: populations and samples; parametric, semi-parametric and non-parametric models; point estimation; likelihood function; maximum likelihood estimator (MLE); estimating equations.
- Frequentist framework I: repeated sampling; loss functions and risk; asymptotic properties of MLE.
- Frequentist framework II: confidence intervals; hypothesis testing; power calculations.
- Bayesian framework I: Bayes’ rule as rational updating; point estimation and credible intervals; Bayesian hypothesis testing.
- Bayesian framework II: comparison of frequentist and Bayesian frameworks; choosing prior distributions; Bayesian computation.
- Linear regression: t-tests; linear regression model; analysis of variance; modelling categorical covariates, interactions and higher-order effects; assessing model fit.
- Contingency tables and logistic regression: contingency tables for two variables; testing independence of categorical variables; logistic regression; multinomial and ordinal logistic regression; log linear models.
- Generalised linear models (GLMs): Poisson regression; single-parameter GLMs; multi-parameter GLMs; multivariate linear regression.
- Survival analysis: censoring; Kaplan-Meier estimator; parametric proportional hazards models; Cox proportional hazards; other survival models.
- Semi-parametric models: Cox proportional hazards (again) and marginal models; quasi-likelihood and generalised estimating equations.
- Permutation tests and bootstrap: Wilcoxon rank-sum tests; permutation tests; parametric and non-parametric bootstrap.
- Recap of this module and preview of Advanced Biostatistics for HDS and Introduction to Machine Learning modules.
- Introduction to optional modules.
Module Length: 8 days over 4 weeks