The power of machine learning is in its ability to predict outcomes and uncover relationships in an automated process. Machine learning is applicable to Ki because of its ability to identify relationships and appropriate models to best predict continuous or categorical outcomes. The diversity of statistical backgrounds of the team of data scientists within Ki made it difficult to select mutually agreed upon models for analyses. The super learner algorithm, one approach to machine learning, allows for a collaborative and objective approach to evaluating models.
Machine learning is an automated process to detect patterns and predict outcomes based on given data.^{[1, 2]}
There are two main types of machine learning:^{[1]}
Predictive or supervised learning maps the relationship between covariates and outcomes. This is the type of machine learning primarily used for Ki.
Descriptive or unsupervised learning solely utilizes inputs to identify patterns in the data. This type of machine learning is also referred to as “knowledge discovery” and does not provide predictions or utilize outcome data.
Commonly used machine learning algorithms include: random forests, neural networks, and decision trees.^{[3]}
Assumptions of machine learning include:
Optimization is a process repeated hundreds or thousands of times to train machine learning algorithms, identifying where mistakes are made during each repetition, adjusting the algorithm, and repeating the process to obtain a highly predictive algorithm.
Models created by machine learning can be highly accurate in prediction of outcomes.^{[4]}
Machine learning can use “wide” data (repeated measures for the same individual) and correlated predictor variables (with similar or related values).^{[3]}
Decision trees are a popular approach to machine learning. Decision trees are constructed by first determining which variable provides the best fit to the data when split into two groups. The process is then repeated in each of these two groups, in each of the four resulting groups, and so on, until a predetermined stopping criterion is reached.
Figure 1 provides an example decision tree. Predictor covariates are root and parent nodes to predict the outcome of interest in the terminal leaf nodes.
Using pre-specified algorithms, binary splits for each covariate of interest inform how each parent node categorization influences the following leaf node.
Key requirements for utilizing a decision tree include:
Decision trees are attractive due to their interpretability of logically working from root nodes outward to leaf nodes (Figure 1).
The term “ensemble methods” refers to a process by which multiple prediction functions are combined into a single prediction function.
Super learner is an algorithm for finding optimal combinations of models with an objective, data-driven approach based on cross validation.
Cross-validation includes two parts, training and validation, to develop a regression fit of the predictor(s) and outcome covariates. Evaluation of the performance of the model fit for prediction informs the final selection of models by the super learner.
Super learner uses cross-validation to assess the fit of multiple models and estimates the best weighted average of the predictions made by the models. Previous research has proven the super learner is optimal in the sense that it will predict outcomes as well as the unknown best combination of models included.^{[6]}