PLEASE REVIEW OUR UPDATED PRIVACY & COOKIES NOTICE
Machine learning algorithms
Why does Ki utilize machine learning?
The power of machine learning is in its ability to predict outcomes and uncover relationships in an automated process. Machine learning is applicable to Ki because of its ability to identify relationships and appropriate models to best predict continuous or categorical outcomes. The diversity of statistical backgrounds of the team of data scientists within Ki made it difficult to select mutually agreed upon models for analyses. The super learner algorithm, one approach to machine learning, allows for a collaborative and objective approach to evaluating models.
WHAT IS MACHINE LEARNING?
Machine learning is an automated process to detect patterns and predict outcomes based on given data.[1, 2]
There are two main types of machine learning:
Predictive or supervised learning maps the relationship between covariates and outcomes. This is the type of machine learning primarily used for Ki.
Descriptive or unsupervised learning solely utilizes inputs to identify patterns in the data. This type of machine learning is also referred to as “knowledge discovery” and does not provide predictions or utilize outcome data.
Commonly used machine learning algorithms include: random forests, neural networks, and decision trees.
Assumptions of machine learning include:
The data used to develop a prediction algorithm come from the same population to whom the prediction algorithm will be applied.
As with any statistical method, correlation does not necessarily imply causation, and particular care must be taken not to over-interpret results.
Some machine learning methods require more data to achieve stability than common approaches.
Optimization is a process repeated hundreds or thousands of times to train machine learning algorithms, identifying where mistakes are made during each repetition, adjusting the algorithm, and repeating the process to obtain a highly predictive algorithm.
Advantages of machine learning
Models created by machine learning can be highly accurate in prediction of outcomes.
Machine learning can use “wide” data (repeated measures for the same individual) and correlated predictor variables (with similar or related values).
Disadvantages of machine learning
Models produced through this methodology can be difficult to understand from a statistical perspective.
Specific predictor parameter estimates may have minimal or no direct interpretation to explain biological relationships. This contrasts with linear or logistic regression, which provide interpretable parameter estimates (assuming a properly specified model).
Ki UTILIZATION OF MACHINE LEARNING
Ensemble of decision trees
Decision trees are a popular approach to machine learning. Decision trees are constructed by first determining which variable provides the best fit to the data when split into two groups. The process is then repeated in each of these two groups, in each of the four resulting groups, and so on, until a predetermined stopping criterion is reached.
Figure 1 provides an example decision tree. Predictor covariates are root and parent nodes to predict the outcome of interest in the terminal leaf nodes.
Root node (RN) is the starting node in the decision tree.
Parent node (PN) follows the root node and leaf nodes follow the parent node.
Leaf nodes (LN), depending on the diversity of the data, can be the terminal node or can be further partitioned.
Figure 1.Example of a decision tree
Using pre-specified algorithms, binary splits for each covariate of interest inform how each parent node categorization influences the following leaf node.
Key requirements for utilizing a decision tree include:
It is supervised learning, and therefore requires pre-identified covariates and outcome(s).
The covariates of the training data (a subset of the analysis data) should be heterogenous (wide range of covariate values) and numerous to ensure a wide cross-section of records are represented.
Decision trees are optimized once by identifying the best split among all variables for all data and then proceeds down each part of the tree until the predetermined stopping criterion is reached. This approach is considered “greedy,” as each split is optimized locally.
The outcome variable identified can be discrete (binary or categorical) or continuous.
Decision trees are attractive due to their interpretability of logically working from root nodes outward to leaf nodes (Figure 1).
The term “ensemble methods” refers to a process by which multiple prediction functions are combined into a single prediction function.
An unweighted ensemble of decision trees, also referred to as random forests, is the process of constructing many decision trees and results in a model with less variance. This approach minimizes the overfitting that occurs with the use of a single decision tree.
Super learner is an algorithm for finding optimal combinations of models with an objective, data-driven approach based on cross validation.
Cross-validation includes two parts, training and validation, to develop a regression fit of the predictor(s) and outcome covariates. Evaluation of the performance of the model fit for prediction informs the final selection of models by the super learner.
Training is conducted utilizing subsets of a data set to train models and determine how well they work within the subset.
Validation assesses the performance of various models in prediction of outcomes with a remaining subset of the full data set that was not utilized for training.
Super learner uses cross-validation to assess the fit of multiple models and estimates the best weighted average of the predictions made by the models. Previous research has proven the super learner is optimal in the sense that it will predict outcomes as well as the unknown best combination of models included.