Machine learning algorithms

Why does Ki utilize machine learning?

The power of machine learning is in its ability to predict outcomes and uncover relationships in an automated process. Machine learning is applicable to Ki because of its ability to identify relationships and appropriate models to best predict continuous or categorical outcomes. The diversity of statistical backgrounds of the team of data scientists within Ki made it difficult to select mutually agreed upon models for analyses. The super learner algorithm, one approach to machine learning, allows for a collaborative and objective approach to evaluating models.


Machine learning is an automated process to detect patterns and predict outcomes based on given data.[1, 2]

There are two main types of machine learning:[1]

  1. Predictive or supervised learning maps the relationship between covariates and outcomes. This is the type of machine learning primarily used for Ki.

  2. Descriptive or unsupervised learning solely utilizes inputs to identify patterns in the data. This type of machine learning is also referred to as “knowledge discovery” and does not provide predictions or utilize outcome data.

Commonly used machine learning algorithms include: random forests, neural networks, and decision trees.[3]

Assumptions of machine learning include:

Optimization is a process repeated hundreds or thousands of times to train machine learning algorithms, identifying where mistakes are made during each repetition, adjusting the algorithm, and repeating the process to obtain a highly predictive algorithm.


Advantages of machine learning

Disadvantages of machine learning



Ensemble of decision trees

Decision trees are a popular approach to machine learning. Decision trees are constructed by first determining which variable provides the best fit to the data when split into two groups. The process is then repeated in each of these two groups, in each of the four resulting groups, and so on, until a predetermined stopping criterion is reached.

Figure 1 provides an example decision tree. Predictor covariates are root and parent nodes to predict the outcome of interest in the terminal leaf nodes.

Figure 1. Example of a decision tree

Using pre-specified algorithms, binary splits for each covariate of interest inform how each parent node categorization influences the following leaf node.

Key requirements for utilizing a decision tree include:

Decision trees are attractive due to their interpretability of logically working from root nodes outward to leaf nodes (Figure 1).

The term “ensemble methods” refers to a process by which multiple prediction functions are combined into a single prediction function.


Super learner

Super learner is an algorithm for finding optimal combinations of models with an objective, data-driven approach based on cross validation.

Cross-validation includes two parts, training and validation, to develop a regression fit of the predictor(s) and outcome covariates. Evaluation of the performance of the model fit for prediction informs the final selection of models by the super learner.

Super learner uses cross-validation to assess the fit of multiple models and estimates the best weighted average of the predictions made by the models. Previous research has proven the super learner is optimal in the sense that it will predict outcomes as well as the unknown best combination of models included.[6]

Resource Links


  1. Murphy K. Machine learning: A probabilistic perspective. Cambridge, Massachusetts. 2012.
  2. LeDell E. Ensemble Learning at Scale: Software, hardware and algorithmic approaches. Presented at: Machine Learning Conference. September 2016; Seattle, WA.
  3. SAS. Machine Learning: What it is & why it matters. Accessed October 14, 2017.
  4. Larose D, Larose C. Data Mining and Predictive Analytics. 2nd ed: IEEE Press 2015.
  5. Personal Communication. Sergey Feldman. January 12, 2018.
  6. Van der Laan MJ, Polley EC, Hubbard AE. Super learner. Statistical applications in genetics and molecular biology. 2007;6:Article25.