- Binomial distribution is the probability distribution for the number of successful outcomes in a set of trials with two possible outcomes. This distribution approximates a normal distribution when the sample size is large.
- Statistical inference uses maximum likelihood estimates, which are the parameters of a logistic regression, to identify which parameter values make the observed data most likely.
- The parameter, or likelihood estimate, is dependent on the probability of the outcome.
- The likelihood ratio test is a hypothesis test to evaluate the difference between the observed parameter and its null value which is usually zero (parameter has no impact on the probability). Another way of conceptualizing the likelihood ratio test is a test of whether the odds ratio confidence interval includes one, or there is no increased odds of the outcome in the presence of the predictor parameters. For example, we can have two logistic regression models of stunting, one with no predictors and one with maternal height (below median/above median) as a predictor. We then compare the two model’s likelihoods by comparing the ratio to a chi-square distribution with one degree of freedom (one parameter added in the model testing maternal height).
- Chi–square (χ2) distribution is a probability density function that is right skewed. Its shape depends on the number of degrees of freedom (defined as n [observations] —1) and the total area under the curve (as any other probability distribution functions) equals one. Some of the statistical applications are:
- Pearson’s Chi-squared test: This is a hypothesis test used to determine whether there is a statistical difference between the expected and observed frequencies in one or more categories. Example applications of this test include:
- Independence Test: A hypothesis test to determine if an association is present between two observed proportions from a contingency table (Table 1).
- Homogeneity test: This is a hypothesis test to determine whether differences in distributions of variables vary between multiple populations. For example, this test could be used to evaluate whether differences in the prevalence of stunting are present when comparing several different geographic regions.
Ordinal Logistic Regression
Ordinal logistic regression uses independent variables (predictors) to predict the odds of outcome being one of the response dependent categories, when the dependent variable has ordered categories.
This model assumes the proportionality of odds for each category of the response variable. In other words, the effect of the predictor is the same across the different categories, which means that for a given change of the predictor, the odds from passing from one category to the next is the same regardless of what category we are starting at. The test for proportionality is discussed further and displayed in the HBGDki example below and can be relaxed if it does not hold.
Advantages of categorical methods
The parameters are easily interpretable (probabilities or odds of outcome).
Disadvantages of categorical methods
- If continuous variables are categorized, a level of detail is lost.
- Categories may lack clinical relevance, may include too few observations, or result in empty cells when many categories are created
Ki UTILIZATION OF CATEGORICAL METHODS
Ordered categorical model for LAZ
As an example, an HBGDki model with categorical outcome variable for LAZ (stunted (LAZ < -2), at-risk for stunting LAZ between -2 and -1), and not stunted (LAZ ≥ -1)) is regressed on continuous and categorical parameters (including age, mother’s height, presence of enteric pathogens in stool, % energy from protein, enrollment LAZ and other important variables). To illustrate the categorization of the LAZ, see Figure 1.
The LAZ categories are created according to the cutoffs, then the percent of each category is calculated across months of age.
For example (Figure 1), at age 0 Months, we had 37 infants below -2 (green points) from the total of 230 infants (shown in gray). This translates to the 16% shown in the lower section of Figure 1. The probability of being stunted (LAZ < -2) is increasing over time.
Ordinal regression analysis is utilized because of the natural order of the constructed LAZ categories.
A linear piecewise spline age with breakpoints every 6-month intervals was necessary to describe the nonlinear relationship between age and the probability of LAZ category.
Figure 2 illustrates the data over time and how to assess goodness of fit. The model fit for age (x-axis) as a predictor of the probability by LAZ category (y-axis) and demonstrates the model fits well (by the overlap in the gray 95% confidence intervals and observed circular points).
Figure 3 demonstrates the proportionality of the odds assumption by the overlap between the odds (solid square) and the two LAZ categories (triangles). The proportionality did not differ in a considerable way as to influence the effect estimates, as demonstrated by the “substantial overlap” in the confidence intervals. This was illustrated using the 5 most important predictors.
FIGURE 1. Categorization of LAZ outcome variable 
FIGURE 2. Goodness of Fit example from MAL-ED study. The median 95% CI helps visualize the fit of the model.
FIGURE 3. Proportionality of odds assumption example from MAL-ED study