# Categorical methods

### Why does Kiutilize categorical methods?

Categorical analysis may best suit outcome variables with nominal or ordinal properties to describe associations applicable to healthy birth, growth, and development. Additionally, Ki can combine continuous outcome variables from collaborators by utilizing categories to align data, which may be differentially (continuous vs. categorical) collected across data sources.

### WHAT ARE CATEGORICAL METHODS?

• Regression models incorporate binary or multi-category outcome variables to determine risk or log odds of one category of an outcome compared to the reference category of the same outcome. Predictors can be either categorical or continuous.
• Statistical methods that utilize categorical variables require different approaches, compared to continuous variables, due to the differences in statistical summaries and distributions.
• Categorical variables are best summarized using frequencies and percentages, which can be thought of as probabilities. The frequencies, by definition, must be greater than zero (non-negative numbers).
• The natural logarithm is a commonly used transformation to ensure data is greater than zero (non-negative) and displays as a linear function of the predictor variable(s). However, this assumption can be relaxed by using flexible splines or other nonlinear functions.
• The probabilities by definition must be between 0 and 1, and several transformation “link” functions can be used to map the probabilities from (0, 1) into an unconstrained scale (-∞, +∞). Common transformations include logit and probit.
• Contingency tables, or two-by-two tables (a special case for two variables with two categories each), can provide a useful summary of categorical predictor and outcome variables. For example, Table 1 shows the stunting status at birth with the corresponding maternal Height category of the mother.

### 10 Distributions

• Binomial distribution is the probability distribution for the number of successful outcomes in a set of trials with two possible outcomes. This distribution approximates a normal distribution when the sample size is large.
• Statistical inference uses maximum likelihood estimates, which are the parameters of a logistic regression, to identify which parameter values make the observed data most likely.
• The parameter, or likelihood estimate, is dependent on the probability of the outcome.
• The likelihood ratio test is a hypothesis test to evaluate the difference between the observed parameter and its null value which is usually zero (parameter has no impact on the probability). Another way of conceptualizing the likelihood ratio test is a test of whether the odds ratio confidence interval includes one, or there is no increased odds of the outcome in the presence of the predictor parameters. For example, we can have two logistic regression models of stunting, one with no predictors and one with maternal height (below median/above median) as a predictor. We then compare the two model’s likelihoods by comparing the ratio to a chi-square distribution with one degree of freedom (one parameter added in the model testing maternal height).
• Chisquare (χ2) distribution is a probability density function that is right skewed. Its shape depends on the number of degrees of freedom (defined as n [observations] —1) and the total area under the curve (as any other probability distribution functions) equals one. Some of the statistical applications are:
• Pearson’s Chi-squared test: This is a hypothesis test used to determine whether there is a statistical difference between the expected and observed frequencies in one or more categories. Example applications of this test include:
• Independence Test: A hypothesis test to determine if an association is present between two observed proportions from a contingency table (Table 1).
• Homogeneity test: This is a hypothesis test to determine whether differences in distributions of variables vary between multiple populations. For example, this test could be used to evaluate whether differences in the prevalence of stunting are present when comparing several different geographic regions.

#### Ordinal Logistic Regression

Ordinal logistic regression uses independent variables (predictors) to predict the odds of outcome being one of the response dependent categories, when the dependent variable has ordered categories.

This model assumes the proportionality of odds for each category of the response variable. In other words, the effect of the predictor is the same across the different categories, which means that for a given change of the predictor, the odds from passing from one category to the next is the same regardless of what category we are starting at. The test for proportionality is discussed further and displayed in the HBGDki example below and can be relaxed if it does not hold.

The parameters are easily interpretable (probabilities or odds of outcome).

• If continuous variables are categorized, a level of detail is lost.
• Categories may lack clinical relevance, may include too few observations, or result in empty cells when many categories are created

### KiUTILIZATION OF CATEGORICAL METHODS

#### Ordered categorical model for LAZ

As an example, an HBGDki model with categorical outcome variable for LAZ (stunted (LAZ < -2), at-risk for stunting LAZ between -2 and -1), and not stunted (LAZ ≥ -1)) is regressed on continuous and categorical parameters (including age, mother’s height, presence of enteric pathogens in stool, % energy from protein, enrollment LAZ and other important variables). To illustrate the categorization of the LAZ, see Figure 1.

The LAZ categories are created according to the cutoffs, then the percent of each category is calculated across months of age.

For example (Figure 1), at age 0 Months, we had 37 infants below -2 (green points) from the total of 230 infants (shown in gray). This translates to the 16% shown in the lower section of Figure 1. The probability of being stunted (LAZ < -2) is increasing over time.

Ordinal regression analysis is utilized because of the natural order of the constructed LAZ categories.

A linear piecewise spline age with breakpoints every 6-month intervals was necessary to describe the nonlinear relationship between age and the probability of LAZ category.

Figure 2 illustrates the data over time and how to assess goodness of fit. The model fit for age (x-axis) as a predictor of the probability by LAZ category (y-axis) and demonstrates the model fits well (by the overlap in the gray 95% confidence intervals and observed circular points).

Figure 3 demonstrates the proportionality of the odds assumption by the overlap between the odds (solid square) and the two LAZ categories (triangles). The proportionality did not differ in a considerable way as to influence the effect estimates, as demonstrated by the “substantial overlap” in the confidence intervals. This was illustrated using the 5 most important predictors. ###### FIGURE 1. Categorization of LAZ outcome variable 