Categorical Data Analysis


Applied Clinical Trials

Applied Clinical TrialsApplied Clinical Trials-05-01-2011
Volume 20
Issue 5

A review of different statistical methods for data types and models used during research.

In a clinical trial, substantial amounts of data are recorded on each subject, such as the patient's demographic characteristics, disease related risk factors, medical history, biochemical markers, medical therapies, and outcome or endpoint data at various time points. This data can be categorical or continuous. Understanding the types of data is important as they determine which method of data analysis to use and how to report the results.1

For the assessment of the effectiveness and safety of an investigational pharmaceutical entity, ideally the scale for the primary clinical endpoint should be numerically continuous to provide an accurate and reliable assessment. In practice, however, it is impossible, or can be extremely expensive, to measure responses quantitatively. On the other hand, patients' responses to treatments can be easily documented according to the occurrence of some meaningful and well-defined event such as death, infection, or cure of a certain disease and any serious adverse events. In addition the intensity of these events can be graded according to some predefined categories. Therefore categorical data can be useful surrogate endpoints for some unobserved latent continuous variables in clinical trials. Sometimes, to provide an easy analysis and/or a better presentation of the results, continuous data are transformed to categorical data with respect to some predefined criteria. As a result, many efficacy and safety endpoints in clinical trials are in the form of categorical data on either a nominal or ordinal scale.2 Different statistical methods are used to analyze this data in clinical trials.

Categorical data arises when individuals are categorized into one of two or more mutually exclusive groups.3 Data can be divided into two main types: quantitative and qualitative. Quantitative data tends to be either continuous variables that one can measure (such as height, weight, or blood pressure) or discrete variables (such as numbers of children per family or numbers of attacks of asthma per child per month). Count data are discrete and quantitative. Qualitative data tend to be categories; people are male or female, European, American, or Japanese, and they have a disease or are in good health. These are examples of categorical data. There are four types of scales that appear in social sciences: nominal, ordinal, interval, and ratio scales. They are categorized into two groups: categorical and continuous scale data. Nominal and ordinal scales are categorical data; interval and ratio scales are continuous data. When categorical data has unordered scales it is called nominal scales. "Hair color" is a good example of the nominal scale. Categorical data that has ordered scales are called ordinal scale. Rank is an example of ordinal scale. There should be distinction between them because the data analysis method is different depending on the scale of measurement.4


The different statistical methods used to analyze the categorical data will be described in detail and the limitations of each method will be discussed.

Univariate and bivariate statistical methods

Chi-square test. The chi-square test of independence is used to test the association between two categorical variables. The idea behind this test is to compare the observed frequencies with the frequencies that would be expected if the null hypothesis of no association/statistical independence were true. By assuming the variables are independent, we can also predict an expected frequency for each cell in the contingency table. If the value of the test statistic for the chi-squared test of association is too large, it indicates a poor agreement between the observed and expected frequencies and the null hypothesis of independence/no association is rejected. For example in clinical trials, it will be used to test the association between adverse event and the treatment used. The chi-square test has some assumptions: independent random sampling, no more than 20% of the cells has an expected frequency less than five, and no empty cells. If the chi-square test shows significant result, then we may be interested to see the degree or strength of association among variables, but it fails to explain another situation where more than or equal to 20% of the cells have an expected frequency less than five. In this case, the usual chi-square test is not valid. Then the Fisher Exact test will be used to test the association among variables. This methods also fails to give the strength of association among variables.

Cochran Armitage trend test. The investigator may be interested to see the trend in response rate among the different doses of a drug. The Cochran Armitage trend test will be used for this purpose. It tests for trends in binomial proportions across levels of a single factor or covariate. This test is appropriate for a two-way table where one variable has two levels and the other variable is ordinal. The two-level variable represents the response, and the other variable represents an explanatory variable with ordered levels.

McNemar test. In clinical trials, we often test the improvement in response rate after a particular treatment. The statistical test used for this purpose is the McNemar test. It is a test on a 2x2 classification table, used to test the difference between paired proportions (e.g., studies in which patients serve as their own control, or in studies with before and after design). Assumptions: sample member must be randomly drawn from the population, within group sample scores must be independent of each other, and no expected frequencies should be less than five. The Cochran Q test is an extension to the McNemar test for related samples that provides a method for testing the differences between three or more matched sets of frequencies or proportions.

Kappa statistics. Kappa statistics are used for measuring agreement or association among raters. In clinical measurement, comparison of a new measurement technique with an established one is often needed to check whether they agree sufficiently for the new to replace the old. Correlation is often misleading.5

The kappa coefficient (k) is used to assess inter-rater agreement. One of the most important features of the kappa statistic is that it is a measure of agreement, which naturally controls for chance. If there is complete agreement, k=1. If the observed agreement is greater than or equal to chance agreement, k=>0. If the observed agreement is less than or equal to chance agreement k=<0. The kappa coefficient (k) can be classified as follows:

  • Poor agreement = Less than 0.20

  • Fair agreement = 0.20 to 0.40

  • Moderate agreement = 0.40 to 0.60

  • Good agreement = 0.60 to 0.80

  • Very good agreement = 0.80 to 1.00

Wilcoxon signed-rank test. The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test for the case of two related samples or repeated measurements on a single sample. It can be used as an alternative to the paired student's t-test when the population can't be assumed to be normally distributed.

Mann–Whitney U test. The Mann–Whitney U test (also called the Mann–Whitney–Wilcoxon (MWW), Wilcoxon rank-sum test, or Wilcoxon–Mann–Whitney test) is a non-parametric test for assessing whether two independent samples of observations come from the same distribution. It is one of the best-known non-parametric significance tests. It was proposed initially by Wilcoxon (1945), for equal sample sizes, and extended to arbitrary sample sizes and in other ways by Mann and Whitney (1947). It can be used as an alternative to the independent student's t-test when the population can't be assumed to be normally distributed.

Kruskal-Wallis test. The Kruskal-Wallis test is a non-parametric method for testing equality of population medians among groups. Intuitively, it is identical to a one-way analysis of variance with the data replaced by their ranks. It is an extension of the Mann-Whitney U test to three or more groups.

Friedman's test. A non-parametric test (distribution-free) used to compare observations repeated on the same subjects. This test is an alternative to the repeated measures ANOVA, when the assumption of normality or equality of variance is not met. This, like many non-parametric tests, uses the ranks of the data rather than their raw values to calculate the statistic. If there are only two measures for this test, it is equivalent to the sign test.

Odds ratio (OR). The odds ratio is the ratio of the odds of an event occurring in one group to the odds of it occurring in another group. It is used to assess the risk of a particular outcome (or disease) if a certain factor (or exposure) is present. The odds ratio is a relative measure of risk, telling us how much more likely it is that someone who is exposed to the factor under study will develop the outcome as compared to someone who is not exposed. For a 2x2 table:

  • OR=1 means there is an equal chance or likelihood of getting the disease among exposed group compared to unexposed group.

  • OR>1 means there is a more chance or likelihood of getting the disease among exposed group compared to unexposed group.

  • OR<1 means there is a less chance or likelihood of getting the disease among exposed group compared to unexposed group. Odds ratio can be used in both retrospective and prospective studies.

Relative risk (RR). Relative risk is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group. In clinical trials, it is used to compare the risk of developing a disease in people not receiving the treatment (or receiving a placebo) versus people who are receiving the treatment. Alternatively, it is used to compare the risk of developing a side effect in people receiving a drug as compared to the people who are not receiving the treatment. For a 2x2 table:

  • RR=1 means there is no difference in the risk of getting disease between two groups.

  • RR>1 means there is a more risk of getting disease among exposed group compared to unexposed group. RR<1 means there is a less risk of getting disease among exposed group compared to unexposed group. It can be used only in prospective studies. For the rare disease, the RR will be approximately equal to OR.

Sensitivity, specificity, PPV, and NPV

Sensitivity. The probability of the test finding disease among those who have the disease or the proportion of people with disease that are positive with the test.

Specificity. The probability of the test finding no disease among those who do not have the disease or the proportion of people free of a disease who have a negative test.

Positive predictive value (PPV). The percentage of people with a positive test result who actually have the disease.

Negative predictive value (NPV). The percentage of people with a negative test who do not have the disease.

Regression methods used in clinical trials

Binary logistic regression. A form of regression, which is used when the dependent variable is dichotomous that is, the dependent variable can take the value 1 with a probability of success θ, or the value 0 with probability of failure 1-θ.

The independent or predictor variables in logistic regression can take any form. That is, logistic regression makes no assumption about the distribution of the independent variables. They do not have to be normally distributed, linearly related, or of equal variance within each group. The relationship between the predictor and response variables is not a linear function in logistic regression; instead, the logistic regression function is used, which is the logit transformation of q.

Where α = the constant of the equation and, β = the coefficient of the predictor variables.

Goodness of fit test6

  • Hosmer and Lemeshow chi-square test of goodness of fit. If the p value of H-L goodness of fit test is greater than 0.05, we fail to reject the null hypothesis that there is no difference between observed and model predicted values, implying that the model's estimates fit the data at an acceptable level.

  • Omnibus tests of model coefficients. It tests if the model with the predictors is significantly different from the model with only the intercept. The omnibus test may be interpreted as a test of the capability of all predictors in the model jointly to predict the response variable. A finding of significance corresponds to a research conclusion that there is adequate fit of the data to the model, meaning that at least one of the predictors is significantly related to the response variable.

  • Lowest AIC (Akaike Information Criterion). AIC is commonly used to compare models. The lower AIC is the better model.

  • Lowest BIC (Bayesian Information Criterion). BIC is also used to compare models; lower BIC is the better model.

Conditional logistic regression. It is used to investigate the relationship between an outcome and a set of prognostic factors in matched case-control studies. The outcome is whether the subject is a case or a control. If there is only one case and one control, the matching is 1:1.

Multinomial logistic regression. An extension of the binary logistic regression, it is used when the dependent variable has more than two nominal (unordered) categories. In multinomial logistic regression the dependent variable is dummy coded into multiple 1/0 variables. There is a variable for all categories but one, so if there are M categories, there will be M-1 dummy variables. All but one category has its own dummy variable. Each category's dummy variable has a value of 1 for its category and a 0 for all others. One category, the reference category, doesn't need its own dummy variable, as it is uniquely identified by all the other variables being 0. The multinomial logistic regression then estimates a separate binary logistic regression model for each of those dummy variables. The result is M-1 binary logistic regression models. Each one tells the effect of the predictors on the probability of success in that category, in comparison to the reference category. Each model has its own intercept and regression coefficients—the predictors can affect each category differently.

Poisson regression. It is a form of regression analysis used to model the count data. It assumes the response variable y has a poisson distribution and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. The assumptions include:7 logarithm of the disease rate changes linearly with equal increment increases in the exposure variable; changes in the rate from combined effects of different exposures or risk factors are multiplicative; at each level of the covariates the number of cases has variance equal to the mean; observations are independent.

Model goodness of fit7

  • A plot of observed versus predicted values.

  • Deviance residuals versus the predicted values.

  • Global goodness of fit statistics of the null hypothesis, model fits; alternative hypothesis, doesn't fit can be found by using the Pearson chi-squared and deviance test statistics given in the SAS Proc GENMOD output. Large values of these statistics, and small p-values imply evidence that the model does not fit the observed data.

  • The poisson regression model assumes that variance is equal to the mean. This potential violation can be observed through examining chi-squared test statistics divided by the degrees of freedom. If the model dispersion holds—and follows a poisson pattern—the ratio should be approximately one, and larger than one for over-dispersed poisson counts.

Loglinear models. The loglinear model is one of the specialized cases of generalized linear models for poisson distributed data. Loglinear analysis is an extension of the two-way contingency table where the conditional relationship between two or more discrete, categorical variables is analyzed by taking the natural logarithm of the cell frequencies within a contingency table.

Cochran Mantel Haenszel (CMH) test. It is used to test the conditional independence in 2x2xK tables. It is a non-model based test used to identify confounders and to control for confounding in the statistical analysis. The CMH can be generalized to IxJxK tables.

Linear mixed models (LMM). Handles data where observations are not independent. That is, LMM correctly models correlated errors, whereas procedures in the general linear model family (GLM) usually do not. LMM is a further generalization of GLM. Random factors are categorical variables where only a random sample of possible category values is measured. Random effects models are models with only one or more random factors and optional covariates as predictors. Fixed factors are categorical variables where all possible category values (levels) are measured. Fixed effects models are models with only fixed factors and optional covariates as predictors. Mixed models have both fixed and random factors as well as optional covariates as predictors. Hierarchical linear models (HLM) are a type of mixed model with hierarchical data—that is, where data exist at more than one level. Random coefficients models (RC), also called multi-level regression models, are a type of mixed model with hierarchical data and each group at the higher level is assumed to have different regression slopes as well as different intercepts for purposes of predicting an individual-level dependent variable.

Generalized estimating equations (GEE). The method of generalized linear models (GLM) is an integral part of the data analyst's toolkit, as it encompasses many models under one roof: logistic and probit regressions, ordinary least squares, ordinal outcome regression, and regression models for the analysis of survival data etc., however, it is inadequate when the data are longitudinal or are otherwise grouped so that observations within the same group are expected to be correlated. The method of generalized estimating equations (GEE) is a generalization of GLM that takes into account this within-group correlation. The GEE method is a practical strategy for the analysis of repeated measurements, particularly categorical repeated measurements. It provides a way to handle continuous explanatory variables, a moderate number of explanatory categorical variables, and time-dependent explanatory variables. It handles missing values, that is, the number of measurements in each cluster can vary from 1 to t. The following are the important properties of the GEE method:8

  • GEEs reduce to GLM estimating equations for ti=1.

  • GEEs are the maximum likelihood score equations for multivariate Gaussian data when you specify unstructured correlation.

  • The regression parameter estimates are consistent as the number of clusters become large, even if you have misspecified the working correlation matrix, as long as the model for the mean is correct.

  • The empirical sandwich estimator of the covariance matrix of is also consistent relative to the number of clusters becoming large, even if you have misspecified the working correlation matrix, as long as the model for the mean is correct.

Correlation structure. In choosing the best correlation structure, we offer the following general guidelines.8 If the size of the panels is small and the data are complete, use the unstructured correlation specification. If the observations within a panel are collected for the same PSU over time, then use a specification that also has time dependence. If the observations are clustered (not collected over time), then use the exchangeable correlation structure. If the number of panels is small, then the independence model may be the best; but calculate the sandwich estimate of variance for use with hypothesis tests and interpretation of coefficients. If more than one correlation specification satisfies the above descriptions, use the QIC measure to discern the best choice.

Of course, if there is motivating scientific evidence of a particular correlation structure, then that specification should be used. The QIC measure, like any model selection criterion, should not be blindly followed.

Proportional odds model. In the proportional odds model, we consider each category in turn and compare the frequency of response at least up to that point on the ordinal scale to the frequency for all points higher on the scale. The first category is compared to all the rest combined, then the first and second combined are compared to all the rest combined, and so on. In this way, the original table with an I category ordinal scale is converted into a series of I-1 sub tables, each with a binary categorization, lower/higher than the point on the scale. We then have three types of variables: the new binary responses variable, indicating more or less on the ordinal scale; a variable indexing the sub tables, corresponding to the points on the ordinal scale; and the explanatory variables. An advantage of this construction is that the interpretation of conclusions is not modified when the number of ordinal response categories is changed. The model is given by logit(θi) = θi + xθ. Where θ1 = π1, probability of first ordered category; θ2 = π1 + π2, probability of first or second ordered category; θi = π1 + π2 + π3 +. . . + πi , probability of first or second or ith ordered categories.

Thus we allow the intercept to be different for different cumulative logit functions, but the effect of the explanatory variables will be the same across different logit functions. That is, we allow different θ's for each of the cumulative odds, but only one set of θs for all the cumulative odds. This is the proportionality assumption and this is why this type model is called proportional odds model. Also notice that although this is a model in terms of cumulative odds, we can always recover the probabilities of each response category.


The different statistical methods are used to analyze the categorical data in different situations. Each method has its limitations, and to overcome that another method is used. In clinical trials, most of these methods play a very important role in the analysis. Before performing the statistical analysis we need to check the assumptions and study the situation. When the data is of categorical nature, these methods can help in getting the appropriate results to make the decision on the objectives. A good number of softwares are available to perform these analysis with ease.

Devadiga Raghavendra* is Associate Biostatistician, e-mail:, and Grace Maria Antony is Director-Statistics, both for the Department of Biometrics, Makrocare Clinical Research Limited, Kavuri Hills, Madhapur, Hyderabad-500033, India.

*To whom all correspondence should be addressed.


1. Duolao Wang and Ameet Bakhai. Clinical Trials—A Practical Guide to Design, Analysis, and Reporting (Remedica Publishing, USA, 2006).

2. Alan Agresti, Categorical Data Analysis, 2nd Ed. (John Wiley & Sons, New Jersey, 2002).

3. Douglas G. Altman, Practical Statistics for Medical Research (Chapman & Hall, London, 1991).

4. Michael J. Campbell, Statistics at Square Two, 2nd Ed. (Blackwell, USA, 2006).

5. J. M. Bland and D. G. Altman, "Statistical Methods for Assessing Agreement Between Two Methods of Clinical Measurement," The Lancet (i) 307-310 (1986).

6. David W. Hosmer and Stanley Lemeshow, Applied Logistic Regression, 2nd Ed. (John Wiley & Sons, NJ, 2000).

7. "Research Methods II: Multivariate Analysis," Journal of Tropical Pediatrics, 136-143 (2009),

8. Maura E. Stokes, Categorical Data Analysis Using the SAS System, 2nd Ed. (John Wiley & Sons, USA, 2003).

© 2024 MJH Life Sciences

All rights reserved.