## Categorical Data Analysis

A review of different statistical methods for data types and models used during research.
May 01, 2011

In a clinical trial, substantial amounts of data are recorded on each subject, such as the patient's demographic characteristics, disease related risk factors, medical history, biochemical markers, medical therapies, and outcome or endpoint data at various time points. This data can be categorical or continuous. Understanding the types of data is important as they determine which method of data analysis to use and how to report the results.1

For the assessment of the effectiveness and safety of an investigational pharmaceutical entity, ideally the scale for the primary clinical endpoint should be numerically continuous to provide an accurate and reliable assessment. In practice, however, it is impossible, or can be extremely expensive, to measure responses quantitatively. On the other hand, patients' responses to treatments can be easily documented according to the occurrence of some meaningful and well-defined event such as death, infection, or cure of a certain disease and any serious adverse events. In addition the intensity of these events can be graded according to some predefined categories. Therefore categorical data can be useful surrogate endpoints for some unobserved latent continuous variables in clinical trials. Sometimes, to provide an easy analysis and/or a better presentation of the results, continuous data are transformed to categorical data with respect to some predefined criteria. As a result, many efficacy and safety endpoints in clinical trials are in the form of categorical data on either a nominal or ordinal scale.2 Different statistical methods are used to analyze this data in clinical trials.

 (B2M PRODUCTIONS/GETTY IMAGES)
Categorical data arises when individuals are categorized into one of two or more mutually exclusive groups.3 Data can be divided into two main types: quantitative and qualitative. Quantitative data tends to be either continuous variables that one can measure (such as height, weight, or blood pressure) or discrete variables (such as numbers of children per family or numbers of attacks of asthma per child per month). Count data are discrete and quantitative. Qualitative data tend to be categories; people are male or female, European, American, or Japanese, and they have a disease or are in good health. These are examples of categorical data. There are four types of scales that appear in social sciences: nominal, ordinal, interval, and ratio scales. They are categorized into two groups: categorical and continuous scale data. Nominal and ordinal scales are categorical data; interval and ratio scales are continuous data. When categorical data has unordered scales it is called nominal scales. "Hair color" is a good example of the nominal scale. Categorical data that has ordered scales are called ordinal scale. Rank is an example of ordinal scale. There should be distinction between them because the data analysis method is different depending on the scale of measurement.4

The different statistical methods used to analyze the categorical data will be described in detail and the limitations of each method will be discussed.

Univariate and bivariate statistical methods

Chi-square test. The chi-square test of independence is used to test the association between two categorical variables. The idea behind this test is to compare the observed frequencies with the frequencies that would be expected if the null hypothesis of no association/statistical independence were true. By assuming the variables are independent, we can also predict an expected frequency for each cell in the contingency table. If the value of the test statistic for the chi-squared test of association is too large, it indicates a poor agreement between the observed and expected frequencies and the null hypothesis of independence/no association is rejected. For example in clinical trials, it will be used to test the association between adverse event and the treatment used. The chi-square test has some assumptions: independent random sampling, no more than 20% of the cells has an expected frequency less than five, and no empty cells. If the chi-square test shows significant result, then we may be interested to see the degree or strength of association among variables, but it fails to explain another situation where more than or equal to 20% of the cells have an expected frequency less than five. In this case, the usual chi-square test is not valid. Then the Fisher Exact test will be used to test the association among variables. This methods also fails to give the strength of association among variables.

Cochran Armitage trend test. The investigator may be interested to see the trend in response rate among the different doses of a drug. The Cochran Armitage trend test will be used for this purpose. It tests for trends in binomial proportions across levels of a single factor or covariate. This test is appropriate for a two-way table where one variable has two levels and the other variable is ordinal. The two-level variable represents the response, and the other variable represents an explanatory variable with ordered levels.

McNemar test. In clinical trials, we often test the improvement in response rate after a particular treatment. The statistical test used for this purpose is the McNemar test. It is a test on a 2x2 classification table, used to test the difference between paired proportions (e.g., studies in which patients serve as their own control, or in studies with before and after design). Assumptions: sample member must be randomly drawn from the population, within group sample scores must be independent of each other, and no expected frequencies should be less than five. The Cochran Q test is an extension to the McNemar test for related samples that provides a method for testing the differences between three or more matched sets of frequencies or proportions.