Coding & Designing a Clinical Database


Applied Clinical Trials

Applied Clinical TrialsApplied Clinical Trials-03-01-2012
Volume 21
Issue 3

Proper questionnaire coding and design can help set the stage for a successful trial.

Questionnaires are used in a wide range of settings to gather information about opinions, outcomes, and the behavior of individuals. Often they are the only cost effective way to reach a number of respondents large enough to allow statistical analysis of the results. Currently every clinical study, no matter what therapeutic area it belongs, utilizes at least one type of questionnaire.

This article discusses database design and coding aspects of data collection via questionnaires. Due to the complexity of the database design process it should be viewed as a multi-stage process, beginning with the definition of all the aspects to be examined, and ending with the interpretation of the results.

An outcomes system database serves two primary goals: accurate collection of all necessary information and analysis support. According to ICH GCP E6 5.5.3, the accuracy, completeness, legibility, and timeliness of the data should be present.1, 2 Easy interpretation, verification, and reporting of clinical trial data is also part of the data quality concept. Principle 11 in the Handbook for Good Clinical Research Practice embraces this concept, providing appropriate procedures are followed. It is also good practice to involve data management, statistical, and programming experts from the very beginning to avoid critical mistakes later. As Mark Kosinski mentioned in his article "Survey Says: Get the Right Data"3 just a few common missteps can cause companies millions of dollars and major delays. For example, mislabeling might lead to the wrong scoring.

Ideally, a database should be designed to allow flexibility with respect to questionnaire modifications. The process of questionnaire data collection can be divided into three major steps: requirement analysis, logical, and physical design.

Data collection schema

Breaking processes into smaller parts makes it easier to foresee and resolve potential issues. There are three major steps in this process:

Requirement analysis. The database requirements are determined by interviewing both the producers and users of data. Gathered information is used to produce a formal requirements specification.

The initial part of the requirement analysis step involves the development of a detailed plan for data collection, analysis, and reporting. Source documents which help to identify what data should be collected are: study protocol, statistical analysis plan, and the data handling plan. Overall the first stages should answer the following questions:

  • What data should be collected?

  • What are the key questions to be answered?

  • How will the data be analyzed?

  • Who is the primary audience for the questionnaire?

  • What reports will be used?

  • What formats will be used?

  • How often will a new report be generated?

For example, in the Blaise philosophy, the start of a survey is the design of a questionnaire, provided with detailed specification containing information about questions, possible answers, conditions on the route through the questionnaire, and the relationships between answers.4 Some data management systems use the term "conditional branching" to define relationships between some questions.

Logical design. A conceptual data model diagram, that shows all the data and their relationships, is developed by referencing many organizations' procedures such as ER, CDISC, or UML. With the theoretical approach, the data model constructs must be transformed into normalized relations. Ultimately, design of the study database also may be restricted by the need to be compatible with other databases,5 and some other internal guidelines.

Physical design. This step involves selection of indexes (access methods), partitioning, and the clustering of data. The purpose of physical design is to optimize performance as closely as possible,6 and to set up an actual database.

The database designer is the key person who ensures successful transition of the logical database model into its final stage-physical design. Therefore, a clear understanding of the questionnaire elements basics by database professionals is beneficial for the whole clinical team.



Questionnaire elements

In general, there are two primary questionnaire elements: questions and answers. Other columns, such as questionnaire categories, sub-categories, respondent types, etc. are considered to be attributes for these two primary ones. Questions and answers are the foundation for any questionnaire design.

Questions. Questionnaires consist of two types of questions: close ended and open ended. Each of these two question types require different attributes in the database:

Close-ended questions (structured): Questions offering limited choice of answers or require a specific pre-determined value as an answer. This question type usually has a predefined format or formats for answers. Sometimes a list of possible answers is hard coded providing specific options. Most of the key questions in the clinical trials are close ended.

Open-ended questions (unstructured): Most common examples of open-ended questions are questions that respondents can answer in an unlimited number of ways (e.g., Why did you decide to participate in this trial?). Other possible questions are work associations and sentence or story completions. Some of these questions will not be part of the analysis data and would not have to be captured by the database. Other questions would have to be in the database. The trial statistician can be asked to advise here.

The database designer must be aware that some questionnaires may contain general questions which are not suitable for analysis and therefore do not have to be captured by the database. This is especially true for some psychiatry or neuroscience studies. It is important for general questions to be neutral questions to establish rapport and put the respondent at ease (e.g., Were there any problems finding our office today? or Have you visited this location before?).

Answers. Answers are dependent on question types. From the database prospective, multiple-choice questions with a predefined list of answers belong to the close-ended (structured) type. However, some multiple-choice questions contain an "other" option where a respondent can provide a free entry which belongs in an open-ended (unstructured) type. In this case "other" becomes a separate question with a free-text answer. There are two answer categories unstructured and structured.

Unstructured (open ended): Unstructured answers are free-text or free-entry-words and/or combination of words and numbers. Free text answers are common for some questionnaires; it is also very common not to have them as part of the database. Free text usually requires manual review by medical professionals. Therefore in most cases it does not make any sense to keep these answers in the database. Sometimes the database designer creates an indicator to check if a question was answered, not answered, or not applicable to the respondent. Providing indicators usually helps during data cleaning and clinical reporting activities.

Structured (close ended): The structured answer category represents the biggest part of questionnaires, there are three main forms: scoring/constant sum, multiple or single choice (coded), and predefined answers (numeric, likert scale).

There are four answer types: dichotomous, multichotomous, ranking answers, and constant sum/scoring.

Dichotomous, or single choice answers, belong to questions with just one possible answer from a predefined list (e.g., yes/no answers). Date or number range questions are also associated with dichotomous answers; answer options allow only one choice. For example, questions regarding family annual income, spending ranges, or age groups (see Table 1). Ranges must never overlap each other. This type of answer has one record per patient, per visit, per respondent in the database.

Multichotomous (multiple choice questions) allow multiple answers (see Table 1) as well as a single answer. These type of answers may have multiple records per patient, per visit, per respondent. However, there is also the option of having just one record if a respondent provided just one answer. The difference is that dichotomous (singe choice) are restricted to have more than one answer, while multichotomus can have either one.

Ranking answers (semantic differential) questions allow a certain set of answers to be ranked based upon a specific attribute or characteristic. For example, it might be requested that the listed options be ranked based on a particular attribute. Perhaps we know that Pioneer, Samsung, LG, or a Phillips DVD player are most likely to be purchased. "Respondents can be asked to rank their preferred brand up to a predetermined number and their least preferred, or those that they don't like at all, if it is more appropriate"9 (e.g., mark the most reliable brand as 1, next reliable as 2, and so on). These type of answers in the database will be stored similar to multichotomous (multiple choice) answers but will have an additional ranking attribute in a separate column.

Constant sum, or scoring questions, permit the collection of "ratio" data, meaning that the data is able to express the relative value or importance of the options. These answers also will be stored in the database with multiple records per patient, per visit, per respondent with an additional column storing option. Total sum can be calculated depending on study design: either during analysis, data management manipulations, or as derived value in additional record/column. So, even though constant sum/scoring answers are logically different from ranking answers (see Table 1 for examples) they are handled in the same manner as the database design prospective.



Physical database design

Physical design is the third and final step in creating a database. Although, in design, there is no single right answer, nor an algorithm for producing the best answer, certain principles exist. It always starts with reviewing documents and specifications generated during requirement analysis and logic modeling. Based on this review, GCP, and internal company practices, the database professional picks the appropriate design, identifies all attributes, and develops the coding schema.

Figure 1. Efficient coding may preserve the simplicity and integrity of the database.

Database structure design. Traditionally, data is stored horizontally in the database, but this approach does not work as well for questionnaires. Its hierarchical data tends to lose its flexibility and logical purity while stored horizontally.7 Based on the mini-questionnaire (Figure 1) you can see the fragment of traditional horizontal database design (Table 2).

What if some questions related to a certain category need to be pulled? The only option here is to hard code the full list of question codes related to the category, and if any new questions were added to the category than the new codes have to be added and the database updated as well. A vertical database structure allows an additional attribute column for questionnaire type, enabling easy pulling of all the questions related to a category. Therefore if a new question is added, it will be automatically associated with the category. Another challenge for a horizontal database is the indicators for invalid answers. This task is very easy with the vertical approach, and almost impossible with a horizontal structure. Even though "specify" questions were not answered, in most cases there is still space reserved in the database. A vertical database just does not have this information, saving database space. Another good reason for utilizing vertical design is adding new questions to the questionnaire after database release. This situation is very common and a horizontal database will require changes of table structure for a new question-another column or columns for new questions need to be added. A vertical structure will be exactly the same except that a new question will add an extra record (row). This is not a full list of benefits for using a vertical database structure, but it gives an idea of the advantages.

Table 2. Traditionally, data is stored horizontally in the database, but this approach does not work as well for questionnaires.

Assigning attributes. As mentioned before, vertical data structure contains two main columns: questions and answers. Some databases will have multiple answer columns to enable various answer formats, either numeric or character. Additional attribute columns should be added to ensure integrity and further analysis:

  • Question level number

  • Validation attribute

  • Main question ID (for sub-questions)

  • Questionnaire type and subtype (if used)

  • Respondent types

The list of attributes may vary depending on specific trial needs. Some types of questionnaires may require data collection time or facility type. Various combinations of attribute columns will allow some meaningful manipulation later. For example, "question, level, number" in combination with "main question ID," would allow the data to be pulled off sequentially. By pulling questions by "main question ID," a report with a full set of answers to concerned questions including answers to related dependable sub-questions, despite the number of sub-questions which were actually answered by every respondent, can be created later.

Validation attributes might be especially useful for information collected online or captured directly into the database without preliminary validation. It helps to distinguish later invalid entries for the questions and make a decision if they should be excluded from the analysis.

Validation attributes usually have numeric values specified in the documentation (e.g., data transfer specifications, database specifications, etc.). It might be equal to one if the record is valid, and equal to zero if the record is not valid. For example, the lab record for hemoglobin results was transferred in the database but appeared to be invalid due to improper storage or sample leaking. Another sample was collected later, and the valid result was transferred into the database. Now the database contains two records for the same analysis. In fact, only the second record should be used in further analysis, while the first one should be excluded as invalid. The best way to achieve it is through assigning a validation attribute. Identifying valid results as one and invalid as zero will allow the easy pulling up of only valid records from the database. Programmatically speaking, just those which have validation attributes equal to one will be considered as valid results for the test.

Questionnaire category is often used to cover different aspects of the problem. For example, screening questionnaire for 13 DSM-IV Axis I disorders might contain questions related to five different areas: eating, mood, anxiety, substance use, and somatoform disorders. A set of questions for each area can be arranged as separate questionnaire. Each type of questionnaire can have assigned a number, (e.g., one for eating disorder, two for mood disorder, etc.).

Questionnaire subtypes might be used to divide the same questionnaire into multiple categories in the sample questionnaire (Figure 1). For example, the "quality of life" questionnaire can be divided into multiple sub-categories: mental health, living environment, income, meaningful work, etc.

Respondent types can be used to increase the depth of the system. Depending on the domain of interest, using multiple sources of information may be necessary: clinical staff, childcare provider, and/or collateral to the patient (e.g. friends, spouse, employer etc.).6 Most of the questionnaires determining existence of ADD disorders require responses from multiple-care providers as well as the patient themselves. This attribute allows a separate row to store the answers for the same questions from different respondents within the same questionnaire. The horizontal structure would require either separate tables for every respondent or have additional sets of the same columns for every potential respondent.

Nothing can compliment the physical design of the database as much as meaningful and consistent coding. Below efficient coding that might work to preserve simplicity and integrity of the database using the questionnaire sample from Figure 1 is explored.



Coding schemas

"Coding is the process of converting questionnaire data into meaningful categories to facilitate analysis. You need to think about your coding scheme at the beginning of the study and, wherever possible, build it into your questionnaire, for example by numbering the response tick boxes for each question."8

Coding schema should be carefully documented and followed precisely. Related study documentation must contain the following information: database space to be reserved for each attribute and format, full description (decoding) of every code used in the study, and mapping of elements to items in data collection instruments (e.g., case report form). This information belongs to database specifications if the data is collected via CRFs, or data transfer specifications if data is collected via IVRS or transferred from outside vendor. Sometimes the clinical team might require some additional instructions or specifications if questionnaire is complicated.

In some cases coding can be used as a tool to help handle metric data in the same manner as categorical data. In the statistical analysis of a survey, an important distinction is made between metric or interval-type data, and nominal or categorical data, since they require different statistical treatment. Implementing appropriate codes for nominal data might help-it can be used later in the analysis, avoiding any harsh manipulation. Once the data analysis methods are developed, the trial statistician might request a special type of coding for some variables, or certain formatting for others. Hence, the overall design, as well as coding scheme, should be discussed with a statistician at this stage so that any mistakes can be rectified before going to the printers.8 The following practices are highly recommended for developing coding schema:

  • Each questionnaire element or group of elements depending on coding schema that have an assigned code (ID number): questionnaire type, questionnaire subtype, question, answer etc (Table 2).

  • Codes (ID numbers) should be consistent in all questions (e.g., yes=1, no=2). This is particularly important when using scales.

  • Space for coding should be left if additional options will be added later. For example, in the beginning there can be two options for answers to the question "How often do you watch TV while eating?"-"always" and "never." You can assign "one" to "always" and "two" to "never." After the database release the clinical team may realize that they need to add "sometimes" as an additional option, and it must be in between the existing options. In this case, one of your options is to reassign the existing code "2" to newly added "sometimes," and assign "3" to "never." Unfortunately, in most cases, it is impossible to do it after the database release, besides it will dramatically increase maintenance work. This issue could be easily handled, if you have had assigned code "10" to "always," and "20" to "never" in the very beginning. Than you would be able to easily squeeze "sometimes" in between the original two options with code "15." That is called "leaving space" in database coding.

Indeed, coding is a very powerful tool and can significantly affect the amount of database space required for data capturing. A big part of developing coding schema is the decision regarding code consolidation. There are also multiple approaches to how it can be achieved. One choice is to consolidate questions, answers, and questionnaire attributes into one coding field wherever possible, or keep separate columns for question coding itself and other questionnaire elements coding (Figure 2). Keeping separate columns for answers and questions will require additional database space but might allow some flexibility in later trials (e.g., the ability to reuse items previously created in the database for other studies).

Figure 2. Keeping separate columns for questions and answers will require additional database space but might allow you some flexibility in later trials.

As with any other questionnaire, Figure 1 contains various types of questions and answers. The first question is a single choice with a predefined set of coded answers. Since only one answer is allowed for this question it will have just one record in the database per patient, per visit, per respondent. The next question (1.1) is also a single choice, free-entry question. The question is conditional and requires entry only if the question is answered "no." Therefore patients who answered "yes" to the first question might have no records in the database for question 1.1.

Questions 3 and 4 are challenging questions in terms of database and coding design. Both look like regular single-answer option questions, but there is an additional free-entry option for those who answered "other" to question 3. This free-entry option will be stored in the database separately. Hence, if "other" is the answer of choice then free-entry will be stored in the database as a separate record. Ultimately, question 3 will have either just one record in the database with the value of "1" or "2", or one value of "33" in the coded field and additionally in the "answer free-text" column for free-entry answers.

Question 4 is similar, but should contain multiple records per patient, per visit because of the scoring distribution between suggested answers. Therefore each of four options which were actually rated by the patient will be stored in the database in an "answer coded" field as "8" for cleaning, "9" for cooking, "10" for bathing and "33" for other. Values of scoring distribution assigned by the patient will be stored in the "answer numeric" field. This design allows respondents to capture the distributed score over three or less listed activities, or add one not-listed activity and rate it among others.

The database designer can also use either the separated or consolidated approach to code every item separately or combine them (Table 3). Advantages of consolidation are reduction in database space and an increase in data warehouse efficiency. The downside is that consolidated items can not really be reused in other studies. There are no defined rules of how exactly coding should be done; sometimes it is more of an art. Since combination of coding and design can have significant effect on the database outcome, the database designer should carefully review full CRF or a questionnaires book before starting to develop coding schema.

Figure 2. Keeping separate columns for questions and answers will require additional database space but might allow you some flexibility in later trials.

Table 3 illustrates how different coding approaches work to reduce the amount of required database space. Elements were combined according to the schema suggested on Figure 2. The last letter in this coding identifies format type of correlated answer: "C" stands for coded answer, "T" for free text, "M" for multiple (possible combination of different answer types).

This is just a small sample of how various coding approaches can significantly affect database design. Even the sample mini-questionnaire provided here can be coded and designed in many different ways. As Parson and Wand commented in their work "an interesting consequence ... different class structures can be used to model precisely the same domain of instances and properties."

It is important to be familiar with all the above described principles and to have a full set of questionnaire elements to insure proper questionnaire design.

Vera Pomerantseva, MS, is Senior Database Programming/CRF Specialist at Novartis Pharmaceuticals Corp, East Hanover, NJ, e-mail:




1. Gerhard Fortwengel, Guide for Clinical Trial Staff. Implementing Good Clinical Practice, (Karger, Basel, Switzerland, 2004).

2. World Health Organization, Handbook for Good Clinical Research Practice (GCP), Principle 11 and 14 (World Health Organization, Geneva, Switzerland, 2002).

3. M. Kosinski, "Survey Says: Get the Data Right," Applied Clinical Trials Online, (May 1, 2009),

4. Jelke Betlehem, "Business Process Reengineering with the Blaise System for Computer-Assisted Survey Processing" in In Search of Structure-Essays in Social Science and Methodology, eds. Meindert Fennema, Cess van der Eijk, and Huibert Schijt, (Het Spinhuis Publishers, The Netherlands, 1998), 109-120.

5. Mark E. Maruish, The Use of Psychological Testing for Treatment Planning and Outcomes Assessment, 2nd Edition (Lawrence Erlbaum Associates Inc., Mahwah, NJ, 1999).

6. Toby J. Teorey, Sam Lightstone, Tom Nadeua, and Morgan Kaufmann, "The Database Life Cycle" and "Conceptual Data Modeling," in Database Modeling and Design, 4th Edition, (Morgan Kaufmann, California, 2006).

7. I. Brace, "Data Types," in Questionnaire Design: How to Plan, Structure, and Write Survey Material for Effective Market Research, 2nd Edition, (Market Research in Practice, London, 2008).

8. V. Pomerantseva, "How to Pick the Right Database Design for Questionnaires," Information Managment, (2009),

9. A. Williams, "How to...Write and Analyse a Questionnaire," Journal of Orthodontics, 3 (9) 245-252 (2003).

Related Content
© 2024 MJH Life Sciences

All rights reserved.