Anticipating Careless Responders in Survey Design and Analysis

,

Clinical trials must rely on internet surveys to incorporate the needs of industry and patients, but how can results be improved?

The clinical trials industry depends on internet surveys for competitor analysis, customer satisfaction/feedback, and benchmarking.1,2,3 The results of these surveys provide insights that guide business and clinical decisions. Like any information from the web, however, internet surveys are prone to noise and misinformation.

In this article, we will show you how to improve the quality of your internet survey data by identifying these sources of the noise and misinformation using careless responding analysis. A fundamental cause of the many forms of internet misinformation is carelessness.4 People who pass along misinformation on the internet are more lazy than biased: They respond to survey items without carefully reading the survey item or thinking deeply about their response.4,5 The result is careless responding that degrades data quality.

What is careless responding?

Careless responding occurs when a respondent is within the sampling frame but does not give an accurate response due to inattention or inadequate cognitive processing.6,7 More specifically, careless respondents may not carefully read and comprehend survey items, retrieve the appropriate information from memory, make a deliberate judgement, or provide an accurate response that aligns with that judgement.8 Carelessness may be induced by the survey structure (e.g. a long survey with complex, poorly written items), respondent characteristics (e.g., fatigue or limited reading ability), or contextual factors (e.g., distractions or responding on a smart phone).6,7,8,9,10 Any of these factors can lead to inaccurate responses due to carelessness.

Careless responding is biasing a lot of the internet surveys.11,12 Up to 30% to 50% of survey responses are careless,6,13 but carelessness begins to bias results bias when about 10% of the respondents are careless.8 The specific effects of careless responding is that it attenuates means and correlations, decreases the reliability of measures, reduces validity, and increases the risk of Type 2 error.6 Put another way, careless responding creates a lot of noise and obscures your ability to detect a signal in the data.14

Identifying careless respondents through survey construction

Survey construction can help deter careless responding and enhance your ability to identify careless respondents. In this section, we describe a variety of tools that can be built into a survey as careless responding detectors. You may not need every tool but should adjust how you use these tools based on the purposes of your research.

Qualify each respondent. Build into the survey some items that will allow you to assess the qualifications of the respondent. This step is analogous to specifying the inclusion/exclusion criteria in a clinical trial. Items that are constructed as continuous variables (e.g., years of experience or ‘rate your expertise in this area’ on a 7-point scale) are more useful than dichotomous response items (e.g., ‘are you qualified’ yes/no). These qualifying items don’t need to go up front in the survey, even though you may think about them first. They may end up at the back of the survey with other demographic items.

Measure the time. Most of the common survey tools will allow you to measure the amount of time it took to complete the survey, but not the specific time-per-item. Still, you can calculate the average time-per-item. As you are writing items, consider the complexity of the questions and estimate the time respondents should need on each item. Respondents often take less time than you think, but this estimate will give you a starting point for analyzing the amount of time taken, which we address in the next session.

Keep surveys short.15 Carelessness increases significantly as survey length increases. The rates of random responding doubled toward the end of long surveys.16 It is a good rule-of-thumb to keep the number of items to less than 150 items. This includes all the demographic, qualification, and careless detection items. Reduce the number of items if you have long instructions or complex items. We realize the organizational pressures that lead to long surveys. Have a strong governance structure in place beforehand to keep the survey focused on the research question.

Be careful with incentives. You can improve the number of respondents with generous incentives, but you might also attract careless respondents. Think about your target respondent and see if you can identify something that will be uniquely valuable to them while not attracting the attention of professional surveyists.

Write good instructions. Typical basic instructions emphasize honesty and accuracy. Adding a warning (e.g. ‘your responses will be analyzed for quality’) will reduce carelessness.7 An occasional reminder, whether written or through a virtual presence, can boost the effect of warning instructions. The decision to assure respondent anonymity is tricky. When respondents enter their name, carelessness is reduced. But identified respondents can become overly agreeable—tending to agree with items even when the agreement is not logical.6,9

Include careless detection items. These are items not related to the purpose of your research but are included to detect careless responding. These items increase cognitive engagement by warning respondents to be careful to look out for tricks.17 For example, you can include instructed response items (e.g., ‘fill in a 3 here’) to detect carelessness.7

A different type of careless detection item is to create pairs of items in the survey that ask the same question. The simplest approach is to repeat the item later in the survey to see if you get the same answer. You can also change the wording with synonyms or antonyms to check agreement. You could also reverse-coded the repeated items. In general, items are scaled so that a higher number means that there is more of the construct or that it is better. In reverse-coding, the scaling is flipped, so that a rating of more or better requires a lower number. Reverse coding is a good way to detect carelessness but they can become so convoluted that they lead to misresponses, or honest mistakes that comes from poorly worded items.17 When using careless detection items, it is good practice to include them early in the survey to serve as a speed bump or warning and then every 50 – 75 items, with a maximum of 3 reverse-coded items per survey.6

Identifying careless respondents in the analysis

In this section, we will discuss simple analytic procedures that you can use to detect careless respondents. When you find a careless respondent, don’t delete them from the dataset. Instead, create a 0/1 categorical variable where a 1 signifies a careless respondent. Someone may ask you to do a post hoc analysis to understand the effect of removing the careless respondents from the analysis. Whether you should comply with such a request is a different story. Carelessness is a type of data contamination.18 These respondents don’t belong in the sample so a request to examine the effects of including these respondents is the same as asking ‘What is the effect of including respondents outside the sampling frame?’ Still, it is good to retain the data in case you need it.

The definition of careless responding rests on the assumption that the respondents are qualified to be sampled.The qualification analysis will vary based on how you structured the survey and the burden of proof you specified in the SAP prior to launching the survey. If you have created the qualifying variable(s) as continuous, examine histograms to assess how many respondents qualified themselves as completely unqualified (e.g., a ‘1’ on a 7-point scale) and then unqualified. Look at a frequency or histogram to see how the respondent’s cluster and identify a logical cutoff point. It may help to create a categorical variable out of the continuous variable because it helps to make the data presentation more understandable.18 Hopefully, though, you don’t need to categorize this variable because all your respondents were highly qualified.

Assess the time. Ideally, you want to know the amount of time the respondent spent on each item. Most of the commercially available survey tools don’t provide this information. Instead, you usually see the survey start and finish times. From this, you can calculate the total amount of time taken on the survey and then an average time per item.

What is too fast and is therefore eligible to be considered careless? We don’t have a fixed speed limit. First, we sort the data from shortest to longest and look for a breakpoint between those respondents carelessly speeding through the data and the careful-but-fast-readers. If you don’t see it, try graphing the time-per-item variable. We usually see a break around 3 seconds-per-item, with an average time-per-item of around 8 or 9 seconds. If you think about all the steps involved in responding to an item (reading, accessing memory, judging, and responding), a respondent is clearly not being careful if they are averaging less than 2 or 3 seconds.

What about respondents who take a very long time on the survey? We commonly see respondents who take a couple of days to complete the survey! Obviously, these people left the survey and came back to it or abandoned the survey and closed their browser tab later. We usually can also see a break point in the histogram on the long side of survey time, usually less than an hour. We look at the long responses and often find they were abandoned, but occasionally see someone who was just interrupted and seems to be responding carefully. Don’t be too hasty to label a long-respondent as careless. Check each one of these long respondents to assess the quality of their responses. We usually see the point where they left the survey because the quality of the responses suddenly gets much worse.

Screen for missing data & abandonment. The end of the survey typically has some innocuous items like demographics that everyone should be able to respond to. Sometimes careful respondents get interrupted by an important phone call and never get back to the survey. Most of the time, however, we find abandoned surveys are careless. Also check the number of missing data points. Again, this is helpful, but not definitive. You might have a highly qualified respondent who is a specialist and only responding to some topics within their area. Most cases that look like they were abandoned with a lot of missing data, however, will probably be careless respondents.

Look for patterns. Careless respondents often default to a pattern of respondents instead of reading and evaluating each item. The most common pattern we see is monotonic responding (e.g., 3, 3, 3, 3...). Less commonly, respondent will alternate their responses (e.g., 3, 4, 3, 4....) or use a progression (e.g., 1, 2, 3, 4, 1, 2, 3, 4...).In a small dataset, you can just scan the matrix for patterns. If it is a larger dataset, screen each respondent for standard deviation and frequency. The standard deviation will allow you to pick up monotonic responses and the frequency will identify progressions.

Check your careless detection items. If you have instructed-response items, check those first. Then you can check the agreement between the repeated items, synonyms, and antonyms to detect inconsistency. If you have blocks of similar items (i.e. loading on a single construct), you can check the even-odd correlations.19 What is an acceptable level of agreement or correlation? Depending on the burden of proof you are using—addressed in the next section—we recommend levels of at least .5 to .75. A cutpoint of .5 is a low burden of proof—it is a coin flip. A cutpoint of .75 seems more reasonable. At this level, the shared variance means that the items share more variance with each other than other nonrelated items.20 These consistency items will help you identify the most subtle cases of carelessness.

What is the burden of proof?

Imagine that you conduct your analysis and find some evidence of possible carelessness. What should be your burden of proof? In other words, how strong does the evidence need to be and how much certainty should you have to declare a respondent careless? The standard that you apply should depend on the type of research that you are doing and is analogous to how courts require a different level of proof in trials. In exploratory research, the standard for carelessness should be quite high, meaning you should only label obvious cases as careless. This is analogous to how courts require proof beyond a reasonable doubt in criminal cases. The prosecution must show there is no other reasonable explanation for the evidence. In exploratory research, you can tolerate some noise in your data because you are just trying to identify the issues involved with this research question. You want to identify all the relevant issues, so you are willing to accept some carelessness to get more honest, open, and free-wheeling responses for an exploratory research question.

Descriptive research requires a higher burden of proof. In descriptive research, you are observing what is happening and making estimates of how often it happens.21 Noise in your data from careless responding begins to obscure your observations. As a result, you need to be more sensitive to careless responding, like how courts apply the standard of clear and convincing evidence, used in many family or administrative law cases. Clear and convincing evidence requires proof that the evidence is highly likely to be true. In descriptive research you do not to be 100% convinced, but it seems highly likely that they were careless.

Causal research demands the lowest burden of proof. With causal research, you must be very sensitive to carelessness because you are trying to explain why something happens. These causal mechanisms are subtle and easily obscured by data in the noise, so you should use the lowest burden of proof to establish carelessness. This is analogous to the burden of proof in civil lawsuits, where you only must be convinced that it is more likely than not. You still need meaningful evidence that they were careless but being ‘mostly sure’ is enough.

Conclusion 

Although the clinical trials industry depends on internet surveys, these studies are prone to bias from careless responding. This article describes how clinical trials executives can clear out the noise from careless respondents by how they structure the survey and conduct the analysis.

Removing careless respondents might make trained statistical analysts uncomfortable. They have received many warnings against data manipulation and case deletion. But removing careless respondents is not case deletion or even deleting extreme outliers. Careless respondents are a type of data contamination consider carelessness a type of data contamination and “...contaminated observations can and should be minimized by careful research procedure and data preparation.”18 Removing careless respondents is good research practice, is not research misconduct, and is ethical.

We realize this message goes against the current data culture where larger datasets are considered better. In this view, if you have a dataset of 200 responses to a survey that is noisy, the best solution is to increase the number of responses to 1,000 or even 10,000. In our view, a better approach is a careless responding analysis that might cut the n to 100 to obtain a clear signal from the analysis. In this way, careless responding analysis can enhance research efficiency. Rather than focus on the size of your dataset, we argue that the care with which you collect and analyze data is more important.

Clinical trial strategy and operations must rely on internet surveys if it is to remain responsive to industry changes and patient needs. Noisy data can create misinformation about industry competitor analysis, customer satisfaction, or patient outcomes. Careless responding analysis is a way to clear out the noise in survey data to get a clear signal in your analysis.

Michael Howley, PA-C, MBA, PhD, Clinical Professor, LeBow College of Business, Drexel University, and Peter Malamis, MBA, Senior Director, Market Development, Phreesia, Inc.

References

  1. Dumais, K., & Raymond, S. (2021). Considering Patient Burden in Oncology. Applied Clinical Trials, 30(12).
  2. Getz, K., & Kim, J. (2022). Measuring Patient Satisfaction as a Primary Outcome for Patient-Centric Initiatives. Applied Clinical Trials, 31(5).
  3. Henderson, L. (2022). Salary Survey: The Age of COVID. Applied Clinical Trials, 31(1/2).
  4. Pennycook, G., & Rand, D. G. (2019). Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning. Cognition, 188, 39-50.
  5. Lazer, D. M., Baum, M. A., Benkler, Y., Berinsky, A. J., Greenhill, K. M., Menczer, F., . . . Rothschild, D. (2018). The science of fake news. Science, 359(6380), 1094-1096.
  6. Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data. Psychological methods, 17(3), 437.
  7. Ward, M. K., & Pond III, S. B. (2015). Using virtual presence and survey instructions to minimize careless responding on Internet-based surveys. Computers in Human Behavior, 48, 554-568.
  8. Swain, S. D., Weathers, D., & Niedrich, R. W. (2008). Assessing three sources of misresponse to reversed Likert items. Journal of Marketing Research, 45(1), 116-131.
  9. Huang, J. L., Curran, P. G., Keeney, J., Poposki, E. M., & DeShon, R. P. (2012). Detecting and deterring insufficient effort responding to surveys. Journal of Business and Psychology, 27(1), 99-114.
  10. Parush, A., & Yuviler-Gavish, N. (2004). Web navigation structures in cellular phones: the depth/breadth trade-off issue. International Journal of Human-Computer Studies, 60(5-6), 753-770.
  11. Barge, S., & Gehlbach, H. (2012). Using the theory of satisficing to evaluate the quality of survey data. Research in Higher Education, 53(2), 182-200.
  12. Tuten, T. L., Urban, D. J., & Bosnjak, M. (2002). Internet surveys and data quality: A review. Online social sciences, 1, 7-26.
  13. Nichols, A. L., & Edlund, J. E. (2020). Why don’t we care more about carelessness? Understanding the causes and consequences of careless participants. International Journal of Social Research Methodology, 23(6), 625-638.
  14. Silver, N. (2012). The signal and the noise: Why so many predictions fail-but some don't: Penguin.
  15. Dillman, D. A., Smyth, J. D., & Christian, L. M. (2014). Internet, phone, mail, and mixed-mode surveys: The tailored design method: John Wiley & Sons.
  16. Berry, D. T., Baer, R. A., & Harris, M. J. (1991). Detection of malingering on the MMPI: A meta-analysis. Clinical Psychology Review, 11(5), 585-598.
  17. Baumgartner, H., Weijters, B., & Pieters, R. (2018). Misresponse to survey questions: A conceptual framework and empirical test of the effects of reversals, negations, and polar opposite core concepts. Journal of Marketing Research, 55(6), 869-883.
  18. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2013). Applied multiple regression/correlation analysis for the behavioral sciences: Routledge.
  19. Johnson, J. A. (2005). Ascertaining the validity of individual protocols from web-based personality inventories. Journal of research in personality, 39(1), 103-129.
  20. Netemeyer, R. G., Bearden, W. O., & Sharma, S. (2003). Scaling procedures: Issues and applications: sage publications.
  21. Babbie, E., & Mouton, J. (2001). The practice of social science research. Belmont, CA: Wadsworth.