How unwanted emails can deepen our understanding of trial probabilities.

Very few people can say anything good about spam (other than spammers, that is). I am a passionate spam-hater, who goes to great lengths to avoid any exposure to the frequently offensive emails known as spam (named, of course, for a classic Monty Python comedy routine involving Vikings). With that introduction, let me tell you something good about spam. Understanding spam, and the ways we deal with spam, can help understand how to identify and manage problematic adverse events in large, untidy databases of spontaneously reported data. If that seems a stretch, let me explain.

Paul Bleicher

Spam is unsolicited commercial email (UCE), sent to thousands or millions of email addresses simultaneously, that typically advertises a commercial service or product or promotes a political viewpoint. It is a huge problem in the world of business, for many reasons. The sheer cost of scanning through and deleting spam messages is enormous, often estimated in the hundreds of thousands of dollars annually for a large company. In addition to productivity issues, spam can create significant security issues, allowing viruses, "malware," and other dangerous programs into a company. Finally, spam can create a hostile work environment for employees, creating a liability for companies.

UCE is estimated to account for 40% to 80% of all email received in the United States. The reason why it is such a huge problem (compared with regular "junk" mail) is that it is free to send email. A commonsense solution for the problem would be to charge for the numbers of emails sent. Unfortunately, most spam is sent through relays and through the illegal hijacking of innocent computers; charging for these emails wouldn't affect the spammers or their pocketbooks.

Whether an email is spam or "not spam" is sometimes in the eye of the beholder, which certainly complicates company and/or group antispam strategies. Most everybody would think of advertisements for enlargement of body parts to be spam, while there might be a difference of opinion about whether an advertisement for a low-rate mortgage is spam to reasonable people. Most noxious spam emails cannot be traced to a legitimate Web site or company and will likely increase the number of emails if the recipient attempts to "unsubscribe."

Shortly after the first spam emails began arriving in the early 90's, programmers began developing "anti-spam" strategies, email filters, and programs. In a constant game of cat and mouse, each new strategy to identify and eliminate spam was met by more sophisticated methodologies of escaping spam detection. The earliest strategies used blacklists, initially personal ones, but later shared blacklists that could collect reports of spam and allow the blocking of emails from particular email addresses or domains. While these strategies do block some spam, spammers typically change their email address and even domain regularly, and/or use domains that can't be blocked because much legitimate mail originates within (e.g., hotmail.com or yahoo.com). Later methods involved the creation of a "fingerprint" of the spam that could be stored on a centralized server. These "distributed checksum clearinghouses" can compare all incoming emails to known spam and reject those that match. Unfortunately, the DCC strategy success is related to ongoing reporting of spam, and static spam content. Neither of these criteria are reliable in the real world situation.

As spammers became more sophisticated, spam-blocking programs developed scoring systems that could eliminate spam email based upon the words and formatting used in the messages. Unfortunately, the rules were available to the spam generators as well, and spam strategies arose to circumvent these algorithms. For example, screening for the word "Viagra" led spammers to begin using V1agra, V!@gra, and many other variations. If a rule looked for a predominance of certain words or phrases, spammers added "nonsense" phrases or text to dilute the actual content. Algorithms which incorporate many different rules are somewhat arbitrary in their weighting criteria, and are prone to errors in overreporting (false positives) and underreporting (false negatives) spam.

Spam filtering tools dramatically improved with the application of Bayesian statistics to the methodology. In fact, Bayesian filtering is behind most of the highly successful spam filtering tools available today. The power of Bayesian filtering is that it identifies and weeds out spam at its very core—the message itself —through a simple method that automatically learns and adapts as the spam message changes. The only way for a spam message to evade a Bayesian filter over time is to make the message content more and more like a normal message, with no unpleasant or sales content at all according to a seminal article on the use of Bayesian filtering for spam ("A Plan for Spam," Paul Graham, http://www.paulgraham.com/spam.html).

All of this may sound like magic, but it isn't. It is the application of a very powerful application of probability theory that doesn't require advanced mathematics to understand.

Bayes' theorem was developed by the mysterious Reverend Thomas Bayes, who wrote exactly one paper on the theorem that was published several years after his death in 1761. This work was rediscovered in a more generalized form by LaPlace less than 20 years later and has been used a wide variety of analysis since, only taking on the name "Bayesian" in the 1950's.

The principle behind Bayesian analysis is the derivation of empirical (often historical) probabilities of the occurrence of some event from existing data. These empirical probabilities are then used to predict the occurrence of future events, or the meaning of undefined current events. I originally learned of Bayesian analysis back in medical school in the 1970s, when a small group of physicians began to think about the interpretation of laboratory tests based on Bayesian principles. Examples of this application of Bayesian analysis can be very clear, and can help us set the stage for understanding the use of Bayesian analysis in spam and later in the analysis of adverse events.

Consider the use of mammography in the screening of 40-year-old women for breast cancer. This example is taken from a very in-depth introduction to Bayes' theorem at http://yudkowsky.net/bayes/bayes.html. All estimates are for discussion purposes only, and are not intended to be accurate. Empirically, about 1% of these women will have breast cancer, and 80% of these women will have a "positive" mammogram at a given screening, indicating the need for a further evaluation. Unfortunately, 9.6% of nonaffected women will also have a positive mammogram. The question that becomes relevant to many people is "What can a physician tell a 40-year-old woman who has a positive mammogram on a random screening?" The real answer surprises a great many physicians and lay people alike. Given these numbers, most physicians guess that 70% of women with a positive mammogram on screening will be later shown to have breast cancer. Only 15% or so guess the correct answer—that 7.8% or one of 13 of the positive results actually indicates a woman with breast cancer.

To understand how we get this result, let's look at 1000 people screened under these conditions. Ten would have breast cancer (as stated, 1%) and of these eight would be screened as positive, as described above. However, of the 990 women who don't have the disease, 95 would also be screened positive, given the 9.6% rate mentioned. Of 1000 people screened, eight would have both a positive result and breast cancer, and 103 total women would have a positive result. Dividing 8/103 gives the 7.8% as the likelihood that a person with a positive mammogram actually has a malignancy. This is known as the posterior probability.

The key to determining the posterior probability of a particular test result is to have a good estimate of the "prior probability" of the population that you are going to test. Consider how different the results would be for a patient with a strong family history of breast cancer and a palpable lump on exam. Here, an individual physician might estimate a 60% prior probability of the patient having a malignancy. Working the numbers through, this would create a post-test probability of 480/(38 + 480) or 92.7% likelihood that a positive mammogram represented a malignancy. Furthermore, it is likely that 120/(120+362) or 24.9% of the women who test negative under these circumstances will still harbor a malignancy. Further testing or a biopsy could be warranted under these circumstances.

The power of Bayes lies in having some empirical understanding of the likelihood of a result in an individual patient BEFORE a test is done. Bayesian reasoning can help the clinician decide the meaning of a lab result, whether to test a given patient, or even whether to screen populations for patients based on the specificity, sensitivity, and population prevalence of the condition being tested.

Back to spam. The same mathematical principles in the example above can be applied to the screening for spam. By examining spam emails and normal emails, it is possible to learn the frequency of particular words in both of these type emails. Certain words (for example, "click") may be found in a high percentage of emails but in very few normal emails. Similarly, there are other words that are common in normal emails but rare in spam. We are, in effect, calculating the prior probability of an email being spam, based on the occurrence of a single word. Thus, an email containing several of these high prior probability words with might have greater than a 99% posterior probability (or likelihood) of being spam. Sophisticated Bayesian spam filters have various strategies to examine 10 or 15 of the highest scoring words suggesting spam and the 10 or 15 of the highest scoring words suggesting normal emails. They then can calculate an overall spam score for the email.

Bayesian spam detection is very powerful, detecting greater than 99.8% of spam with virtually no false positives. Importantly, if false positives and false negatives are identified and "reported" to the spam engine, the quality of spam detection will continue to improve. In fact, as new spam and normal emails come in they will be examined and added to the calculations. Thus, the Bayesian spam engine will "learn" about new spam emails when they first come out, and will adapt. Even strange spelling and other tricks can't fool the filter—they will be easily discovered and screened out. The hardest spams for a Bayesian filter to identify are those that are nearly identical to a normal email—hardly effective spam!

Bayesian analysis can be applied in many places where it is possible to develop an estimate of the expected probability of an occurrence. Currently, there is interest (and some experience) in the use of Bayesian statistics for determination of efficacy in clinical trials based in part on historical data. While this approach requires the acceptance and buy in of regulatory reviewers, there is another important application for Bayes in clinical data—the data mining of large databases of adverse events. The statistics and principals of this process are far beyond the scope of this column, but it is worth considering the general concept.

In looking at very large databases of reported adverse experiences, it is easy to identify all of the AEs reported with any particular drug, and even to rank these by the frequency of their reports. For example, headache might be reported in 10% of the AE reports regarding a particular drug or drug class. This is not particularly meaningful without an understanding of the expected frequency of headache reports. Using a calculated prior probability for headaches (perhaps based on the frequency of headaches in ALL reported AEs), Bayesian statistics can be applied to calculate an adjusted relative risk for headache associated with a particular drug. When other possible confounding variables are eliminated and a visualization and/or threshold is applied, one can identify adverse events that may require further study and analysis to determine if there is a meaningful clinical association between the drug and the AE. This type of Bayesian analysis can also be applied to determine if two drugs together might be associated with particular AEs.

When applied to a massive data set with appropriate visualization techniques, the Bayesian techniques allow a nontechnical person to sift through an enormous amount of data and relatively quickly identify important AE signals amidst a tremendous amount of noise. When thought of in this way, it has clear shared characteristics to the identification of useful emails amongst a sea of spam. The techniques of finding the "signal" amongst the "noise" is very different. Yet, for both of these applications, we owe a debt of gratitude to a very private 18th century mathematician.

**Paul Bleicher **MD, PhD, is the founder and chairman of Phase Forward, 880 Winter Street, Waltham, MA 02451, (888) 703-1122, paul.bleicher@phaseforward.comwww.phaseforward.com. He is member of the *Applied Clinical Trials* Editorial Advisory Board.

Related Content

Clinical Trials in Indonesia: Challenges and Opportunities for Industry Sponsors

June 17th 2024Article

Improved diversity in clinical trials can be achieved by exploring opportunities for research in developing countries that have not historically participated in large-scale, industry-sponsored clinical trials.