OR WAIT null SECS
With the inherent limitations of traditional data sources, this overview explores the definitions and strategic fundamentals that inform machine learning and modeling techniques.
Machine learning, which has become a celebrated phrase, is a branch of computational science regarded by some as a panacea for our imperfect understanding of disease classification and the prediction of clinical outcomes. Others, particularly those for whom these methods seem mechanical, artificial, and enigmatic, take an unjustifiably dim view of the potential they hold. As usual, the truth resides at neither of these extremes. While these tools are potent and often utilitarian, they are not flawless, and, if used or interpreted incorrectly, can lead to grievous mistakes. In the proper hands, however, one can optimize model performance and simulations to useful advantage, while appreciating the limitations of the results. One prediction can be made with reasonably high certainty, however. Namely, machine learning and modeling are likely to transform biological research and the practice of clinical medicine over the ensuing years and decades. We should welcome this, because the amount of knowledge about disease pathogenesis, mechanisms of drug and biologic effects, and biology in general with its vast networks, is growing at an enormous pace, and has reached the point that we have exceeded the capacity of any single human intellect to form a detailed, comprehensive, and integrated understanding of the complexity that must be mastered to achieve optimal clinical results.
Ultimately, with finer parsing of patient populations into more refined diagnostic categories, each tethered to specific risks, risk levels, and clinical outcomes, we will be in a better position to more precisely intervene, both preemptively and therapeutically, with an eye toward enhancing human health. Otherwise stated, if we can confirm that the group of patients with an overarching diagnosis of “X” actually consists of multiple subsets (e.g., X1, X2, X3, etc.), each attended by greater or lesser degrees of risk for pathophysiological derangements/events that eventuate in certain disease complications, we can surgically tailor interventions to the specific patient with disease X1 rather than intervene on the basis of the “averaged knowledge” of future risks for the entire group (i.e., X) as a whole. As has been said, no individual patient is “average.” More granular insights into a patient’s condition and the possible perils he or she faces in the future, based on a swath of clinical and –omics data, will have implications for target product profiles in drug or biologic development programs, engendering better, safer, and more precise interventions in a given clinical case.
Traditionally, our efforts to discern differences between patient groups and outcomes that can be predicted based on patient characteristics have relied upon subset analyses of large trials, meta-analyses, or epidemiologic databases using, quite often, simple post-hoc descriptive statistics or logistic regression techniques. Sometimes, small underpowered studies of rare patient groups/subgroups or case reports are relied upon by clinicians, despite the inherent limitations of these data sources. Machine learning techniques, however, improve our ability to interrogate large amounts of multidimensional data to make improved predictions.
This short monograph provides a high-level overview of basic principles of machine learning and modeling approaches geared toward clinical scientists and others in the biotechnology and biopharmaceutical sector, but will not cover the mathematical underpinnings of specific techniques, each of which would require separate in-depth treatment. Note also that we will not dwell at length on systems biology-based modeling as a means of elucidating biological complexity, as this, too, would constitute a separate survey. Instead, we will confine ourselves to basic definitions and strategic fundamentals that inform machine learning techniques. The importance of the latter cannot be over-emphasized. There is a tendency on the part of some to regard specific machine learning methods, especially newer ones with impressive-sounding names such as deep neural nets, as “better” than other techniques. While newer methods are extraordinarily powerful, they are not always appropriate in a given case, and carry major risks if used when they should be left in the toolbox in favor of a more suitable implement. We would not use a sledgehammer to place a nail in a wall made of particle board in order to hang a picture. Much of the art of machine learning relies upon knowing when, precisely, to use a tool and, importantly, what steps must be taken prior to and during the use of that hammer (or wrench or screwdriver). These steps will be described and explained herein, but include means of mitigating the hazards of model “overfitting” (to be defined later) and tactics that enable early cross-validation of models to improve performance.
Basic concepts and definitions
Machine learning comes in three major flavors: (1) supervised learning (in which predictive algorithms/models or “classifiers” are developed with the knowledge of the class to which an example belongs in a model “training data” set), (2) unsupervised learning (in which the algorithm clusters training examples without any such foreknowledge, and (3) reinforcement-based learning. In this paper, we will concentrate on an explanation of “supervised” learning, which produces algorithms that serve as “classifiers.”
The first step in creating a machine learning model or algorithm for prediction or classification consists of “training” a well-selected type of algorithm. Before we discuss training (or “fitting” of data), we must describe the typical “substrate” on which such fitting depends, namely, the training data set itself. In brief, a training data set consists of “examples” (e.g., patients) characterized by certain feature values (e.g., elevated temperature [yes/no], leukocytosis [yes/no], rapid respirations [yes/no] etc.). In “supervised” machine learning, we have a priori knowledge of the “class” to which each example actually belongs. A toy example is illustrative. Table 1 depicts a “feature matrix” for a hypothetical training data set, with each row corresponding to a patient, and each column containing data for a specific variable in the training set (e.g., fever, etc.). These variables are also referred to as “features” or “attributes” of the examples.
Distinct from this “feature matrix,” which may include either dichotomous or continuous variables (the latter circumscribe the number of techniques we can deploy), the training data set also includes a single “target vector.” In simplistic terms, the target vector can be thought of as a list of known classification assignments for each example (e.g., patient in this case) in the training set. There is one “class” (also known as “label”) entry for each patient in the target vector, and the class (e.g., imaging study evidence of pneumonia [yes/no]) is what we will seek to predict based purely on the feature values in the feature matrix. This “one dimensional” target vector might, for example, include a “Y” (i.e., radiological evidence of pneumonia confirmed) for patients 2, 3, and 4, but an “N” for patients 1 and N because the latter two had evidence of a pulmonary embolism on CT angiogram (which would fit with the elevated d-dimer levels in these two examples) and clear chest X-rays (CXRs). Obviously, this is merely a toy example of limited clinical relevance, as a CXR result in an ED setting would likely be available before a d-dimer value, but it illustrates the points we need to make.
A model must be “fitted” to training data before we can use it. After a linear model (i.e., one that, in a sense, provides a line that best separates classes based on features) is optimally fitted we acquire, for example, values for multiple coefficients, each linked to a specific feature that, in aggregate, allow us to calculate a score which determines the class to which a given example is likely to belong, based solely on the feature values for that example. This is an iterative process, and, during the fitting process, the values for these coefficients are continually adjusted until the prediction error rate is minimized. In a very rough way, this can be thought of as follows. Let’s say we would like to know whether someone (John Q. Patient) infected with bacterium X is likely to be admitted to the ICU. We have scads of features at admission to hospital for John Q. Patient (some of which are relevant and some of which are not). To construct a predictive model, we consider a number of other such infected individuals for whom we know the outcome (a “training set” with data available for the same set of feature variables), select an appropriate machine learning tool and “train” that tool by adjusting weights based on the training set. For the purposes of this high-level overview, we don’t need to delve here into the neat mathematical details of the means by which those weights are adjusted but, at the end of the exercise, we have something akin to an extremely simple equation like the following:
With our trained algorithm in hand (note that variables with more “predictive heft” will be tethered to larger weights at the end of the day), we then insert John Q. Patient’s individual data for the respective variables (A, B, etc.) and compute the summed score “X” on the right. In contrast to the training examples that we used to fit the model, we don’t know, for example, if he will die within 30 days of a diagnosis of sepsis. The model then compares John’s score with the model’s threshold and determines whether or not the score is above or below that threshold. Depending on John’s score relative to that boundary line, we have a simple “yes/no” answer (as a prediction). That is John either appears to belong to the class of patients that will die within 30 days, or he does not. This is useful! This knowledge might affect monitoring and patient care. It could also be used as an eligibility criterion to enrich a patient population for a clinical trial seeking to evaluate the effectiveness of intervention Y in sepsis. We have chosen a simple example of a “linear classifier” in this instance, but there are many other types of machine learning tools.
Here, we must underscore one extremely important point. Once a model has been trained (as above) and we have a specific algorithm in hand, we will, of course, be able to say how well it classifies all of the examples in a training set (sensitivity, specificity, and accuracy). In other words, if a training set consists of 50 Beatles songs and 50 Rolling Stones songs (for millennials, feel free to swap in Arcade Fire or Ed Sheeran for one of those groups), and those tunes are classified by an algorithm based on a variety of features (timbre of the lead vocalist, reliance on harmony, etc.), we can easily calculate the proportion of Beatles songs in the training set, for example, correctly identified as Beatles songs by the algorithm. Let’s say that percentage is 90%. Do we have a clever algorithm? Not necessarily. Themerit of a machine learning algorithm is measured not by its ability to perfectly classify examples in a training data set, but to correctly make such predictions for novel examples not included in the training data set (i.e., an independent “test” set). In other words, we are primarily interested in how well the model “generalizes” to albums and songs it has never heard before. Fortunately, there are ways to enhance the likelihood of generalizability as we shall learn.
The major peril of overfitting a model
In any given case, there is an optimal degree of model complexity. More features (e.g., hundreds of transcriptomic, proteomic, and clinical model parameters) are not necessarily better and can be deleterious to model performance. Furthermore, some model types (e.g., polynomial classifiers and deep neural nets) are inherently prone to complexity. A model that is overly complex given the data (the term we use is “overfitted”) makes perfect or near-perfect predictions/classifications for the data set upon which it has been trained. Yet such an opaque and/or counterintuitive model of byzantine complexity may be misguided, reflecting random peculiarities in the training set features, thereby leading to abject predictions when the model is applied to an independent test data set (which is what truly counts). A model that is too simple and sparse, on the other hand, does not adequately capture the predictive value of key data features that are useful in making predictions (i.e., it is “underfitted” and omits critical parameters). There are various techniques to optimize this trade-off between what is referred to as high “bias” (i.e., an underfitted model) and high “variance” (i.e., an overfitted model marred by extra “Trojan Horse” variables and parameters that sabotage model performance when the algorithm is applied to independent test data sets. In general, to avoid overfitting, we strive for the simplest model that accomplishes the goal. To better understand the concept of overfitting, let’s consider a very straightforward toy example. Suppose we have 80 pieces of fruit in a basket, consisting of 40 Ambrosia apples (first image) and 40 Honeycrisp apples (second image):
Now, in inspecting these two images, one appreciates some fairly obvious distinctions between the two types of apples that are often used to help differentiate the classes. The Ambrosia variety is dual-colored and elongate or conical in shape, while the Honeycrisp is predominantly red and, in this example at least, squat in appearance. One also sees that the Honeycrisp can have stripes. All is well, we have candidate features for an algorithm that seeks to discriminate between the two types of apples, such as primarily red (Y/N), elongated (Y/N), and striped (Y/N). It should be noted, however, that single apple features (like individual clinical symptoms) in many cases can’t be relied upon exclusively to serve as diagnostic traits indicative of a class. For example, when a Honeycrisp is not ripe, it is not almost entirely red, and the yellow under-color is more prominent. That could cause some confusion in differentiating it from the yellow-red Ambrosia. Similarly, like the Ambrosia, the Honeycrisp can, in some cases, also be elongate in shape. Therefore, an algorithm that leverages multiple appropriately weighted features is likely to perform better than a single discriminator in telling the two breeds apart.
One types of features do we typically like to include? Those that are predictive and that provide what is sometimes referred to as “independent” or “orthogonal” information. For example, if we were to add a taste variable to the set of features, that would be completely orthogonal. On the other hand, if we included another “appearance” variable that was highly correlated with an appearance variable we already have in in our suite of features in the feature matrix, that is much less helpful.
But what if we toss in one or two random monkey wrenches? Suppose, purely by chance, the apple picker at Orchid B (the source of the Honeycrisps) was more prone to collect apples that had fallen to the ground and that were bruised than the picker at Orchid A. Let’s say, then, that 65% of the Honeycrisp apples he contributes to our basket have bruises, while only 5% of the Ambrosia apples from the other orchid have contusions. Or, perhaps, again by fluke, 50% of the apples from Orchid A harbor apple maggots whereas only 3% of apples from Orchid B play host to these “railroad worms.” If we included railroad worms and bruises as additional features in our model (or classifier – we will loosely use those terms interchangeably here), we may have a predictive algorithm that, with 100% accuracy, specifically distinguishes Ambrosia apples from Honeycrisp apples in the training set, but which, as one would expect with the inclusion of such “random nonsense” features, fails miserably when applied to other mixed apple baskets containing Ambrosia and Honeycrisp apples from a variety of other orchids.
In a similar vein, if we were to use a complex polynomial classifier (rather than a simple linear classifier) with multiple features including exponents of variables and products of variables, we may perfectly separate Rheumatoid Arthritis patients with outcome X versus those with outcome Y in a tiny training set with a potentially “tortured” non-linear boundary between classes that has an appearance like this:
What we see above is an odd-appearing and serpentine decision boundary for this classifier that almost seems to battle to separate patients with the two different outcomes into discrete camps. Overfitting is likely here, and some would call the model “opaque” as it is not intuitive. It is not, obviously, a clean linear classifier, with examples neatly separated by a straight line. This is not to say that polynomial classifiers have no use in predictive health analytics. There are cases, particularly with larger data sets that defy simple linear classifiers, where a polynomial classifier may come to the rescue. We will learn, in a very general sense, how to tailor the complexity of a model type to the size of a data set shortly. In short, unleashing a complex neural net or a polynomial classifier on a small data set, especially with numerous features, is practically always ill-advised, since we would court the “overfitting demon” and be left with a model that performs wretchedly when the rubber hits the road (i.e., when confronted with an independent data set for which the clinical outcomes are not known).
How, then, do we avoid the dangers of overfitting?
To mitigate the risk of overfitting, one can deploy a variety of techniques. Some models lend themselves to “regularization” as a means of reducing that hazard, the detailed mathematical basis for which we will skip here but the gist of which we will attempt to briefly explain. Suffice it to say that weights or coefficients for variables in a predictive model are often adjusted during the fitting process in accordance with the rate of change of a “cost function” (which is higher when model predictions are incorrect) with respect to that given weight. This amounts to a partial derivative. The objective is to decrease model cost and thereby generate a model that classifies or predicts more accurately. Simply put, when cost declines rapidly with respect to a specific variable weight, the weight in question is more substantially increased relative to other weights during the fit. This can be thought of as preferentially “growing” a weight for a specific variable that is heavily responsible for driving down cost (and, by corollary, enhancing model prediction accuracy). Conversely, when the rate of cost decline changes little with respect to a weight, that weight is not accentuated a great deal in the next go-around. Regularization, a means of reducing overfitting, amounts, in essence, to levying a greater penalty on parameter weights of larger absolute magnitude, with the result that costs are more inflated for those weights. In other words, when regularization strength (set by the modeler through a specific parameter) is increased, there is more of a cost penalty as the weight for that variable (feature) grows. During fitting, the model is “tricked” into believing that augmenting an already large weight for a variable achieves less by way of enhanced model accuracy (i.e., cost reduction) than it actually does. One can think of this metaphorically as preferentially blackening the character of larger weights (i.e., corresponding to variables with more explanatory power) such that fewer weights and variables are invited to the after-party (i.e., the final model). Or the weighted variables in the model that is being fitted can be thought of as participating in a race. If larger weights are more hamstrung by regularization during the race, then it becomes progressively more difficult for a weight to “cross the finish line” into the final model. This results in a model that is more parsimonious, with fewer weights (and attendant variables). If, on the other hand, the regularization strength is attenuated, then there is less incremental penalty as a weight grows, with a tendency to retain more parameters in the model. There are different types of regularization, but those distinctions are beyond the scope of this primer.
A separate set of tools we deploy to lessen the threat of overfitting is “feature selection” and/or “feature extraction.” Here the general idea is quite simple. Choose or determine the features that are most promising and/or carry the most predictive bang for the buck. Fewer features or “dimensions” in the feature matrix (described in the basic concepts section) translate into a smaller chance of overfitting. These techniques are especially important for machine learning algorithms to which regularization cannot be applied (e.g., k-nearest neighbor and decision tree algorithms, which will be listed in a tabular “menu” of machine learning choices toward the end of this monograph). Feature selection may be based on expert insights or a variety of other more systematic techniques, whereas feature extraction (a subset of feature selection techniques), is typically used to refer specifically to certain mathematically rigorous ways of condensing the feature space (i.e., through principal component analysis or linear discriminant analysis). Some of these “dimensionality reduction” techniques are listed below (again, we have avoided any mathematical treatment of these topics, but it is worth knowing of their existence and general features):
1. Sequential Backward Selection (SBS). We can simply establish a criterion for a “permissible” erosion of classification performance for a model once a single feature is removed. A paltry 1% incremental gain in prediction accuracy bestowed by a feature in a training set is more likely to result in compromise of model generalizability (to independent data sets on which the model has not been trained) than a feature that confers a 20% increase in prediction accuracy in the training set. As such, it may be best to drop the “1%” feature.
2. Biological pathway analysis. Here we leverage feature variables from discrete pathways to maximize “orthogonal information.” That is to say, reliance on features that serve as independent predictors (e.g., representing pathophysiological perturbations of different pathways involved in the same disease process, for example) is often desirable.
3. Evaluation of feature importance to guide feature selection using:
4. Feature extraction (also known as data compression)
5. Mathematical techniques (e.g., dynamic Bayesian network inference) to identify features that appear to have a causal link to outcomes.
6. Time series analysis of biological analytes that may serve as feature candidates can be useful in winnowing the players in a feature matrix. Rapid response kinetics (e.g., post-exposure to a pathogen) with durable up or down-regulation may be preferable for a clinical laboratory feature. This enables population of model with variables that provide early predictive value, and that are less difficult to miss than ephemerally expressed transcripts, etc.
The last major means of avoiding the jeopardy of overfitting is the use of a specific technique to “match” model type complexity to the size of the data set. Recall, larger data sets can handle more complex models, should those be needed to optimize performance. Again, it should be borne in mind that the goal is to minimize prediction or classification error for independent data sets to which the model is “naïve” (i.e., on which it was not originally trained). If a training data set is smaller, it is prudent not to attempt to wring too much from it, because the risk of overfitting is substantial with a complex model (e.g., a neural net or polynomial classifier). In such cases, it is desirable to know how much complexity a training set of a given size is likely to accommodate. Technically, this impels us into the realm of machine learning theory but, at a basic level, it is worth knowing that the question is answerable by leveraging a model-type-dependent measure of complexity known as the Vapnik-Chervonenkis (VC) dimension. The VC dimension is higher, for example, for a (more complex) polynomial classifier compared with a linear classifier. The likely upper bound of the test error for an algorithm on an independent data set is determined by an equation that relies upon the VC dimension and the training sample size. We can calculate the VC dimension that minimizes this test error. With this is hand, we can obtain a prospective sense of what general types of predictive models we can bring to bear given the size of our training data set.
A Machine learning menu
Now that some of the basics are under our belt, we can look at a menu of specific machine learning techniques. Each would merit a standalone article in and of itself to convey the essentials, even if we were to eschew the math, but, knowing the names of the major dishes might guide further reading for those who are interested in the ingredients. We concentrate here on supervised learning, in which an algorithm is trained on a data set for which a defined “gold standard” target vector exists (i.e., we know the classes to which examples belong in the training set). With the accretion of time and deeper insights into numerous –omic markers, unsupervised learning is likely to generate further insights into clinically relevant patient clusters that had heretofore been undefined. These novel clusters will splinter traditional diagnostic categories into progressively finer patient subpopulations, a compartmentalization that will drive precision medicine, but we will not cover unsupervised learning (e.g., cluster and topographical data analysis) here.
A number of companies are exploring machine learning as predictive health analytic tools. For the most part, these algorithms have sought to predict a single clinical outcome (e.g., 30-day hospital readmission) or condition. Various machine learning methods that may be employed in health analytics are listed in the table below. In general, clinicians may feel most comfortable with more intuitive models, such as linear regression and decision trees. Deep neural nets, on the other hand, although capable of fitting data to an extraordinarily fine degree, amount to a mathematically well-defined but extremely complex and counter-intuitive “black box” between the input and the output.
Improved model performance through cross-validation and validation curve construction
Our final topic is that of model cross-validation, which involves partitioning a training data set into a (smaller) “training subset” of examples and a “validation data set,” often in multiple different ways. In doing so, we create one or more mock independent test data set(s). One can consider this as a trial run (or runs) to optimize a model before settling on a final version that we believe is best-equipped to make predictions in the wild. As we have stressed, the objective is to generate a model that is less sensitive to the characteristics of a specific training set as a means of improving model performance when applied to completely independent test data sets.
The simplest form of cross-validation splits a training set once into a single training subset and a validation subset. The extent to which the model “generalizes” to the validation set after it has been trained on the training subset provides a foreshadowing of how it might perform when confronted with a fresh independent data set. Think of the validation subset of the training set as a stand-in for a hypothetical novel data set on which the model would be unleashed. To gain an even finer sense of model generalizability, one can perform the partitioning of a training set into training and validation subsets iteratively. This is the basis for the commonly used k-fold validation technique. In this case, the training set is randomly divided into an arbitrary number of “folds” of equal size, let’s say five, with 30 examples (e.g., patients) per fold:
One fold (designated in red font above) is reserved as a validation data set and the other four are combined and used as a training set. The model is trained using folds 2, 3, 4, and 5 (in aggregate) to produce “model Number 1.” In subsequent iterations, each fold, in turn, is given an opportunity to serve as a validation set while the other four are amalgamated into a training set. Ultimately, this results in 5 “versions” of the model: five variations on a theme (e.g., five linear models with slightly different variable coefficients by virtue of having been fitted with somewhat different training subsets). The model error rates (i.e., false positive + false negative classifications divided by total number of examples in the validation set) are computed for the five models and then averaged to yield an estimated error rate that is robust to the particular characteristics of a given data set. Compared an error estimate based on a single division into a training set and a validation set, this average estimated error rate is likely to be closer to the error rate one would encounter in classifying examples in an independent test set. Typically, smaller fold sizes are used for smaller training sets. In the extreme, a fold size of n=1 examples per fold equates to what is referred to as “leave-one-out” cross-validation.
The fineness of the partitioning in k-fold cross-validation (i.e., the proportion of the training set examples in a single fold) has implications for the accuracy of the error estimate using cross-validation compared with the “true” error rate in a completely new (i.e., independent test case) scenario. Simply stated, a larger fold number generally enables a more reliable estimate of the true error rate, but at the price of higher computational cost. Furthermore, the extreme of small fold sizes (leave-one-out cross validation) is not a silver bullet (the interested reader is referred to Shao J. Linear model selection by cross-validation. Journal of the American Statistical Association 1993: 88: 486-494). Stratified k-fold cross validation, which preserves class proportions in each training set (e.g., pneumonia [Y/N] to use our prior toy example) provides superior bias/variance characteristics when class labels are unequally distributed in a training set.
Monte Carlo cross-validation is a different technique in which a training subset is randomly selected from the complete training set in a repeated fashion (with overlap among training sets permitted). A discussion of differential bias/variance compared with k-fold cross validation is beyond the scope of this overview.
One of the major benefits of cross-validation is the ability to estimate model performance when one varies what are referred to as “hyperparameters” in the model. These factor values, which are set by the modeler and are not influenced by the data or the fitting process, include regularization strength (see prior discussion) and, for radial basis function kernel SVM, the gamma parameter (which influences the decision boundary). One can use k-fold cross-validation and plot validation curves for varying hyperparameter values to identify those that optimize model performance. Multiple hyperparameters can be interrogated simultaneously using the grid search function from a tooll such as scikit.
Choosing a specific machine learning tool and a given set of candidate predictive features is not tantamount to an arbitrary selection process based, for example, on the familiarity one might have with a specific tool, how cutting-edge, complex, and powerful that implement might be, or the availability of an enormous complement of data features one might have in the form of an array of genomic, transcriptomic, proteomic, and metabolomic databases. As a means of underscoring this point, it is worth noting that data from a multi-site study evaluating >30,000 human and preclinical models of transcriptomic data used by 36 separate teams revealed that choice of a particular machine learning algorithm is of less importance than other factors in determining success, such as team proficiency and the manner in which model algorithms are specifically implemented (MAQC Consortium. “The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models.” Nature Biotechnology, 2010, 28(8): 827-838).
Finally, some of the more significant take-home points from the current review are listed here:
David Sahner, MD, is Senior Clinical Director, Computational Science, SRI International; email: firstname.lastname@example.org