[email to Gregory Cooper sent July 26, 2014, cc’d to Jay Shendure, Martin Kircher, Brad Gulko, and Ilan Gronau]
I’ve spent some time thinking about your letter and have composed a response (below). There were quite a few criticisms and allegations packed into those few paragraphs, so I thought it was best to break out what I saw as the major points and address them one by one. Please let me know if any of these responses are unclear. I would like to make this exchange as constructive as possible for both groups.
I’ve copied Brad Gulko and Ilan Gronau, who have been the leads in my group on the fitCons project.
Claim: You are not really predicting fitness consequences of alleles, and the resolution of the fitCons scores along the genome is very coarse. Your title and the name “fitCons scores” are misleading.
Response: First, note that we have not claimed to predict fitness effects of mutations in terms of allele-specific selective coefficients or some similar measurement. We think we make it quite clear in the title and the manuscript that our predictions are of “probabilities that the nucleotide identity at individual genomic positions influences fitness”. I should add that by this we mean “influence fitness to a sufficient degree to substantially influence patterns of genetic variation within and between species”. We are not making predictions about nearly neutral mutations, with |2Ns| ~ 1, but mutations having |2Ns| of at least about 5-10. (Our papers on INSIGHT explore these issues in some detail.)
Second, you are correct that we have not made separate predictions for each alternative allele at each site, but we do not believe that this is an important difference between our methods and yours. If a user were to insist on allele-specific mutations, we can easily produce them — our model simply assumes that the probability of fitness consequences for each candidate point mutation are equal, so the score for each alternative allele would be the same.
On a philosophical level, we are fundamentally Bayesian in our approach to prediction tasks in computational genomics, and I think there is a clear Bayesian rationale for defining our scores as “probabilities of fitness consequences”. The question we wish to address is the following: given a collection of functional genomic covariates along the genome and a measure of natural selection based on patterns of genetic variation, what is the probability that each nucleotide position is under natural selection, hence that it will have fitness consequences if mutated? Our approach is to group sites into large, relatively coarse-grained categories, to pool data about genetic variation within these groups, then to treat an estimate of the fraction of sites under selection as a probability of fitness consequences per site. Under the assumption of exchangeability of sites within each class, this is a legitimate Bayesian estimator. Obviously, this strategy ignores considerable heterogeneity within each class and results in scores that are “blocky” along the genome. But there is nothing fundamentally unfair or misleading about treating these estimates as our “best guesses” for probabilities of fitness consequences given the data at hand. How good those “best guesses” are is, of course, another question (to be addressed below).
Claim: The assumptions underlying the fitCons scores are unrealistic. For example, your approach effectively assumes that all nucleotide positions in a transcription factor binding site experience the same degree of selective constraint, which is obviously wrong.
Response: Our modeling approach does indeed make some rather strong simplifying assumptions, but we are mindful of the fact that all models are approximate and any prediction method should be guided by what is realistically possible given the information in the data. I am sure we do not have to explain to you that prediction methods that make fewer simplifying assumptions are not necessarily better, because of factors such as overfitting, bias/variance tradeoffs, etc. Frankly, in our view, CADD is being at least as unrealistic as we are by imagining that it will be possible to make meaningful position- and allele-specific predictions of pathogenicity across unannotated noncoding regions of the genome, by simply throwing a large number of covariates into an SVM with a linear kernel. We just don’t believe that there’s enough information in the data to support these types of predictions, and even if there were, that your linear-kernel SVM framework would be adequate for exploiting it.
To illustrate the point, consider the slightly absurd analogy of an insurance company that chooses to predict not only whether a customer will have an automobile accident over the next year and approximately how much it will cost, but also on what day and in what town an accident will occur. Obviously, those predictions would be meaningless, and, worse, potentially misleading for the unsophisticated user. We believe that, at least in noncoding regions, it is more honest and realistic to work at the coarse-grained level than to provide highly specific predictions of the type CADD attempts to make.
In any case, there’s no point in having a philosophical discussion about the costs and benefits of alternative modeling approaches when they can be compared empirically. The proof is in the pudding. Which brings me to the next point…
Claim: Your validation experiments are unfair to CADD because they measure “bulk coverage of elements” rather than allele-specific pathogenicity.
Response: In answer to this claim, I want to make two arguments: first, that CADD and fitCons scores are more similar than they are different and therefore can reasonably be compared, and second, that our evaluation of candidate cis-regulatory elements is a reasonable and fair way to evaluate the performance of both methods in noncoding regions.
Despite differences in resolution and allele-specificity, as discussed above, CADD and fitCons scores both essentially address the problem of measuring the “deleteriousness” of mutations at individual sites along the genome, given a collection of covariates and a “readout” of deleteriousness based on patterns of genetic variation. To be sure, the two methods use rather different modeling and prediction strategies. In particular, instead of explicitly clustering genomic positions into distinct classes, CADD is implicitly grouping sites with similar patterns of covariates using an SVM regression strategy. And instead of a model-based inference strategy for its evolutionary readout, it is using a simulation-based measure of constraint. Nevertheless, we believe that these are two different approaches for addressing the same fundamental problem.
If it is true that two methods are addressing the same, or at least quite similar, problems, then what is an appropriate way to compare their performance? Let’s focus in particular on noncoding regions, which is our main interest (we are happy to concede that CADD will outperform fitCons scores in coding regions).
Your paper considered several validation methods, but, honestly, we found them rather unconvincing about noncoding performance. Several of these were basically anecdotal and did not involve comparisons with conservation-based methods, which would likely have performed quite well in these settings (e.g., correlations with derived allele frequencies, single-locus results for MLL2, HBB, and TP53, correlations with expression changes in saturation mutagenesis experiments, exome-based ASD and intellectual disability analysis). Another was more statistical but also included no comparison (analysis of GWAS hits). As far as we could tell, the only real statistical comparisons with other methods were all based on ClinVar (as shown in your Figures 3 and 4), which is heavily skewed toward coding regions (by our accounting, 95% of pathogenic ClinVar variants overlap CDSs). To me, it is not surprising at all that CADD would outperform standard conservation scores in coding regions. Indeed, it is perhaps somewhat surprising that it does so by so slim a margin! In any case, we could quibble about how informative or noninformative these validation experiments are, but I think you’ll agree that they do not provide a compelling demonstration that CADD outperforms other measures of “pathogenicity” (or selection) at transcription factor binding sites and other regulatory elements.
[NB: Greg pointed out in response to this letter that they did actually compare CADD with conservation-based methods on several of these benchmarks, and found an improvement; results are reported in supplementary tables]
So we need a more direct measure of performance in regulatory elements. We have the problem, of course, that there is quite limited data on the specific pathogenic effects of regulatory variation. Our proposal was to use what we see as the next best thing: reasonably high-quality predictions of cis-regulatory elements. Our assumption is that most mutations that fall in these elements will be disruptive and hence pathogenic. It is not a perfect assumption, but we would argue that this the best we can do at present, and it is certainly a more direct and general evaluation of noncoding pathogenicity than any considered in the CADD paper.
Now, what of your criticism that the resolution used in these tests favors fitCons over CADD? There is some merit to this argument if in fact CADD is effectively differentiating among, say, informative and degenerate positions in transcription factor binding sites (which, I want to reiterate, remains an unproven assumption). However, even if this is true, I do not believe the penalty to CADD will be as great as you imagine. First, I don’t think this argument is relevant for our eQTL tests, which apply to individual nucleotides. Not all of these eQTLs are indeed causal loci, of course, but many of them are, and false positive eQTLs shouldn’t alter the relative performance of the different scoring systems. Notice that the fitCons scores outperform CADD scores in eQTLs by nearly as high a margin as in the other elements. For the TFBSs, keep in mind that we are examining quite short binding sites (6-8 bp) under ChIP-seq peaks, after trimming off noninformative flanks of the motifs. It is true that these TFBSs may contain some degenerate positions, but I don’t think they could possibly account for more than about a third of sites. So in the best case, with perfect discrimination between informative and noninformative positions within binding sites, the sensitivity of CADD shown figures 5, S2, and S3 would rise by a factor of 3/2 if only informative sites were considered, which would not be sufficient for it to outperform fitCons. (We actually quite like the idea of performing a version of this experiment that considers only informative positions in motifs and plan to add this to the manuscript soon.) The Gerstein enhancers are admittedly a rather low-resolution set but keep in mind that a low density of functional sites hurts fitCons as well as helping it, because it means that the scores will tend to be lower and it will be harder to obtain good sensitivity without compromising specificity (that is the beauty of ROC plots). So even if it is not a great validation set, I don’t think that it is unfair to consider it.
[NB: We did in fact carry out follow-up experiments with eQTLs and high-information-content positions in TFBS and found very little change in our results, as reported in the published manuscript.]
Claim: CADD is not biased toward coding regions
Response: We can only speculate on this issue, of course, but I think there is a problem with the way you fit the CADD model to the data that is likely to make it perform considerably better in coding regions than in noncoding regions. As I understand it, CADD is trained by a global strategy, in the sense that a single set of parameter values are selected (for a given choice of training set and generalization parameter) to obtain an optimal fit, on average, across all examples in the training set. Thus, if it is true that different covariates are relevant in coding and noncoding regions, as we expect, then the method will have to make tradeoffs between these types of sites. If the “signal” for training (i.e., the contribution to the SVM’s objective function) is stronger in one type of region than another, it is likely that the tradeoff will favor that type of region. Because constraint will be strongest, on average, in coding regions, leading to higher rates of difference between simulated and observed variants in these regions, it seems likely that these regions will indeed dominate in the training procedure, which will tend to make CADD sacrifice performance in noncoding regions in order to improve performance in coding regions.