Our paper on our new computational method for computing probabilities of fitness consequences for point mutations in the human genome, originally released as a preprint on the bioRxiv, appeared online today in Nature Genetics. As I described in an earlier blog post, this paper is the fruit of more than three years of labor led by Brad Gulko and Ilan Gronau in my research group. It is satisfying to finally see it in print not only because this has been a long, arduous project with a number of false starts, but also because this work neatly weaves together several threads in the research my group has pursued over the last decade, including the use of evolutionary conservation to characterize genomic function, the detection of natural selection based on joint patterns of polymorphism and divergence, estimation of the share of the human genome that is under selection, and characterization of the impact of evolutionary turnover on patterns of genetic variation.
In addition, this paper provides an excellent example of a publication model about which I am increasingly enthusiastic: rapid publication of a preprint in a fully open fashion, improvement of the manuscript based on community feedback, rigorous peer review, further revision, and publication in a high-quality journal. Whatever opinions one holds about the Nature Publishing Group—and they are undoubtedly controversial in the circles I move in—one must acknowledge that they deserve some credit for their willingness to consider and publish papers, like ours, that have been fully “in the open” since before submission.
I want to turn now to the less pleasant issue of a critique of our article that was posted on Haldane’s Sieve by Greg Cooper and colleagues a couple of months ago. Let me begin with my perspective on the background for this critique.
While preparing our manuscript, we came across a paper by Cooper, Jay Shendure, Martin Kircher and colleagues on a method called CADD (an acronym for the rather cryptic name “Combined Annotation-Dependent Depletion”) that purports to solve a similar problem to the one addressed by our fitCons method. In the months prior to publication, CADD was heavily promoted in meetings and conferences and at the time of publication it was beginning to show signs of serious “market penetration” in the human genetics community. We spent some time closely evaluating the CADD paper and, in all honesty, found the approach more than a little naive from both a machine learning and a molecular evolution standpoint. In addition, while CADD evidently does perform quite well in coding regions, we found the claims of its effectiveness in predicting “pathogenicity” in noncoding regions to be highly unconvincing. As we studied the paper more closely and carried out some of our own analyses of the CADD scores, we also identified some troubling features of the reported validation experiments. In our hands, the CADD scores did not seem to perform nearly as well outside of coding regions as suggested in the paper.
We decided to approach this issue head-on by including CADD in our validation tests and comparing it directly with fitCons (along with several other methods), focusing on prediction performance at putative cis regulatory elements. We also included a paragraph on CADD in our discussion speculating on the reasons for its much poorer performance on our tests than on those reported in the CADD paper. Among other things, this paragraph called out the CADD paper for some features of their validation tests that we found misleading.
Greg Cooper was unhappy with our treatment of CADD, and within a few days responded to our preprint on the bioRxiv with a long, highly critical email. Among other things, Greg felt that we had inaccurately described what our fitCons scores actually represent, unfairly compared fitCons with CADD, and used validation experiments that were strongly biased in favor of fitCons.
I took some time to mull over these criticisms during a trip to Oxford, and then responded about two weeks later with a rather carefully written letter that spelled out the principles behind our approach, why we think the two methods are fundamentally very similar and deserve to be compared, and the rationale behind our validation experiments. I have posted the full text of this letter below, because I think it still does a reasonably good job of capturing my perspective on these issues.
There followed a series of emails that were at times cordial and constructive, but never resulted in resolution of our differing points of view. During this exchange, my group performed two sets of substantial additional validation experiments to address concerns that Greg had raised: an analysis of high information content positions within TFBSs and a reanalysis of eQTLs that corrected for certain ascertainment biases. We felt that these analyses provided a convincing refutation of Greg’s most serious criticisms, and we incorporated them into a revised preprint and, eventually, the published manuscript. Nevertheless, Greg remain unconvinced by our arguments, and he evidently decided to go on the offensive and publish his critique of our paper as an extended blog post.
At this point, it hardly needs to be stated that we disagree with almost everything in Greg’s critique. His blog post mischaracterizes our goals, our claims, our approach, and our results. And it fails to recognize several highly relevant points we had communicated clearly to him by email, for example, concerning the rationale behind our approach and our validation experiments, and the reasons why fitCons performs poorly at the three loci that were experimentally analyzed by the Shendure group.
I recognize that few readers will have patience for this level of detail, but, in the interest of setting the record straight, I provide a point-by-point rebuttal of Greg’s critique in the post that follows. I want it to be clear that we stand by our paper, all of our results, and our comparative evaluation of fitCons and CADD. And I fully expect that independent evaluations of performance of these methods in characterizing fitness consequences of mutations in regulatory regions will support our findings.
Thoughts on: Probabilities of Fitness Consequences for Point Mutations Across the Human Genome
Posted on October 23, 2014 by Joe Pickrell
This guest post is by Greg Cooper, Martin Kircher, Daniela Witten, and Jay Shendure, and is a comment on Probabilities of Fitness Consequences for Point Mutations Across the Human Genome by Gulko et al.
Recently, Gulko et al. (2014) described an approach, FitCons, to estimate fitness consequences for point mutations using a combination of functional genomic annotations and inferences of selection based on human variant frequency spectra.
This is not quite an adequate summary of our approach. We consider patterns of divergence between primates as well as allele frequencies in human populations. Our inferences of natural selection are based on patterns divergence and polymorphism at the sites of interest in contrast to those at nearby neutral sites. We interpret estimates of fractions of sites under selection within a class of genomic positions as probabilities that point mutations within that class will have fitness consequences.
On the basis of comparisons with several maps of regulatory element features, they concluded that FitCons is substantially better at inferring deleterious regulatory effects of variants than other metrics, including an annotation we developed named Combined Annotation Dependent Depletion (CADD, Kircher et al. 2014). However, we believe that the comparisons of FitCons and CADD for detecting deleterious regulatory variation are misleading, and that methods to predict fitness effects of point mutations should evaluate variants with demonstrable effects rather than variants assumed to have an effect by virtue of being within a functional element.
This is a distorted characterization of the difference between our validation approaches. Cooper et al. have not evaluated variants having any more “demonstrable effects” than the ones we have used, particularly vis-a-vis cis-regulatory elements. To the contrary, we believe that, on the whole, our validation experiments are more direct and more comprehensive than those used in the CADD paper (see arguments below).
We find that FitCons is substantially less effective than CADD at separating variants, both coding and regulatory, with functional and/or phenotypic effects from functionally inert and/or organismally benign variants. For example, CADD is much more effective at enriching for mutations in two enhancers and one promoter that have been experimentally shown to have large transcriptional effects.
As Cooper and colleagues know from our correspondence with them, the results of their saturation mutagenesis analysis at their three hand-picked loci are largely a consequence of the gene expression patterns at these loci. Our approach is designed to make use of cell-type-specific data. Consequently, it has little predictive power at loci not active in the analyzed cell types, and that happens to be the case in our cell types for two of these three loci (see details below) Furthermore, the correlation coefficients cited by Cooper and colleagues suggest that all of the methods tested are performing very poorly by their benchmarks. It is hard to see how the authors can justify claiming a great victory when their method appears to explain less than 10% of the variance in their own benchmarking statistic!
Further, in contrast with CADD, FitCons does not separate highly deleterious variants that cause Mendelian disease from high-frequency benign variants, nor does it separate complex-trait associated variants, which are enriched for deleterious regulatory effects, from matched control variants. We believe that it would be more appropriate to characterize FitCons as a predictor of cell-type specific regulatory elements, and to compare it to other tools directed specifically at this task, rather than variant fitness consequences.
Notice the leap of faith above to prediction of “highly deleterious variants that cause Mendelian disease” and “complex-trait associated variants”. These are distant goals well beyond what we have claimed to address in our paper, and Cooper et al.’s “demonstration” that CADD is achieving them requires a good deal of creative license (as we discuss more specifically below). In fact, both CADD and fitCons are really just predicting the local action of purifying selection over relatively long evolutionary time periods, based on various types of covariates along the genome. There are differences in the particular covariates considered, the prediction methods used, and the evolutionary time periods represented but the set-up for both methods is fundamentally the same. Thus, in both cases, any predictive power for Mendelian or complex traits depends on correlations between these traits and long-term purifying selection on individual nucleotides. We expect that purifying selection will be at least somewhat useful as a proxy for these phenotypes, but many phenotype-associated variants will not show signs of selection, and many sites under selection will not have direct phenotypic links. For this reason, we think it is more meaningful to speak of “probabilities of fitness consequences” than “pathogenicity”. In addition, we think that it is more useful and informative to evaluate the performance of these predictors on molecular phenotypes in particular cell types, such as ChIP-seq-based transcription factor binding sites, rather than on statistically associations with complex phenotypes through unknown mechanisms.
FitCons, recently developed by Gulko et al (2014), is a method to estimate fitness effects of point mutations at both coding and non-coding positions in human genomes. FitCons works by first defining regional boundaries on the basis of clusters of functional genomic signals (“fingerprints”) and then estimating selective effects, inferred from allele frequency distributions within human populations, for variants with the same fingerprint. On the basis of comparisons with enhancers, transcription factor binding sites (TFBSs), and expression quantitative trait loci (eQTLs), Gulko et al. concluded that FitCons is substantially better at inferring variant regulatory effects than other metrics, including an annotation we developed named Combined Annotation Dependent Depletion (CADD, Kircher et al. 2014).
We never describe our method as “estimating fitness effects of point mutations” but instead say that it estimates the “probabilities that point mutations will have fitness effects” across groups of sites. This apparently subtle difference is important because the first phrase suggests prediction of specific effects of specific mutations, whereas we have never made any other claim than that we can produce a useful average measure of the probabilities that individual mutations will have fitness effects across coarse-grained classes of sites. Clearly, the more specific type of prediction would be more valuable than the more general one, were it possible to do accurately, but we believe even these coarser predictions of average effects represent an important step forward in noncoding regions of the genome. As we discuss later in this response, we also think that the evidence is very weak that CADD’s highly specific predictions carry much useful information about the actual impact of regulatory mutations.
While FitCons is an interesting approach with potentially useful attributes, we believe that the comparisons of FitCons and CADD for detecting deleterious regulatory variation are misleading. Clarification is needed as to the purposes and performances of these metrics. Below, we first describe what we believe to be important general distinctions between CADD and FitCons and then detail their relative effectiveness at differentiating several sets of functional and/or pathogenic variants from inert and/or benign variants. Finally, we consider how correlations between the bin definitions and validation datasets used in Gulko et al., rather than fitness per se, may underlie the performance of FitCons for cell-type specific regulatory element prediction.
CADD is an allele-specific measure of variant deleteriousness that accounts for a wide variety of discrete and quantitative information at the allelic, site, and regional levels (Kircher et al. 2014). CADD scores can vary among the possible alleles at a given site, across sites within a given region or functional element, and between and across variants within differing classes of functional elements.
It is true that CADD considers many sources of information as input and that it produces different scores for different alleles. However, it is not at all clear that it is really able to make good use of all of that information and that those allele-specific differences in score are meaningful, particularly in noncoding regions. In other words, while CADD has access to this information, whether or not it truly “accounts” for it is an open question. Our evaluations suggest, to the contrary, that CADD’s scores have quite limited predictive power for fitness consequences at cis-regulatory elements.
FitCons, on the other hand is driven by a small number of cell-type-specific, regional features with reduced or absent variation within regions. As a result, FitCons is in practice a segmentation method: the median length of uniformly scored segments is 72 bases, the average segment length is 196 bases, and 50% of all scored bases in hg19 lie within a segment over 950 bases long. Furthermore, 30% of all bases in hg19 are assigned to the mode value (0.062), 60% are assigned one of two FitCons values, and over 80% are assigned one of 10 possible values. Thus, FitCons is in practice a regional annotation of cell-type-specific molecular activity, not a site or allele-specific metric of variant deleteriousness.
FitCons first clusters sites according to several functional genomic covariates, then predicts average probabilities of mutational fitness consequences within each cluster. As we have stated clearly in the paper, these clusters are fairly coarse-grained. This feature of the approach is by design. In our view, the coarse-grained nature of the clusters is well matched to the information in the data. There is simply very little information about the precise fitness consequences of particular, position-specific mutations across most of the noncoding genome. We opt to produce a meaningful low-resolution score instead of a higher resolution score that produces a misleading impression of precision.
That a large fraction of sites are assigned a uniform low value is also part of the design. These are the sites in our “null” class, with no signal from DNAse-seq, RNA-seq, or histone modifications, and they are assigned a “background” value representing our estimate of average probabilities of fitness consequences in the absence of informative functional genomic data.
These observations should not be surprising to anyone who has taken the trouble to read our paper. The blockiness of the scores and the fact that relatively few distinct scores dominate the genome wide distribution are natural consequences of the functional genomic covariates considered and the approach for estimating class-wise probabilities of fitness consequences. Our genome-wide tracks and ROC plots demonstrate that this simple design leads to remarkably informative scores.
The basic structures of FitCons and CADD are crucial to interpreting the data presented by Gulko et al. In particular, they measure utility by assessing coverage of bases within functional elements, namely TFBSs and enhancers, relative to genomic background. While such an approach is reasonable to evaluate a method to annotate functional elements, it is not informative for a method to estimate organismal deleteriousness since many mutations within functional elements are evolutionarily neutral, including many that lack even a molecular effect. To wit, by FitCons/INSIGHT estimates, most sites within the enhancers and TFBSs evaluated have fitness effect probabilities below 0.2 (Gulko et al. 2014). While likely somewhat higher among high-information TF-binding motif positions and lower among the enhancers used (mean size of 888 bp), a decisive majority of positions in these nucleotide groups are mutable without consequence. Performance evaluations that reward uniformly high coverage of bases in these regions, rather than the particular subset of variants therein that actually have deleterious molecular effects, are therefore not meaningful for estimates of point mutation fitness consequences.
More of the same. Cooper and colleagues have an admirable ideal in mind—a computational method that can precisely predict the fitness consequences of any possible mutation at any position across the genome, distinguishing mutations that alter TF binding affinity or miRNA binding preferences or ncRNA structure from those that do not. The problem is that their dream so far has no basis in reality. The evidence that CADD, in particular, is making fine distinctions of the kind they describe across the noncoding genome is exceedingly weak. FitCons is at least producing coarse-grained predictions of average effects that are useful in practice for distinguishing signal from noise across the genome.
We firmly believe that methods to predict functional or fitness effects of mutations should be evaluated on mutations for which we have data relevant to function and fitness, not large aggregates of genomic regions or bases within which mutations are simply assumed to be phenotypically relevant. When tested on such mutation sets, we find that FitCons fails to capture a considerable amount of site- and allele-specific information that is captured by CADD (and between-species conservation metrics to a lesser extent). This loss of information, in turn, has profound effects on FitCons’ ability to identify variants with functional, pathogenic, or deleterious effects, including for regulatory changes.
We acknowledge that our benchmarks rely on indirect measures of “function and fitness effects of mutations”, but the measures of performance Cooper and colleagues have actually examined—correlations of scores with reporter-gene-based saturation mutagenesis experiments at three loci, ROC plots for very small numbers of ClinVar variants, and enrichments for GWAS hits—are no more direct than our evaluations of transcription factor binding sites, eQTLs, and enhancers. Moreover, Cooper et al. fail to mention that, in response to their privately relayed criticisms of our validation experiments, we performed a follow-up analysis that considered only the high-information-content positions within our ChIP-seq-supported transcription factor binding sites, which should be strongly enriched for the particular positions at which fitness consequences are most severe. As shown in the latest version of our paper on the bioRxiv (and the version published in Nature Genetics), we observed no substantial difference in our ROC plots when focusing on this set, indicating that the excellent performance of fitCons in those experiments is not an artifact of our focus on whole regulatory elements (rather than individual positions) in our tests.
First, FitCons has no predictive power for separating pathogenic variants in ClinVar (Landrum et al. 2014) from benign, high-frequency polymorphisms matched for genic effect category (e.g., missense, nonsense, etc): the distributions of FitCons scores for pathogenic and benign variants are nearly identical (Figure 1). While most of these variants are protein-altering, this same pattern holds for the subset of pathogenic/benign variants that do not directly alter proteins (Figure 1, right). In contrast, CADD and conservation measures like GERP (Cooper et al. 2005) strongly differentiate pathogenic from high-frequency variants, and, although more weakly, also differentiate non-protein-altering pathogenic from benign variants (for further details, see Kircher et al. 2014). The inability of FitCons to distinguish these highly pathogenic/deleterious variants from clearly benign variants runs counter to the general narrative in Gulko et al. in which FitCons scores are claimed to correlate with mutational fitness effect probabilities.
Here Cooper et al. double down on ClinVar, which, as we have pointed out, is highly problematic in an evaluation of genome-wide performance because it is dominated by protein-coding variants. In the analysis above, Cooper and colleagues try to address this criticism by looking at sites in ClinVar outside of coding regions, but what they don’t mention is just how tiny this set is.
When we last reviewed this database we found that, of roughly 10,000 autosomal pathogenic positions in ClinVar only about 3.4% fall outside of coding sequences (CDSs) and fewer than 1% are more than 100bp away from a CDS. As far as we could tell, ClinVar also gives no indication of whether these are true (transcription-associated) cis-regulatory variants, as opposed to, say, variants that disrupt splicing or variants that fall in CDSs in alternative isoforms of a transcript. So the analysis of noncoding variants above reflects at most a few hundred variants, with some unknown mixture of variants in cis-regulatory elements and variants more directly associated with protein-coding function, and presumably, highly variable activity across cell types. In short, this database is totally inadequate for the kind of validation we are interested in, and it is not surprising that fitCons would perform poorly on it. We expect that the elevation in CADD scores at these variants is largely driven by conservation, proximity to exon boundaries, mutations in known splice sites, and other gene-annotation-based features, rather than by an ability to accurately estimate fitness effects associated with cis-regulatory function.
Second, as Gulko et al. emphasize the detection of regulatory variation (title of the manuscript not withstanding), we performed a detailed examination of three regulatory elements for which saturation mutagenesis data exist (Patwardhan et al. 2009; 2012). While not global, these data are comprised of directly measured, not assumed, regulatory effects.
These saturation mutagenesis data are undoubtedly informative, but they are hardly definitive as functional characterizations of regulatory elements. They are based on in vitro reporter gene experiments that may or may not capture in vivo regulatory patterns in particular cellular contexts. The contribution of experimental noise in these measurements also appears to be relatively large. Overall, it is not clear to us that these data are all that much more direct in measuring “pathogenicity” than are binding patterns of transcription factors as assessed by ChIP-seq and informatics analysis (the basis of our first set of validation experiments). And when one considers the much larger size of our ChIP-seq data set (55,844 TFBSs compared to three elements), we think it is clear that our benchmarks are much more likely to be representative of the genome than are the ones reported here.
Nevertheless, why does fitCons perform so poorly at the three loci examined by Cooper et al.? As we have explained to the authors, what fitCons is doing here is no mystery: the fitCons scores at these loci are a direct consequence of the expression patterns and functional genomic marks at these loci in the three cell types we have examined. It just so happens that two of the three loci (the HBB promoter and ALDOB enhancer) have almost no functional genomic signal in our data—they mostly fall in our “0” classes for RNA-seq and DNase-seq and are annotated as “quiescent” by ChromHMM in all three cell types (with the exception of a weak RNA-seq signal in H1-hESC over part of the ALDOB enhancer). This absence of signal can be clearly seen in the fitCons tracks on our browser mirror (http://genome-mirror.cshl.edu/) at approximately chr11:5,248,210-5,248,401 (HBB) and chr9:104,195,570-104,195,828 (ALDOB) in the hg19 assembly. As a result, fitCons scores these regions at or slightly above the genomic background across these elements, exactly as designed. We have made the decision to produce cell-type-specific scores based on cell-type-specific functional genomic information. As we show in our paper, this design leads to excellent performance in many cases, but it does have the necessary and inevitable consequence of eliminating any prediction power at genomic loci that are not active in any of the cell types we have examined.
In the 70-bp promoter region of HBB (Patwardhan et al. 2009), FitCons assigns all bases to the genome-wide mode (0.062). However, mutations in this region exhibit substantial variation in both transcriptional and disease consequences. Mutational effects on in vitro promoter activity range from no effect to a >2-fold change in transcription, and some of the strong in vitro effect mutations cause beta-thalassemia by disrupting normal transcript regulation. CADD and GERP correlate significantly with the regulatory (CADD Spearman’s rho=0.23, GERP rho=0.11) and disease consequences of these mutations (details in Kircher et al. 2014).
Incidentally, CADD’s Spearman’s rho of 0.23 in this region seems to support our point that it is at best weakly informative about regulatory function. While this value is significantly different from zero, it represents a very poor correlation in absolute terms. Spearman’s rho is simply Pearson’s r for ranks. Therefore, it is fair to say that CADD has an r^2 of 0.23^2 = 0.053 for rank order in transcriptional output as assessed by saturation mutagenesis. In other words, it is able to explain only about 5% of the total variance in ranks.
Within each of two enhancers tested by saturation mutagenesis (Patwardhan et al. 2012), FitCons scores are correlated with mutational effect (ECR11 rho=0.32, ALDOB rho=0.26) similar in magnitude to CADD (ECR11 rho=0.25, ALDOB rho=0.36). However, in both elements, the FitCons correlation is due to a higher score segment overlapping a more transcriptionally active region (Figure 2); no predictive power within the active regions exists. For example, most of the mutations with regulatory effects in the ECR11 enhancer reside in the last ~100 bases, which in turn reside within a single 168-bp FitCons segment. Within this segment, considerable mutational effect variation exists: 209 of 504 possible SNVs, distributed across 110 of the 168 sites, have no discernible effect on transcription (p >= 0.1). Concordantly, these inert mutations have significantly lower CADD scores (Wilcox test p=5.9 x 10-25) than their interdigitated SNV neighbors with at least some evidence for functional effect. Furthermore, within the set of mutations that have at least some evidence for effect (p<0.1; other arbitrary thresholds yield similar results), transcriptional effect sizes vary considerably and correlate with CADD (rho=0.33).
The ECR11 locus is the one of the three at which a functional genomic signal is present in our data—namely, we have promoter and weak-enhancer ChromHMM states and a moderate DNase-seq signal (broad peaks) here in the GM12878 cell type (see chr2:169,939,082-169,939,701 in the browser), leading to three blocks of modestly elevated scores (as shown in Cooper’s Figure 2). We would argue that the fitCons scores are doing exactly what they are designed to do at this locus, by distinguishing regions with stronger average functional effects from those with weaker effects, without providing misleadingly specific predictions of fitness effects. CADD, by contrast, makes very precise predictions about the functional consequences of each possible mutation at each position, but there is little support for most of these predictions. Indeed, in this case, CADD explains only 6%-13% of the variance in ranks in transcriptional output. In addition, it is very difficult to tell what is driving these predictions—CADD is completely a “black box” for the user.
Next, as suggested by Gulko et al., we quantified coverage of discretely thresholded regulatory variants to evaluate the extent to which FitCons and CADD could enrich for “large-effect” regulatory mutations. Specifically, there are 108 mutations that alter transcriptional activity by at least two-fold within the three elements tested (29 mutations across 19 bases in ECR11, 76 mutations across 41 bases in ALDOB, and 3 mutations across 3 sites in HBB). We compared coverage of these 108 mutations at various thresholds relative to coverage of hg19, and find that CADD is much more effective at enriching for them than is FitCons (Figure 3). For example, 95 (88%) of the large-effect regulatory variants have a scaled CADD score above 10, a threshold that includes 10% of all possible hg19 SNVs (~9-fold enrichment above genomic background). Enrichment peaks at a CADD score of 15, a threshold that includes 53.7% of large-effect regulatory variants but only 3.2% of hg19 SNVs (~17-fold enrichment). In contrast, FitCons enrichment peaks at a threshold of 0.08, wherein only ~27% of all large-effect mutations are covered (~4-fold enrichment above background).
We next evaluated the ability of FitCons to distinguish trait-associated SNPs identified in genome-wide association studies (GWAS). Such SNPs are as a group enriched for regulatory variants with pathogenic, likely deleterious effects, a hypothesis supported by numerous anecdotal and systematic studies (e.g., Hindorff et al. 2009; Musunuru et al. 2010; Nicolae et al. 2010; ENCODE Project Consortium et al. 2012). These variants are overwhelmingly non-protein-altering (98%), with ~83% being intronic or intergenic SNPs not near to an exon. We previously showed that CADD scores significantly separate lead GWAS SNPs from matched neighboring control SNPs (Wilcoxon p=5.3 x 10-12). This separation remains highly significant for the 83% that are intronic or intergenic (p=1.26 x 10-9), indicating it is not driven solely by coding or near-coding variation. In contrast, FitCons scores do not separate lead GWAS SNPs from controls, either considering all variants (p=0.32) or intronic/intergenic only (p=0.57).
Cooper et al. are evidently much more impressed with GWAS enrichments than we are. The problem with analyses like this one—showing that a group of sites statistically enriched for some property, are also statistically enriched for another property—is that they are notoriously hard to interpret. Typically the fold enrichments are modest (it is telling that they were not reported above) and very often the statistical differences can be explained by relatively uninteresting covariates (GC content, overall conservation, proximity to exons, etc.). The genome is large and statistically significant differences between collections of sites are easy to find. The real question, in our view, is does one have any predictive power to identify individual variants of interest. For this reason, we have focused on ROC and ROC-like plots in our validation experiments. In our hands, CADD has very little predictive power for noncoding sites of known functional significance, and what it does have often appears to be largely a consequence of local evolutionary conservation or proximity to genes.
With respect to separation of eQTL SNPs from controls, Gulko et al. used all common variants that were part of the study as a background/control set. We believe the results from such a test are difficult to interpret. They do not control for effects of minor allele frequency, for example, a key property that correlates with both technical (e.g., eQTL discovery power) and biological effects (e.g., eQTL deleteriousness). Additionally, they do not control for genomic location. By virtue of the annotations it uses, FitCons scores will tend to be higher near transcribed genes in the cell-type of choice, which are in turn the only genes for which eQTLs can be identified. Therefore, this analysis confounds the information content resulting from a focus on cis, rather than trans, eQTLs, with that intrinsic to the scores themselves. While this may be an advantage, relative to a more general predictor like CADD, for predicting cell-type-specific function, it likely comes at a cost of reduced accuracy in terms of predicting deleteriousness per se (see below). Furthermore, in practical terms it is not likely to be useful given that cis-effects are the first and major focus of most eQTL discovery efforts, and it is furthermore unclear that FitCons would outperform other cell-type specific regulatory element annotations, such as those from integrative predictions of enhancers and promoters (Ernst and Kellis 2012; Hoffman et al. 2012). In any case, an analysis in which eQTL SNPs are matched for MAF, genomic location, distance to TSS and other confounders would provide a more meaningful evaluation of the utility of FitCons in a realistic eQTL discovery/fine-mapping analysis.
The conjecture that fitCons only shows good performance at eQTLs because we have not controlled for minor allele frequency or genomic location is wrong. We have repeated these experiments after stratifying the data by proximity to TSS and MAF and see very little difference in the ROC plots.
Finally, we believe that the discrepancies in performance metrics defined here vs. within Gulko et al. are also influenced by the potential circularity of cell-type-specific information within both the model definition and validation. Indeed, while INSIGHT adds value by scoring the 624 potential feature-defined bins, the correlations between the bin definitions (expression, DNase, and ChromHMM states) and validation data (TFBSs, enhancers, eQTL candidates) are quite likely the primary drivers of performance of FitCons as measured in Gulko et al.
But a correlation between the definitions of the clusters and the validation data is only a problem if it is not driven by the underlying biology. In a sense, defining the clusters so that they are correlated with true cell-type-specific functional roles is the whole point of the approach! And we have taken care to use measures of cis-regulatory function (such as eQTL association and GRO-seq-based enhancer identification) that are as different as possible from the covariates used to define the scores.
In fact, the strong correlation between INSIGHT and PhyloP as a bin scoring method suggests that metrics of evolutionary constraint could replace INSIGHT in the FitCons framework.
As discussed in our comparison of the fitCons scores with their divergence-based counterpart (called fitConsD), much of the information in fitCons does come from divergence but there is also an important contribution from polymorphism. The difference between the fitCons and fitConsD scores is the basis of our analysis of evolutionary turnover.
More generally, the use of cell-type specific features in both the metric definition and validation obscures a crucial trade-off in analyses of this sort. Evolutionary constraint-driven metrics (including CADD, which is strongly influenced by evolutionary constraint) emphasize variant effects on organismal fitness, which often have nothing to do with molecular function in any given cell-type; this means that constraint-driven metrics may be sub-optimal when trying to predict molecular function within said cell-type. However, the converse is also true, in that efforts to predict deleteriousness that emphasize cell-type specific molecular functionality will often be misleading when a variant has no molecular effect in that cell-type but strong fitness consequences due to functional effects elsewhere and/or has a molecular effect with no fitness consequence. Obviously, optimal choices in this trade-off depends greatly on analytical context and goals, but in our opinion the goal of predicting “fitness consequences for point mutations” dictates that performance metrics focused on organismal deleteriousness are more appropriate.
We don’t disagree that there are tradeoffs concerning the use of cell-type-specific data, and, indeed, were quite frank in our paper that improved methods for integrating across cell types are an important area for improvement of our methods (see last paragraph of Discussion). CADD also attempts to use cell-type-specific data for prediction—it just does it in a rather crude way, by simply including summaries across cell types as covariates and hoping that its SVM can sort them out. We also agree that the use of cell-type-specific data for validation provides an imperfect measure of organismal fitness. But, as discussed above, we believe that the measures considered in the CADD paper are at least as imperfect.
As a final illustrative example, Weedon et al. (2014) identified a set of noncoding SNVs that disrupt the function of an enhancer of PTF1A and cause pancreatic agenesis. CADD scores these variants between 23.2 and 24.5, higher than 99.5% of all possible hg19 SNVs and higher than 56% of pathogenic SNVs in ClinVar (most of which are protein-altering); much of the CADD signal for these variants results from measures of mammalian constraint (not shown). FitCons, on the other hand, places these variants in a 5-kb block of sites all scored at the genome-wide mode (0.062). This is in part a result of not having functional genomic data from cells in which the enhancer is active; however, the absence of such data in disease studies is common given that the relevant cell-types are frequently unknown or inaccessible. Further, even if DNase, RNA, and ChromHMM data were all generated for this cell type, given the general distributions of FitCons scores within regulatory elements observed in other cell types and lack of inter-species conservation information, it is unlikely that FitCons would have ranked these variants within the top 0.5% of all possible SNVs.
We are not trying to duplicate the functionality of tools such as phastCons, phyloP, or GERP (which presumably work fairly well at this locus) but to provide something complementary, based on cell-type-specific functional genomic data. At loci unexpressed or inactive in the cell types we have analyzed, one is clearly better off using evolutionary constraint than fitCons. It is unclear, from our experiments, how much value CADD adds at such loci.
In any case, Gulko et al. demonstrate that FitCons is reasonably effective, and more so than CADD, at predicting the approximate boundaries of regulatory elements in cell types on which it is trained. However, claims that it better predicts functional or fitness effects of variants in either coding or non-coding regions are unsupported. Indeed, when challenged to separate point mutations with demonstrable effects from appropriate sets of control SNVs, CADD and other metrics that include evolutionary constraint information are substantially better as predictors of both coding and non-coding variant impact.
It is simply wrong to dismiss our benchmarking analyses by saying that they merely demonstrate that fitCons can find “approximate boundaries of regulatory elements in cell types on which it is trained.” We have shown that fitCons outperforms CADD at high information content positions of TFBSs and at eQTLs, both of which should be highly enriched for positions at which point mutations have fitness effects. We have also shown that our methods generalize reasonably well to cell types other than the ones we have trained on (Supplementary Figure 10).
We suggest that it would be more appropriate to characterize FitCons as a predictor of cell-type specific regulatory elements rather than variant fitness consequences, and to compare it to other tools directed at this task, such as ChromHMM (Ernst and Kellis 2012) or Segway (Hoffman et al. 2012).
ChromHMM and Segway are essentially clustering methods, with no use of evolutionary information. They produce multivariate outputs (assignments of genomic positions to multiple classes) rather than a univariate score meant to serve as a means for ranking genomic positions for follow-up study, as CADD and fitCons produce. We maintain that the comparison with CADD is appropriate.