Phylogenetic correlations have limited effect on coevolution-based contact prediction in proteins

Coevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop two strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. An analysis of these data shows that the strongest coevolutionary couplings, i.e. those used by Direct Coupling Analysis to predict contacts, are only weakly influenced by phylogeny. However, phylogeny-induced spurious couplings are of similar size to the bulk of coevolutionary couplings, and dissecting functional from phylogeny-induced couplings might lead to more accurate contact predictions in the range of intermediate-size couplings. The code is available at https://github.com/ed-rodh/Null_models_I_and_II. Author summary Many homologous protein families contain thousands of highly diverged amino-acid sequences, which fold in close-to-identical three-dimensional structures and fulfill almost identical biological tasks. Global coevolutionary models, like those inferred by the Direct Coupling Analysis (DCA), assume that families can be considered as samples of some unknown statistical model, and that the parameters of these models represent evolutionary constraints acting on protein sequences. To learn these models from data, DCA and related approaches have to also assume that the distinct sequences in a protein family are close to independent, while in reality they are characterized by involved hierarchical phylogenetic relationships. Here we propose Null models for sequence alignments, which maintain patterns of amino-acid conservation and phylogeny contained in the data, but destroy any coevolutionary couplings, frequently used in protein structure prediction. We find that phylogeny actually induces spurious non-zero couplings. These are, however, significantly smaller that the largest couplings derived from natural sequences, and therefore have only little influence on the first predicted contacts. However, in the range of intermediate couplings, they may lead to statistically significant effects. Dissecting phylogenetic from functional couplings might therefore extend the range of accurately predicted structural contacts down to smaller coupling strengths than those currently used.


Introduction
Global coevolutionary modeling approaches have recently seen a lot of interest [1,2], 20 either directly for predicting residue-residue contacts from sequence ensembles 21 corresponding to homologous protein families [3][4][5], in predicting mutational 22 effects [6][7][8], or even in designing artificial but functional protein sequences [9][10][11], or as 23 an input to deep-learning based protein structure prediction. The latter approach has 24 recently lead to a breakthrough in prediction protein structure from sequence [12][13][14][15][16]. 25 The basic idea of coevolutionary models, like the Direct-Coupling Analysis (DCA) [6], 26 is that the amino-acid sequences, typically given in the form of a multiple-sequence 27 alignment (MSA) of width (or aligned sequence length) L, can be considered as a 28 sample drawn from some unknown probability distribution P (a 1 , ..., a L ), with 29 (a 1 , ..., a L ) being an aligned amino-acid sequence. This probabilistic model is typically 30 parameterized as P (a 1 , ..., a L ) ∝ exp i<j J ij (a i , a j ) + h i (a i ) via biases (or fields) 31 h i (a i ) representing site-specificities in amino-acid usage (i.e. patterns of amino-acid 32 conservation), and via statistical couplings J ij (a i , a j ), which represent coevolutionary 33 constraints and cause correlated amino-acid usage in positions i and j [17]. 34 In most cases, the parameters of these models are inferred by (approximate) 35 maximum-likelihood, under the assumption that the sequences in the MSA are (almost) 36 independently and identically distributed according to P (a 1 , ..., a L ). On one hand, this 37 assumption is needed to make model inference from MSA technically feasible. On the 38 other hand, it is in obvious contradiction to the fact that sequences in homologous 39 protein families share common ancestry in evolution, and therefore typically show 40 considerable phylogenetic correlations, which can be used to infer this unknown ancestry 41 from data [18]. Phylogeny induces highly non-trivial correlations between MSA 42 columns [19], which however do not represent any functional relationship. 43 Disentangling correlations induced by functional or structural couplings from 44 phylogeny-caused correlations turns out to be a highly non-trivial task [19][20][21]. Simple 45 statistical corrections have been proposed, like down-weighting similar sequences in 46 determining statistical correlations [3], or the average-product correction applied to the 47 final coevolutionary coupling scores [22]. While sequence weighting has initially 48 reported to significantly improve contact prediction, recent works show little effect [23], 49 probably due to the fact that, e.g., Pfam [24] is now based on reference proteomes and 50 therefore less redundant than databases used to be about a decade ago.

51
Average-product correction was shown to be more of a correction of biases related to 52 amino-acid conservation than to phylogeny [25].

53
To make progress, we suggest a complementary approach. Instead of removing 54 phylogenetic correlations from DCA-type analyses, we suggest null models having the 55 same conservation and phylogenetic patterns of the original MSA of the protein family 56 under consideration, but strictly lack any functional or structural couplings.
The spurious couplings induced by phylogeny are, however, non-zero, and may limit the 62 accuracy of contact prediction when going beyond the first few strongest couplings.

63
The paper is organized as follows. After this introduction, we provide the Materials 64 and Methods, with a short review of DCA, but most importantly with the presentation 65 of three null models. The Results section compares the spectral properties of the 66 residue-residue covariance matrix of the real data with those of MSA generated by the 67 null models, followed by an assessment of the couplings inferred by DCA and their alignments [24], and are therefore typical removed from the MSA before statistical  parameterized via pairwise coevolutionary residue-residue couplings J ij (a i , a j ) and 93 single-residue biases (or fields) h i (a i ), while Z is a normalization factor also known as 94 partition function. In the simplest setting, these parameters are inferred from the data 95 via maximum likelihood, i.e.
This maximization leads directly to the fact, that the model P reproduces the empirical 97 statistics of single MSA columns and of column pairs, where f i (a) represents the fraction of amino acids a in column i, i.e. the hierarchical phylogenetic relations. Phylogeny by itself leads to a non-trivial correlation 112 structure between different residue positions with a power-law spectrum [19], and this 113 July 15, 2020 4/19 leads to non-zero but also non-functional residue-residue couplings when using DCA.

114
These may interfere with the functional couplings, which are e.g. used for residue-residue 115 contact prediction from MSA data, and thereby negatively impact prediction accuracy. 116 It is notoriously hard to disentangle the two, cf. [20,21]. The problem is that 117 evolution is a non-equilibrium stochastic process, whose dynamics in principle depends 118 on the evolutionary constraints represented, e.g., by the couplings and fields in the DCA 119 model. Global model inference from phylogenetically correlated data remains an open 120 questions, since current approximation schemes have only very limited and 121 non-systematic influence on the accuracy of residue contact prediction.

122
Here we follow a different route. We define different null models, which take residue 123 conservation and in part also phylogeny into account, but do not show any intrinsic 124 amino-acid covariation. The null models allow us to create large numbers of suitably 125 randomized sequence ensembles, on which standard DCA can be run. The couplings 126 resulting from randomized data can be used to assess the statistical significance of the 127 couplings resulting from the real MSA D, and therefore to discard purely 128 phylogeny-caused couplings. While, to the best of our knowledge, this has never been 129 done in the context of protein families and DCA, somehow similar techniques have been 130 proposed in the context of phylogenetic profiling [28], but applied to correlations rather 131 than couplings.
So in principle there are no couplings between different residues at all. However, when 141 running DCA on this sample, inferred couplings will be non-zero due to finite sample 142 size. They will take distinct values from one randomization to the next, but there may 143 be systematic biases due to the distinct conservation levels, which are maintained as sequences, as is done in distance-based phylogeny reconstruction [18,29].

152
The aim of the second null model is to construct a randomized MSA which preserves 153 both the sequence profile given by the position-specific frequencies f i (a), and the matrix 154 {D H mn } of pairwise Hamming distances between sequences. This can be achieved by the 155 following Markov chain Monte Carlo (MCMC) procedure acting on the entire alignment. 156 Our method is initialized using a sample of null model I, i.e. all coevolutionary and preserved. The resulting randomized MSA after t MCMC steps is called In step t → t + 1, one column i ∈ {1, ..., L} is selected randomly, as well as two rows 161 m, n ∈ {1, ..., M }. An exchange of the entriesã m i andã n i is attempted, to obtain a new 162 matrixD t+1 . This matrix is accepted with a Metropolis-Hastings acceptance 163 probability p acc of non-trivial properties in agreement with [19], and DCA is expected to be able to 182 reproduce this correlation structure via couplings J ij (a, b). Repeating the 183 randomization many times, we can assess the statistics of phylogeny-generated 184 couplings, and thereby the significance of the couplings found by running DCA on the 185 original protein sequences collected in D.

186
Note also that, in the limit where the formal inverse temperature in Eq. (5) is set to 187 β = 0, i.e. in the case of infinite formal temperature T = β −1 , we recover Null model I. 188 One could use β therefore as an interpolating parameter between these two Null models. 189 with µ being the mutation rate and δ a,b the Kronecker symbol, which equals one if and 203 only if the two arguments are equal, and zero else. In this model, there is no mutation 204 with probability e −µt , and the amino acid in position i does not change, or at least one 205 mutation with probability 1 − e −µt . In the latter case, the new amino acid b is emitted 206 with its equilibrium probability ω i (b). While being simple, the Felsenstein model of 207 evolution is frequently used in phylogenetic inference.

208
The algorithm proceeds in the following way, using the implementation of [21]:

209
• A phylogenetic tree T is inferred from the MSA D using FastTree [31]. Instead of 210 using inter-sequence Hamming distances (like in Null model II), FastTree is using 211 a maximum-likelihood approach.

212
• The mutation rate µ and all site-specific frequencies {ω i (a)} are inferred using 213 maximum likelihood.

214
• To resample the MSA according to this model, the root sequence is drawn 215 randomly from P ω , and stochastically evolved on the branches of T using the 216 transition probability Eq. (7).

217
• The resampled MSA is composed by the sequences resulting in the leaves of T .

218
This procedure allows thus to emit many artificial MSA being evolved on the same 219 phylogeny and the same stationary sequence distribution as the one inferred from the 220 natural sequences given in D. Note that these MSA are expected to be more noisy than 221 the ones of Null model II. In particular the column statistics will differ from f i (a), and 222 also the inter-protein Hamming distances D H will differ more from the ones in the  Following the mathematical derivations published in [19], we would expect that the is strongly impacted by phylogenetic correlations in the 240 data. More precisely, while totally random data would lead to the Marchenkov-Pastur 241 distribution for the eigenvalue spectrum of C, the hierarchical structure of data on the 242 leaves of a phylogenetic tree leads to a power-law tail of large eigenvalues.

243
It is thus not very astonishing, that both Null models II and III show fat tails in the 244 spectrum of their data covariance matrices C (even if Null model II does not fulfill the 245 mathematical conditions of the derivation in [19] because not generated according to a 246 hierarchical process), while the spectrum of Null model I has a substantially more information, e.g. used for PCA or for the identifcation of protein sectors [32,33], defined 256 as multi-residue groups of coherent evolution.

257
Phylogenetic effects induce couplings in DCA, but these are 258 smaller than couplings found in natural sequences 259 However, the couplings derived by DCA are not directly related to the largest 260 eigenvalues of the residue-residue covariance matrix. Actually, the computationally most 261 efficient DCA approximations based on mean-field [3] or Gaussian [34] approximations, 262 relate the couplings J to the negative of the inverse of C. The DCA couplings are 263 therefore dominated by the smallest eigenvalues of C, cf. also [35].

264
Here we use plmDCA, the resulting couplings therefore lack any simple relation to This observation becomes even more interesting, when we compare the couplings of 280 residue-residue contacts and non-contacts. In Fig. 4, where this is shown for the DCA 281 couplings derived from the natural MSA, the histogram of contacts is much broader.

282
Large couplings above a DCA score of about 0.2-0.3 are mostly contacts, in agreement 283 with earlier findings using plmDCA [36]and the closely related GREMLIN [37]. Some 284 proteins have many residue-pairs above this threshold and thus an accurate contact 285 prediction, other have only few pairs above this threshold, predicted contacts are very 286 sparse, and contact-map prediction will require more sophisticated supervised contact 287 predictors, like e.g. the recent deep-learning based methods [16,38].

288
When looking to Null model II, cf. high-confidence contact prediction. This idea is also corroborated by the quantitative 300 assessment of the statistical significance in the couplings derived from natural sequences, 301 as compared to the ones generated by the Null models. To this aim, we assign a z-score 302 to each residue pair (i, j): Using 50 samples of Null model II, we determine the mean 303 and standard deviation of couplings derived from Null model II, individually for each 304 pair (i, j). We use these values to determine the z-score, i.e. the number of standard 305 deviations, the actual couplings (from natural MSA) is away from the means for Null 306 model II. In Fig. 6, we observe, that this z-score is highly correlated with the plmDCA 307 score derived from natural MSA, across all families. Almost all DCA scores above 0.2 308 have highly significant z-scores above 3 or even more. Even larger correlations between 309 DCA and z-scores are observed in Null model III, cf. Fig. S8. As a consequence, we 310 conclude that the influence of phylogeny on these couplings remains rather limited. Only pairs with linear separation |i − j| > 4 along the chain are taken into account. It becomes evident that any signal related to contacts is totally destroyed by the randomization procedure in Null model II. Interestingly, the null model generates almost no couplings with scores above 0.2-0.3, which was seen in Fig. 4 as a cutoff for high-accuracy contact prediction.  coevolutionary covariation between sites and phylogenetic correlation between sequences. 320 Statistical corrections may improve the situation slightly, but they are to simple to take 321 the hierarchical correlation structure into account, which is generated by the 322 evolutionary dynamics on a phylogenetic tree.

323
Here we have proposed to approach this problem in a complementary way, by 324 introducing null models -i.e. randomized or re-emitted multiple-sequence alignments -325 which reproduce conservation and phylogeny, but do not contain any real coevolutionary 326 signal. When applying Direct Coupling Analysis as a prototypical global coevolutionary 327 modeling approach, we observe that phylogenetic correlations between sequences lead to 328 a changed residue-residue correlation structure, represented by a fat tail in the 329 eigenvalue spectrum of the data covariance matrix. It leads also to distributed 330 couplings, which, however, are smaller than the largest couplings found when applying 331 DCA to natural sequence data, i.e. smaller than the couplings used for residue-residue 332 contact prediction. The latter are significantly larger than couplings resulting from 333 phylogeny, i.e. we can conclude that contact prediction is influenced only to a very 334 limited degree by phylogenetic couplings.

335
However, it is also striking that, across several protein families, the phylogeny-caused 336 couplings in Null models II and III almost reach the DCA-score threshold found before 337 for accurate contact prediction. This suggests that the suppression of phylogenetic 338 biases in the data (or their better consideration in model inference), may shift this 339 threshold down and therefore allow for predicting much more contacts. Only pairs with linear separation |i − j| > 4 along the chain are taken into account. It becomes evident that any signal related to contacts is totally destroyed by the randomization procedure in Null model III. Interestingly, the null model generates almost no couplings with scores above 0.2-0.3, which was seen in Fig. 4 as a cutoff for high-accuracy contact prediction.