Integrated Network Approach to Protein Function Prediction

One of the main problems in functional genomics is the prediction of the unknown gene/protein functions. With the rapid increase of high-throughput technologies, the vast amount of biological data describing different aspects of cellular functioning became available and made it possible to use them as the additional information sources for function prediction and to improve their accuracy. In our research, we have described an approach to protein function prediction on the basis of integration of several biological datasets. Initially, each dataset is presented in the form of a graph (or network), where the nodes represent genes or their products and the edges represent physical, functional or chemical relationships between nodes. The integration process makes it possible to estimate the network importance for the prediction of a particular function taking into account the imbalance between the functional annotations, notably the disproportion between positively and negatively annotated proteins. The protein function prediction consists in applying the label propagation algorithm to the integrated biological network in order to annotate the unknown proteins or determine the new function to already known proteins. The comparative analysis of the prediction efficiency with several integration schemes shows the positive effect in terms of several performance measures.


I. INTRODUCTION
In the past decade due to biotechnological advances, the various types of molecular data at a genome-wide scale have been produced, but despite a lot of fully sequenced genomes the function of large numbers of proteins still remains unknown.The classical way is to find homologies between a protein and other proteins in protein databases using programs such as BLAST, PSI-BLAST [1] and then predict functions based on sequence homologies.However, roughly 20 %-40 % of proteins in newly sequenced genomes do not have statistically significant sequence similarity to functionally annotated proteins.In addition, sequence similarity does not necessary imply functional equivalence and thus Blast-based annotations can be erroneous [2].Therefore, the additional sources of information from a variety of high-throughput experiments have been extensively used for the study of protein functions.Among them are gene expression patterns, phylogenetic profiles, protein fusions and protein-protein interactions (PPI) [3].Clustering analysis of gene expression data has been used to predict functions of unannotated proteins based on the idea that co-expressed genes are more likely to have similar functions [4].The great popularity of using the PPI data can be explained by providing information on the biological context of protein functions.As a rule, proteins are not operating in isolation but interact with one another in order to provide cell functioning, taking part in some metabolic pathway or biological process.Therefore, it is possible to deduce functions of a protein through the functions of its interaction partners [3].PPI functional linkage networks (graphs), where nodes represent proteins and edges represent the detected interactions, are extensively used for deriving the functions of unannotated proteins using different probabilistic and graph algorithms.Probabilistic analysis of graph neighbourhoods in a proteinprotein interaction network is described in [2].In [5], [6] the network propagation algorithm for protein function prediction is proposed.The algorithm allows obtaining functional evidence from non-neighbouring nodes in functional-linkage graphs.
However, each information source can possess the inherent noise, e.g., protein interaction databases such as MIPS [7], BioGRID [8] and STRING [9], which have assembled a large collection of putative functional links between proteins by including information provided by diverse computational and experimental screens, can produce large numbers of both false positive and false negative interactions.Additionally, each type of data describes only one part of cellular activity; therefore, it was proposed to combine the heterogeneous data sources in order to increase the coverage and the accuracy of protein function prediction [4], [10].The need to integrate several sources of information has increased the task complexity and more computationally efficient approaches must be developed.All the existing up-to-date methods can be roughly divided into two groups: kernel methods and functional linkage network methods [10].In the first group, for each data source the similarities between proteins are determined using the kernel similarity matrix, and different kernel integration methods are applied in order to combine heterogeneous data sources, e.g., in [11] on the basis of the integrated similarity kernel a support vector machine (SVM) was used to predict protein functions.In the second group, each data source is presented as the functional linkage graph and the network integration algorithm is applied.After that probabilistic graphical models or network-based classification algorithms are used to infer the annotations for unknown proteins [3], [4], [12], [13].There are also some approaches, where individual classifiers are trained on each network and then the ensemble learning technique is used to combine classifiers [10], [14].
As there is not a ready solution to solve the problem of integrating data sources in a more optimal way in order to increase the prediction accuracy and to deal with the unbalanced labels in GO functional categories we have Information Technology and Management Science _______________________________________________________________________________________________ 2018/21 99 described and analysed the performance capabilities of a twostep approach to protein function prediction, which is promising in accounting for the label imbalance and computationally efficient due to relying on the sparseness of functional association networks.It has the advantage in comparison to simple averaging of individual networks and correlation-based network weighting method, described below.The first step consists in constructing the integrated functional network from heterogeneous data sources.It is based on integration of single functional association networks using a form of kernel-target alignment [15] between the composite network and a "target" network constructed from the function label vector.The alignment task is formulated as the task of linear regression with constraints, which allows determining the weights for each data source and simultaneously excluding non-informative ones [16].The second step consists in assigning the functions to unannotated proteins from a single composite network using the label propagation algorithm [16], [17].

II. METHODS
In our study, we consider the task of protein function prediction as a binary classification task, where the labels correspond to the presence or absence of the specific function.Each data source is initially presented in the form of functional association network, which encodes information of shared protein functions from high-throughput proteomic (or genomic) data sources (i.e., protein-protein interactions (PPI), protein domains).In this representation, a node in the network corresponds to a protein, and the weights of the edges of connected nodes correspond to their similarity computed by a specific similarity metric for a given data source.Then these individual networks are combined, through a weighted sum, into a composite network, where the weights are optimised using labels, each label corresponding to a distinct protein function.The weight of the network reflects its usefulness in predicting a given function of interest.Next, the network-based classification algorithm is applied to the composite network in order to compute the association score of a specific function label for the unannotated proteins.We have applied the label propagation algorithm [17] in order to derive the protein functions on the basis of integrated functional association network.

A. Data Sources and Pre-processing
We have made several experiments on the MouseFunc I benchmark data [18] in order to evaluate the two-step approach to protein function prediction.Ten association networks were constructed from ten data sources, including gene expression, protein annotations from Pfam [19] and InterPro [20], Protein-Protein Interactions, Phylogenetic Profiles, Disease Profiles from OMIM [21].The data sources cover 21603 mouse genes (Table I).
We have constructed networks from each profile-based highthroughput data source using Pearson correlation coefficient (PCC).For network-based data (e.g., protein interaction), we have used both binary matrix of protein interactions and matrix of distances between proteins.In order to disperse the resulting networks and increase the computation efficiency without degrading the accuracy, we have set the threshold on the number of links for each gene to K=50.Each functional network i W has been normalised using the expression , where i D is the diagonal row sum matrix of i W .The same normalization has been applied to the integrated functional network.

B. Integration of Data Sources
In order to integrate the individual functional networks, we have applied the kernel-target alignment, which can be formulated in the form of linear regression task [23].
, where where is the target network of functional label, computed as follows: where , n n + − are the numbers of positives and negatives in label vector


, where positive and negative genes are labelled as +1, −1, respectively.Pairs of negatively labelled genes have no influence in determining the weights.In order to exclude the negative-negative pairs of proteins from consideration, the entries in K and each network m W that corresponds to negative pairs of genes are removed.
By minimising (1), larger weights are assigned to the networks, which consider highly similar proteins that share function of interest, and smaller weights -to networks, which consider highly similar proteins that do not share the function.Finally, networks, coherent with functional labels, get higher weights.By using the equality ( ) ( ) ( ) where ( ) vec K is the vectorization operator that stacks the columns of K on top of each other we can write (1) as a nonnegative unregularized linear regression problem: , , ( ) We have also used the constraints in linear regression in order to increase the robustness to the inclusion of irrelevant and redundant networks.It can help dealing with the different level of importance of each individual data source for the prediction of a particular functional class.In this case, to obtain the weight vector α , we solve the following ridge regression problem: where 0 J ≥ is the regularization function.For ridge regression with the prior, the regularization function is as follows: where ν  is the prior weight vector and m s is the strength of the regularization on m α .For ridge with uniform prior, we set 1 m ν = .When all the , 1 m s m M ≤ ≤ are set to zero, then cost function ( 4) is unregularized and solving for α becomes equivalent to unregularized linear regression.Solving equations (3)(4) requires at most M iterations, where each iteration involves solving a system of linear equations with M variables.

C. Network-Based Prediction Algorithm
The first approach based on the "guilt by association" principle to predict the protein function on the basis of functional association network annotates the unknown protein with the functions of its neighbours, which can lead to errors.In our experiment, we have used the network-propagation algorithm, which allows using a global topology of the entire interaction network instead of the local neighbourhood and increasing the reliability of prediction.n n k n where , n n + − are positive and negative labels.The initial association values k for unknown genes in (6) help account for label unbalance, where as a rule only a small number of genes is annotated with gene function of interest.
Label propagation algorithm is applied to the composite network W to predict functions of the unknown proteins.Using the algorithm, the scores f (discriminative values) for each node in the network are computed using the following optimization function: which consists of two terms, where the first term penalizes the differences between the discriminant values of nodes and their initial labels and the second term penalizes the differences between the discriminant values of neighbouring nodes in the network.In such a way, the labelling information propagates Information Technology and Management Science _______________________________________________________________________________________________ 2018/21 101 through the network allowing one to determine the labels of unknown proteins not directly connected to the positive nodes.The conjugate gradient (CG) method is used to solve the system of linear equations, which presents the solution to the optimization task in (7).Due to sparseness of the composite network W a conjugate gradient method is very efficient in solving (7); potentially the runtime of CG depends only on the number of connections in W and it is possible to get very close to the exact solution with only less than few dozens of iterations.

III. RESULTS
We have made the comparative analysis of different integration schemes, including the unregularized linear regression, ridge regression with uniform prior weight vector and a network combination with uniform weights (equal weighting), where the network weights are all set to 1/ M ; M is the number of networks.We have also compared the results with the correlation-based network method.In this method, each network weight corresponds to the kernel-target alignment score for this network: Figure 1 depicts the performance of each analysed integration scheme in five categories: predicting gene GO functions, which have [3][4][5][6][7][8][9][10], [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30], [31-100], [101-300] positive annotations and the whole set of  positive annotations.Figure 1 shows that ridge regression significantly outperforms unregularized regression and equal weighting in terms of AUP (P=1.60×10−4 and P=7.70×10 −4 , Wilcoxon paired signed rank test) taken all the gene GO functions together.The same tendency is for the AUC values (P=2.50×10−10 and P=1.80×10 −5 , Wilcoxon paired signed rank test).The weights assigned by the correlation method give the worst prediction using both AUC and AUP measures. I can be explained by inability of the method to register redundancy between the networks.The performance of the compared schemes in terms of AUP improves with the increase of the number of positive annotations, where the ridge regression is always on the top position in all the evaluation categories. Te AUC values are higher for the category with [3][4][5][6][7][8][9][10] annotations, which is the result of a low number of positive annotations and the possible absence of them in the testing set.In all the other categories, the AUC values increase with the number of positive annotations.
We have also compared the network weights that were assigned by different integration schemes for individual networks (Fig. 2).Unregularized linear regression is the most selective and assigns non-zero weights to only several functional networks for each evaluation category.Composite network, constructed using ridge regression with uniform priors, includes more networks with positive weights.Therefore, efficiency of ridge regression and equal weighting scheme differ to a lesser extent.
The correlation weights have the opposite tendency to unregularized regression.The possible redundancy of gene expression and phylogenetic profiles is not taken into consideration.The average weights assigned to these data sources are much higher than those of linear regression.
In all the schemes, a high proportion of the weights is assigned to the networks derived from gene expression and protein-protein interaction data sources.

IV. CONCLUSION
In the paper, we have analysed the performance capabilities of the two-step approach to protein function prediction, which is promising in accounting for the label imbalance and computationally efficient due to relying on the sparseness of functional association networks.The experiments were conducted on the MouseFunc I benchmark data and GO evaluation categories of protein annotations.Two different performance measures were applied, notably AUC and AUP.AUP measure is more appropriate to the estimation of the results of binary classification tasks with significant label unbalance, i.e., a small number of positive in comparison to negative cases.
In the first step, the integrated functional network is constructed from heterogeneous data sources.The weights for individual networks correspond to the solution of linear regression task with constraints, which is formulated on the basis of kernel-target alignment method and takes into account the known protein annotations.The second step makes the prediction of the protein functions on the basis of an integrated network using the label propagation algorithm.
Several experiments with different integration schemes have shown that the scheme on the basis of ridge regression with uniform priors has preference in comparison to the widely assumed equal weighting.The correlation weighting method was the worst in all the evaluation categories.It can be partly explained by the inability of this scheme to filter information redundancy.The application of unregularized regression to the integration of individual functional networks was restricted to the selection of only few networks with positive weights, which could explain the loss in performance in comparison to ridge regression.
The experiments have shown that the selection of integration scheme has a great influence on the accuracy of protein function prediction, and more extensive experiments with different priors for ridge regression must be conducted.The possible future research direction is the development of approach, which takes into account the hierarchical organisation of GO ontology and makes simultaneous predictions for the groups of functional terms.This direction can lead to the improvement of accuracy of protein function prediction from multiple networks.
of each individual network, which determine the accuracy of protein function prediction and can be estimated by the kernel-target alignment in the form Information Technology and Management Science _______________________________________________________________________________________________ 2018 matrix of the composite functional network.Non-zero elements of W correspond to the strength of association between the connected proteins; association is absent when 0ij W = .Weights m αrepresent the relevance of the m-th network for the prediction task.Among n nodes in W we have l proteins, labelled with specific function and u unlabelled proteins.The labels are used to specify the label vector and unlabelled nodes.The following expression is used to specify k :

Fig. 1 .
Fig. 1.Efficiency measures of protein function prediction using different network integration schemes (ridge regression with uniform priors, unregularized regression, equal weighting and correlation weighting): (a) area under precision-recall curve; (b) average area under the ROC curve.The bars indicate average performance in evaluation categories with a different number of positive annotations.