Fuses Predict Function of Unknown Proteins | National Agricultural Library

Objective

#1. Identify protein functional residues using in silico mutagenesis. Defining a functional role for every residue facilitates understanding of protein molecular/cellular functions. However, quantitative ways are lacking to compare the importance of a "catalytic residue" to a "binding hot-spot", and to differentiate these from structurally important core positions. Existing tools define functional sequence (FuSe) residue importance using evolutionary/family and structural information. We will develop a generic way to score the importance of FuSe positions by evaluating the functional effects of their modifications. Experimentally probing each residue is costly and not feasible on a large scale. We have previously built an accurate/efficient sequence-based computational tool (SNAP) to predict functional effects of nsSNPs. A 20-mer vector of SNAP scores describes the functional impact of having every possible residue at a given sequence position. We will use these vectors to create a single well-calibrated index of each position's functional importance. 
#2. Identify functional sequence signatures (FuSeS). The ability to recognize FuSeS, sets of residues that together are responsible for a specific function, elucidates functional mechanisms. Sequence or structural families/domains are often used to indirectly approximate functionality of unknown proteins. FuSeS will instead characterize molecular function directly. Using multiple sequence alignments we will identify sets of FuSe residues consistently present in all proteins of experimentally annotated similar function. We will quantitatively evaluate the validity of FuSeS by their ability to precisely identify other functionally related sequences. For qualitative inference support, we'll also map FuSeS onto available protein 3D structures and manually inspect their correlation with the probable locations of functionality-defining sites. We'll extend the FuSeS concept to sequences not yet studied experimentally for further elucidation of molecular functions of the human proteome. 
#3. Build a database of FuSeS and implement a protein sequence-scanning tool. FuSeS and their functional annotations, gleaned from source sequences, will be stored in a freely available database - DFuSeS. Additionally, FuSeScanner, a methodology for searching protein sequences for known FuSeS, will be developed and tested. We will use FuSeScanner to annotate all protein sequences predicted from existing metagenomes. The resulting annotations will be stored in DFuSes and referenced to the corresponding FuSe Signatures.

More information

NON-TECHNICAL SUMMARY: Extensive sequencing is increasingly leaving us with a potential goldmine of "not-quite-useful-yet" data. Existing tools that gauge the specifics of protein sequence-encoded functionality lack both functional specificity and sensitivity. Our long-range goal is to develop novel computational methods that can accurately identify residues specifically relevant for protein function, reveal the range of functions encoded by a given meta-, gen-, ex-, transcript-ome, and reduce the experimental work needed to describe the variome-mediated functional differences. The particular objective of this proposal is to elucidate the molecular functional make-up of the currently available meta-proteomes using per-residue functional significance predictions to profile/cluster protein sequences. We suggest a three-tiered approach: first, predict functional sequence
(FuSe) residues using in silico mutagenesis. Then, align experimentally annotated orthologues and close paralogues to extract FuSe Signatures (FuSeS) - sets of FuSe residues representative of specific protein functions. Use FuSeS to gauge functions of available un-annotated sequences. Finally, cluster the pool of FuSeS-less proteins to build a collection of new FuSeS defining yet unknown functions. Note that while all aims are logically interconnected, the project is modular and the completion of one aim/module is sufficiently independent of the others; i.e. data collection and proofs of concept for all aims may proceed simultaneously, while modules unsuccessful in development may be replaced. The expected outcome of this project is a database of protein functional signatures (FuSeS) and a corresponding computational tool (FuSeScanner) for protein function annotation from sequence alone.
The innovation of FuSeS is in building on established methodologies to create a completely unique, novel and highly informative functional view of existing proteome data. This is also highly significant as FuSeS can be used to generate new experimentally testable hypotheses about the make up and optimization of specific microbiotic environments. Understanding the human gut microbiome, for instance, could facilitate research in the directions of food safety and childhood obesity. Deeper knowledge of electron transfer chains in microbial communities could potentially aid research and development of sustainable energy resources. FuSeS will also be easily, cheaply, and accurately applicable to any -omic study requiring a more succinct annotation of protein function.
APPROACH: We'll evaluate the accuracy of annotating functional sites based on experimental mutagenesis data. We'll use pre-defined explicitly functional sites, structurally functional sites, and negative controls (all other residues) to evaluate FuSe (functional sequence) predictive abilities of experimental mutagenesis. We'll further collect protein per-residue functional annotations from various sources to create a training/testing set. We have previously developed SNAP, a neural network based method for evaluating the functional effect of single amino acid substitutions. We'll train a standard feed-forward neutral network to recognize FuSe residues from SNAP vectors of all possible substitutions plus additional features, such as conservation scores and predicted secondary structure. We'll vary feature sets, AI algorithms, and their parameters to create multiple
prediction methods. These will be compared to each other, to the residue conservation baseline, and to other methods using our testing sets and external data of same type. We expect to develop a fast/accurate in silico mutagenesis-based method for computing well calibrated scores representative of per-residue FuSe propensities. We'll extract from SwissProt all enzymes with EC numbers. The sequences will be split into subsets at every digit of all assigned ECs. We'll also extract proteins with manually assigned GO "molecular function" terms. First, we'll build MSAs for all full four-digit EC number subsets and all GO subsets using MAFFT. For all protein groups we'll extract FuSeS: 1) Select from the MSA columns where >90% of sequences contain a FuSe. 2) Eliminate from this set columns where at least one FuSe residue isn't SNAP conserved 3) Randomly split the sequence set into ten subsets
and iteratively recombine nine of ten, leaving a different one out each time; build MSAs of the ten 90%-subsets and repeat steps 1,2 on each. 4) Define FuSeS per protein as a set of residues in columns selected by all ten MSAs to which it belongs. For each protein of known function, we expect to determine a set of functional sequence signatures (FuSeS) defining said function. Newly created FuSeS will be stored in the DFuSeS database together with their corresponding functions. We will use these FuSeS to annotate functions of all UniProt proteins as follows (FuSeScanner): 1) PSI-BLAST all sequences used in FuSeS building against UniProt, 2) validate each hit against its query sequence checking SNAP conservation of all FuSeS residues, 3) in hits where the FuSeS are not conserved, re-align sequences first using ClustalW and, failing that, Smith-Waterman, and 4) transfer function annotation
from the query sequence to the hit if all FuSeS residues are conserved. We'll fvaluate the FuSeScanner ability to annotate the function of SwissProt proteins and that of the electronically GO-annotated proteins. Numerous recent metagenomic studies have produced tens of millions of protein sequences. We will use FuSeScanner to annotate functions of as many proteins predicted from this data as possible. We also expect to build DFuSeS, a database of FuSeS, their functional annotations, and references to known proteins sequences.
PROGRESS: 2012/10 TO 2013/09 Target Audience: Nothing Reported Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest? Results have been disseminated via publications in major journals, and via an informal report at the SNP-SIG'13, a meeting co-organized by PI-Bromberg andattended by >100 computational biologists in Berlin, Jul 2013. What do you plan to do during the next reporting period to accomplish the goals? We will continue developing our methods for identifying protein functional sites.
PROGRESS: 2011/10/01 TO 2012/09/30 OUTPUTS: As a direct result of my research for the FuSeS project, the annual SNP-SIG meeting that I co-chair (2012 edition, Long Beach, CA) had a specific subfocus on impact of mutations in functionally significant sites. PARTICIPANTS: Yana Bromberg -- Principle Investigator, Rutgers Chris Rusnak -- Undergraduate student, Rutgers; Data extraction and data model building. Burkhard Rost -- non-formal collaborator, Technical University of Munich Christian Schaefer -- Co-supervised Ph.D. Student, in the lab of Dr. Burkhard Rost, Technical University of Muinch; Data collection, manuscript write-up TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Investigators

Bromberg, Yana

Institution

Rutgers University

Start date

2012

End date

2017

Funding Source

Nat'l. Inst. of Food and Agriculture

Project number

NJ01150

Accession number

228906