MoTIF SPACE

Discovering Spatial Motifs from the Protein Structure Space

MotifSpace

This is the home page for research relating to the Motif Space project that is currently being done at the Computer Science department of the University of North Carolina at Chapel Hill, NC. This provides an ovierview of the research and all related applications being built for the project. 


What's New

  • Upload of all PDB Files to database completed
  • More Stuff

The Challenge

A protein is a long sequence of amino acid residues that folds into a stable structure in 3D. A central tenet of modern biology is that a protein's function is determined by its structure. Often local substructures within the protein determine its function. These substructures, composed of a small number of amino acid residues, often have conserved spatial arrangements across a group of proteins of the same function, and are referred to as spatial motifs. The challenge is to develop automated techniques to identify spatial motifs in proteins.

The Solution

MotifSpace is a collection of computational methods for discovering, cataloging, querying, and visualizing spatial motifs.  The tools can be used to study the locations of spatial motifs within protein structures, build predictive models for protein function inference, and construct hypotheses to guide the design of biological experiments.

We represent a protein structure by a labeled multigraph and detect spatial motifs by searching for common subgraphs from a group of protein graphs. In our representation, a node abstracts an amino acid residue in a protein structure with the amino acid identity as the node label and an edge connects two amino acid residues and is labeled by (1) the discretized Euclidian distance between the two amino acid residues, and (2) the potential interaction between the two amino acid residues. A spatial motif corresponds to a subgraph where edges are labeled by distance intervals that allow some perturbation of the amion acids to account for dynamics and uncertainty in structure determination.

We have developed frequent subgraph mining methods to search all subgraphs that appear in at least a fraction of members in a group of graphs. For protein graphs, such subgraphs represent spatial motifs. As a proof-of-concept, we locate patterns with known biological functions such as the catalytic triad in serine protease (see the image below), the catalytic diad and the hydrophobic binding pocket in papain-like cysteine protease, the ligand binding sites in nuclear binding domains, and the co-factor binding sites in NADP binding proteins.

We have identified more than six million spatial motifs from thousands of representative proteins in the Protein Databank (PDB). MotifSpace will provide an integrated view of information and knowledge of spatial motifs:

Project Leaders

Wei Wang, Associate Professor & Director

Leonard McMillan, Associate Professor

Jan Prins, Professor

Jack Snoeyink, Professor

Alexander Tropsha, Professor(School of Pharmacy)

Graduate Research Assistants

Luke Huan, Deepak Bandyopadhyay, Yetian Chen (School of Pharmacy), Jinze Liu, Ruchir Shah (School of Pharmacy), Kiranjit Sidhu, David Williams, Tao Xie, Jindan Zhang

Research Sponsors

National Science Foundation

National Institute of Health

Microsoft Research

References:

[1] D. Bandyopadhyay, J. Huan, J. Liu, J. Prins, J. Snoeyink, W. Wang, A. Tropsha, Structure-based function inference using protein family-specific fingerprints, to appear in Protein Science, 2006.
[2] J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraph in the presence of isomorphism. In Proc. International Conference on Data Mining, pages 549-552, 2003.
[3] J. Huan, W. Wang, A. Washington, J. Prins, R. Shah, and A. Tropsha. Accurate classification of protein structural families based on coherent subgraph analysis. In Proc. Pacific Symposium on Biocomputing (PSB), pages 411-422, 2004.  
[4] J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. Mining protein family specific residue packing patterns from protein structure graphs. In Eighth Annual International Conference on Research in Computational Molecular Biology (RECOMB), pages 308–315, 2004.
[5] J. Huan, W. Wang, J. Prins, and J. Yang. Spin: Mining maximal frequent subgraphs from graph databases. ACM SIGKDD, pages 581-586, 2004.
[6] A. Murzin, S. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536–40, 1995.
[7] H. Weissig, I. Shindyalov, and P. Bourne. The protein data bank. Nucleic Acids Research, 28:235–42, 2000.