This is the home page for research relating to the Motif Space project that is currently
being done at the Computer Science department of the University of North Carolina
at Chapel Hill, NC. This provides an ovierview of the research and all related applications
being built for the project.
MotifSpace is a collection of computational
methods for discovering, cataloging, querying, and visualizing spatial motifs.
The tools can be used to study the locations of spatial motifs within protein structures,
build predictive models for protein function inference, and construct hypotheses
to guide the design of biological experiments.
We represent a protein structure by a
labeled multigraph and detect spatial motifs by searching for common
subgraphs from a group of protein graphs. In our representation, a node abstracts
an amino acid residue in a protein structure with the amino acid identity as the
node label and an edge connects two amino acid residues and is labeled by (1) the
discretized Euclidian distance between the two amino acid residues, and (2) the
potential interaction between the two amino acid residues. A spatial motif corresponds
to a subgraph where edges are labeled by distance intervals that allow some perturbation
of the amion acids to account for dynamics and uncertainty in structure determination.
We have developed frequent subgraph mining
methods to search all subgraphs that appear in at least a fraction of members in
a group of graphs. For protein graphs, such subgraphs represent spatial motifs.
As a proof-of-concept, we locate patterns with known biological functions such as
the catalytic triad in serine protease (see the image below), the catalytic diad
and the hydrophobic binding pocket in papain-like cysteine protease, the ligand
binding sites in nuclear binding domains, and the co-factor binding sites in NADP
binding proteins.
We have identified more than six million spatial motifs from thousands of representative
proteins in the Protein Databank (PDB). MotifSpace will provide an integrated view
of information and knowledge of spatial motifs:
Project Leaders
Wei Wang, Associate Professor & Director
Leonard McMillan, Associate Professor
Jan Prins, Professor
Jack Snoeyink, Professor
Alexander Tropsha, Professor(School of Pharmacy)
Graduate Research Assistants
Luke Huan, Deepak Bandyopadhyay, Yetian Chen (School of Pharmacy), Jinze Liu, Ruchir
Shah (School of Pharmacy), Kiranjit Sidhu, David Williams, Tao Xie, Jindan Zhang
Research Sponsors
National Science Foundation
National Institute of Health
Microsoft Research
References:
[1] D. Bandyopadhyay, J. Huan, J. Liu, J. Prins, J. Snoeyink,
W. Wang, A. Tropsha, Structure-based function inference using protein family-specific
fingerprints, to appear in Protein Science, 2006.
[2] J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraph in the
presence of isomorphism. In Proc.
International Conference on Data Mining, pages 549-552, 2003.
[3] J. Huan, W. Wang, A. Washington, J. Prins, R. Shah, and A. Tropsha. Accurate
classification of protein structural families based on coherent subgraph analysis.
In Proc. Pacific Symposium on Biocomputing (PSB), pages 411-422, 2004.
[4] J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha.
Mining protein family specific residue packing patterns from protein structure graphs.
In Eighth Annual International Conference on Research in Computational Molecular
Biology (RECOMB), pages 308–315, 2004.
[5] J. Huan, W. Wang, J. Prins, and J. Yang. Spin: Mining maximal frequent subgraphs
from graph databases. ACM SIGKDD, pages 581-586, 2004.
[6] A. Murzin, S. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural classification
of proteins database for the investigation of sequences and structures. Journal of
Molecular Biology, 247:536–40, 1995.
[7] H. Weissig,
I.
Shindyalov, and P. Bourne. The protein data bank. Nucleic Acids Research,
28:235–42, 2000.