MoTIF SPACE

Discovering Spatial Motifs from the Protein Structure Space

Fast Frequent Subgraph Mining

Introduction:


The FFSM (Fast Frequent Subgraph Mining) algorithm is part of our on-going effort to develop effective and efficient algorithms for knowledge discovery in complex data. The current executable FFSM V. 1 solves the frequent subgraph mining problem, which is, given a collection of graphs D and a threshold  f between 0 (exclusive) and 1 (inclusive), to enumerate subgraphs that occur in at least a fraction f of graphs in D. In searching for frequent subgraphs, we make no assumption about the size, topology, or location of these patterns, and hence our result is an exclusive enumeration of all frequent patterns.  Further algorithmic details about FFSM can be found in [1]; the application of FFSM in protein structural comparison, one of our driving applications, can be found in [2].


Download & Install:


Please send email to Dr. Wei Wang and cc to Dr. Jan Prins (prins@cs.unc.edu) and Jun Huan (huan@cs.unc.edu). The executable, together with two sample data sets and a README file, is distributed as a zip file (ffsm.tar.gz). After saving the zip file into your local directory,  please run commands: gunzip ffsm.tar.gz;  tar -xvf ffsm.tar.


Command Line Parameters:

$FFSM <node file> <edge file> <frequency> 
Example: FFSM d1n d1e 27
or FFSM d2n d2e 36


Input Format:

There are two input files: node file and edge file with the following formats:

node file:
node <g_index> <n_index> <n_label>
<g_index> : an integer from 0 to n-1 where n is the size of the graph database.
<n_index> : the index of a node in a  graph
<n_label>  : a positive integer label of the related node

edge file:
edge <g_index> <n1_index> <n2_index> <e_label>
<g_index>       : an integer from 0 to n-1 where n is the size of the graph database.
<n1/2_indices>: the indices of the nodes which the edge connects
<e_label>        : a positive integer  label of the edge

<frequency> is a positive integer.

Sample Output:

=========Welcome using FFSM: Fast Frequent Subgraph Mining=========
Developed by the University of North Carolina at Chapel Hill
graph database size 1083 threshold 27
......
Total frequent patterns: 348514 Search Time 94.5196 seconds

Sample Datasets:

D1: Confirmed moderate active chemicals from NIH anti-HIV virus drug screen.
D2: SCOP serine proteases. The list of the protein identifiers, which can be used to retrieve the related proteins from the Protein DataBank (PDB, http://www.pdb.org), is included.

 

Platform:

The FFSM executable was built in Red Hat Linux.

Limitations & Disclaimer:

FFSM supports graph databases that contain maximal 32k graphs; each graph may have up to 32k nodes. Single-node patterns are not included in results. The current software is confined to performance comparison only; full functional one is available upon request. The executable is provided "as it is" and we assume no responsibility for any damage that it may cause to your system/files.

 

References:

Algorithm/Dataset D1:
[1] Jun Huan, Wei Wang, and Jan Prins. "Efficient Mining of Frequent Subgraph in the Presence of Isomorphism", in Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), pp. 549-552, 2003. PDF

[2] Jun Huan, Wei Wang, and Jan Prins. "Efficient Mining of Frequent Subgraph in the Presence of Isomorphism", in Technical Reports produced by the Department of Computer Science at the University of North Carolina, Chapel Hill, 2003. PDF

Dataset D2:

[3] Jun Huan, Wei Wang, Deepak Bandyopadhyay, Jack Snoeyink, Jan Prins, and Alexander Tropsha. "Mining Family Specific Residue Packing Patterns from Protein Structure Graphs ", in Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB), pp. 308-315, 2004. PDF