Fast Frequent Subgraph Mining
Introduction:
The FFSM (Fast Frequent Subgraph Mining)
algorithm is part of our on-going effort to develop effective and efficient algorithms
for knowledge discovery in complex data. The current executable FFSM V. 1 solves
the frequent subgraph mining problem, which
is, given a collection of graphs D and a
threshold f between 0 (exclusive)
and 1 (inclusive), to enumerate subgraphs that occur in at least a
fraction f of graphs in
D.
In searching for frequent subgraphs, we make no assumption about the
size, topology, or location of these patterns, and hence our result is an exclusive
enumeration of all frequent patterns.
Further algorithmic details about FFSM can be found in [1]; the application of FFSM
in protein structural comparison, one of our driving applications, can be found
in [2].
Download
& Install:
Please send email to Dr. Wei Wang and cc to Dr. Jan Prins (prins@cs.unc.edu)
and Jun Huan (huan@cs.unc.edu). The executable, together with two sample data sets
and a README file, is distributed as a zip file (ffsm.tar.gz). After saving the
zip file into your local directory, please run commands: gunzip ffsm.tar.gz;
tar -xvf ffsm.tar.
Command Line Parameters:
$FFSM <node file> <edge file> <frequency>
Example: FFSM d1n d1e 27
or FFSM d2n d2e 36
Input Format:
There are two input files: node file and edge file with the following formats:
node file:
node <g_index> <n_index> <n_label>
<g_index> : an integer from 0 to n-1
where n is the size of the graph database.
<n_index> : the index of a node in a graph
<n_label> : a positive integer label of the related node
edge file:
edge <g_index> <n1_index> <n2_index> <e_label>
<g_index> : an integer from 0 to
n-1 where n is the size of the
graph database.
<n1/2_indices>: the indices of the nodes which the edge connects
<e_label> : a positive integer label of the
edge
<frequency> is a positive integer.
Sample Output:
=========Welcome using FFSM: Fast Frequent Subgraph Mining=========
Developed by the University of North Carolina at Chapel Hill
graph database size 1083 threshold 27
......
Total frequent patterns: 348514 Search Time 94.5196 seconds
Sample Datasets:
D1: Confirmed moderate active
chemicals from NIH anti-HIV virus drug screen.
D2: SCOP serine proteases. The
list of the protein identifiers, which can be used to retrieve the related
proteins from the Protein DataBank (PDB, http://www.pdb.org), is included.
Platform:
The FFSM executable was built in Red Hat Linux.
Limitations & Disclaimer:
FFSM supports graph databases that contain maximal 32k
graphs; each graph may have up to 32k nodes. Single-node patterns are not included
in results. The current software is confined to performance comparison only; full
functional one is available upon request. The executable is provided "as it is"
and we assume no responsibility for any damage that it may cause to your system/files.
References:
Algorithm/Dataset D1:
[1] Jun Huan, Wei Wang, and Jan Prins. "Efficient Mining of Frequent Subgraph in
the Presence of Isomorphism", in Proceedings of the 3rd IEEE International Conference
on Data Mining (ICDM), pp. 549-552, 2003.
PDF
[2] Jun Huan, Wei Wang, and Jan Prins. "Efficient Mining
of Frequent Subgraph in the Presence of Isomorphism", in Technical Reports produced
by the Department of Computer Science at the University of North Carolina, Chapel
Hill, 2003. PDF
Dataset D2:
[3] Jun Huan, Wei Wang, Deepak Bandyopadhyay,
Jack Snoeyink, Jan Prins, and Alexander Tropsha. "Mining Family Specific Residue
Packing Patterns from Protein Structure Graphs ", in Proceedings of the 8th Annual
International Conference on Research in Computational Molecular Biology (RECOMB),
pp. 308-315, 2004. PDF