Overview SUBDUE is a graphbased knowledge discovery system that finds structural, relational patterns in data representing entities and relationships. SUBDUE represents data using a labeled, directed graph in which entities are represented by labeled vertices or subgraphs, and relationships are represented by labeled edges between the entities. SUBDUE uses the minimum description length (MDL) principle to identify patterns that minimize the number of bits needed to describe the input graph after being compressed by the pattern. SUBDUE can perform several learning tasks, including unsupervised learning, supervised learning, clustering and graph grammar learning. SUBDUE has been successfully applied in a number of areas, including bioinformatics, web structure mining, counterterrorism, social network analysis, aviation and geology. 

Graphbased Unsupervised Learning (Discovery) 

In discovery mode SUBDUE uses heuristic search guided by MDL to find patterns minimizing the description length of the entire graph compressed with the pattern. Once a pattern is found, SUBDUE can compress the graph using this pattern and repeat the process on the compressed graph to look for more abstract patterns, possibly defined in terms of previouslydiscovered patterns. The illustration below shows how SUBDUE finds four instances of a pattern S_{1} in the input graph, the graph compressed with pattern S_{1}, the pattern S_{2} found in the next iteration, and the graph compressed with pattern S_{2}.




Graph Based Supervised Learning If graphs depicting both positive and negative examples of a phenomenon are available for input, then SUBDUE enters supervisedlearning mode, searching for a pattern that compresses the positive graphs, but not the negative graphs. For example, given positive graphs describing criminal networks and negative graphs describing normal social networks, SUBDUE can learn patterns distinguishing the two, and these patterns can be used as a predictive model to identify emerging criminal networks. 



Graph Based Hierarchical Clustering The ability of SUBDUE to iteratively discover patterns and compress the graph can be used to generate a clustering of the input graph. Essentially, clustering mode forces SUBDUE to iterate until the input graph can be compressed no further. The resulting patterns form a cluster lattice, such that if a pattern S is defined in terms of one or more previouslydiscovered patterns, then these patterns are parents of S in the lattice. In the above example, S_{2} is defined in terms of S_{1}. Each cluster in the lattice is defined conceptually by the graphical pattern discovered by SUBDUE, producing a hierarchical, conceptual clustering of the input data. 



Graph Grammar Learning Graph grammars are similar to string grammars, but where terminals and nonterminals can represent arbitrary graphs. SUBDUE learns contextfree, nodereplacement graph grammars by looking for common connections between the instances of a substructure S. 

