|

The Tree and Network Inference module
Cluster analysis, also called unsupervised learning, is an indispensible tool in bioinformatics. BioNumerics brings together the power and flexibility of its relational database, the contribution of multiple techniques, and a wide range of clustering algorithms in a clustering module with unique capabilities.
The Comparison window The heart of BioNumerics' analysis functions is the Comparison window, presenting a comprehensive overview of all available experiments for a selection of entries and enabling the user to show and compare any combination of experiments. Similarity or distance matrices and dendrograms can be calculated for any selected experiment, and the obtained groupings can be compared with patterns or characters obtained from other experiments. A large number of similarity and distance coefficients and clustering methods are available, in order to provide the most appropriate clustering for all data types and clustering purposes.
Dendrogram features Interpreting trees of up to 20,000 entries is not an simple task. BioNumerics offers a comprehensive set of features for interpreting and mining of complex data sets, including viewing tools such as two-way zoom-sliders, swapping and abridging of branches, rerooting of trees, displaying data (characters, patterns, curves or sequences) in various modes, assigning colors or symbols to groups, etc. Furthermore, adding entries to, or deleting entries from large clusterings is facilitated using the incremental clustering feature. Rather than recalculating matrices and trees, BioNumerics automatically updates, so that adding or deleting entries becomes a matter of a few seconds.

Composite clustering
Data from multiple techniques can be combined into one composite clustering. Similarities can be adopted from the individual experiments and averaged using different weighting strategies. Alternatively, all characters from the individual experiments can be pooled to form one global data set, which can be clustered. Using a mathematical linearization model, a consensus similarity matrix and dendrogram can be calculatd based upon individual matrices from different experiments.
Phylogenetic inference BioNumerics offers Maximum Parsimony and Maximum Likelihood as phylogenetic inference methods. Besides standard algorithms, the optimal trees can be calculated using simulated annealing or quartet puzzling. Both methods result in an unrooted tree, which can be converted into a rooted tree after assignment of a root. To correct phylogenetic distance scaling, the Jukes & Cantor or Kimura 2 parameter correction factors can be chosen.
Minimum Spanning Trees Whereas parsimony and maximum likelihood techniques are suitable for inferring deeper phylogenetic relationships, the Minimum Spanning Tree (MST) algorithm allows short-term divergence and micro-evolution in populations to be reconstructed based upon sampled data. The MST technique as implemented in BioNumerics is an excellent tool for analyzing genetic subtyping data such as derived from MLST, MLVA and other allele-comparison techniques. The MST interface offers great interaction with the database and other techniques and is the ideal platform for plotting epidemic divergence against other factors such as geographical distribution, date of sampling, serotypes, etc.
Cluster significance tools BioNumerics employs proprietary technology to assess the reliability of clusters for any clustering algorithm and data set. The method is based on resampling/permutation techniques operating at the data level or at the similarity level and is designed as a framework encompassing all available clustering algorithms in BioNumerics. The method enumerates the reliability of dendrograms or networks in function of degeneracies as well as poorly resolved clusters and can calculate consensus trees or networks that impose a minimum reliability threshold on each resolved cluster. This method sheds new light on the problem of cluster significance and the reliability of tree and network inferring algorithms and is an invaluable asset in interpreting clustering trees and networks.

Specifications:
- Methods. Comparisons of up to 20,000 database entries, various similarity/distance coefficients for different data types: Pearson product-moment correlation, cosine correlation, Dice or Nei and Li, Jaccard, Jeffrey's X, Ochiai,... fuzzy logic and area sensitivity for banding patterns. Gower, Rank correlation, Canberra metric, Simple Matching, Bray-Curtis, Chebyshev etc. for character data. Categorical coefficient for multi-state data (VNTR, MLST, AB resistance patterns, etc.). Similarity-based clustering: Unweighted pair-grouping (UPGMA), complete linkage (furthest neighbor), single linkage (nearest neighbor), Ward, Centroid, Median, Neighbor Joining, Bio-Neighbor Joining, NeighborNet clustering. Correlation Eliminator and Partial Correlation Elimintor methods. Adjustable trace-to-trace optimization and tolerance settings for banding patterns. Statistical determination of most suitable tolerance settings for banding patterns. Interactive wizard-driven input of parameters, options and choices make the clustering window more intuitive for users with little statistical background.
- Phylogenetic inference methods. Generalized Parsimony and Maximum Likelihood with standard and simulated annealing or quartet puzzling calculation. Population modeling: Analysis of categorical data such as MLST or VNTR (MLVA) using Minimum Spanning Trees to reconstruct evolutionary models. Advanced presentation and editing tools, including faithful tree representation ('rendered trees').
- Interpretation. Combined display of character images, sequences, normalized pattern images, with similarity matrices and sorted according to dendrogram(s). Indication of statistical error at all linkage levels and calculation of co-phenetic correlation. Unrooted and rooted representation for all tree inference methods. Bootstrap analysis for single or composite datasets. Display of sorted similarity matrices, shaded or with numerical similarity values. Numerous edit and publishing functions. Enhanced presentation and printing facilities, in a WYSIWYG environment. Direct interaction between database and dendrogram. Incremental and decremental clustering: new entries can be added to or deleted from existing cluster analyses, without having to recalculate the complete analysis. Transversal clustering: characters and entries can simultaneously be clustered based upon the swapped data matrix. All features of a comparison can be stored to disk.
- Cluster significance. Patented method allowing the reliability of clusters to be calculated and visualized for any clustering algorithm and data set.
- Congruence between techniques. Calculation of global similarity or congruence between different techniques as matrix or dendrogram. Easy visualization of taxonomic depth or level of each technique by pairwise regression plots of similarities.
- Composite cluster analysis. Different data sets of the same type and of different types (fingerprint, character, sequence and matrix) can be combined into one consensus clustering. Calculation of global similarity by merging characters or by averaging experiment-related similarities. Optional weighting based on number of characters or defined by the user.
- Plots and graphs. Creation of 2-D and 3-D bar graphs, contingency tables, 2-D and 3-D scatterplots from database fields and characters. Professional presentation, printing and exporting tools.
© 2009 Applied Maths NV
|