Mutual information MI as a function of number of gene pairs swapped between clusters.
At each permutation, two genes are chosen at random from each of two randomly chosen clusters there are 30 clusters in all. The genes are swapped, and the MI between cluster membership and attribute possession is recomputed. For convenience, the MI is shown as a fraction of its initial value.
It is clear that MI decreases monotonically as the genes are swapped, illustrating that it is a good gauge of the quality of the clusters. It does not fall to zero because even with random assignment of genes to clusters, it is likely that genes will coincidentally end up in the same cluster. Clusters taken from Tavazoie et al. Two characteristics are evident. Firstly, as expected, MI decreases as the clusters become increasingly disordered with respect to function. Secondly, after a large enough number of random swaps, MI reaches a non-zero baseline value, reflecting the fact that even for data chosen at random, when the number of clusters is much smaller than the number of genes, there is some degree of mutual information between membership in a particular cluster and possession of certain attributes.
The z -score can then be interpreted as a standardized distance between the MI value obtained by clustering and those MI values obtained by random assignment of genes to clusters. The larger the z -score, the greater the distance, and higher scores indicate clustering results more significantly related to gene function.
Clusters to which genes were randomly assigned were chosen to be as nearly uniform in size as possible, so that some of the success of a clustering algorithm relative to random may derive from producing nonuniform cluster size distributions.
CN103745258A - 基于最小生成树聚类的遗传算法的复杂网络社区挖掘方法 - Google Patents
It is reasonable to assume that those using clustering methods are seeking a fine structure, rather than a broad one. For example, in cell cycle data, genes might be broadly classified according to the phase of the cell cycle in which they peak, yielding perhaps no more than five clusters, corresponding to early G 1 , late G 1 , S, G 2 , and M phases Cho et al.
Certainly, this is a correct answer, but it yields little new knowledge. It would be more useful to find those probably small groups of genes sharing rather specific biological functions e. On the other hand, it is no help to classify each gene into its own cluster. Many properties are desirable in an annotation database to improve assessment of relatedness between clustering results and annotation.
We have already described the algorithm and parameters used to reduce the database from the complete annotated genome to a subset of relatively independent attributes that are neither too general e. We have acknowledged that we are biasing optimal cluster number in this manner.
How sensitive are the results to the particular parameter values we choose N min , N max , U max?leondumoulin.nl/language/encyclopedias/10565-tender-as.php
Data Mining Algorithms In R/Clustering/Hybrid Hierarchical Clustering
We have constructed several databases, based on a variety of choices for N min and U max. We find that although the particular scores obtained do change with differing choice of these parameters, the basic shape location of the peak, rolloff at higher k -values, ranking of clustering methods does not. Also, the relative success of different distance measures is insensitive to the parameters used to filter the attribute database.
First of all, missing expression-data values were imputed, using the KNNimpute program Troyanskaya et al. Ratio-style data were then log-transformed, and arrays were median-normalized, to account for interarray differences. Each gene was median-centered, and ranked by standard deviation across arrays. The top genes in this ranking were selected for clustering and standardized across all arrays, so that each gene's expression profile had zero median and unit variance.
We implemented the k -means algorithm with several different distance measures in the Perl programming language Wall et al. Although this algorithm has been implemented for gene clustering, it has not been available in a form that allowed user-defined distance measures to be easily substituted in. Numerical performance was improved by up to two orders of magnitude through use of C code for the core algorithm, written in-house and interfaced with Perl using SWIG Beazley et al.
Hierarchical trees from Cluster were cut into groups based on distance from the root, again using in-house C code glued to Perl with SWIG. We thank G. Berriz, O. King, J. White, P. D'haeseleer, and P. Kharchenko for helpful discussions. We are grateful to G. Berriz, D. Goldberg, O. King, S. Wong, and S. Komili for critical reading of the manuscript.
We believe this paper was improved by taking account of the suggestions of the anonymous referees, and we thank them. The publication costs of this article were defrayed in part by payment of page charges. View all Gibbons and Frederick P.
Figure 1. Schematic of dataflow in clustering and evaluation. Previous Section Next Section. View this table: In this window In a new window. Table 1. Figure 2. Figure 3. Figure 4. Figure 5. Table 2.
Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation
Previous Section. Aach J. Genome Res. Angelo M.
- Complementary hierarchical clustering?
- God in the Corridors of Power: Christian Conservatives, the Media, and Politics in America.
- Neurological Rehabilitation for Stroke (Queen Square Neurological Rehabilitation);
- Cell Physiology (LANGE Physiology Series).
- Navigation menu;
- Ramanujans Lost Notebook: Part I.
- POF — Polymer Optical Fibers for Data Communication!
Google Scholar. Ashburner M. CrossRef Medline Google Scholar. Beazley D. O'Reilly Perl Conference 2. Ben-Hur A. Altman R. World Scientific , Kauai, HI , pp 6 — Brown P. Cho R. Cohen B. Cell 13 : — Cover T. Schilling D. Wiley-Interscience , New York. DeRisi J. Science : — Eisen, M. Eisen M. Methods Enzymol. Everitt B. Heinemann , London. Fraley C. Which clustering method?
Answers via model-based cluster analysis. Function clara is a wrapper to pam for larger data sets. Silhouette plots and spanning ellipses can be used for visualization. Package apcluster implements Frey's and Dueck's Affinity Propagation clustering. The algorithms in the package are analogous to the Matlab code published by Frey and Dueck. Package ClusterR implements k-means, mini-batch-kmeans, k-medoids, affinity propagation clustering and Gaussian mixture models with the option to plot, validate, predict new data and estimate the optimal number of clusters.
The package takes advantage of RcppArmadillo to speed up the computationally intensive parts of the functions. Package clusterSim allows to search for the optimal clustering procedure for a given dataset. Package clustMixType implements Huang's k-prototypes extension of k-means for mixed type data. Package evclust implements various clustering algorithms that produce a credal partition, i.
Package flexclust provides k-centroid cluster algorithms for arbitrary distance measures, hard competitive learning, neural gas and QT clustering. Neighborhood graphs and image plots of partitions are available for visualization. Some of this functionality is also provided by package cclust. Package kernlab provides a weighted kernel version of the k-means algorithm by kkmeans and spectral clustering by specc. Package kml provides k-means clustering specifically for longitudinal joint data. Package skmeans allows spherical k-Means Clustering, i.