Information

15.3: Clustering Algorithms - Biology

15.3: Clustering Algorithms - Biology



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

To analyze the gene expression data, it is common to perform clustering analysis. Alternatively, agglomerative clustering methods yield a set of nested clusters organized as a hierarchy representing structures from broader to finer levels of detail.

K-Means Clustering

The k-means algorithm clusters n objects based on their attributes into k partitions. This is an example of partitioning, where each point is assigned to exactly one cluster such that the sum of distances from each point to its correspondingly labeled center is minimized. The motivation underlying this process is to make the most compact clusters possible, usually in terms of a Euclidean distance metric.

The k-means algorithm, as illustrated in figure 15.8, is implemented as follows:

  1. Assume a fixed number of clusters, k
  2. Initialization: Randomly initialize the k means μk associated with the clusters and assign each data point xi to the nearest cluster, where the distance between xi and μk is given by di,k = (xi − μk)2 .
  3. Iteration: Recalculate the centroid of the cluster given the points assigned to it: (mu_{k}(n+1)=sum_{x_{i} in k} frac{x_{i}}{left|x^{k} ight|}) where xk is the number of points with label k. Reassign data points to the k new centroids by the given distance metric. The new centers are effectively calculated to be the average of the points assigned to each cluster.
  4. Termination: Iterate until convergence or until a user-specified number of iterations has been reached. Note that the iteration may be trapped at some local optima.

There are several methods for choosing k: simply looking at the data to identify potential clusters or iteratively trying values for n, while penalizing model complexity. We can always make better clusters by increasing k, but at some point we begin overfitting the data.

We can also think of k-means as trying to minimize a cost criterion associated with the size of each cluster, where the cost increases as the clusters get less compact. However, some points can be almost halfway between two centers, which doesn’t fit well with the binary belonging k-means clustering.

Fuzzy K-Means Clustering

In fuzzy clustering, each point has a probability of belonging to each cluster, rather than completely belonging to just one cluster. Fuzzy k-means specifically tries to deal with the problem where points are somewhat in between centers or otherwise ambiguous by replacing distance with probability, which of course could be some function of distance, such as having probability relative to the inverse of the distance. Fuzzy k-means uses a weighted centroid based on those probabilities. Processes of initialization, iteration, and termination are the same as the ones used in k-means. The resulting clusters are best analyzed as probabilistic distributions rather than a hard assignment of labels. One should realize that k-means is a special case of fuzzy k-means when the probability function used is simply 1 if the data point is closest to a centroid and 0 otherwise.

The fuzzy k-means algorithm is the following:

  1. Assume a fixed number of clusters k
  2. Initialization: Randomly initialize the k means μk associated with the clusters and compute the probability that each data point xi is a member of a given cluster k, P(point xi has label k|xi,k).
  3. Iteration: Recalculate the centroid of the cluster as the weighted centroid given the probabilities of membership of all data points xi: [mu_{k}(n+1)=frac{sum_{x_{i} in k} x_{i} imes Pleft(mu_{k} mid x_{i} ight)^{b}}{sum_{x_{i} in k} Pleft(mu_{k} mid x_{i} ight)^{b}} onumber ] And recalculate updated memberships P (μk|xi)(there are different ways to define membership, here is just one example): [Pleft(mu_{k} mid x_{i} ight)=left(sum_{j=1}^{k}left(frac{d_{i k}}{d_{j k}} ight)^{frac{2}{b-1}} ight)^{-1} onumber ]
  4. Termination: Iterate until membership matrix converges or until a user-specified number of iterations has been reached (the iteration may be trapped at some local maxima or minima)

The b here is the weighting exponent which controls the relative weights places on each partition, or the degree of fuzziness. When b− > 1, the partitions that minimize the squared error function is increasingly hard (non-fuzzy), while as b− > ∞ the memberships all approach 1 , which is the fuzziest state. There is no k theoretical evidence of how to choose an optimal b, while the empirical useful values are among [1, 30], and in most of the studies, 1.5 (leqslant ) b (legslant ) 3.0 worked well.

K-Means as a Generative Model

A generative model is a model for randomly generating observable-data values, given some hidden parameters. While a generative model is a probability model of all variables, a discriminative model provides a conditional model only of the target variable(s) using the observed variables.

In order to make k-means a generative model, we now look at it in a probabilistic manner, where we assume that data points in cluster k are generated using a Gaussian distribution with the mean on the center of cluster and a variance of 1, which gives

[Pleft(x_{i} mid mu_{k} ight)=frac{1}{sqrt{2 pi}} exp left{-frac{left(x_{i}-mu_{k} ight)^{2}}{2} ight}.]

This gives a stochastic representation of the data, as shown in figure 15.10. Now this turns to a maximum likelihood problem, which, we will show in below, is exactly equivalent to the original k-means algorithm mentioned above.

In the generating step, we want to find a most likely partition, or assignment of label, for each xi given the mean μk. With the assumption that each point is drawn independently, we could look for the maximum likelihood label for each point separately:

[arg max _{k} Pleft(x_{i} mid mu_{k} ight)=arg max _{k} frac{1}{sqrt{2 pi}} exp left{-frac{left(x_{i}-mu_{k} ight)^{2}}{2} ight}=arg min _{k}left(x_{i}-mu_{k} ight)^{2} onumber ]

This is totally equivalent to finding the nearest cluster center in the original k-means algorithm.

In the Estimation step, we look for the maximum likelihood estimate of the cluster mean μk, given the partitions (labels):

[ left.arg max _{mu}left{log prod_{i} Pleft(x_{i} mid mu ight) ight}=arg max _{mu} sum_{i}left{-frac{1}{2}left(x_{i}-mu ight)^{2}+log left(frac{1}{sqrt{2 pi}} ight) ight) ight}=arg min _{mu} sum_{i}left(x_{i}-mu ight)^{2} onumber ]

Note that the solution of this problem is exactly the centroid of the xi, which is the same procedure as the original k-means algorithm.

Unfortunately, since k-means assumes independence between the axes, covariance and variance are not accounted for using k-means, so models such as oblong distributions are not possible. However, this issue can be resolved when generalize this problem into expectation maximization problem.

Expectation Maximization

K-means can be seen as an example of EM (expectation maximization algorithms), as shown in figure 15.11 where expectation consists of estimation of hidden labels, Q, and maximizing of expected likelihood occurs given data and Q. Assigning each point the label of the nearest center corresponds to the E step of estimating the most likely label given the previous parameter. Then, using the data produced in the E step as observation, moving the centroid to the average of the labels assigned to that center corresponds to the M step of maximizing the likelihood of the center given the labels. This case is analogous to Viterbi learning. A similar comparison can be drawn for fuzzy k-means, which is analogous to Baum-Welch from HMMs. Figure 15.12 compares clustering, HMM and motif discovery with respect to expectation minimization algorithm.

It should be noted that using the EM framework, the k means approach can be generalized to clusters of oblong shape and varying sizes. With k means, data points are always assigned to the nearest cluster center. By introducing a covariance matrix to the Gaussian probability function, we can allow for clusters of different sizes. By setting the variance to be different along different axes, we can even create oblong distributions.

EM is guaranteed to converge and guaranteed to find the best possible answer, at least from an algorithmic point of view. The notable problem with this solution is that the existence of local maxima of probability density can prevent the algorithm from convergin to the global maximum. One approach that may avoid this complication is to attempt multiple initializations to better determine the landscape of probabilities.

The limitations of the K-Means algorithm

The k-means algorithm has a few limitations which are important to keep in mind when using it and before choosing it. First of all, it requires a metric. For example, we cannot use the k-means algorithm on a set of words since we would not have any metric.

The second main limitation of the k-means algorithm is its sensitivity to noise. One way to try to reduce the noise is to run a principle component analysis beforehand. Another way is to weight each variable in order to give less weight to the variables affected by significant noise: the weights will be calculated dynamically at each iteration of the algorithm K-means [3].

The third limitation is that the choice of initial centers can influence the results. There exist heuristics to select the initial cluster centers, but none of them are perfect.

Lastly, we need to know a priori the number of classes. As we have seen, there are ways to circumvent this problem, essentially by running several times the algorithm while varying k or using the rule of thumb (k approx sqrt{n / 2} if we are short on the computational side. en.Wikipedia.org/wiki/Determining_ the_number_of_clusters_in_a_data_set summarizes well the different techniques to select the number of clusters. Hierarchical clustering provides a handy approach to choosing the number of cluster.

Hierarchical Clustering

While the clustering discussed thus far often provide valuable insight into the nature of various data, they generally overlook an essential component of biological data, namely the idea that similarity might exist on multiple levels. To be more precise, similarity is an intrinsically hierarchical property, and this aspect is not addressed in the clustering algorithms discussed thus far. Hierarchical clustering specifically addresses this in a very simple manner, and is perhaps the most widely used algorithm for expression data. As illustrated in figure 15.13, it is implemented as follows:

1. Initialization: Initialize a list containing each point as an independent cluster.
2. Iteration: Create a new cluster containing the two closest clusters in the list. Add this new cluster to

the list and remove the two constituent clusters from the list.

One key benefit of using hierarchical clustering and keeping track of the times at which we merge certain clusters is that we can create a tree structure that details the times at which we joined every cluster, as can be seen in figure 15.13. Thus, to get a number of clusters that fits your problem, you simply cut at a cut-level of your choice as in figure 15.13 and that gives you the number of clusters corresponding to that cut-level. However, be aware that one potential pitfall with this approach is that at certain cut-levels, elements that are fairly close in space (such as e and b in figure 15.13), might not be in the same cluster.

Of course, a method for determining distances between clusters is required. The particular metric used varies with context, but (as can be seen in figure 15.14 some common implementations include the maximum,

minimum, and average distances between constituent clusters, and the distance between the centroids of the clusters.

Noted that when choosing the closest clusters, calculating all pair-wise distances is very time and space consuming, therefore a better scheme is needed. One possible way of doing this is : 1) define some bounding boxes that divide the feature space into several subspaces 2) calculate pair-wise distances within each box 3)shift the boundary of the boxes in different directions and recalculate pair-wise distances 4) choose the closest pair based on the results in all iterations.

Evaluating Cluster Performance

The validity of a particular clustering can be evaluated in a number of different ways. The overrepresentation of a known group of genes in a cluster, or, more generally, correlation between the clustering and confirmed biological associations, is a good indicator of validity and significance. If biological data is not yet available, however, there are ways to assess validity using statistics. For instance, robust clusters will appear from clustering even when only subsets of the total available data are used to generate clusters. In addition, the statistical significance of a clustering can be determined by calculating the probability of a particular distribution having been obtained randomly for each cluster. This calculation utilizes variations on the hypergeometric distribution. As can be seen from figure 15.15, we can do this by calculating the probability that we have more than r +’s when we pick k elements from a total of N elements. http://en.Wikipedia.org/wiki/Cluster...tering_results gives several formula to assess the quality of the clustering.


Introduction

Clustering analysis has been an emerging research issue in data mining due its variety of applications. With the advent of many data clustering algorithms in the recent few years and its extensive use in wide variety of applications, including image processing, computational biology, mobile communication, medicine and economics, has lead to the popularity of this algorithms. Main problem with the data clustering algorithms is that it cannot be standardized. Algorithm developed may give best result with one type of data set but may fail or give poor result with data set of other types. Although there has been many attempts for standardizing the algorithms which can perform well in all case of scenarios but till now no major accomplishment has been achieved. Many clustering algorithms have been proposed so far. However, each algorithm has its own merits and demerits and cannot work for all real situations. Before exploring various clustering algorithms in detail let's have a brief overview about what is clustering.

Clustering is a process which partitions a given data set into homogeneous groups based on given features such that similar objects are kept in a group whereas dissimilar objects are in different groups. It is the most important unsupervised learning problem. It deals with finding structure in a collection of unlabeled data. For better understanding please refer to Fig I.


Background

Single-cell RNA sequencing (scRNA-seq) enables researchers to study heterogeneity between individual cells and define cell types from a transcriptomic perspective. One prominent problem in scRNA-seq data analysis is the prevalence of dropouts, caused by failures in amplification during the reverse-transcription step in the RNA-seq experiment. The prevalence of dropouts manifests as an excess of zeros and near zero counts in the data set, which has been shown to create difficulties in scRNA-seq data analysis [1, 2].

Several packages have recently been developed for the various aspects of scRNA-seq data analysis, including cell cycle (cyclone [3] and scLVM [4]), normalization (scran [5]), differential expression analysis (scde [2] and MAST [6]), and temporal analysis (Monocle [7]), but few perform preprocessing steps such as dimensionality reduction and clustering, which are critical steps for studying cell-type heterogeneity.

The state-of-the-art dimensionality-reduction package for scRNA-seq data is ZIFA [1]. It implements a modified probabilistic principal component analysis (PCA) method that incorporates a zero inflated model to account for dropout events. ZIFA uses an iterative expectation-maximization algorithm for inference, which makes it computationally intensive for large scRNA-seq data sets.

Another package t-SNE [8] is popular among biologists, but it is not designed specifically for scRNA-seq data and does not address the issue of dropouts. Other recently developed tools, such as BackSPIN [9], pcaReduce [10], SC3 [11], SNN-Cliq [12], RaceID [13], and BISCUIT [14], were designed to deal with optimal clustering of single cells into meaningful groups or hierarchies. Like ZIFA, these algorithms usually involve statistical modeling, which requires estimates of parameters. These algorithms often make use of iterative methods to achieve local or global optimal solutions, and hence they can be slow when processing large data sets of more than several hundred single cells.

In many practical situations, researchers are interested in fast and intuitive clustering results that they can easily visualize. PCA is a common analytical approach for data visualization for sample heterogeneity, and is often used for dimensionality reduction prior to clustering. Many versions of PCA, such as the implementation prcomp in R, are very fast and have routinely been used for analyzing large gene expression data sets. Nonetheless, standard PCA is not designed to take into account dropouts in scRNA-seq data. In this work, we aim to develop a fast PCA-like algorithm that takes dropouts into account.


Methods

The ensemble clustering transformation to categorical space

This section describes the ensemble clustering (EC) transformation that transforms the original data from its original feature to categorical space as illustrated in Fig. 2. The basic algorithm assumes that points belonging to the same cluster are more similar than points that fall in different clusters. In real-world, this assumption may not always hold, as illustrated in the example presented in Fig. 1. In this example, the data includes two classes (circles and diamonds). If we cluster the data into two clusters, then the left cluster will include two types of classes and the right one will still have all the points from the same class.

Example of clustering data

As a conclusion, we decided to run the clustering algorithm several times. Points belonging to the same cluster in the multiple runs are consider as identical points and will define a (group) that will be classified to the same class.

Let, (D) be a set of labeled points used as training data, and A a set of unlabeled data. First, the GrpClassifierEC algorithm will create a new dataset (E) , where (E) is a dataset combining (D) and (A) (i.e., (E=Dcup A) ), then the GrpClassifierEC runs the k-means clustering algorithm several times with different values of (k) (we refer it to nmc = number of clusters) and creates the clustering matrix (cMat) . (cMat) is a matrix where the (^) row consists of the clustering results of the (^) point in (E) . See Table 1 for an example of cMat with 20 points and 10 dimension of categorical features. The first column is the results of running k-means with k = 2 while the last column is the results of running k-means with k = 11. The values are the index of the cluster that was assigned by k-means. We record the results from k = 2.

Applying the EC transformation on (_in E) will create a new point (_^<*>in cMat) with categorical values. The dimension of the xi * is (k-1) . Therefore applying the EC transformation on the whole data will generate a new categorical data (EC data) that consists of l points with nmc-1 categorical features.

The new dimension nmc-1, usually, is much less that the original data dimension (nmc-1N in Fig. 2). More interestingly, the new EC data point can also be reduced as the new EC data contains identical points. We will explain it in more details in the section “Reduction of the Data”. Identical points that share the same clusters over the all iteration of k-means are represented as a same point in cMat as a result those points are consider to be one point, as a result all the identical points will define a group. For example, in Table 1, point 11, point 12 and point 20 have the same categorical values. This means, the vector space that represents those 3 points is = (g) (c0,c2,c2,c2,c4,c5,c6,c5,c5,c4). As a result, we consider those 3 points as a single point (g) that we refer to it as a unique point. In other words, each group is represented by one unique point.

The workflow for creating the EC categorical space based on the k-means clustering algorithm. The original data is the input to the workflow. The outcome is a new dataset named EC data in a categorical space with dimension k. the sign ≪ indicates that k is dramatically smaller than the original data dimension N

Note that, the set (E) contains labeled and unlabeled points, and as a result, the groups may contain labeled and unlabeled points. Generally, there are three possible cases for the identical points in the same group:

The labeled points are having the same class label the unlabeled points will be classified with this label.

The labeled points have different class labels: here the group points will be classified as the majority class.

All the points are not labeled: in this case, the group will be an unclassified group and the algorithm classifies it based on labeled nearest group.

To this end, we define a purity measurement for a given group in order to evaluate the purity of the grouping process. The purity measurement is based mainly on the probabilities of the labeled objects as follows:

where (_) denotes group (i) that was represented by vector (_) in the matrix (G) , (#classes) denotes the number of the classes in (_) , and (

_) denotes the probability of class (j) in group (i) . As can be seen, (purity(_)) equals 1 when the group is pure and (frac<1><#classes>) for the lowest purity, that will decrease as the number of the classes increases.

The k-means algorithm is known to have a time complexity of O(n 2 ) where n is the where n is the input data size. Then the complexity of the EC transformation is O(k.n 2 ) where k is the number of times we run k-means. In fact, this part is the heaviest computation part of the GrpClassifierEC algorithm.

GrpClassifierEC—ensemble clustering based classifier

The GrpClassifierEC pseudo code is presented in Algorithm 2. The input to the classifier is the cMat matrix that generated by the EC transformation that described in Algorithm 1. The first step of the GrpClassifierEC is creating the groups extracted from cMat. groups = < (grou

_) > where i = 1,…, s. s is number of groups. The number of groups is influenced by nmc, the number of iteration that we run k-means. For instance, if we run k-means with nmc = 1 then all the points will be assigned to one cluster which means that we have just one group that contains all the data points. As we seen from Table 2 for the data Cercopithecidae vs Malvacea we have 449 groups with nmc = 30 while with the same data with nmc = 50 we have 593 groups (Table 3 #EC_Samples is equal to the number of groups). The number of groups is increasing as nmc is increasing and might reach the number of points in the data, which means that each group will host one point in categorical values.

Groups could have different sizes (size is the number of categorical points belongs to it). As seen from Table 2, group can have just one point actually, we see that 305 different groups (unique points) with size 1 while 68 groups (unique points) with size 2. We see also that we have one group with size 31 which is the maximum size in this specific data.

Following the step of creating the groups, we suggest our novel approach for classification, by randomly selecting one point from each group. The label of the selected point will be the label of all points belongs to the group. The process of selecting random point and assigning its label to its group repeated r times. The GrpClassifierEC classifier produce a list named prd_set that for contains the predictions results. Then in order to calculate the performances we run a scorer function. The scorer function compare the assigned label and original label for each point in order to get the confusion matrix. Accuracy statistics such as True-Positives, False-Positives, True-Negatives, False-Negatives, Recall, Precision, Sensitivity, Specificity, F-measure, as well as the overall accuracy and Cohen's kappa, are calculated.

Reduction of the data

Table 2 shows the output of the EC procedure with k = 30 applied on the data Cercopithecidae vs Malvacea that contains 894 examples (points). The table also shows that the EC data has 449 unique points or groups, a 50% reduction in the size of the original data (449/894 = 0.5).

For each group (unique point), we measure its size, equal to the number of times this unique point appears in the EC data. For example, in Table 2, we have 305 unique points with size 1. All these points appear once in the new data space. In addition, we have 68 unique points. If each one appears twice in the data, then each one is size 2. There are 22 points with size 3—each of these 22 unique points appears 3 times in the data. Note that the labels are not included in the EC data. This means that the group of points at the EC space can have different labels associated with the original points and still share the same group.

Figure 3, shows the distribution of the group size for nmc = 30 and nmc = 50, and clearly indicates that as nmc increases, the number of groups with size 1 also increases. The expectation is that the number of groups of size of 1 should be the same as the number of the original number of points as we increase the value of nmc. In other words, each point will be hosted in one cluster. This actually raises a scientific question: what is the optimal value of nmc that will yield in improving the performance of the classifier, or more specifically, capture the nature of the data in terms of clusters. Answering this question is requiring additional future research.

Distribution of the groups points (points) size comparing nmc = 30 and nmc = 50

Experiments on numerical datasets

To evaluate the performance of the new classifier GrpClassifierEC we compared its results to the k-nearest neighbors, decision trees and random forest classification algorithms. We tested it over 10 biological datasets and we compared the performance for each algorithm. The results show that the new algorithm using the ensemble clustering was superior and outperforms the other baseline algorithms on most the datasets.

Datasets

The data consists of microRNA precursor sequences, and each sequence is made up of 4 nucleotide letters . The length of each precursor sequence is about 70 nucleotides. The source of this data is miRbase [18]. Part of the data we have used has was from other different studies [19,20,21], including our study [16].

One simple way of representing sequences that consist of 4 nucleotide letters is by employing the k-mers frequency. The (k) -mer counts in a given sequence were normalized by the length of the sequence.

Our features include k-mer frequencies, other distance features that were recently suggested by Yousef et al. [19] and secondary features suggested suggest by [22]. Many additional features describing pre-miRNAs have also been proposed [23] and are included in the feature set that numbers1038 features.

The main data consists of information from 15 clades (Table 4). The Homo sapiens sequences were taken out of the data of its clade Hominidae. The homology sequences were removed from the dataset and only one representative was kept. Each clade can serve as a positive examples or a as a negative examples. Considering all the different combination of pair of clades (positive/negative) it is possible to generate 256 datasets. We selected 10 datasets at random presented in Table 5.

Implementation

We have implemented the GrpClassifierEC in Knime [24]. We have decided to use the free and open-source platform Knime due to its simplicity and very useful graphical presentations. Additionally, Knime is also a highly integrative tool. The Knime workflow consists from two parts, the first part is performing the EC transformation as describe on Algorithm 1. Actually, this part is time consuming where for example it took 13 min to generate the EC matrix for the input file that consists from 1038 features ad 1068 points. The run was performed on a laptop with Intell® Core ™ i7 7600U CPU @2.80 GHz 2.90 GHz with 16GM RAM.

Model performance evaluation

We tested a different number of EC clusters using the k-means clustering algorithm with nmc values from 10 to 50. For each level, we performed 100 iterations with equal sample size, and then calculated the mean of each performance measurements described below.

For each established model we calculated a number of performance measures for the evaluation of the classifier such as sensitivity, specificity, and accuracy according to the following formulas (TP: True Positive, FP: False Positive, TN: True Negative, and FN False Negative classifications):


DropClust: efficient clustering of ultra-large scRNA-seq data

Droplet based single cell transcriptomics has recently enabled parallel screening of tens of thousands of single cells. Clustering methods that scale for such high dimensional data without compromising accuracy are scarce. We exploit Locality Sensitive Hashing, an approximate nearest neighbour search technique to develop a de novo clustering algorithm for large-scale single cell data. On a number of real datasets, dropClust outperformed the existing best practice methods in terms of execution time, clustering accuracy and detectability of minor cell sub-types.

Figures

( A ) 2D embedding of 20K PBMC transcriptomes, chosen randomly from the…

Barplot depicting the number of…

Barplot depicting the number of estimated Gaussian components for each of the top…

Bars show the ARI indexes…

Bars show the ARI indexes obtained by comparing clustering outcomes with cell-type annotations.

Localization of PBMC transcriptomes of…

Localization of PBMC transcriptomes of same type (based on annotation) on the 2D…

Clustering of ∼68K PBMC data.…

Clustering of ∼68K PBMC data. dropClust based visualization (a modified version of tSNE)…

Trend of increase in analysis…

Trend of increase in analysis (preprocessing, clustering and vizualization)) time for different pipelines…

Detectability of minor cell types.…

Detectability of minor cell types. Bars showing average of F 1 -scores, obtained…

( A ) Boxplots depicting…

( A ) Boxplots depicting average Silhouette scores computed on 100 bootstrap samples…


The many different clustering algorithms

There are several variants of clustering algorithms family: K-means, hierarchical, DBSCAN, spectral, gaussian, birch, mean shift and affinity propagation are some of them. Below I am highlighting some key points on the first three algorithms— the most commonly applied ones.

K-means: First, “K” refers to the number of clusters you want. That is, K = n means n number of clusters to be identified. Then there’s something called “centroid”, which is an imaginary/artificial data point (an average of data points) around which each cluster of data is partitioned. So K = 2 means that the algorithm will partition the observations (data) into 2 clusters such that the distances between the centroids and observations are minimized.

Advantages: simple to understand, easy to implement

Disadvantages: sometimes difficult to choose the K outliers can drag the centroid in their direction scaling data can change the clusters

Hierarchical clustering: Hierarchical clustering works in two different ways: the first one is called a “bottom-up” or agglomerative clustering, where each observation gets its own cluster, then each pair of clusters are merged together to form another cluster, and so on. The other one (a.k.a. “top-down” or divisive clustering) works in the opposite direction, i.e., all observations start with one cluster, then repeatedly divided into smaller cluster sizes.

Advantages: easy to implement number of clusters is easy to identify by looking at the dendrogram more informative than K-means clustering

Disadvantages: highly sensitive to outliers can be time consuming for large datasets

DBSCAN: Proposed in 1996, it is a density-based algorithm, where observations are clustered based on how close they are to each other given a minimum number of points. It takes two parameters: (i) ε (epsilon) — determining the radius within which the points should be in one cluster and (ii) minPts — specifying a minimum number of points to form a dense space/cluster. Interesting enough, the 1996 paper that proposed this algorithm won the “ Test of Time Award” in the 2014 KDD conference.

Advantages: unlike K-means and hierarchical clustering, DBSCAN is robust in the presence of outliers thus can be used in anomaly (i.e. outliers) detection.

Disadvantages: it is sensitive to parameter values (ε and minPts) fails to identify any clusters appropriately in varying data density.


Clustering Challenges in Biological Networks

This volume presents a collection of papers dealing with various aspects of clustering in biological networks and other related problems in computational biology. It consists of two parts, with the first part containing surveys of selected topics and the second part presenting original research contributions. This book will be a valuable source of material to faculty, students, and researchers in mathematical programming, data analysis and data mining, as well as people working in bioinformatics, computer science, engineering, and applied mathematics. In addition, the book can be used as a supplement to any course in data mining or computational/systems biology.

  • Surveys of Selected Topics:
    • Fixed-Parameter Algorithms for Graph-Modeled Data Clustering (Hüffner et al.)
    • Probabilistic Distance Clustering: Algorithm and Applications (C Iyigun & A Ben-Israel)
    • Analysis of Regulatory and Interaction Networks from Clusters of Co-expressed Genes (E Yang et al.)
    • Graph-based Approaches for Motif Discovery (E Zaslavsky)
    • Statistical Clustering Analysis: An Introduction (H Zhang)
    • Diversity Graphs (P Blain et al.)
    • Identifying Critical Nodes in Protein-Protein Interaction Networks (V Boginski & C W Commander)
    • Faster Algorithms for Constructing a Concept (Galois) Lattice (V Choi)
    • A Projected Clustering Algorithm and Its Biomedical Application (P Deng & W Wu)
    • Graph Algorithms for Integrated Biological Analysis, with Applications to Type 1 Diabetes Data (J D Eblen et al.)
    • A Novel Similarity-based Modularity Function for Graph Partitioning (Z Feng et al.)
    • Mechanism-based Clustering of Genome-wide RNA Levels: Roles of Transcription and Transcript-Degradation Rates (S Ji et al.)
    • The Complexity of Feature Selection for Consistent Biclustering (O E Kundakcioglu & P M Pardalos)
    • Clustering Electroencephalogram Recordings to Study Mesial Temporal Lobe Epilepsy (C-C Liu et al.)
    • Relating Subjective and Objective Pharmacovigilance Association Measures (R K Pearson)
    • A Novel Clustering Approach: Global Optimum Search with Enhanced Positioning (M P Tan & C A Floudas)

    Updated pub date on 29/1/2008

    Updated pub date on 10/4/2008

    Updated price on 28/05/2008

    Updated pub date on 18/6/2008

    Updated pub date on 5/8/2008

    Updated descrip, eds & in-hse ed on 18/12/2008

    Updated contents, pp & pub date on 13/2/2009

    FRONT MATTER
    Fixed-Parameter Algorithms for Graph-Modeled Data Clustering

    Fixed-parameter algorithms can efficiently find optimal solutions to some NP-hard problems, including several problems that arise in graph-modeled data clustering. This survey provides a primer about practical techniques to develop such algorithms in particular, we discuss the design of kernelizations (data reductions with provable performance guarantees) and depth-bounded search trees. Our investigations are circumstantiated by three concrete problems from the realm of graph-modeled data clustering for which fixed-parameter algorithms have been implemented and experimentally evaluated, namely CLIQUE, CLUSTER EDITING, and CLIQUE COVER.

    Probabilistic Distance Clustering: Algorithm and Applications

    The probabilistic distance clustering method of the authors [2, 8], assumes the cluster membership probabilities given in terms of the distances of the data points from the cluster centers, and the cluster sizes. A resulting extremal principle is then used to update the cluster centers (as convex combinations of the data points), and the cluster sizes (if not given.) Progress is monitored by the joint distance function (JDF), a weighted harmonic mean of the above distances, that approximates the data by capturing the data points in its lowest contours. The method is described, and applied to clustering, location problems, and mixtures of distributions, where it is a viable alternative to the Expectation–Maximization (EM) method. The JDF also helps to determine the “right” number of clusters for a given data set.

    Analysis of Regulatory and Interaction Networks from Clusters of Co-expressed Genes

    Extracting biological insight from high-throughput genomic studies of human diseases remains a major challenge, primarily due to our inability to recognize, evaluate and rationalize the relevant biological processes recorded from vast amounts of data.

    We will discuss an integrated framework combining fine-grained clustering of temporal gene expression data, selection of maximally informative clusters, based of their ability to capture the underlying dynamic transcriptional response, and the subsequent analysis of the resulting network of interactions among genes in individual clusters. The latter are developed based on the identification of common regulators among the genes in each cluster through mining literature data. We characterize the structure of the networks in terms of fundamental graph properties, and explore biologically the implications of the scale-free character of the resulting graphs. We demonstrate the biological importance of the highly connected hubs of the networks and show how these can be further exploited as targets for potential therapies during the early onset of inflammation and for characterizing the mechanism of action of anti-inflammatory drugs. We conclude by identifying two possible challenges in network biology, namely, the nature of the interactions and the potentially limited information content of the temporal gene expression experiments, and discuss expected implications.

    Graph-based Approaches for Motif Discovery

    Sequence motif finding is a very important and long-studied problem in computational molecular biology. While various motif representations and discovery methods exist, a recent development of graph-based algorithms has allowed practical concerns, such as positional correlations within motifs, to be taken into account. This survey provides an overview of the multi-partite graph formulation of motif finding, and focuses on algorithmic aspects of various motif discovery methodologies.

    Motif finding has been recast as a number of different graph substructure identification problems. First we review a formulation as a maximum-weight clique finding problem, and examine two different integer linear programs to model it. The motif finding algorithms use graph pruning techniques and a cutting planes approach in conjunction with linear programming relaxations. Secondly, we discuss a formulation of motif discovery as that of maximum density subgraph finding, and review a maximum flow based algorithm in an appropriately augmented flow network. Finally, we mention the ‘subtle’ motifs formulation, and define its corresponding graph problem of maximal clique identification. We discuss two different approaches to tackle this problem, one based on winnowing spurious edges and the other on divide-and-conquer sub-clique finding.

    Statistical Clustering Analysis: An Introduction

    Clustering analysis is to segment objects in a dataset into meaningful subsets such that objects with high similarity are segmented into the same subset, and objects with low similarity are segmented into different subsets. This chapter introduces three fundamental but core topics in clustering analysis: the definition of similarity and dissimilarity measure, the clustering algorithm, and determining the number of clusters. For each topic, we introduce the ones that are most popularly used, and emphasize their statistical backgrounds.

    Diversity Graphs

    Bipartite graphs have long been used to study and model matching problems, and in this paper we introduce the bipartite graphs that explain a recent matching problem in computational biology. The problem is to match haplotypes to genotypes in a way that minimizes the number of haplotypes, a problem called the Pure Parsimony problem. The goal of this work is not to address the computational or biological issues but rather to explore the mathematical structure through a study of the underlying graph theory.

    Identifying Critical Nodes in Protein-Protein Interaction Networks

    In recent years, the study of biological networks has increased dramatically. These problems have piqued the interest of researchers in many disciplines from biology to mathematics. In particular, many problems of interest to biological scientists can be modeled as combinatorial optimization problems and studied by operations researchers. In this chapter, we consider the problem of identifying the critical nodes of a network and its potential applications to protein-protein interaction networks. More specifically, we are interested in determining the smallest set of nodes whose removal from the graph maximally disconnects the network. Recent techniques for identifying critical nodes in telecommunication networks are applied to the study of protein-protein interaction graphs and the results are analyzed.

    Faster Algorithms for Constructing a Concept (Galois) Lattice

    In this paper, we present a fast algorithm for constructing a concept (Galois) lattice of a binary relation, including computing all concepts and their lattice order. We also present two efficient variants of the algorithm, one for computing all concepts only, and one for constructing a frequent closed itemset lattice. The running time of our algorithms depends on the lattice structure and is faster than all other existing algorithms for these problems.

    A Projected Clustering Algorithm and Its Biomedical Application

    Projected clustering is concerned with clustering data in high dimensional space where data is more likely correlated in subspaces of full dimensions. Recently, several projected clustering algorithms that focus on finding specific projection for each cluster have been proposed. We find that, besides distance, the closeness of points in different dimensions also depends on the distributions of data along those dimensions. Based on this, we propose a projected clustering algorithm, IPROCLUS (Improved PROCLUS), which is efficient and accurate in handling data in high dimensional space. According to the experimental results on randomly generated synthetic data, our algorithm shows much higher accuracy for the scaled datasets and lower dependence on one of user inputs than PROCLUS. We also apply IPROCLUS on real biomedical data and show that it can achieve much better accuracy than PROCLUS.

    Graph Algorithms for Integrated Biological Analysis, with Applications to Type 1 Diabetes Data

    Graph algorithms can be effective tools for analyzing the immense data sets that frequently arise from high-throughput biological experiments. A major computational goal is to identify dense subgraphs, from which one can often infer some form of biological meaning. In this paper, new techniques are devised and analyzed in an effort to improve the quality and relevance of these subgraphs, and to extend the utility of clique-centric methods that may produce them. Using non-obese diabetic mice as a target organism, the paraclique algorithm is tested on transcriptomic data under various parameters in order to determine how it can best be tuned to applications. The use of proteomic anchors is also discussed in an effort to help guide subgraph selection in the presence of inhomogeneous data, which is an important but notoriously difficult problem in its own right.

    A Novel Similarity-based Modularity Function for Graph Partitioning

    Graph partitioning, or network clustering, is an essential research problem in many areas. Current approaches, however, have difficulty splitting two clusters that are densely connected by one or more “hub” vertices. Further, traditional methods are less able to deal with very confused structures. In this paper we propose a novel similarity-based definition of the quality of a partitioning of a graph. Through theoretical analysis and experimental results we demonstrate that the proposed definition largely overcomes the “hub” problem and outperforms existing approaches on complicated graphs. In addition, we show that this definition can be used with fast agglomerative algorithms to find communities in very large networks.

    Mechanism-based Clustering of Genome-wide RNA Levels: Roles of Transcription and Transcript-Degradation Rates

    DNA array techniques invented over a decade ago enable biologists to measure tens of thousands of mRNA levels in cells simultaneously as functions of environmental perturbations. In a few cases the same technique has been employed to measure not only genome-wide transcript levels (TL) but also the associated transcription rates (TR) simultaneously. Since TL is determined by the balance between two opposing processes, i.e., transcription and transcript degradation, simple theoretical considerations indicate that it would be impossible to determine TR based on TL data alone. This conclusion is supported by the finding that TL and TR do not always vary in parallel. In fact, the genome-wide measurements of TL and TR in budding yeast undergoing glucose-galactose shift indicate that TL can decrease even though TR increases and TL can increase despite the fact that TR decreases. These counter-intuitive findings cannot be accounted for unless transcript-degradation rates (TD) are also taken into account. One of the main objectives of this contribution is to derive a mathematical equation relating TL to TR and TD. Based on this equation, it was predicted that there would be 9 different mechanisms by which TL can be altered in cells. The TL and TR data measured in budding yeast demonstrate that all of the 9 predicted mechanisms are found to be activated in budding yeast during glucose-galactose shift, except Mechanisms 5 (i.e., decreasing TL with no change in TR) and 9 (i.e., no change in TL nor in TR). It was also shown that the opposite changes in the mRNA levels of glycolytic and respiratory genes observed between 5 and 360 minutes following the glucose-galactose shift could be quantitatively accounted for in terms of what is referred to as the transcript-degradation/transcription (D/T) ratios calculated here for the first time. Our results suggest that the predicted 9 mechanisms of controlling TL may be employed to cluster the genome-wide measurements of mRNA levels as a means to characterize the functional states of both normal and diseased cells.

    The Complexity of Feature Selection for Consistent Biclustering

    Biclustering is simultaneous classification of the samples and features in a way that samples from the same class have similar values for that class' characteristic features. A biclustering is consistent if in each sample (feature) from any set, the average expression of features (samples) that belong to the same class is greater than the average expression of features (samples) from other classes. Supervised biclustering uses a training set to classify features whose consistency is achieved by feature selection. The worst case complexity of this feature selection process is studied.

    Clustering Electroencephalogram Recordings to Study Mesial Temporal Lobe Epilepsy

    The brain connectivity is known to have substantial influences over the brain function and its underlying information processes. In this chapter, a novel graphtheoretic approach is introduced to investigate the connectivity among brain regions through electroencephalogram (EEG) recordings acquired from a patient with mesial temporal lobe epilepsy (MTLE). The first step of the proposed approach is to transform the brain connectivity behavior into a complete graph. The connectivity for each pair of the brain regions is first quantified by the cross mutual information (CMI) measure, and then the maximum clique algorithm is subsequently applied to find the clique that contained a group of highly connected brain regions that is represented by a clique with maximum size. The CMI is known to have the ability to capture the connectivity between EEG signals. The adopted maximum clique algorithm can reduce the complexity of the clustering procedure for finding the maximum connected brain regions. The proposed graph-theoretic approach offers better assessments to visualize the structure of the brain connectivity over time. The results indicate that the maximum connected brain regions prior to seizure onsets were where the impending seizure was initiated. Furthermore, the proposed approach may be used to improve the outcome of the epilepsy surgery by identifying the seizure onset region(s) correctly.

    Relating Subjective and Objective Pharmacovigilance Association Measures

    The field of pharmacovigilance is concerned with the detection and interpretation of associations between drugs and adverse medical events that may be related to the use of those drugs. These assocations can be measured in various ways, and this paper considers five: two are aggregate statistical measures derived from an entire adverse event database, two are case-specific objective measures, and one is a subjective measure related to the way adverse events are reported. Examination of the available data suggests that these measures are all interrelated, but in a complicated manner. This finding motivates the use of cluster analysis to explore these relationships, with the ultimate objective of constructing an index of blame that quantifies the tendency for some drugs to be subjectively blamed for adverse events even in the absence of objective evidence for an association with those events.

    A Novel Clustering Approach: Global Optimum Search with Enhanced Positioning

    Cluster analysis of genome-wide expression data from DNA microarray hybridization studies is a useful tool for identifying biologically relevant gene groupings. It is hence important to apply a rigorous yet intuitive clustering algorithm to uncover these genomic relationships. In this study, we describe a novel clustering algorithm framework based on a variant of the Generalized Benders Decomposition, denoted as the Global Optimum Search [2, 19, 21, 23, 51] which includes a procedure to determine the optimal number of clusters to be used. The approach involves a pre-clustering of data points to define an initial number of clusters and the iterative solution of a Linear Programming problem (the primal problem) and a Mixed-Integer Linear Programming problem (the master problem), that are derived from a Mixed Integer Nonlinear Programming problem formulation. Badly-placed data points are removed to form new clusters, thus ensuring tight groupings amongst the data points and incrementing the number of clusters until the optimum number is reached. We apply the proposed clustering algorithm to experimental DNA microarray data centered on the Ras signaling pathway in the yeast Saccharomyces Cerevisiae and compare the results to that obtained with some commonly-used clustering algorithms. Our algorithm comes up favorably against these algorithms in the aspects of intra-cluster similarity and inter-cluster dissimilarity, often considered two key tenets of clustering. Furthermore, our algorithmcan predict the optimal number of clusters, and the biological coherence of the predicted clusters is analyzed through gene ontology.


    Chapter 8 Classification

    Imagine you have RNA-seq of a collection of labeled normal lung and lung cancer tissues. Given a new sample of RNA-seq from the lung with unknown diagnosis, will you be able to predict based on the existing labeled samples and the expression data whether the new sample is normal or tumor? This is a sample classification problem, and it could be solved using unsupervised and supervised learning approaches.

    Unsupervised learning is basically clustering or dimension reduction. You can use hierarchical clustering, MDS, or PCA. After clustering and projection the data to lower dimensions, you examine the labels of the known samples (hopefully they cluster into separate groups by the label). Then you can assign label to the unknown sample based on its distance to the known samples.

    Supervised learning considers the labels with known samples and tries to identify features that can separate the samples by the label. Cross validation is conducted to evaluate the performance of different approaches and avoid over fitting.

    StatQuest has done an amazing job with machine learning with a full playlist of well organized videos. While the full playlist is worth a full course, for the purpose of the course, we will just highlight a number of widely used approaches. They include logistic regression (this is considered statistical machine learning), K nearest neighbors, random forest, and support vector machine (these are considered computer science machine learning).


    TimesVector: a vectorized clustering approach to the analysis of time series transcriptome data from multiple phenotypes

    Motivation: Identifying biologically meaningful gene expression patterns from time series gene expression data is important to understand the underlying biological mechanisms. To identify significantly perturbed gene sets between different phenotypes, analysis of time series transcriptome data requires consideration of time and sample dimensions. Thus, the analysis of such time series data seeks to search gene sets that exhibit similar or different expression patterns between two or more sample conditions, constituting the three-dimensional data, i.e. gene-time-condition. Computational complexity for analyzing such data is very high, compared to the already difficult NP-hard two dimensional biclustering algorithms. Because of this challenge, traditional time series clustering algorithms are designed to capture co-expressed genes with similar expression pattern in two sample conditions.

    Results: We present a triclustering algorithm, TimesVector, specifically designed for clustering three-dimensional time series data to capture distinctively similar or different gene expression patterns between two or more sample conditions. TimesVector identifies clusters with distinctive expression patterns in three steps: (i) dimension reduction and clustering of time-condition concatenated vectors, (ii) post-processing clusters for detecting similar and distinct expression patterns and (iii) rescuing genes from unclassified clusters. Using four sets of time series gene expression data, generated by both microarray and high throughput sequencing platforms, we demonstrated that TimesVector successfully detected biologically meaningful clusters of high quality. TimesVector improved the clustering quality compared to existing triclustering tools and only TimesVector detected clusters with differential expression patterns across conditions successfully.

    Availability and implementation: The TimesVector software is available at http://biohealth.snu.ac.kr/software/TimesVector/.

    Contact: [email protected]

    Supplementary information: Supplementary data are available at Bioinformatics online.


    15.3: Clustering Algorithms - Biology

    Localized Multiple Kernel k-Means Clustering

    Use Git or checkout with SVN using the web URL.

    Work fast with our official CLI. Learn more.

    Launching GitHub Desktop

    If nothing happens, download GitHub Desktop and try again.

    Launching GitHub Desktop

    If nothing happens, download GitHub Desktop and try again.

    Launching Xcode

    If nothing happens, download Xcode and try again.

    Launching Visual Studio Code

    Your codespace will open once ready.

    There was a problem preparing your codespace, please try again.


    Watch the video: StatQuest: K-means clustering (August 2022).