Bi-CoPaM

Binarisation of Consensus Partition Matrices (Bi-CoPaM) is a consensus clustering method with distinct features that address important biological aspects.

Implementation is freely available as the ‘uncles’ function in the R package ‘UNCLES‘.

For biologists – What does this method do?

Given a set of gene expression datasets, this method aims to identify the subsets of genes which are consistently co-expressed (correlated) in each one of these datasets. That is, to find those genes which go up together and go down together in genetic expression in all of these datasets. Genes which are consistently co-expressed over multiple datasets are likely to be co-regulated as well, that is, regulated together by a common molecular mechanism. However, conventional computational methods, particularly and relevantly clustering methods, can only analyse a single dataset at a time. Moreover, combining multiple datasets to form a single dataset that is analysed by such methods is infeasible, especially when the datasets are heterogeneous and were produced using different technologies (one-colour microarrays, two-colour microarrays, or next generation sequencing (NGS)), different platforms, in different years and laboratories, and under different conditions.

Another key and distinct feature of the Bi-CoPaM is that it is tuneable. Although genome-wide data is usually provided to the method, the resulting clusters do not usually include all of the input genes. For example, the human genome with about 20,000 genes can be the input, and the output may be few clusters with few hundreds of genes in total. The rest of the genes are excluded from all of the clusters in this case. This filtering process is embedded within the Bi-CoPaM process, and some parameters may be manually used to further tune how tight or loose the clusters should be.

Brief technical explanation:

Despite the infeasibility of combining multiple heterogeneous gene expression datasets, Bi-CoPaM is able to analyse multiple datasets collectively. This is done by first applying conventional clustering methods to each one of the datasets independently, which in reality maps the data from a gene-sample space into a gene-cluster space, in which the datasets become homogeneous. This is because the values of the genes at any given sample in the original datasets represent gene expression, which highly depends, in terms of its dynamic range, statistical distribution, biological interpretation and other attributes, on the technology and platform adopted in that particular dataset, and on the biological conditions of the samples. Therefore, such gene expression values are not directly comparable or combinable with expression values in the rest of the datasets. After applying clustering on each of the datasets individually, the resulting partitions assign each gene to each one of the clusters with a membership value ranging from zero (gene does not belong to the cluster) to unity (gene belongs). These membership, or belongingness, values are directly comparable across those different datasets, and they are combined to produce the final result of the Bi-CoPaM.

In addition to that, applying different clustering methods to the datasets independently in the first step of the Bi-CoPaM is likely to produce different results. Examples of such methods include k-means, self-organising maps (SOMs), hierarchical clustering, self-organising oscillator networks, information-based clustering, and others. Furthermore, applying the same method with different sets of parameters or even applying the same stochastic method and parameters multiple times tends to produce different results. Thus, Bi-CoPaM recruits multiple methods and/or sets of parameters to be applied to each of the datasets and then combines all of these results into a single consensus result (see Figure below). This allows the datasets to be considered by the different implicit assumptions that these methods possess while producing a result that captures them collectively.

Bi-CoPaM

Further details on the specific techniques for combining the results are found in the reference (Abu-Jamous et. al., PLOS ONE, 2013).

Other key issues resolved

What is the correct number of clusters in a given dataset? What are the optimum values for the Bi-CoPaM tuning parameters? How to validate or assess the results of the Bi-CoPaM quantitatively? The M-N scatter plots technique, which can be appended to the Bi-CoPaM method as a plug-in, tackles and resolves these aspects. Read more …

Some applications

The method has been applied to various biological areas and produced successful results. Examples include (sorted from the most developed to the least):

Publications:

Basel Abu-Jamous, Rui Fa, David J. Roberts, and Asoke K. Nandi. “Paradigm of Tunable Clustering using Binarization of Consensus Partition Matrices (Bi-CoPaM) for Gene Discovery”. PLOS ONE, 2013, 8(2): e56432, doi: 10.1371/journal.pone.0056432. View online | Download PDF