Massive amounts of biological datasets have been generated in the recent decades without being comprehensively analysed and exploited. The pace of data generation has neither slowed down nor plateaued and has always been faster than the pace of developments in data analysis and discovery inference. Moreover, it is well known that for the same biological context (e.g. budding yeast cell-cycle), many datasets have been generated in different laboratories, in different years, by using different technologies, and under different specific conditions. Despite such heterogeneity, those datasets carry common information regarding their common context. Therefore, it is crucial to design, implement, and apply a new generation of computational methods which have the capability and capacity to analyse multiple related datasets collectively in order to enhance the pace of data analysis significantly, and to be able to extract information inaccessible to simple comparative analyses.
In this area, I have developed a complete computational framework for the unsupervised clustering analysis of multiple gene expression datasets simultaneously.
Clust: edge-breaking gene expression clustering
Clust is my edge-breaking method for clustering one or more heterogeneous gene expression datasets. It is freely available as a straightforwardly installed and used package. Despite its ease of use, it encapsulates a sophisticated computational framework of multiple steps that I have been carefully and continuously curating since 2011.
Clust is available at:
These are some of the computational methods that I previously developed, and that eventually contributed to the development Clust:
- Bi-CoPaM, the Binarisation of Consensus Partition Matrices: Mine multiple gene expression datasets for the subsets of genes consistently correlated (co-expressed) in all of them.
- UNCLES, the UNification of CLustering results from multiple datasets using External Specifications: Mine multiple gene expression datasets for the subsets of genes consistently correlated in one subset of datasets while being poorly correlated in another subset of datasets.
- M-N scatter plots technique: Cluster assessment, validation, and selection technique which aims to select the largest clusters while minimising their within dispersion. The technique is aided with visual scatter plots and is utilised to address the issues of setting the parameters of Bi-CoPaM and UNCLES, and defining the most suitable number of clusters.
- F-P scatter plots technique: Similar to the M-N scatter plots technique, this is a technique of cluster assessment, validation, and selection, but is only suitable when the ground truth is available, that is, it can be used while validating a proposed clustering method by application to well-known datasets, such as synthetic datasets.
- Gene expression data synthesis: a technique to synthesise gene expression datasets by using real data measurements. AVAILABLE FOR DOWNLOADING.