Validating clustering for gene expression data
Konstantinos Kalpakis, Dhiral Gada, and Vasundhara Puttagunta, "Distance Measures for Effective Clustering of ARIMA Time-Series".
In the Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM'01), San Jose, CA, November 29-December 2, 2001, pp. Many environmental and socioeconomic time--series data can be adequately modeled using Auto-Regressive Integrated Moving Average (ARIMA) models. We consider the problem of clustering ARIMA time--series.
In this paper we provide adaptations of parameters related to association rules generation so they can be used to represent distance.
Furthermore, we integrate goal-oriented quantitative attributes in distance measure formulation to increase the quality of gained results and streamline the decision making process.
As a proof of concept, newly developed measures are tested and results are discussed both on a referent dataset as well as a large real-life retail dataset.
I am reading about the difference between k-means clustering and k-medoid clustering.
For example, few LPC cepstral coefficients are sufficient in order to discriminate between time--series that are modeled by different ARIMA models.
It obviously belongs to the cluster around 1000 but k-means will put the center point away from 1000 and towards 100000.If you look up literature on the median, you will see plenty of explanations and examples why the median is more robust to outliers than the arithmetic mean. Which do you think is more representative of the data set?Essentially, these explanations and examples will also hold for the medoid. The mean has the lower squared error, but assuming that there might be a measurement error in this data set ...Technically, the notion of breakdown point is used in statistics. half of the data points can be incorrect, and the result is still unaffected), whereas the mean has a breakdown point of 0 (i.e.a single large observation can yield a bad estimate).
Search for validating clustering for gene expression data:
Just a tiny note added to @Eli's answer, K-medoid is more robust to noise and outliers than k-means because the latter selects the cluster center, which is mostly just a "virtue point", on the other hand the former chooses the "actual object" from the cluster.