Statistical performance of agglomerative hierarchical clustering technique via pairing of correlation-based distances and linkage methods
Agglomerative hierarchical clustering technique assists to group unknown objects into several clusters. The quality of clustering can be achieved when the clusters are internally homogeneous and externally heterogenous. However, the technique is sensitive to the choice of the distance measure and li...
| Auteur principal: | |
|---|---|
| Format: | Thèse |
| Langue: | anglais anglais |
| Publié: |
2025
|
| Sujets: | |
| Accès en ligne: | https://etd.uum.edu.my/11651/1/permission%20to%20deposit-grant%20the%20permission-s828267.pdf https://etd.uum.edu.my/11651/2/s828267_01.pdf |
| Résumé: | Agglomerative hierarchical clustering technique assists to group unknown objects into several clusters. The quality of clustering can be achieved when the clusters are internally homogeneous and externally heterogenous. However, the technique is sensitive to the choice of the distance measure and linkage method, particularly to the correlation-based distance because it is influenced by outliers. Choosing the wrong pairing may lead to the poor cluster formation and low clustering performance. Therefore, the main objective was to measure the statistical performance of agglomerative hierarchical clustering using four correlation-based distances and four linkage methods, tested through simulation study and real data application. The four correlation-based distances; Pearson, Spearman, Kendall’s Tau and Winsorized while the four linkage methods; Average, Ward’s, Complete and Single. The simulation study was conducted under various data conditions; number of variables, sample sizes, percentages of outliers, and data distribution to measure the performance using Cophenetic Correlation Coefficient (CCC). To validate the clustering model on real data, the Spearman-average algorithm was applied to cluster Juru river basin data based on five water quality parameters. The simulation study showed that the Spearman-average performed well under normal distribution and contaminated data, while the Winsorized-average performed well under Gamma distribution. Five tables of summary for choosing appropriate clustering algorithms according to data distribution were produced. The real data validation produced five clusters, with the CCC of 0.76 and the Calinski and Harabasz index of 26.879. In addition, the clustering model was also able to identify the time when the highest and lowest level of river contamination occurred. As a conclusion, the performance of clustering depends on the conditions of the data. Agglomerative hierarchical clustering is suitable to be performed when the number of variables is not more than 20 and the sample size smaller than 500. |
|---|