Statistical performance of agglomerative hierarchical clustering technique via pairing of correlation-based distances and linkage methods

Agglomerative hierarchical clustering technique assists to group unknown objects into several clusters. The quality of clustering can be achieved when the clusters are internally homogeneous and externally heterogenous. However, the technique is sensitive to the choice of the distance measure and li...

詳細記述

書誌詳細
第一著者: Nurshaziana, Mohamad Shamsuri
フォーマット: 学位論文
言語:英語
英語
出版事項: 2025
主題:
オンライン・アクセス:https://etd.uum.edu.my/11651/1/permission%20to%20deposit-grant%20the%20permission-s828267.pdf
https://etd.uum.edu.my/11651/2/s828267_01.pdf
https://etd.uum.edu.my/11651/
Abstract Abstract here
_version_ 1855574586913980416
author Nurshaziana, Mohamad Shamsuri
author_facet Nurshaziana, Mohamad Shamsuri
author_sort Nurshaziana, Mohamad Shamsuri
description Agglomerative hierarchical clustering technique assists to group unknown objects into several clusters. The quality of clustering can be achieved when the clusters are internally homogeneous and externally heterogenous. However, the technique is sensitive to the choice of the distance measure and linkage method, particularly to the correlation-based distance because it is influenced by outliers. Choosing the wrong pairing may lead to the poor cluster formation and low clustering performance. Therefore, the main objective was to measure the statistical performance of agglomerative hierarchical clustering using four correlation-based distances and four linkage methods, tested through simulation study and real data application. The four correlation-based distances; Pearson, Spearman, Kendall’s Tau and Winsorized while the four linkage methods; Average, Ward’s, Complete and Single. The simulation study was conducted under various data conditions; number of variables, sample sizes, percentages of outliers, and data distribution to measure the performance using Cophenetic Correlation Coefficient (CCC). To validate the clustering model on real data, the Spearman-average algorithm was applied to cluster Juru river basin data based on five water quality parameters. The simulation study showed that the Spearman-average performed well under normal distribution and contaminated data, while the Winsorized-average performed well under Gamma distribution. Five tables of summary for choosing appropriate clustering algorithms according to data distribution were produced. The real data validation produced five clusters, with the CCC of 0.76 and the Calinski and Harabasz index of 26.879. In addition, the clustering model was also able to identify the time when the highest and lowest level of river contamination occurred. As a conclusion, the performance of clustering depends on the conditions of the data. Agglomerative hierarchical clustering is suitable to be performed when the number of variables is not more than 20 and the sample size smaller than 500.
format Thesis
id oai:etd.uum.edu.my:11651
institution Universiti Utara Malaysia
language English
English
publishDate 2025
record_format EPrints
record_pdf Restricted
spelling oai:etd.uum.edu.my:116512025-05-11T01:34:16Z https://etd.uum.edu.my/11651/ Statistical performance of agglomerative hierarchical clustering technique via pairing of correlation-based distances and linkage methods Nurshaziana, Mohamad Shamsuri QA273-280 Probabilities. Mathematical statistics Agglomerative hierarchical clustering technique assists to group unknown objects into several clusters. The quality of clustering can be achieved when the clusters are internally homogeneous and externally heterogenous. However, the technique is sensitive to the choice of the distance measure and linkage method, particularly to the correlation-based distance because it is influenced by outliers. Choosing the wrong pairing may lead to the poor cluster formation and low clustering performance. Therefore, the main objective was to measure the statistical performance of agglomerative hierarchical clustering using four correlation-based distances and four linkage methods, tested through simulation study and real data application. The four correlation-based distances; Pearson, Spearman, Kendall’s Tau and Winsorized while the four linkage methods; Average, Ward’s, Complete and Single. The simulation study was conducted under various data conditions; number of variables, sample sizes, percentages of outliers, and data distribution to measure the performance using Cophenetic Correlation Coefficient (CCC). To validate the clustering model on real data, the Spearman-average algorithm was applied to cluster Juru river basin data based on five water quality parameters. The simulation study showed that the Spearman-average performed well under normal distribution and contaminated data, while the Winsorized-average performed well under Gamma distribution. Five tables of summary for choosing appropriate clustering algorithms according to data distribution were produced. The real data validation produced five clusters, with the CCC of 0.76 and the Calinski and Harabasz index of 26.879. In addition, the clustering model was also able to identify the time when the highest and lowest level of river contamination occurred. As a conclusion, the performance of clustering depends on the conditions of the data. Agglomerative hierarchical clustering is suitable to be performed when the number of variables is not more than 20 and the sample size smaller than 500. 2025 Thesis NonPeerReviewed text en https://etd.uum.edu.my/11651/1/permission%20to%20deposit-grant%20the%20permission-s828267.pdf text en https://etd.uum.edu.my/11651/2/s828267_01.pdf Nurshaziana, Mohamad Shamsuri (2025) Statistical performance of agglomerative hierarchical clustering technique via pairing of correlation-based distances and linkage methods. Masters thesis, Universiti Utara Malaysia.
spellingShingle QA273-280 Probabilities. Mathematical statistics
Nurshaziana, Mohamad Shamsuri
Statistical performance of agglomerative hierarchical clustering technique via pairing of correlation-based distances and linkage methods
thesis_level Master
title Statistical performance of agglomerative hierarchical clustering technique via pairing of correlation-based distances and linkage methods
title_full Statistical performance of agglomerative hierarchical clustering technique via pairing of correlation-based distances and linkage methods
title_fullStr Statistical performance of agglomerative hierarchical clustering technique via pairing of correlation-based distances and linkage methods
title_full_unstemmed Statistical performance of agglomerative hierarchical clustering technique via pairing of correlation-based distances and linkage methods
title_short Statistical performance of agglomerative hierarchical clustering technique via pairing of correlation-based distances and linkage methods
title_sort statistical performance of agglomerative hierarchical clustering technique via pairing of correlation based distances and linkage methods
topic QA273-280 Probabilities. Mathematical statistics
url https://etd.uum.edu.my/11651/1/permission%20to%20deposit-grant%20the%20permission-s828267.pdf
https://etd.uum.edu.my/11651/2/s828267_01.pdf
https://etd.uum.edu.my/11651/
work_keys_str_mv AT nurshazianamohamadshamsuri statisticalperformanceofagglomerativehierarchicalclusteringtechniqueviapairingofcorrelationbaseddistancesandlinkagemethods