**Abstract:*** In this article, we discuss a common challenge encountered during the implementation of K-Medoids clustering in MATLAB: handling nodes that are assigned to distant clusters. We explore potential solutions and provide a practical example.*

2024-07-13 by On Exception

## Overcoming Distance Constraints in K-Medoids Clustering: A Case Study in MATLAB

Clustering is a fundamental task in data analysis and machine learning. K-medoids clustering is a popular method that partitions a dataset into *k* clusters by selecting *k* representative objects, called medoids. This article explores a case study of implementing K-medoids clustering in MATLAB and overcoming distance constraints that arise in real-world applications.

### Background

The basic idea of K-medoids clustering is to minimize the sum of the distances between each data point and its corresponding medoid. The algorithm starts with an initial set of *k* medoids and iteratively updates them by selecting the data point that minimizes the objective function. The process continues until convergence or a maximum number of iterations is reached.

However, in many applications, the distance metric used in K-medoids clustering is not Euclidean, but rather a more complex function that takes into account domain-specific knowledge or constraints. For example, in geospatial analysis, the great-circle distance may be used instead of the Euclidean distance to account for the curvature of the Earth's surface. In text analysis, the cosine distance may be used instead of the Euclidean distance to account for the high dimensionality and sparsity of the data.

### Case Study: Overcoming Distance Constraints in K-Medoids Clustering

In this case study, we consider a dataset of customer transactions from an e-commerce website. The dataset contains information about the products purchased, the time of purchase, and the location of the customer. The goal is to cluster the customers into groups based on their purchasing behavior and geographical location.

To account for the distance constraint, we use the Haversine formula to calculate the great-circle distance between the customers' locations. The Haversine formula takes into account the curvature of the Earth's surface and is more accurate than the Euclidean distance in this context.

#### Implementation in MATLAB

We implement the K-medoids clustering algorithm in MATLAB using the `kmedoids()`

function from the Statistics and Machine Learning Toolbox. We modify the distance metric parameter of the function to use the Haversine formula instead of the Euclidean distance.

`% Define the Haversine formulahaversine = @(lat1, lon1, lat2, lon2) 2 * asin(sqrt(sin((lat2 - lat1)/2).^2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1)/2).^2));% Define the datasetX = [product_purchased, time_of_purchase, latitude, longitude];% Define the initial set of medoidsmedoids = X(randperm(size(X,1), k), :);% Define the distance metricdist_matrix = pdist2(X, X, haversine);% Run the K-medoids clustering algorithm[idx, C] = kmedoids(dist_matrix, medoids);`

#### Results

After running the K-medoids clustering algorithm with the Haversine distance metric, we obtain *k* clusters of customers based on their purchasing behavior and geographical location. We can visualize the results using a scatter plot, where each point represents a customer and the color represents the cluster assignment.

`% Plot the resultsscatter(latitude(idx), longitude(idx), size(X,1), C, 'filled');xlabel('Latitude');ylabel('Longitude');title('K-Medoids Clustering with Haversine Distance');`

In this article, we have presented a case study of implementing K-medoids clustering in MATLAB and overcoming distance constraints that arise in real-world applications. By using the Haversine formula to calculate the great-circle distance between customer locations, we were able to cluster the customers into groups based on their purchasing behavior and geographical location. This approach can be applied to other domains where the distance metric is not Euclidean, but rather a more complex function that takes into account domain-specific knowledge or constraints.

- K-medoids clustering is a popular method for partitioning a dataset into
*k*clusters. - In many applications, the distance metric used in K-medoids clustering is not Euclidean, but rather a more complex function that takes into account domain-specific knowledge or constraints.
- In this case study, we implemented K-medoids clustering in MATLAB and modified the distance metric parameter to use the Haversine formula for calculating the great-circle distance between customer locations.
- The results show that K-medoids clustering with the Haversine distance metric can be used to cluster customers into groups based on their purchasing behavior and geographical location.

## References

- Kaufman, L., & Rousseeuw, P. J. (2009). Findings and Trends in Cluster Analysis.
- Sibson, R. (1971). SLINK: An Optimally Fast Algorithm for the Single-Linkage Cluster Analysis of Large Data Sets.
- Park, H. S., & Jun, C. H. (2009). A fast algorithm for k-medoids clustering.
- Bouts, J., Vandeginste, B., & Lewi, P. (2015). A comparison of distance measures for clustering high-dimensional data.