Prab: CSV import

2024-04-24T01:57:28Z

CSV import

New page

[[Image:Clusters.svg|Clusters|thumb]] '''Hierarchical clustering''' is a method of [[cluster analysis]] which seeks to build a hierarchy of [[clusters]]. Strategies for hierarchical clustering generally fall into two types: *Agglomerative*: This is a "bottom-up" approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. *Divisive*: This is a "top-down" approach where all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. The results of hierarchical clustering are usually presented in a [[dendrogram]].

== Overview ==
In the field of [[data analysis]], hierarchical clustering is a powerful tool that allows the analyst to identify the natural groupings or structures within a dataset. Unlike [[k-means clustering]], which requires the analyst to specify the number of clusters beforehand, hierarchical clustering does not require the number of clusters to be specified in advance, making it particularly useful for exploratory data analysis.

== Algorithm ==
The algorithm for hierarchical clustering can be described as follows:

=== Agglomerative Clustering ===
# Start by treating each [[data point]] as a single cluster.
# Find the closest (most similar) pair of clusters and merge them into a single cluster.
# Compute distances (similarities) between the new cluster and each of the old clusters.
# Repeat steps 2 and 3 until all items are clustered into a single cluster of size ''n''.

=== Divisive Clustering ===
# Start with all observations in a single cluster.
# Find the cluster to split and how to split it.
# Perform the split to create two new clusters.
# Repeat steps 2 and 3 until each observation is in its own cluster.

== Distance Measures ==
The choice of distance measure is a critical step in clustering. It defines how the similarity of two elements is calculated and it will influence the shape of the clusters. The most common distance measures used in hierarchical clustering are:

* [[Euclidean distance]]: The standard distance measure also known as straight-line distance.
* [[Manhattan distance]]: Sum of the absolute differences of their Cartesian coordinates also known as city block distance.
* [[Cosine similarity]]: Measures the cosine of the angle between two vectors.

== Applications ==
Hierarchical clustering is widely used in various fields such as:

* [[Biology]], for constructing phylogenetic trees.
* [[Information retrieval]], for document clustering.
* [[Social sciences]], for clustering individuals based on their characteristics.
* [[Market research]], for customer segmentation.

== Advantages and Disadvantages ==
=== Advantages ===
* Does not require the number of clusters to be specified in advance.
* Easy to implement and provides hierarchical relationships among the observations.

=== Disadvantages ===
* Can be computationally expensive, especially for large datasets.
* The results can be sensitive to the choice of distance measure and linkage criteria.

== See Also ==
* [[Cluster analysis]]
* [[K-means clustering]]
* [[Dendrogram]]
* [[Data analysis]]

[[Category:Cluster analysis]]
[[Category:Data mining]]
[[Category:Machine learning]]

{{stb}}

Hierarchical clustering - Revision history

Prab: CSV import