# Cluster Analysis 101

Cluster analysis is the art of mathematically split objects into groups or clusters whose members are similar in some way and dissimilar to objects belonging to other groups.

There is a plethora of clustering algorithms available with plenty supporters and detractors of all flavours. I am not going to cover all of them.

This post is about how I normally conduct cluster analyses. The way I use cluster analysis as a previous step for predictive modelling - this mean I create cluster and when new objects are available I try to classify them into groups using multiple techniques.

When possible, I’ll try to provide different approaches to achieve the same result using R.

For this example I’m going to use the water-treatment data set from UCI

This dataset comes from the daily measures of sensors in a urban waste water treatment plant. The objective is to classify the operational state of the plant in order to predict faults through the state variables of the plant at each of the stages of the treatment process.

## Data Set Description

- Number of instances: 527
- Number of Attributes: 38
- All attributes are numeric and continuous

There are some missing values, all are unknown information.

Comments to the data file:

The first element of each line is the day of the data, the rest are the attribute values

###Attributes

```
N. Attrib.
1 Q-E (input flow to plant)
2 ZN-E (input Zinc to plant)
3 PH-E (input pH to plant)
4 DBO-E (input Biological demand of oxygen to plant)
5 DQO-E (input chemical demand of oxygen to plant)
6 SS-E (input suspended solids to plant)
7 SSV-E (input volatile supended solids to plant)
8 SED-E (input sediments to plant)
9 COND-E (input conductivity to plant)
10 PH-P (input pH to primary settler)
11 DBO-P (input Biological demand of oxygen to primary settler)
12 SS-P (input suspended solids to primary settler)
13 SSV-P (input volatile supended solids to primary settler)
14 SED-P (input sediments to primary settler)
15 COND-P (input conductivity to primary settler)
16 PH-D (input pH to secondary settler)
17 DBO-D (input Biological demand of oxygen to secondary settler)
18 DQO-D (input chemical demand of oxygen to secondary settler)
19 SS-D (input suspended solids to secondary settler)
20 SSV-D (input volatile supended solids to secondary settler)
21 SED-D (input sediments to secondary settler)
22 COND-D (input conductivity to secondary settler)
23 PH-S (output pH)
24 DBO-S (output Biological demand of oxygen)
25 DQO-S (output chemical demand of oxygen)
26 SS-S (output suspended solids)
27 SSV-S (output volatile supended solids)
28 SED-S (output sediments)
29 COND-S (output conductivity)
30 RD-DBO-P (performance input Biological demand of oxygen in primary settler)
31 RD-SS-P (performance input suspended solids to primary settler)
32 RD-SED-P (performance input sediments to primary settler)
33 RD-DBO-S (performance input Biological demand of oxygen to secondary settler)
34 RD-DQO-S (performance input chemical demand of oxygen to secondary settler)
35 RD-DBO-G (global performance input Biological demand of oxygen)
36 RD-DQO-G (global performance input chemical demand of oxygen)
37 RD-SS-G (global performance input suspended solids)
38 RD-SED-G (global performance input sediments)
```

## Data Preparation

```
water.treatment = read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/water-treatment/water-treatment.data", header = F)
# Remove the first column (day of the data)
water.treatment$V1 = NULL
```

```
# rename variables to something more friendly
colnames(water.treatment) = c("Q_E", "ZN_E", "PH_E", "DBO_E", "DQO_E", "SS_E",
"SSV_E", "SED_E", "COND_E", "PH_P", "DBO_P", "SS_P", "SSV_P", "SED_P", "COND_P",
"PH_D", "DBO_D", "DQO_D", "SS_D", "SSV_D", "SED_D", "COND_D", "PH_S", "DBO_S",
"DQO_S", "SS_S", "SSV_S", "SED_S", "COND_S", "RD_DBO_P", "RD_SS_P", "RD_SED_P",
"RD_DBO_S", "RD_DQO_S", "RD_DBO_G", "RD_DQO_G", "RD_SS_G", "RD_SED_G")
apply(water.treatment, 2, class)
```

```
## Q_E ZN_E PH_E DBO_E DQO_E SS_E
## "character" "character" "character" "character" "character" "character"
## SSV_E SED_E COND_E PH_P DBO_P SS_P
## "character" "character" "character" "character" "character" "character"
## SSV_P SED_P COND_P PH_D DBO_D DQO_D
## "character" "character" "character" "character" "character" "character"
## SS_D SSV_D SED_D COND_D PH_S DBO_S
## "character" "character" "character" "character" "character" "character"
## DQO_S SS_S SSV_S SED_S COND_S RD_DBO_P
## "character" "character" "character" "character" "character" "character"
## RD_SS_P RD_SED_P RD_DBO_S RD_DQO_S RD_DBO_G RD_DQO_G
## "character" "character" "character" "character" "character" "character"
## RD_SS_G RD_SED_G
## "character" "character"
```

```
# attributes are numeric and continuous so convert 'character variable to
# numeric
water.treatment.num = apply(water.treatment, 2, as.numeric)
```

```
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
## Warning in apply(water.treatment, 2, as.numeric): NAs introduced by coercion
```

Now we have a bunch of missing values due to the above conversion so we need to fill these gaps.

```
library(impute)
rseed = 2394 # for reproducibility.
# Note: impute.knn requires a matrix as input data
water.treatment.knn = impute.knn(water.treatment.num, rng.seed = rseed)
# Retrieve imputed matrix from list object
water.treatment.imputed = as.data.frame(water.treatment.knn$data)
```

## Exploratory Analysis

It is always recommendable to remove variables with very low variability as they will not contribute to the formation of clusters.

```
library(caret)
```

```
## Loading required package: lattice
```

```
## Loading required package: ggplot2
```

```
nearZeroVar(water.treatment.imputed)
```

```
## integer(0)
```

Some algorithms like kmeans are susceptible to outliers, badly skewed distribution etc. So checking these things and fixing or improving skewness when possible is a most-do before applying clustering (especially kmeans)

```
summary(water.treatment.imputed)
```

```
## Q_E ZN_E PH_E DBO_E
## Min. :10050 Min. : 0.100 Min. :6.90 Min. : 31.0
## 1st Qu.:32964 1st Qu.: 0.900 1st Qu.:7.60 1st Qu.:148.0
## Median :36010 Median : 1.500 Median :7.80 Median :182.0
## Mean :37221 Mean : 2.357 Mean :7.81 Mean :188.3
## 3rd Qu.:41182 3rd Qu.: 3.000 3rd Qu.:8.00 3rd Qu.:223.0
## Max. :60081 Max. :33.500 Max. :8.70 Max. :438.0
## DQO_E SS_E SSV_E SED_E
## Min. : 81.0 Min. : 98.0 Min. :13.20 Min. : 0.400
## 1st Qu.:326.5 1st Qu.: 170.0 1st Qu.:55.85 1st Qu.: 3.200
## Median :400.0 Median : 196.0 Median :64.30 Median : 4.411
## Mean :407.0 Mean : 227.5 Mean :61.45 Mean : 4.580
## 3rd Qu.:474.5 3rd Qu.: 242.0 3rd Qu.:69.60 3rd Qu.: 5.500
## Max. :941.0 Max. :2008.0 Max. :85.00 Max. :36.000
## COND_E PH_P DBO_P SS_P SSV_P
## Min. : 651 Min. :7.30 Min. : 32.0 Min. : 104 Min. : 7.10
## 1st Qu.:1201 1st Qu.:7.70 1st Qu.:156.0 1st Qu.: 184 1st Qu.:54.20
## Median :1406 Median :7.80 Median :194.0 Median : 220 Median :62.90
## Mean :1479 Mean :7.83 Mean :204.6 Mean : 254 Mean :60.43
## 3rd Qu.:1672 3rd Qu.:8.00 3rd Qu.:238.5 3rd Qu.: 272 3rd Qu.:68.80
## Max. :3230 Max. :8.50 Max. :517.0 Max. :1692 Max. :93.50
## SED_P COND_P PH_D DBO_D DQO_D
## Min. : 1.000 Min. : 646 Min. :7.100 Min. : 26 Min. : 80.0
## 1st Qu.: 3.050 1st Qu.:1217 1st Qu.:7.700 1st Qu.: 98 1st Qu.:221.5
## Median : 4.500 Median :1420 Median :7.800 Median :119 Median :274.0
## Mean : 5.006 Mean :1496 Mean :7.812 Mean :122 Mean :274.3
## 3rd Qu.: 6.000 3rd Qu.:1714 3rd Qu.:7.900 3rd Qu.:147 3rd Qu.:325.0
## Max. :46.000 Max. :3170 Max. :8.400 Max. :285 Max. :511.0
## SS_D SSV_D SED_D COND_D
## Min. : 49.00 Min. : 20.20 Min. :0.0000 Min. : 85
## 1st Qu.: 78.00 1st Qu.: 68.05 1st Qu.:0.2000 1st Qu.:1226
## Median : 92.00 Median : 74.50 Median :0.3000 Median :1428
## Mean : 94.23 Mean : 73.01 Mean :0.4199 Mean :1491
## 3rd Qu.:106.00 3rd Qu.: 79.35 3rd Qu.:0.5000 3rd Qu.:1701
## Max. :244.00 Max. :100.00 Max. :3.5000 Max. :3690
## PH_S DBO_S DQO_S SS_S
## Min. :7.00 Min. : 3.00 Min. : 9.00 Min. : 6.00
## 1st Qu.:7.60 1st Qu.: 14.00 1st Qu.: 65.00 1st Qu.: 14.00
## Median :7.70 Median : 18.00 Median : 85.00 Median : 19.00
## Mean :7.71 Mean : 20.29 Mean : 88.49 Mean : 22.34
## 3rd Qu.:7.80 3rd Qu.: 23.00 3rd Qu.:103.00 3rd Qu.: 24.00
## Max. :9.70 Max. :320.00 Max. :350.00 Max. :238.00
## SSV_S SED_S COND_S RD_DBO_P
## Min. : 29.20 Min. :0.00000 Min. : 683 Min. : 0.60
## 1st Qu.: 75.15 1st Qu.:0.00000 1st Qu.:1235 1st Qu.:30.47
## Median : 81.30 Median :0.01000 Median :1433 Median :37.80
## Mean : 80.21 Mean :0.03646 Mean :1496 Mean :38.62
## 3rd Qu.: 85.70 3rd Qu.:0.02000 3rd Qu.:1694 3rd Qu.:47.65
## Max. :100.00 Max. :3.50000 Max. :3950 Max. :79.10
## RD_SS_P RD_SED_P RD_DBO_S RD_DQO_S
## Min. : 5.30 Min. : 7.7 Min. : 8.20 Min. : 1.40
## 1st Qu.:50.60 1st Qu.: 88.6 1st Qu.:80.50 1st Qu.:62.85
## Median :59.30 Median : 93.3 Median :85.00 Median :69.30
## Mean :58.51 Mean : 90.5 Mean :82.92 Mean :67.46
## 3rd Qu.:66.80 3rd Qu.: 95.7 3rd Qu.:87.70 3rd Qu.:75.00
## Max. :96.10 Max. :100.0 Max. :94.70 Max. :96.80
## RD_DBO_G RD_DQO_G RD_SS_G RD_SED_G
## Min. :19.60 Min. :19.20 Min. :10.30 Min. : 36.40
## 1st Qu.:87.25 1st Qu.:73.35 1st Qu.:87.50 1st Qu.: 99.30
## Median :89.90 Median :78.80 Median :90.70 Median : 99.70
## Mean :88.71 Mean :77.53 Mean :88.93 Mean : 99.11
## 3rd Qu.:92.30 3rd Qu.:83.20 3rd Qu.:93.00 3rd Qu.:100.00
## Max. :97.00 Max. :98.10 Max. :99.40 Max. :100.00
```

```
library(e1071)
sk_ind = apply(water.treatment.imputed, 2, skewness)
sort(sk_ind)
```

```
## RD_SED_G RD_DBO_G RD_SS_G RD_DBO_S RD_SED_P RD_DQO_G
## -11.86161340 -5.24580201 -4.54156409 -4.21695601 -3.32192020 -1.67070923
## RD_DQO_S SSV_E SSV_P SSV_D SSV_S RD_SS_P
## -1.38833850 -1.14369347 -1.04507457 -1.03270973 -0.91373513 -0.35995665
## RD_DBO_P PH_D DQO_D PH_E PH_P DBO_D
## -0.01696595 0.03736981 0.03915236 0.14152839 0.24740400 0.30907457
## Q_E DQO_E DBO_E COND_D COND_P COND_E
## 0.47877146 0.57820675 0.80852710 0.85505040 0.89314698 0.89893174
## DBO_P COND_S SS_D PH_S DQO_S SED_D
## 0.98658135 1.21395916 1.81186759 2.01945917 2.27587139 3.40077458
## ZN_E SS_P SED_E SED_P SS_S SS_E
## 4.69887711 5.37324004 5.49155592 5.79264289 6.29553008 6.61849115
## DBO_S SED_S
## 11.36128989 14.20156424
```

Some variables are badly skewed (positively and negatively). Let’s plot some of them.

```
par(mfrow = c(2, 2))
hist(water.treatment.imputed$RD_SED_G)
hist(water.treatment.imputed$RD_DBO_S)
hist(water.treatment.imputed$RD_DBO_P)
hist(water.treatment.imputed$SED_S)
```

We can use Box-Cox transformations to try to fix/improve skeweness.

```
# from caret use BoxCoxTrans()
RD_DBO_S.trans = BoxCoxTrans(water.treatment.imputed$RD_DBO_S)
RD_DBO_S.trans
```

```
## Box-Cox Transformation
##
## 527 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.20 80.50 85.00 82.92 87.70 94.70
##
## Largest/Smallest: 11.5
## Sample Skewness: -4.22
##
## Estimated Lambda: 2
```

```
par(mfrow = c(1, 2))
# raw variable
hist(water.treatment.imputed$RD_DBO_S)
# After Applying Cox-Box
hist(predict(RD_DBO_S.trans, water.treatment.imputed$RD_DBO_S))
```

This is only for one variable so we need to do the above for all other variables with ill-skewness. Whilst we could wrap up the above piece of code in a function there is another way I call the **FastTrack** using the caret package.

The **FastTrack** uses the preProcess() function in the caret package to apply BoxCox, center & scale and imputation all in one go. It is pretty neat!

```
trans = preProcess(as.data.frame(water.treatment.num), method = c("BoxCox",
"center", "scale", "knnImpute"), k = 10)
# Apply transformations
water.treatment.transf = predict(trans, as.data.frame(water.treatment.num))
```

Note: The only thing you need to do before applying the fastTrack approach is to ensure all your variables are continuous.

## K-means

I always start with k-means as it is easy to implement and understand. I set k to a high value and run k-means from 2 clusters to k clusters and plot the well known elbow graph to have an idea of how many clusters I will need.

For this example I choose k=50.

```
library(useful)
source("~/projects/scripts/R/multiKmeans.R")
source("~/projects/scripts/R//elbowGraph.R")
cl = multiKmeans(water.treatment.transf, 50, 200)
elbow = elbowGraph(cl$css)
elbow
```

Note: You may get a slightly different graph as k-means always return different setups due to initial random selection.

Looking at the above graph I’d say **k** is between 3 and 9 clusters

## Hierarchical Clustering

Initial k-means is giving `3<=k<=9`

clusters and I think 4 or 5 clusters seems right but I’ll use hierarchical clustering to gain more insight on the number of clusters. Please notice that I use other clustering algorithms to gain insight on how many clusters I will choose for my previous k-means.

```
d = dist(water.treatment.transf, method = "euclidean") # distance matrix
fit1 = hclust(d, method = "ward")
```

```
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
```

```
plot(fit1, labels = FALSE) # display dendogram
```

In the above tree graph, what we are looking for is for large stems and it seems like 3,4 or 5 clusters are big candidates.

Let’s see how these different options of clusters looks like.

```
par(mfrow = c(2, 2))
plot(fit1, labels = FALSE)
rect.hclust(fit1, k = 3, border = "red")
plot(fit1, labels = FALSE)
rect.hclust(fit1, k = 4, border = "green")
plot(fit1, labels = FALSE)
rect.hclust(fit1, k = 5, border = "blue")
```

3 clusters seem like a potential solution. Let’s see how looks like 3 clusters

```
clus3 = cl$cluster[[3]]
source("~/projects/scripts/R/plot.kmeans2.R")
plot.kmeans2(clus3, data = water.treatment.transf)
```

## Validating cluster solutions

Another way of validating cluster is using the **clValid** are package. Google it for further information.

```
library(clValid)
```

```
## Loading required package: cluster
```

```
intern = clValid(water.treatment.transf, 3:8, clMethods = c("hierarchical",
"kmeans", "pam"), validation = "internal")
```

```
## Warning in clValid(water.treatment.transf, 3:8, clMethods = c("hierarchical", :
## rownames for data not specified, using 1:nrow(data)
```

```
summary(intern)
```

```
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 3 4 5 6 7 8
##
## Validation Measures:
## 3 4 5 6 7 8
##
## hierarchical Connectivity 7.7159 9.7159 19.0258 19.1369 27.3194 29.3194
## Dunn 0.4385 0.4385 0.3477 0.3477 0.3365 0.3365
## Silhouette 0.4664 0.4631 0.4025 0.3560 0.2693 0.2683
## kmeans Connectivity 216.5996 304.8694 306.8694 421.4171 405.9639 397.2667
## Dunn 0.1245 0.1688 0.1688 0.1424 0.1621 0.2055
## Silhouette 0.1317 0.0928 0.0919 0.0907 0.1005 0.0950
## pam Connectivity 356.3349 460.5222 476.2377 517.5083 548.0472 552.6091
## Dunn 0.0890 0.0807 0.0983 0.0925 0.1010 0.1691
## Silhouette 0.0910 0.0784 0.0792 0.0764 0.0531 0.0616
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 7.7159 hierarchical 3
## Dunn 0.4385 hierarchical 3
## Silhouette 0.4664 hierarchical 3
```

```
stab = clValid(water.treatment.transf, 3:8, clMethods = c("hierarchical", "kmeans",
"pam"), validation = "stability")
```

```
## Warning: did not converge in 10 iterations
```

```
## Warning in clValid(water.treatment.transf, 3:8, clMethods = c("hierarchical", :
## rownames for data not specified, using 1:nrow(data)
```

```
summary(stab)
```

```
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 3 4 5 6 7 8
##
## Validation Measures:
## 3 4 5 6 7 8
##
## hierarchical APN 0.0003 0.0022 0.0037 0.0161 0.0113 0.0251
## AD 7.9868 7.9567 7.8577 7.8426 7.7760 7.7598
## ADM 0.0111 0.0294 0.0656 0.1647 0.1280 0.2199
## FOM 0.9582 0.9517 0.9329 0.9298 0.9227 0.9182
## kmeans APN 0.0544 0.0838 0.0882 0.4702 0.2279 0.2299
## AD 7.4932 7.2738 7.2119 7.3940 6.9624 6.8668
## ADM 0.3272 0.5465 0.4356 2.3369 1.1887 1.1365
## FOM 0.9227 0.9010 0.8841 0.8494 0.8339 0.8250
## pam APN 0.2046 0.3202 0.2715 0.2742 0.2082 0.2428
## AD 7.5714 7.4790 7.2551 7.1482 6.9676 6.8648
## ADM 1.0277 1.5481 1.2421 1.2396 0.9860 1.0749
## FOM 0.9298 0.9111 0.8923 0.8749 0.8703 0.8632
##
## Optimal Scores:
##
## Score Method Clusters
## APN 0.0003 hierarchical 3
## AD 6.8648 pam 8
## ADM 0.0111 hierarchical 3
## FOM 0.8250 kmeans 8
```

Using the stability algorithm it seems like 3 and 8 are good options for the number of clusters - however with 8 clusters there is a lot of overlapping among clusters.

```
clus8 = cl$cluster[[8]]
source("~/projects/scripts/R/plot.kmeans2.R")
plot.kmeans2(clus8, data = water.treatment.transf)
```

So final solution at this stage will be 3 clusters. One also can isolate each cluster and try to find if there are meaningful sub-clusters by repeating the above process.