JustSong Archive Link About
Projects
Categories
Others

L9 聚类 Clustering

Tag: OneNote 模式识别 Posted on 2020-08-11 21:10:21 Edited on 2020-08-11 21:10:21 Views: 170

重点

Clustering: clustering criteria, k-means, fuzzy c- 
means, and agglomerative clustering.

 

Clustering Criteria

Intra-cluster distances are minimized.

Inter-cluster distances are maximized.

 

Types of Clustering

  1. Partitional Clustering(分区聚类): A division of data objects into non overlapping subsets (clusters) such that each data object is in exactly one subset.(可能需要用户指定聚类数目)

  1. Hierarchical clustering(层次聚类): A set of nested clusters organized as a hierarchical tree.(可以使用 dendrogram 来描述)

р 1 р2 РЗ р4

 

K-means

  1. k cluster 的数目,由用户指定。
  2. 每个聚类有一个中心,称为 centroid

 

算法步骤

  1. 确定 k
  2. (随机)初始化 k 个聚类中心。
  3. 遍历所有样本点,把其划分到离其最近的聚类中心所处的聚类中。
  4. 遍历所有聚类,重新计算其中心。
  5. 如果没有样本点的类别在上次迭代中发生改变,就退出,否则回到步骤 3

 

距离的度量可以考虑使用欧几里得距离(Euclidean Distance)。

 

问题:如何确定 k 即聚类的数目?

一种近似方法是寻找转折点,即所谓的 Elbow Finding

8 
8 
Cost Function 
8888888 
8

如何实现?

seKi = 
xjECIuster i 
Ci112

sew

寻找使  最小的 K 即可,其中 c 是一个常数(即所有样本点到其中心点的距离外加一个关于聚类个数的惩罚项)。

 

问题:如何处理 outliers

  1. 一种方法是移除相较于本聚类中其他数据点,其离聚类中心过远的数据点;为了安全,我们可能需要在多个 iteration 中观测这些点。
  2. 另一种方法是使用随机取样,每次我们只选数据集的一个较小的子集,因此选择中一个 outlier 的概率很小;之后通过距离或者相似度比较将剩余的数据点归类。

 

问题:K-means 对于初始的 K 个聚类中心的选择较为敏感?

那就使用不同的随机数种子以进行多次初始化,选择表现好的。

 

问题:K-means 不适合寻找聚类形状非超球体(hyper-sphere)的聚类?

 

K-means 的评价:

  1. 简单高效。
  2. 没有清晰的证据表明其他聚类算法在一般情况下比 K-means 要好。

 

Fuzzy C-means FCM

核心思想:每个样本可以同时以不同程度属于多个类别。

同时满足:

其中:

  1. C 是聚类数目。
  2. N 是数据点个数。
  3.  是第 k 个数据点对于第 i 个聚类的 fuzzy membership
  4.  是第 k 个数据点与第 i 个聚类的中心的欧几里得距离。
  5. m 是一个大于 1 的数,称之为 fuzzy weighting factor which defines the degree of fuzziness of the results (normally chosen m =2, to get analytical solution)

 

样本点到中心的距离:

Fuzzy membership / 样本点 k 属于聚类 i 的程度:

聚类 i centroid 的计算方法:

 

 

算法步骤:

1. Choose the number of classes C, 2 CC < n ; Chose m=2, 
Initialize U (0) ( start with some arbitrary values for cluster 
centers and corresponding partition matrix U (0) values ) 
2. Calculate the cluster centers v, 
3. Calculate new partition matrix [J(l) 
4. Compare U (J and U (1+1) . If the variation of the membership 
degree, calculated with an appropriate norm, is smaller 
than a given threshold, stop the algorithm, otherwise go back 
to step 2.

 

Pros and Cons

优点:

  1. 无监督。
  2. 总是可以收敛。

缺点:

  1. 较大的计算量。
  2. 对初始中心的选择较为敏感。
  3. 对噪声较为敏感。

 

Hierarchical Clustering

训练方法:

  1. Agglomerative(聚集):从每个点都是一个聚类开始,每一步都将两个距离最近的聚类合并到一起,直到只剩下一个聚类。
  2. Divisive:从一个包含所有数据点的聚类开始,每一步都分裂出一个聚类直到每个聚类只包含一个点。

传统的层次算法使用一个相似矩阵 / 距离矩阵。

 

优点:

  1. 无需指定聚类数目;
  2. They may correspond to meaningful taxonomies.

缺点:

  1. 一旦合并两个聚类的决定被做出,无法撤销。
  2. 没有可以直接最小化的损失函数。
  3. 可能有以下问题:
    1. 对噪声和离群点敏感;
    2. 难以处理不同尺寸的簇集以及 convex shape(凸形);
    3. 打破大的簇集。

 

Agglomerative Clustering Algorithm

算法步骤:

1. 
2. 
3. 
4. 
5. 
6. 
Compute the proximity matrix 
Let each data point be a cluster 
Repeat 
Merge the two closest clusters 
Update the proximity matrix 
Until only a single cluster remains

关键操作是两个聚类的相似度(proximity)的计算,有以下几种方式:

  1. 使用两个聚类的最近点间的距离(one link / single link)。
    1. 优点:可以处理非椭圆形状(non-elliptical shapes)的聚类。
    2. 缺点:对噪声和离群点敏感。
  2. 使用两个聚类的最远点间的距离complete link
    1. 优点:不易受噪声和离群点影响(Less susceptible to noise and outliers
    2. 缺点:
      1. Tends to break large clusters;
      2. Biased towards globular(球状) clusters.
  3. 使用两个聚类的平均点间的距离。
    1. 优点:不易受噪声和离群点影响。
    2. 缺点:Biased towards globular(球状) clusters.
  4. 使用两个聚类的中心点间的距离。

 

总结

• Unsupervised learning is to explore the data to find some 
intrinsic structures. 
• Clustering analysis is to group data points into classes, 
such that the intra-class similarity is high and the inter- 
class similarity is low. 
• k-means is the most popular algorithm due to its 
simplicity and efficiency. 
• Fuzzy c-means expresses the amount of ambiguity in 
human thinking. 
• Hierarchical clustering produces a set of nested clusters 
organized as a hierarchical tree. 
• Comparing different clustering algorithms is a difficult 
task. No one knows the correct clusters!

 

问题一

Use the k-means algorithm 
to cluster 8 examples into 3 
clusters. The distance matrix based on the Euclidean distance 
is given below: 
Al 
Al O 
Vik 
USS 
0 
NGE 
N'fi5 
v!iij 
Suppose that the initial seeds are Al, A4 and A7. 
Run the k- 
means algorithm for 1 epoch only, and show the new clusters.

 

首先遍历所有样本点为其划分簇集,之后遍历所有簇集为其计算新的中心点。

Seedl=A1, Seed2=A4, Seed3=A7. 
A2 e Cluster3 
-> A3 e Cluster2 
A5 e Cluster2 
-> A6 e Cluster2 
A8 e Cluster2 
New clusters: Clusterl: {Al}, Cluster2: {A3, A4, AS, A6, A8}, 
Cluster3: {A2, A7}

 

问题二

Distance-based classifier. 
a) Design a minimum distance classifier with three classes 
using the following training data: Class 1: (-2,-1) and (-2,-3), 
Class 2: (2,1) and (2,-1), Class 3: (-2, 1) and (-2,3). Then 
classify the vector (1,-2) with the trained classifier.

首先为每个类别计算其中心,然后计算该店到各个中心点的距离,将其划分为离其距离最小的中心所属的类别。

 

b) The decision function for a minimum distance classifier is 
dj(x) = xmf 1 
— —m•m where mj is the mean vector for 
class wj. What is the value of the decision function for 
each of the three classes in Sub-question a) for the test 
vector (0,-1)?

 

The decision function of each class: 
Class 1, dl(x) = x(—2, —2)T—4; 
Class 2, d2(x) = x(2,0)T—2; 
Class 3, d3(x) = x(—2,2)T—4. 
For the test vector x = (0, —1), dl(x) — 
d3(x) — 
—6. 
-2, d2(x) 
= —2, and

 

问题三

Use single link agglomerative clustering to group the data 
described by the following distance matrix. Show the 
dendrogram. 
A 
B 
c

 

Agglomerative -> initially every point in a cluster of its own 
and we merge cluster until we end-up with one unique cluster 
containing all points. 
Single link: distance between two 
clusters is the shortest distance 
between a pair of elements from 
the two clusters. 
ABCDo 
__1 
-2 
-3 
d 
O 
2 
3 
k 
4 
3 
2 
1 
K 
Comments 
We start with each int — cluster 
Merge {A} and {B} since A & B are the 
closest: d(A, B 
Merge {A, B} and {C} since B & C are 
the closest: d(B, 
Mer e D

未经允许,禁止转载,本文源站链接:https://iamazing.cn/