Soft K-Means clustering #3453

joangog · 2024-05-17T12:50:17Z

Is there a way for K-Means to return NxK, the probabilities that each point belongs to cluster K?

hammad7 · 2024-05-22T19:02:35Z

@joangog ,
To get the NxK matrix, you can actually use softmax on top of the distances returned by kmeans.index.search() . Here is the working code for the same.

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return e_x / e_x.sum(axis=1, keepdims=True)


def soft_prob(kmeans,data):
    centroids = kmeans.centroids
    
    assert centroids.ndim == 2, "Centroids must be a 2D array"
    assert data.ndim == 2, "Data must be a 2D array"

    # calculate the distance between each data point and each centroid
    # distances = np.linalg.norm(data[:, np.newaxis, :] - centroids[np.newaxis, :, :], axis=2)

    distances, assignments = kmeans.index.search(data, len(centroids))

    # calculate the probability of each data point belonging to each cluster
    probabilities = softmax(-distances)

    for i in range(len(probabilities)):
        # Get the current row of assignments
        current_assignments = assignments[i]
        
        # Sort the current row of _ based on the assignments
        sorted_row = [probabilities[i][j] for j in np.argsort(current_assignments)]
        
        # Update the current row of probabilities with the sorted values
        probabilities[i] = sorted_row

Usage:

# Generate dummy data
d = 10
n = 100
k = 5
np.random.seed(1234)
x = np.random.random((n, d)).astype('float32')

# # Perform k-means clustering
kmeans = faiss.Kmeans(d, k, niter=25)
kmeans.train(x)

print(soft_prob(kmeans,x))

Output:
[[0.17769898 0.22218941 0.1265448 0.29491347 0.17865327]
[0.31340864 0.17389828 0.18536964 0.18793808 0.13938534]
[0.12620465 0.1935667 0.19170085 0.19852394 0.29000384]
...
...
[0.20580962 0.2837541 0.12138043 0.17246573 0.21659008]
[0.15004209 0.16144404 0.2073633 0.18762918 0.29352143]
[0.27158892 0.2264123 0.17823231 0.09824435 0.22552218]]

mlomeli1 added the question label May 24, 2024

hammad7 mentioned this issue May 26, 2024

Issue-3453: added utility to compute cluster probability of each data point #3483

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soft K-Means clustering #3453

Soft K-Means clustering #3453

joangog commented May 17, 2024

hammad7 commented May 22, 2024

Soft K-Means clustering #3453

Soft K-Means clustering #3453

Comments

joangog commented May 17, 2024

hammad7 commented May 22, 2024