Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft K-Means clustering #3453

Open
joangog opened this issue May 17, 2024 · 1 comment
Open

Soft K-Means clustering #3453

joangog opened this issue May 17, 2024 · 1 comment
Labels

Comments

@joangog
Copy link

joangog commented May 17, 2024

Is there a way for K-Means to return NxK, the probabilities that each point belongs to cluster K?

@hammad7
Copy link

hammad7 commented May 22, 2024

@joangog ,
To get the NxK matrix, you can actually use softmax on top of the distances returned by kmeans.index.search() . Here is the working code for the same.

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return e_x / e_x.sum(axis=1, keepdims=True)


def soft_prob(kmeans,data):
    centroids = kmeans.centroids
    
    assert centroids.ndim == 2, "Centroids must be a 2D array"
    assert data.ndim == 2, "Data must be a 2D array"

    # calculate the distance between each data point and each centroid
    # distances = np.linalg.norm(data[:, np.newaxis, :] - centroids[np.newaxis, :, :], axis=2)

    distances, assignments = kmeans.index.search(data, len(centroids))

    # calculate the probability of each data point belonging to each cluster
    probabilities = softmax(-distances)

    for i in range(len(probabilities)):
        # Get the current row of assignments
        current_assignments = assignments[i]
        
        # Sort the current row of _ based on the assignments
        sorted_row = [probabilities[i][j] for j in np.argsort(current_assignments)]
        
        # Update the current row of probabilities with the sorted values
        probabilities[i] = sorted_row 

Usage:

# Generate dummy data
d = 10
n = 100
k = 5
np.random.seed(1234)
x = np.random.random((n, d)).astype('float32')

# # Perform k-means clustering
kmeans = faiss.Kmeans(d, k, niter=25)
kmeans.train(x)

print(soft_prob(kmeans,x))

Output:
[[0.17769898 0.22218941 0.1265448 0.29491347 0.17865327]
[0.31340864 0.17389828 0.18536964 0.18793808 0.13938534]
[0.12620465 0.1935667 0.19170085 0.19852394 0.29000384]
...
...
[0.20580962 0.2837541 0.12138043 0.17246573 0.21659008]
[0.15004209 0.16144404 0.2073633 0.18762918 0.29352143]
[0.27158892 0.2264123 0.17823231 0.09824435 0.22552218]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants