`min_samples` in HDSCAN #28976

schae211 · 2024-05-08T14:15:59Z

Describe the issue linked to the documentation

I find the description of the min_samples argument in sklearn.cluster.HDBSCAN confusing.

It says "The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself."

But if I understand everything correctly min_samples corresponds to the $k$ used to compute the core distance $\text{core}_k\left(x\right)$ for every sample $x$ where the $k$'th core distance for some sample $x$ is defined as the distance to the $k$'th nearest-neighbor of $x$ (counting itself). (-> which exactly what is happening in the code here: https://github.com/scikit-learn-contrib/hdbscan/blob/fc94241a4ecf5d3668cbe33b36ef03e6160d7ab7/hdbscan/_hdbscan_reachability.pyx#L45-L47, where it is called min_points)

I don't understand how both of these descriptions are equivalent. I would assume that other people might find that confusing as well.

Link in Code:

scikit-learn/sklearn/cluster/_hdbscan/hdbscan.py

Lines 441 to 444 in 8721245

    
               min_samples : int, default=None 
        
                   The number of samples in a neighborhood for a point 
        
                   to be considered as a core point. This includes the point itself. 
        
                   When `None`, defaults to `min_cluster_size`.

Link in Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN

Suggest a potential alternative/fix

No response

The text was updated successfully, but these errors were encountered:

adrinjalali · 2024-05-13T09:49:50Z

cc @Micky774

glemaitre · 2024-05-15T10:33:58Z

Indeed, we used the docstring of the original implementation that reused the DBSCAN information. However, the parameter here have a different meaning: it define the core distance.

So we should make sure to change the different docstrings from the file.

schae211 added Documentation Needs Triage Issue requires triage labels May 8, 2024

glemaitre added help wanted and removed Needs Triage Issue requires triage labels May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`min_samples` in HDSCAN #28976

`min_samples` in HDSCAN #28976

schae211 commented May 8, 2024

adrinjalali commented May 13, 2024

glemaitre commented May 15, 2024

min_samples in HDSCAN #28976

min_samples in HDSCAN #28976

Comments

schae211 commented May 8, 2024

Describe the issue linked to the documentation

Suggest a potential alternative/fix

adrinjalali commented May 13, 2024

glemaitre commented May 15, 2024

`min_samples` in HDSCAN #28976

`min_samples` in HDSCAN #28976