You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Indeed, we used the docstring of the original implementation that reused the DBSCAN information. However, the parameter here have a different meaning: it define the core distance.
So we should make sure to change the different docstrings from the file.
Describe the issue linked to the documentation
I find the description of the
min_samples
argument in sklearn.cluster.HDBSCAN confusing.It says "The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself."
But if I understand everything correctly$k$ used to compute the core distance $\text{core}_k\left(x\right)$ for every sample $x$ where the $k$ 'th core distance for some sample $x$ is defined as the distance to the $k$ 'th nearest-neighbor of $x$ (counting itself). (-> which exactly what is happening in the code here: https://github.com/scikit-learn-contrib/hdbscan/blob/fc94241a4ecf5d3668cbe33b36ef03e6160d7ab7/hdbscan/_hdbscan_reachability.pyx#L45-L47, where it is called
min_samples
corresponds to themin_points
)I don't understand how both of these descriptions are equivalent. I would assume that other people might find that confusing as well.
Link in Code:
scikit-learn/sklearn/cluster/_hdbscan/hdbscan.py
Lines 441 to 444 in 8721245
Link in Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN
Suggest a potential alternative/fix
No response
The text was updated successfully, but these errors were encountered: