what options to use for sampling train data of size around 2 mil records #2653

algomaschine · 2024-05-03T13:43:49Z

algomaschine
May 3, 2024

Gents, I'm trying to use the below approach for a large dataset of 2mil records, however it doesn't help. Can I tweak some random tree sampling parameters to try and get this work, any good suggestions please? I know it does it by default, but it seems like the samples are not kicking it. Thanks for you advice in advance!

`from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split

Assuming X, y are already defined

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=True)

Initialize the CatBoostClassifier with more advanced parameters

best_model = CatBoostClassifier(
random_seed=777,
iterations=20000, # Fine-grained learning over more iterations
learning_rate=0.005, # Gradual convergence
depth=8, # Capture complex patterns with deeper trees
l2_leaf_reg=3, # Regularization to mitigate overfitting
bootstrap_type='Bernoulli',
subsample=0.8,
border_count=128, # Increase the number of splits for numerical features
feature_border_type='GreedyLogSum', # Advanced feature split method
grow_policy='Lossguide', # Tree growing policy
leaf_estimation_method='Newton', # More accurate leaf estimation at the cost of computation
auto_class_weights='Balanced', # Automatically balance class weights
verbose=500,
early_stopping_rounds=300,
thread_count=-1 # Use all available CPU cores
)

Fit the model

best_model.fit(
X_train, y_train,
eval_set=(X_test, y_test),
plot=True # Enable plotting if using a compatible environment
)
`

andrey-khropov · 2024-05-07T16:22:28Z

andrey-khropov
May 7, 2024
Maintainer

but it seems like the samples are not kicking it.

Do you mean "kicking in" ?

And why do you assume that sampling is not working?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what options to use for sampling train data of size around 2 mil records #2653

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

what options to use for sampling train data of size around 2 mil records #2653

algomaschine May 3, 2024

Assuming X, y are already defined

Initialize the CatBoostClassifier with more advanced parameters

Fit the model

Replies: 1 comment

andrey-khropov May 7, 2024 Maintainer

algomaschine
May 3, 2024

andrey-khropov
May 7, 2024
Maintainer