what options to use for sampling train data of size around 2 mil records #2653
Unanswered
algomaschine
asked this question in
Q&A
Replies: 1 comment
-
Do you mean "kicking in" ? And why do you assume that sampling is not working? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Gents, I'm trying to use the below approach for a large dataset of 2mil records, however it doesn't help. Can I tweak some random tree sampling parameters to try and get this work, any good suggestions please? I know it does it by default, but it seems like the samples are not kicking it. Thanks for you advice in advance!
`from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
Assuming X, y are already defined
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=True)
Initialize the CatBoostClassifier with more advanced parameters
best_model = CatBoostClassifier(
random_seed=777,
iterations=20000, # Fine-grained learning over more iterations
learning_rate=0.005, # Gradual convergence
depth=8, # Capture complex patterns with deeper trees
l2_leaf_reg=3, # Regularization to mitigate overfitting
bootstrap_type='Bernoulli',
subsample=0.8,
border_count=128, # Increase the number of splits for numerical features
feature_border_type='GreedyLogSum', # Advanced feature split method
grow_policy='Lossguide', # Tree growing policy
leaf_estimation_method='Newton', # More accurate leaf estimation at the cost of computation
auto_class_weights='Balanced', # Automatically balance class weights
verbose=500,
early_stopping_rounds=300,
thread_count=-1 # Use all available CPU cores
)
Fit the model
best_model.fit(
X_train, y_train,
eval_set=(X_test, y_test),
plot=True # Enable plotting if using a compatible environment
)
`
Beta Was this translation helpful? Give feedback.
All reactions