SLEP005: Resampler API #15

glemaitre · 2019-03-01T13:42:32Z

Here is the SLEP regarding the Outlier Rejection API

index.rst

amueller · 2019-03-01T16:38:22Z

I was expecting "the fit_resample slep" to be about imbalanced data. I think subsampling is a much more common issue then trying to automatically remove outliers.

Do you have a real-world application where this would help? ;)

slep005/proposal.rst

GaelVaroquaux · 2019-03-01T17:19:21Z

Indeed, as I was telling Guillaume IRL, I think that this SLEP should be about fit_resample, rather than outlier rejection, and outlier rejection should be listed as an application. I am not convinced that proposing a new mixing calls for a SLEP in itself. It's the addition of the method fit_resample that calls for the SLEP.

slep005/proposal.rst

glemaitre · 2019-03-02T18:46:20Z

OK, so I will modify the SLEP to make it about resamplers. Basically, I should keep the part about the implementation of the pipeline with the limitations. @orausch made a good suggestion.

glemaitre · 2019-03-02T18:51:58Z

Do you have a real-world application where this would help? ;)

@amueller This is indeed a really good point which should be demonstrated in the outlier PR.

Co-Authored-By: glemaitre <g.lemaitre58@gmail.com>

chkoar · 2019-03-02T18:59:05Z

Will resamplers applied in transform? I got confused.

glemaitre · 2019-03-02T19:00:10Z

Will resamplers applied in transform? I got confused.

Nop resampler will be implementing fit_resample and will not implement fit_transform or transform.

chkoar · 2019-03-02T19:02:45Z

I wonder because of this 4ecc51b#diff-fa5683f6b9de871ce2af02905affbdaaR80

orausch · 2019-03-02T19:39:13Z

I wonder because of this 4ecc51b#diff-fa5683f6b9de871ce2af02905affbdaaR80

Sorry, that is a mistake on my part. They should not be applied on transform.

EDIT: and accordingly, during fit_transform, resamplers are only applied during the fit, and not the transform.

chkoar · 2019-03-02T19:46:40Z

EDIT: and accordingly, during fit_transform, resamplers are only applied during the fit, and not the transform.

Yes, this make sense.

orausch · 2019-03-02T22:15:14Z

Currently, my PR applies resamplers on transform (and so does imblearn). It seems that it is not trivial to change the behavior so that, for fit_transform, resamplers are only applied during the fit, and not during the transform. Currently, the pipeline is efficient in the sense that each transformer is only called once during a fit_transform call.

To get the behavior described above, a naive implementation would have to call each transformer after the first resampler twice: once for the fit path, where we apply resamplers, and once for the transform path, where we don't. It seems to me that, in order to do it efficiently, we would need some knowledge of what samples were added/removed by the resamplers in the pipeline.

If we want to make transform not apply resamplers, we would have to address this problem in the pipeline.

This brings me to the next thought: does it even make sense to have resamplers in a transformer only pipeline? Is there a good usecase? One choice would be to simply disallow this behavior (similarly to how we disallow resamplers for fit_predict).

glemaitre · 2019-03-02T22:18:55Z

So I update the SLEP toward a more fit_resample discussion.

I realised (even if it is obvious) that outlier rejection is unsupervised while resampling for balancing is supervised (and for binary/multiclass classification) AFAIK. In the latter case, resampler will require to validate the targets and define an API for driving the resampling (i.e. sampling_strategy in imblearn)

Is this API choice should be discussed within the SLEP as well or this is more specific to a type of resampler and it will be handled later on.

glemaitre · 2019-03-02T22:21:55Z

Currently, my PR applies resamplers on transform (and so does imblearn).

https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/pipeline.py#L487

We are skipping the resampler during transform.

glemaitre · 2019-03-02T22:32:52Z

@orausch I am a bit confused with your comment.

When calling fit_transform on a pipeline, we will call all fit_transform or fit_resample of all estimators but the last. The last will then call fit.transform or fit_transform.

For transform on a pipeline, we will call transform of all estimators but the last, skipping the ones with fit_resample. Finally, we call transform of the last estimator.

Could you explain where do you think we are applying the resampling exactly, maybe I am missing something?

orausch · 2019-03-02T22:49:58Z

We are skipping the resampler during transform.

Sorry, I missed that. I think there is a problem regardless.

When calling fit_transform on a pipeline, we will call all fit_transform or fit_resample of all estimators but the last. The last will then call fit.transform or fit_transform.

If we do this, we have inconsistent behavior between pipe.fit_transform(X, y) and pipe.fit(X, y).transform(X). For example consider:

X = [A, B, C] # A, B, C are feature vectors 
y = [a, b, c]
pipe = make_pipeline(removeB, mult2) # removeB is a resampler that will remove B, mult2 is a transformer

Then pipe.fit_transform(X, y) gives [2A,2C] but pipe.fit(X, y).transform(X) gives [2A, 2B, 2C]

chkoar · 2019-03-02T23:41:10Z

At least in the imbalanced setting use case usually you will have at the last step either a classifier or a resampler, not a transformer. I suppose that the same happens in the outlier rejection. In your example it make sense to resample also in pipeline's transform, right? But what if you transform again with a X2? Does it make sense to resample on X2 when you only need the transformation? It seems like you need to resample in transform when you will pass the same fitted data again, in order to not break the contract. Thoughts?

orausch · 2019-03-04T10:44:47Z

Some options to address this:

Forbid calling fit_transform on a pipeline containing resamplers. (like we do with fit_predict).
Define the API so that transform applies resamplers (then the approach from imblearn works).

Or we can implement the behavior that for fit_transform, we apply resamplers on fit, but not on transform. Here I can also see two options

Implement the behavior, but do it inefficiently (simply call fit and then transform).
Do it efficiently. Like I mentioned earlier, it seems to me that to do this, we need some knowledge of what samplers were added/removed by resamplers. Perhaps something like return_idx is an optional kwarg to fit_resample. If the resampler supports returning idx, we do it efficiently, if not, we call fit, then transform.

Let me know if I missed something.

glemaitre · 2019-03-04T21:41:49Z

@amueller @jnothman @adrinjalali @GaelVaroquaux @agramfort Any thoughts about the issue raised by @orausch #15 (comment)

slep005/proposal.rst

Co-Authored-By: glemaitre <g.lemaitre58@gmail.com>

adrinjalali · 2019-03-05T13:30:55Z

I don't think there's one solution which works for all usecases, here are two real world examples (and I hope they're convincing):

In the context of FAT-ML, assume the following pipeline:

resample to tackle class imbalance
mutate the data for the purpose of a more "fair" data (which may or may not touch y)
usual transformers
an estimator

In the above pipeline, during fit, I'd like the first two steps to be on, and during predict, I'd like them to be off, which we can do if the first two steps are resamplers and they're automatically out during the predict.

I get periodic data from a bunch of sensors installed in a factory, and I need to do predictive maintenance. Sometimes the data is clearly off the charts and I know from the domain knowledge that I can and should safely ignore them. The pipeline would look like:

exclude outliers
usual transformers
predict faulty behavior

In contrast to the previous usecase, now I'd like the first step to be always on, cause I need to ignore those data points from my analysis.

Also, we should consider the fact that once we have this third type of model, we'd have at least three types of pipelines, i.e. estimator, transformer, and resampler pipelines. I feel like this fact, plus the above two usecases, would justify a parameter like resample_on_transform for the pipeline, to tune the behavior of the pipeline regarding these resamplers as its steps. I'm not sure if it completely solves our issues, but it may.

For instance, if the user wants to mix these behaviors, they can put different resamplers into different pipelines, set the resample_on_transform of each pipeline appropriately, and include those pipelines in their main pipeline.

I haven't managed to completely think these different cases through, but I hope I could convey the idea.

amueller · 2019-03-05T15:28:02Z

I'm not sure I buy the second use case. Why is the outlier removal in the pipeline? I would assume it's some simple heuristic like too big a change from yesterday or a value way out of the range. I haven't heard anyone use outlier detection models in practice for something like that. I guess you could argue that estimating the range should probably be done on the training set only in Cross-Validierung. Though I would argue in this use case you want a human to review the ranges and not determine them automatically. Actually a use-case that I think is more realistic is a user tagging obvious outliers so it's a supervised classification problem and you use a classifier to reject outliers before they go to the main processing pipeline. I'm not sure if we want to support this use-case - in the end you could always just add another class to the main problem that is "outlier". Sent from phone. Please excuse spelling and brevity.

…

On Tue, Mar 5, 2019, 08:30 Adrin Jalali ***@***.***> wrote: I don't think there's one solution which works for all usecases, here are two real world examples (and I hope they're convincing): 1. In the context of FAT-ML, assume the following pipeline: - resample to tackle class imbalance - mutate the data for the purpose of a more "fair" data (which may or may not touch y) - usual transformers - an estimator In the above pipeline, during fit, I'd like the first two steps to be on, and during predict, I'd like them to be off, which we can do if the first two steps are resamplers and they're automatically out during the predict. 1. I get periodic data from a bunch of sensors installed in a factory, and I need to do predictive maintenance. Sometimes the data is clearly off the charts and I know from the domain knowledge that I can and should safely ignore them. The pipeline would look like: - exclude outliers - usual transformers - predict faulty behavior In contrast to the previous usecase, now I'd like the first step to be always on, cause I need to ignore those data points from my analysis. Also, we should consider the fact that once we have this third type of model, we'd have at least three types of pipelines, i.e. estimator, transformer, and resampler pipelines. I feel like this fact, plus the above two usecases, would justify a parameter like resample_on_transform for the pipeline, to tune the behavior of the pipeline regarding these resamplers as its steps. I'm not sure if it completely solves our issues, but it may. For instance, if the user wants to mix these behaviors, they can put different resamplers into different pipelines, set the resample_on_transform of each pipeline appropriately, and include those pipelines in their main pipeline. I haven't managed to completely think these different cases through, but I hope I could convey the idea. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbcFoG2UyVVVX62qJC_lf3PQPax02O2ks5vTnGRgaJpZM4bZCV9> .

amueller · 2020-04-07T19:47:45Z

I think @GaelVaroquaux and @glemaitre would have to weigh in what their thoughts are on slep1 vs slep5. Right now I'm leaning towards Slep1. Option A is basically saying "we don't need to address this issue", saying the current API is sufficient. I think this is unlikely to be the solution of choice. It has been the default solution because we couldn't agree on any other so far.

jnothman · 2020-04-19T14:20:32Z

Right now I'm leaning towards Slep1

Does that mean, @amueller, that you prefer its text/discussion, but perhaps not its proposals?? I'm confused by your comment.

I think the core of the current proposal(s) is that there is a new kind of estimator (resampler) that can be plugged into some kind of meta-estimator, which is either a Pipeline extended to handle such things, or a specialised meta-estimator class.

amueller · 2020-04-20T20:29:42Z

As I said in #15 (comment), I think there's two core issues: distinguishing training and test phase and returning more complex objects. Neither of these seems closely related to sampling, so I think using the framing of this SLEP is confusing. It doesn't even mention these two core changes explicitly.
So I guess I like the discussion in SLEP 1 more. But I also think that adding a new method and a specialized meta-estimator seems like the worst of both worlds. I think adding a new method that does both (is only called during fitting and returns a tuple) and supporting it wherever necessary would be best (i.e. modify pipeline and other meta-estimators).

If we don't call it fit_resample but instead call it fit_modify then we could simplify our stacking meta-estimator which would be sweet (that requires only the distinction of fitting and prediction phase, not returning more complex objects, same for impact/target encoding which is basically a variant of stacking).

While there might be a use-case for trans_modify as proposed in SLEP 1, I think that's a bit confusing as it's only use would be to have a more complex return value, but it would otherwise behave as transform.

jnothman · 2020-04-21T12:16:10Z

I'm not really convinced it's the worst of both worlds. The semantics of including a resampler in a pipeline can get pretty confusing in some cases, certainly for the maintainer, but also for the user, I think. The specialised meta-estimator makes it much more explicit. The reason to have a specialised method as well is simply to have the meta-estimator pluggable, and to have the object plugged into it a valid estimator (parametrised and state set by fitting). I'm not sure how renaming from fit_resample to fit_modify changes anything, unless you also mean that there was a change in semantics. However, unless I misunderstand your use of this for stacking, this would involve a transformation of the data at test time, which is different to a more minimal proposal here.

amueller · 2020-05-02T20:25:55Z

I think the semantics of having something in a pipeline that does something different during training and prediction is not that complicated, but the current scikit-learn API makes it really hard to reason about it because it doesn't distinguish training and test time transformations.

I wasn't suggesting to change the semantics, but I don't think you need anything more general for stacking. The current proposal wants things not to be a resampler and a transformer at the same time, but I don't see why this would be necessary.

I can send along a paper I recently worked on that is comparing several pipelining languages, and as I said in other places, for MLR this is no problem at all, and I think the semantics are perfectly clear. Mostly because the differentiate between training time and test time transformations, and because they don't have fit_transform which has some pretty weird semantics tbh.

If we were not mixing the two issues of returning targets and distinguishing training and test transformation I would suggest we remove fit_transform and only have fit_modify but it would probably be too annoying to always return y and sample weights etc.

So re minimal proposal, I think a pretty minimal proposal would be to add fit_modify that is called in a pipeline during fit if it exists and returns X, y, ... and we'd prohibit fit_transform or fit_predict on the pipeline. That would allow the implementation of stacking with a transformer (with the fit_modify API) and it seems more minimal than this SLEP.

jnothman · 2020-06-14T13:29:11Z

I can send along a paper I recently worked on that is comparing several pipelining languages

I'd be interested. I agree that these issues come from the design of fit_transform, fit returning self, transform returning only Xt, etc.

So re minimal proposal, I think a pretty minimal proposal would be to add fit_modify that is called in a pipeline during fit if it exists and returns X, y, ... and we'd prohibit fit_transform or fit_predict on the pipeline. That would allow the implementation of stacking with a transformer (with the fit_modify API) and it seems more minimal than this SLEP.

And at test-time any estimators with fit_modify are treated as the identity transform (and hence fit_modify must return X with same columns as the input, which is one reason why fit_resample was a decent name)? I don't see this as being very different to the current proposal. As well as needing a mechanism to disable fit_transform, you still need to deal with the interpretation of props returned by fit_modify and the routing of props to be used or modified by this Modifier. But I'd be happy to see an implementation of this change and see if we can get this one rolling.

amueller · 2020-06-15T17:57:51Z

And at test-time any estimators with fit_modify are treated as the identity transform

No, transform (or modify) can still do arbitrary things. For stacking it would not be the identity, it would be predict_proba of the base-estimator.

I don't see this as being very different to the current proposal. As well as needing a mechanism to disable fit_transform, you still need to deal with the interpretation of props returned by fit_modify and the routing of props to be used or modified by this Modifier.

Yes, I agree, you still need to do all of these things. The thing is really awkward because the question is whether you'd also have a modify, which would be the same as transform but without the y or sample props. And you only need to disable fit_transform on meta-estimators that contain fit_modify, which doesn't seem so hard?

In an alternate universe, we could have transform just return an xarray dataset, we'd break the fit().transform() == fit_transform() convention and everything would be fine ;)

Not sure if I'll get around to prototype this, I still have a book to write this week, or something.

jnothman · 2020-06-16T13:49:11Z

@orausch's implementation of Pipeline handling fit_resample, close to your proposal, is https://github.com/scikit-learn/scikit-learn/pull/13269/files/61fce479352f82b7f3d1136aac24d5598748ec73#diff-dd32b47cafa79252b520b030724ddda9. I think some things there in the implementation could be simplified. There are some things there not yet handled such as resamplers returning kwargs/props.

A remaining question in your design is whether "modifiers" (please could we stick to "resamplers" for now and later propose the name change to modify?) must implement transform to be used at test time. E.g. whether the ResamplerMixin should implement an identity transform as a default?

amueller · 2020-06-16T16:27:23Z

Yes I would say they should. I don't see why they wouldn't. Sure, we could special-case it and have the pipeline and other meta-estimators handle it, but I don't see why it would be better to put the logic there instead of putting it in the estimator.

One thing that is a bit strange is that right now, is the relation of a "resampler" with estimators that only have fit_transform but no transform. That means it doesn't apply to the test set, so there is no point in distinguishing training and test set, even though conceptually what they do is more close to transforming the training set, so fit_modify might be more appropriate. I guess in theory you could come up with a case where you want to return X, y, and sampleprops for something that only applies to the training set, and then fit_modify would make sense to have without having a transform.

So "resampler" to you is something that returns X and y and sample props and has different behavior during training and fitting? I'm fine using that terminology as long as we agree what it means.

amueller · 2020-06-16T16:34:22Z

The downside of that proposal is that you can't use a TargetEncoder or stacking and then use TSNE, because we don't allow calling fit_transform because it wouldn't be the same as fit().transform().

If TSNE implemented fit_resample it would actually work (though with the resample name it would be weird). For using DBSCAN I don't see how to fix it unless we also add a new verb for predicting on the training set.

jnothman · 2020-06-16T22:53:46Z

The downside of that proposal is that you can't use a TargetEncoder or stacking and then use TSNE, because we don't allow calling fit_transform because it wouldn't be the same as fit().transform().

You can call fit and rely on the stored embeddings_. But I don't really understand why you'd use stacking as a transformer before TSNE..?

I think it is appropriate not to use a Pipeline when you want predictions over something other than the input data. For TSNE and DBSCAN after resampling, you'd likely want to either: keep a copy of the data in the resampled space (i.e. break the pipeline before TSNE/DBSCAN), or have a way to impute the TSNE/DBSCAN embeddings back onto the original full dataset, via something like inverse_transform, or both.

amueller · 2020-06-17T14:57:51Z

You can call fit and rely on the stored embeddings_. But I don't really understand why you'd use stacking as a transformer before TSNE..?

Using the attribute works, though having to change the code if you add something with fit_resample (or whatever it's called) in the pipeline seemed just a bit strange. It wasn't really a realistic example ;) I was trying to come up with an example of a pipeline with something that transforms differently on the training data and then something that can only produce on the training data.

I'm not sure if it would actually ever be an issue in practice, but conceptually it seems a bit ugly, and I find the mental model hard to describe and communicate.
In hindsight, I think it was a bad idea to use fit_transform and fit_predict for things that only apply on the training set if we want to keep the fit_transform == fit().transform().

I think something like @orausch's implementation probably solves most issues in some way. I guess I was hoping for something that results in a simple and easy to understand overall API but that would probably require substantial backward incompatible changes and so maybe that's more appropriate for sklearn 2.0 ;)

What part do you think could be simplified? The changes to pipeline are pretty small, right?
Oh, interesting question is: what do the other meta-estimators do? Say Column transformer? the default case for TargetEncoding is within a column transformer, but for resampling it doesn't make sense. I guess anything that changes y doesn't make sense in a target transformer, but things that are different between training and testing do make sense.

amueller · 2020-06-29T12:53:30Z

Notes from dev meeting: @jnothman felt like my concern was unrelated to the resampling and only relates to fit_transform != fit.transform, as it doesn't change the sample alignment between input and output.

I felt like resampling is "returning y" + "doing something else on the training set". Can you maybe say how the alignment comes in?

jnothman · 2020-06-29T13:35:17Z

I felt like resampling is "returning y" + "doing something else on the training set". Can you maybe say how the alignment comes in?

A core assumption in [fit_]transform, [fit_]predict, predict_proba, etc. is that the return value has samples that are mapped directly to the input X. This allows the predictions to then be scored against some held out y. This is true for the transformation in stacking and target encoding, but not in resampling, sample reduction, sample rejection or dataset augmentation, which are the focus of this PR.

One reason I like the verb "resample" here is because we are literally changing the set of samples that downstream estimators receive features of. (I'm intentionally distinguishing a "sample" from its features, to focus on the identity of the sample rather than the observation of it.) The fact that we need to be able to modify y, and the fact that we cannot perform this operation at test time are corollaries of changing the set of samples we are operating on.

I hope this makes sense.

jnothman · 2020-06-29T13:37:08Z

Also: As discussed at today's meeting, I proposed the meta-estimator, ResampledTrainer, because while it seems natural that a resampler should occur in a Pipeline, the fact that it changes the sample set makes its semantics a little hard to reason about (at least in theory). It also complicates the implementation and testing of myriad Pipeline setups.

jnothman · 2020-08-17T21:43:28Z

An open question here is: is there any value in requiring resamplers to have fit? They're only useful as resamplers if fit_resample is called.

amueller · 2020-08-17T22:04:09Z

I like your distinction above and requiring resamplers not to change the meaning of features and basically operate on sample identities. That makes sense as a distinct logical concept to me.

However, in terms of API, I'm not sure if we're working on too narrow a solution here, since it will still not let us modify y in other areas and have distinct training and test transformations, two things that I think we are relatively certain we want.

On the one hand it makes sense to me to tackle one problem after the others and have incremental solutions. However, these issues seem connected to me; as you said

The fact that we need to be able to modify y, and the fact that we cannot perform this operation at test time are corollaries of changing the set of samples we are operating on.

So if we solve this particular case, but not the more general case, do we keep tacking on things to cover the other cases, and might we come up against issues that can not be solved (like the ones I tried to construct above)?

As you said, it might be fine not to use a pipeline in some cases, I guess.

jnothman · 2020-08-17T23:33:25Z

However, in terms of API, I'm not sure if we're working on too narrow a solution here, since it will still not let us modify y in other areas and have distinct training and test transformations, two things that I think we are relatively certain we want.

I'm still not very aware of categories of use case for modifying y where it's not about:

changing the sample identity, which doesn't make sense on inductive models at test time, and so should be doable with fit_resample.
- In some cases, such as retrieving both X and y from some external data source, this involves modifying the feature space as well as the sample space. I don't mind us extending the current proposal to allow for changing feature identity, which requires transform at test time, but does not involve transforming y at test time.
transforming the target, such that the prediction is inverse-transformed (a generalisation of TransformedTargetRegressor)

If there are other cases, but they are not a cohesive category, I see no reason not to tackle them with custom meta-estimators.

I like that this solution circumscribes a group of tools that do similar things, i.e. changing the set of samples used at training time; and it protects the user from doing something inappropriate with test samples.

If we forbid resamplers having transform methods, then it is extensible to later allowing feature spaces to be modified.

And I think this is separate from the need for non-equivalence between fit_transform and transform.

amueller · 2020-08-17T23:47:47Z

Yeah, I have tried to come up with other examples of modifying y that doesn't fit in these categories but couldn't really think of any. @GaelVaroquaux might have had some use-cases in mind?

And I agree, it's probably separate from having non-equivalence between fit_transform and transform.

drorata · 2020-08-18T15:00:11Z

I posted a SO question today and @adrinjalali suggested that my concern is related to this SLEP. After a brief chat with him, I was convinced that I can contribute a use case where the Resampler could be helpful.

Assume that we start with a dataset like so, denoted by X:

feat_1	feat_2	feat_3
f_1_1	f_1_2	f_1_3
f_2_1	f_2_2	f_2_3
f_3_1	f_3_2	f_3_3
f_4_1	f_4_2	f_4_3

etc..

I would like to construct a pipeline which performs the following stages (each stage may consist of multiple steps):

Feature engineering: create derived features from those available in the original set.
Extract targets from the dataset. When saying "extract", I mean applying some logic on the data and creating a new feature that will be the target of the predictions. For example, think of time series data and the target of someday d_i will be the number of transactions 10 days later. The extracted target should be y and the rest of the pipeline should respect it.
The last phase is no longer special. It gets from the previous stages (probably modified) X and targets y.

I hope this is indeed on-topic for the SLEP. I'll be happy to elaborate and discuss further as you need.

jnothman · 2020-08-30T12:22:39Z

@glemaitre did you see my proposed amendments in glemaitre#4?

jnothman · 2020-12-20T04:44:01Z

@glemaitre, should I just be pushing the changes from glemaitre#4 to this branch?

I see the proposal here as reasonable to implement, without touching much else in the library.

The contrast between the Pipeline and the ResampledTrainer choice is still a bit irksome. Pipeline seems more natural (i.e. what users expect), but the scope of the resampling can be confusing, so I prefer the explicit ResampledTrainer.

But once you've got a specialised metaestimator like ResampledTrainer, why not just allow each resampler to be its own metaestimator, e.g. Oversampled(predictor, n_samples) or NanRemoved(predictor) or OutlierRemover(predictor, outlier_detector), all inheriting from a common base class? This would get rid of the need for a new fit_resample verb, but would consequently make it hard to use imblearn's resamplers directly.

adrinjalali · 2022-09-01T06:48:09Z

slep005/proposal.rst

+        make_pipeline(StandardScaler(), SelectKBest(), SVC()),
+    )
+    clf = ResampledTrainer(
+        NaNRejector(), # removes samples containing NaN


I'm not sure how this is gonna work, as in, what does the transform of this method actually do?

SLEP005: Outlier Rejection API

0c49bfb

amueller reviewed Mar 1, 2019

View reviewed changes

index.rst Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Mar 1, 2019

View reviewed changes

slep005/proposal.rst Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Mar 1, 2019

View reviewed changes

slep005/proposal.rst Outdated Show resolved Hide resolved

orausch reviewed Mar 2, 2019

View reviewed changes

slep005/proposal.rst Show resolved Hide resolved

Update slep005/proposal.rst

4ecc51b

Co-Authored-By: glemaitre <g.lemaitre58@gmail.com>

Update slep

c855ffe

glemaitre changed the title ~~SLEP005: Outlier Rejection API~~ SLEP005: Resampler API Mar 3, 2019

orausch mentioned this pull request Mar 5, 2019

[WIP] Resamplers scikit-learn/scikit-learn#13269

Open

8 tasks

adrinjalali reviewed Mar 5, 2019

View reviewed changes

slep005/proposal.rst Outdated Show resolved Hide resolved

Update slep005/proposal.rst

c16ef7b

Co-Authored-By: glemaitre <g.lemaitre58@gmail.com>

amueller mentioned this pull request May 28, 2020

Overlap with sklearn koaning/scikit-lego#361

Closed

Merge branch 'master' into slep005

c64044d

jnothman mentioned this pull request Oct 31, 2020

Separate train and test prediction/scoring inside BaseSearchCV scikit-learn/scikit-learn#18710

Closed

TomAugspurger mentioned this pull request Dec 10, 2020

Add GMM dask/dask-ml#113

Open

adrinjalali mentioned this pull request Jan 1, 2021

Adding resampling process in case some classifier fit functions can't take sample weights fairlearn/fairlearn#663

Closed

adrinjalali mentioned this pull request Jun 22, 2021

ENH Add reweighing preprocessing technique based on Kamiran, Calders fairlearn/fairlearn#784

Open

adrinjalali reviewed Sep 1, 2022

View reviewed changes

glemaitre mentioned this pull request Jun 27, 2023

Proposal: Automated Outlier Handling in scikit-learn's Preprocessing Module scikit-learn/scikit-learn#26706

Closed

SLEP005: Resampler API #15

Are you sure you want to change the base?

SLEP005: Resampler API #15

Conversation

glemaitre commented Mar 1, 2019 • edited

amueller commented Mar 1, 2019

GaelVaroquaux commented Mar 1, 2019

glemaitre commented Mar 2, 2019

glemaitre commented Mar 2, 2019

chkoar commented Mar 2, 2019

glemaitre commented Mar 2, 2019

chkoar commented Mar 2, 2019

orausch commented Mar 2, 2019 • edited

chkoar commented Mar 2, 2019

orausch commented Mar 2, 2019

glemaitre commented Mar 2, 2019

glemaitre commented Mar 2, 2019

glemaitre commented Mar 2, 2019

orausch commented Mar 2, 2019

chkoar commented Mar 2, 2019

orausch commented Mar 4, 2019

glemaitre commented Mar 4, 2019

adrinjalali commented Mar 5, 2019

amueller commented Mar 5, 2019 via email

amueller commented Apr 7, 2020

jnothman commented Apr 19, 2020

amueller commented Apr 20, 2020 • edited

jnothman commented Apr 21, 2020 via email

amueller commented May 2, 2020

jnothman commented Jun 14, 2020

amueller commented Jun 15, 2020

jnothman commented Jun 16, 2020

amueller commented Jun 16, 2020

amueller commented Jun 16, 2020

jnothman commented Jun 16, 2020

amueller commented Jun 17, 2020

amueller commented Jun 29, 2020

jnothman commented Jun 29, 2020 • edited

jnothman commented Jun 29, 2020

jnothman commented Aug 17, 2020

amueller commented Aug 17, 2020 • edited

jnothman commented Aug 17, 2020

amueller commented Aug 17, 2020 • edited

drorata commented Aug 18, 2020

jnothman commented Aug 30, 2020

jnothman commented Dec 20, 2020

adrinjalali Sep 1, 2022

Choose a reason for hiding this comment

glemaitre commented Mar 1, 2019 •

edited

orausch commented Mar 2, 2019 •

edited

amueller commented Apr 20, 2020 •

edited

jnothman commented Jun 29, 2020 •

edited

amueller commented Aug 17, 2020 •

edited

amueller commented Aug 17, 2020 •

edited