Problem defining a custom transformer applied as the second step of a pipeline fed with a pandas dataframe #28917
-
Describe the workflow you want to enableI am trying to do the following: import pandas
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.base import TransformerMixin, BaseEstimator
# Input Data
df = pandas.DataFrame([["car",0.1,0.0],["car",0.2,0.0],["suv",0.0,0.2]],columns=['vehicleType','features_car','features_suv'])
# Custom Transformer
class GetScore(BaseEstimator, TransformerMixin): # type: ignore
"""Apply binarize transform for matching values to filter_value."""
def __init__(self):
"""Initialize transformer with expected columns."""
pass
def dot_product(self, x) -> float:
"""Return 1.0 if input == filter_value, else 0."""
return x[0]*x[2] + x[1] * x[3]
def fit(self, X, y=None): # type: ignore
"""Fit the transformer."""
"""Transform the given data."""
if type(X) == pandas.DataFrame:
x = X.apply(lambda x: self.dot_product(x), axis=1)
return x.values.reshape((-1, 1))
def transform(self, X: pandas.DataFrame):
"""Transform the given data."""
if type(X) == pandas.DataFrame:
x = X.apply(lambda x: self.dot_product(x), axis=1)
return x.values.reshape((-1, 1))
# elif type(X) == numpy.ndarray:
# vector_func = numpy.vectorize(self.dot_product)
# x = vector_func(X)
# return x.reshape((-1, 1))
def get_feature_names_out(self) -> None:
"""Return feature names. Required for onnx conversion."""
pass
onehot = ColumnTransformer(
transformers=[
("onehot",OneHotEncoder(categories=[["car", "suv"]], sparse_output=False), ['vehicleType']),
],
remainder="passthrough",
verbose_feature_names_out=False,
)
get_score = ColumnTransformer(
transformers=[
("getScore", GetScore(),[0,1,2,3])
],
remainder='passthrough'
)
pipeline = make_pipeline([("onehot", onehot),
("get_score", get_score)])
preprocesses_df = pipeline.fit(df)
print(preprocesses_df) I basically wanna get the onehot encoded columns from onehot and then pass them into GetScore to calculate dot product with the ['features_car','features_suv'] input features from df. TypeError: Last step of Pipeline should implement fit or be the string 'passthrough'. ' Is there any easier way to do what I am trying to do? Describe your proposed solutionSome way to use the intermediate features from previous transformer into next one in a single ColumnTransformer. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
A couple of remarks:
EDIT: I just added the first bullet point which is probably the main problem and causes the weird error message you get. |
Beta Was this translation helpful? Give feedback.
A couple of remarks:
make_pipeline
call is invalid: you should either callmake_pipeline(onehot, get_score)
or call the constructor of thePipeline
class with a list of named steps tuples.fit
method should return the estimator itself (return self
). See https://scikit-learn.org/dev/developers/develop.html#apis-of-scikit-learn-objects for more details.