Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatementParser.parse_decimal delocalizes numbers the wrong way #67

Open
lopter opened this issue Sep 10, 2017 · 11 comments
Open

StatementParser.parse_decimal delocalizes numbers the wrong way #67

lopter opened this issue Sep 10, 2017 · 11 comments

Comments

@lopter
Copy link
Contributor

lopter commented Sep 10, 2017

Hello,

I noticed @andrewshadura ended up pulling that patch I wrote last year, which is really cool, thank you.

The reason why I didn't submit my patch back then was because, as Andrew noticed, (all) plugins need to be fixed, and I don't have a solution for that; but I wanna say that a monorepo approach would really help here, and I'm happy to talk about project management (not on this issue though).

Delocalizing shouldn't be the plugin responsibility since as we'll see it's very error prone. Indeed, @andrewshadura modified my patch to try to cope with localized numbers [1] but it doesn't work. For example: in the US a , is used to separate thousands while a . is used to separate the decimal part, therefore with the current code, a properly localized number would be converted to a bogus number with a bunch of dots.

IMO, a better approach would be to get rid of parse_float and parse_decimal, break all plugins on purpose, and introduce parse_amount since it's really what we are doing here:

import contextlib
import locale

from decimal import Decimal, Decimal as D
from typing import (
    Generator,
    Tuple,
)


@contextlib.contextmanager
def _override(categories: Tuple[int], loc: str) -> Generator[None, None, None]:
    saves = [locale.setlocale(c, loc) for c in categories]
    yield
    for i, c in enumerate(categories):
        locale.setlocale(c, saves[i])


def parse_amount(value: str, loc: str) -> Decimal:
    """Parse the given amount of money according to the given locale.

    .. note:: This function isn't thread-safe.
    """

    with _override((locale.LC_NUMERIC, locale.LC_MONETARY), loc):
        lconv = locale.localeconv()
        value = value.replace(lconv["currency_symbol"], "")
        return D(locale.delocalize(value))

Then we could have a some helpers to help plugins ask the user to configure the right locale and encoding for their system, since none of those identifiers are standard; but that's another can of worms.

Let me know what you think and thank you for writing ofxstatement!

[1] src/ofxstatement/parser.py#L63.

edit: I had a few things wrong and the original title sucked.

@lopter lopter changed the title Parser.parse_decimal shouldn't try to delocalize numbers StatementParser.parse_decimal shouldn't try to delocalize numbers Sep 10, 2017
@lopter lopter changed the title StatementParser.parse_decimal shouldn't try to delocalize numbers StatementParser.parse_decimal delocalize numbers the wrong way Sep 10, 2017
@lopter lopter changed the title StatementParser.parse_decimal delocalize numbers the wrong way StatementParser.parse_decimal delocalizes numbers the wrong way Sep 10, 2017
@kedder
Copy link
Owner

kedder commented Sep 11, 2017

Conceptually, I don't think parse_amount is better than parse_float: I'd expect all numbers in the statement to be formatted the same way, even if they do not represent "amount of money" kind of data. So we'd end up using parse_amount to parse all numbers, just like we do with parse_float now.

I like the ability to specify the locale for parse_float/parse_decimal though. Can we keep both backwards compatibility and introduce this feature though? E.g. if locale is specified, we'd use it to parse the value. Otherwise do the old behavior.

What do you think?

@lopter
Copy link
Contributor Author

lopter commented Sep 25, 2017

Hello,

Sorry for the response time, will try to do better moving forward.

Conceptually, I don't think parse_amount is better than parse_float: I'd expect all numbers in the statement to be formatted the same way, even if they do not represent "amount of money" kind of data. So we'd end up using parse_amount to parse all numbers, just like we do with parse_float now.

Conceptually, I think it's really just a super-set of parse_float that knows how to deal with localized numbers and currency signs. I also think that's an important difference of semantic and would keep the name parse_amount.

I like the ability to specify the locale for parse_float/parse_decimal though. Can we keep both backwards compatibility and introduce this feature though? E.g. if locale is specified, we'd use it to parse the value. Otherwise do the old behavior.

Because, I consider that existing code is broken at this point, I'm not sure if I'd call this a feature, and I'm not sure about the value of backward compatibility.

I believe that we should break the API and force everyone to acknowledge the problem. We can write some good release notes so that people can easily update their plugins.

An hybrid approach could be taken by releasing two versions (1.0 and 0.7), but I would really just break the API.


Anyway, that's my PoV, whatever works for the project will work for me…

@lopter
Copy link
Contributor Author

lopter commented Dec 22, 2017

Hello @kedder, I'm wondering if you had some time to ponder that, especially about breaking the API which I thinks is (unfortunately) the right thing to do here.

Happy holidays!

@mcepl
Copy link
Contributor

mcepl commented Dec 22, 2017

I wanna say that a monorepo approach would really help here, and I'm happy to talk about project management (not on this issue though)

Just to say that I think monorepo is actually one of the worst ideas for an open source project.

For most (see the linked post) monorepo basically means “We don’t care about the API stability, or we don’t care about API at all”. Not surprisingly, the discussion about monorepo is here while discussing the breaking of API (whether in this particular situation it is a good idea or not is another point, which I don’t have much opinion about).

If you are a huge proprietary company which doesn’t care about external contributions (like those mentioned in the linked post) that it might be a good idea. If you have open source project just in the name and it is more “throw our proprietary code over the fence, but we don’t really care about community” (I look at you Chromium; I am not sure how stable Blink API is, perhaps the situation has improved) then you can probably get away with it. If any component in the chain (including SSL library or even kernel for Android) doesn’t fit in your vision, than you just fork your own version.

However, the other project which went this way was Gaia (the UI part of Firefox OS), and as a former user and developer of the system, I can say that monorepo was one of the reasons why I believe the project died. There was absolutely no way how to build any apps outside of the project while using its APIs, because there were no APIs.

So, if you want close yourself to the external contributors, you manage super-complicated (thousands sub-projects complicated) project, you don’t care about external users of your project (e.g., Facebook, Google, Twitter), then perhaps monorepo is a good idea. I don‘t think it is the case for ofxstatement.

@kedder
Copy link
Owner

kedder commented Dec 22, 2017

Sorry guys, this got out of my loop completely.

First of all, I think we can make a compromise here, because I'd really hate to break the other plugins. They apparently work well for their authors and whoever uses them. Such a breaking change will not be seen as "improvement".

OTOH, I acknowledge the problem.

Can we declare parse_float and parse_int deprecated (but still working) and recommend using new and shiny parse_amount instead? Enlightened plugins would switch, old ones would work as before.

The way I see it, plugins are still responsible of parsing dates, amounts and whatever else is in proprietary statements, and ofxstatement is merely providing some reusable helpers to achieve that. I'm pretty sure we cannot make parse_amount to handle all peculiarities of proprietary data formats. We can make a reasonably good approximation though and plugins are free to use it or not.

WRT monorepo, I agree with @mcepl, this would not work too well here. It raises the bar for people who would like to create/maintain their own plugins, puts the burden of maintenance, enforcing coding standards, etc on me. Even if we would have a power to make a change to every plugin, we would have to take responsibility for not breaking them. I.e. we'd have to figure out how to test each and every plugin. I'm sure not all of them have runnable unit tests at very least.

@lopter
Copy link
Contributor Author

lopter commented Dec 23, 2017

Can we declare parse_float and parse_int deprecated (but still working) and recommend using new and shiny parse_amount instead? Enlightened plugins would switch, old ones would work as before.

Sure, I think it's still strictly better than what we have now anyway.

WRT monorepo, I agree with @mcepl, this would not work too well here. It raises the bar for people who would like to create/maintain their own plugins, puts the burden of maintenance, enforcing coding standards, etc on me. Even if we would have a power to make a change to every plugin, we would have to take responsibility for not breaking them. I.e. we'd have to figure out how to test each and every plugin. I'm sure not all of them have runnable unit tests at very least.

This analysis is 100% correct, it does make the project harder to work with for its maintainer/"owner", I'm definitely taking the PoV of an user and/or plugin contributor. From that PoV a monorepo has the following advantages:

  • The plugin contributor doesn't have to publish a package, and just need to make a pull-request;
  • This simplifies plugin search/discovery by removing it altogether (i.e. all plugins are in one canonical place);
  • This forces people to contribute on the same (existing) plugins: e.g. if one plugin goes out of date you are incentivized to fix/replace it here (instead of a starting/adding a new one from scratch);
  • It reduces code/plugin duplication, this goes back to the two previous point but also to the fact that many plugins tend to do similar things, and I think it's easier to share code with one repo than many repos. One reason I didn't publish the plugins I wrote so far is because I couldn't find a satisfying way to share code between them, hopefully I'll revisit that at some point.

@mcepl
Copy link
Contributor

mcepl commented Dec 23, 2017

Can we declare parse_float and parse_int deprecated (but still working) and recommend using new and shiny parse_amount instead? Enlightened plugins would switch, old ones would work as before.

OTOH, we have known number of plugins, haven't we? So, it should be possible just grep through all of them and make pull requests for all occurrences of parse_int/parse_float, shouldn't it?

Yeah, the famous “somebody should”.

@mcepl
Copy link
Contributor

mcepl commented Dec 23, 2017

OK, so I have created a git repository collecting all known ofxstatement plugins (except for ofxstatement-postfinance, which has dead repository) and thus I could search through all plugins in one run and here are results. It seems that plenty of plugins define their own parse_float().

@gerasiov
Copy link
Contributor

gerasiov commented Dec 24, 2017

@mcepl https://github.com/gerasiov/ofxstatement-plugins =)

Even fxstatement-postfinance is there
Oh, you have two more, I missed

@gerasiov
Copy link
Contributor

gerasiov commented Dec 24, 2017

Ah, those two from Petri Savolainen are incompatible with ofxstatement. They are GPLv2 only =(
Filled issues.

@kedder
Copy link
Owner

kedder commented Dec 26, 2017

I did some assessment of what we ended up with in parser.py. We have StatementParser that provides some very trivial framework for parsing "line-based" statements, that users can subclass to save on boilerplate. It also includes a very trivial helpers for parsing dates and numbers. Recently we grew a more opinionated parse_decimal, that attempted to fix "badly formatted" numbers. Obviously, if provided implementation doesn't work for you, it is easy to override it in your own parser - what is most plugins actually seem to do.

I think ofxstatement cannot possibly provide an ultimate amount parser that will work for everyone, so why would we even try? Even the proposed solution will break if bank does not respect a particular locale formatting rules, or uses various currency symbols (e.g. € vs EUR).

Having said that, I realize that ofxstatement plugin might provide some more sophisticated implementation to avoid copy-pasting the same code from one plugin to another. I don't think anything as complicated as proposed implementation should be a default behavior - it simply does too much guesswork of what statement values might look like.

I'll try to come up with the MR for that, but first I have to get rid of buildout - it doesn't work anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants