Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve circular dependency of Rust implemention with geozero #283

Open
pka opened this issue Aug 10, 2023 · 7 comments
Open

Resolve circular dependency of Rust implemention with geozero #283

pka opened this issue Aug 10, 2023 · 7 comments

Comments

@pka
Copy link
Member

pka commented Aug 10, 2023

We have a long standing maintenance problem because of the circular dependency between Rust Flatgeobuf and geozero. I've discussed that with @nyurik and we see the following solutions:

  1. Move rust/flatgeobuf to the geozero repo
  2. Extract the geozero dependent part of rust/flatgeobuf into a geozero driver
  3. Remove all flatgeobuf dependencies from the geozero repo

Pros/Cons:

  • Option 1 is technically the easiest. We just had to give @bjornharrtell all the required Github permissions.
  • Option 2 gets at least difficult in the test dependencies. To make readable tests we need to convert from/to FlatGeoBuf and the only option for that is using geozero. We could move them also to geozero, but that's not nice.
  • Option 3 is also quite a lot of work, because flatgeobuf is used in many places. We have e.g. geozero-cli which converts other geozero supported formats from/to Flatgeobuf, we have benchmarks and so on.

So before going into more detail, is there an option you would rule out, @bjornharrtell?

@nyurik
Copy link
Contributor

nyurik commented Aug 10, 2023

Thanks @pka for an awesome write-up! I agree that option 1 is by far the simplest and safest from the code quality and maintenance perspective.

Fundamentally, we have a bad situation of flatgeobuf having tons of dependencies on geozero, which in turn depends on flatgeobuf in many unit tests, benchmark, and geozero-cli code. This means that any bump in minor or major version causes CI and local builds to fail, especially when both the geozero AND test/cli/bench code is changed at the same time.

Geozero relies on 3rd party libraries to handle the low level processing of the geo data. For example MVT relies on protobuf, and GeoJSON on serde. Yet, flatgeobuf usage is unusual -- instead of geozero relying on flatgeobuf, it is the other way around. In theory, flatgeobuf should offer a low level binary stream encoding/decoding, i.e. being similar to serde or protobuf, and GeoZero should do the higher level processing, e.g. iterate over the stream of data and convert it to the common internal feature representation for geometries and properties.

This essentially is the option 2, and I think it is a better path long term. Yet, to get there, most of the refactoring would need to be done in close sync inside a single repo with a common CI. Thus, I would recommend starting with option 1, and possibly try to move to option 2.

P.S note that fgbutil does not need to move, only the flatgeobuf crate and probably bench/fuzz stuff.

@bjornharrtell
Copy link
Member

bjornharrtell commented Aug 11, 2023

Not sure I want to rule out any option at this point, but gut feeling says I'm not fond of option 1. I think I agree with "In theory, flatgeobuf should offer a low level binary stream encoding/decoding" so that the dependency situation can be similar to other formats for geozero. Instead of option 1 "temp" then option 2 "if possible" due to the need to close proximity and common CI, could that not be a (complex) branch here instead possibly including a fork of geozero for the life time of the branch?

As a side note to perhaps explain my motivations for the above, I seem to have a preference toward centralizing flatgeobuf into this monorepo and concerted releases and versioning for the cross language support of flatgeobuf. But I do see this approach has some disadvantages too...

@michaelkirk
Copy link
Collaborator

michaelkirk commented Sep 12, 2023

The circular dependency is a bit annoying.

It seems like there is some general support for option 2 as an end state, though how we get there might not be for sure. Moving everything to one repo just to move it back later seems not ideal. And a long running branch has it's own problems.

I'm wondering if there's some smaller things we can do to move towards the desired end state that would be incrementally mergable/useful.

One small piece might be the currently ubiquitous use of geozero::{Error, Result} as the error type for all of the flatgeobuf crate. It needs to be used for the geozero impl's, which drive a lot of the functionality of the fgb crate. But to me it seems a bit forced at times - e.g.:

if header.index_node_size() == 0 || header.features_count() == 0 {
    return Err(GeozeroError::Geometry("Index missing".to_string()));
}

Would folks be interested in a local flatgeobuf::{Error, Result} type that can be used for the non-geozero code? That's something I think we could merge in the nearer term.

michaelkirk added a commit to michaelkirk/flatgeobuf that referenced this issue Sep 27, 2023
Motivation:

Currently we have a circular dependency between the geozero and
flatgeobuf crates.

It seems like we're on board with breaking that cycle, but aren't quite
sure how to do that.

Some related discussion in
flatgeobuf#283

What changed:

1. Introduce a new flatgeobuf::Error type and use it for code that
   doesn't depend on Geozero.
2. Relocate code that depends on Geozero into separate
   `geozero_integration` modules. This code still returns GeozeroError.
3. For tests, which might return either flatgeobuf::Error or
   geozero::GeozeroError, use Box<dyn std::error::Error>.
michaelkirk added a commit to michaelkirk/flatgeobuf that referenced this issue Sep 27, 2023
Motivation:

Currently we have a circular dependency between the geozero and
flatgeobuf crates.

It seems like we're on board with breaking that cycle, but aren't quite
sure how to do that.

Some related discussion in
flatgeobuf#283

What changed:

1. Introduce a new flatgeobuf::Error type and use it for code that
   doesn't depend on Geozero.
2. Relocate code that depends on Geozero into separate
   `geozero_integration` modules. This code still returns GeozeroError.
3. For tests, which might return either flatgeobuf::Error or
   geozero::GeozeroError, use Box<dyn std::error::Error>.
michaelkirk added a commit to michaelkirk/flatgeobuf that referenced this issue Sep 27, 2023
Motivation:

Currently we have a circular dependency between the geozero and
flatgeobuf crates.

It seems like we're on board with breaking that cycle, but aren't quite
sure how to do that.

Some related discussion in
flatgeobuf#283

What changed:

1. Introduce a new flatgeobuf::Error type and use it for code that
   doesn't depend on Geozero.
2. Relocate code that depends on Geozero into separate
   `geozero_integration` modules. This code still returns GeozeroError.
3. For tests, which might return either flatgeobuf::Error or
   geozero::GeozeroError, use Box<dyn std::error::Error>.
michaelkirk added a commit to michaelkirk/flatgeobuf that referenced this issue Sep 27, 2023
Motivation:

Currently we have a circular dependency between the geozero and
flatgeobuf crates.

It seems like we're on board with breaking that cycle, but aren't quite
sure how to do that.

Some related discussion in
flatgeobuf#283

What changed:

1. Introduce a new flatgeobuf::Error type and use it for code that
   doesn't depend on Geozero.
2. Relocate code that depends on Geozero into separate
   `geozero_integration` modules. This code still returns GeozeroError.
3. For tests, which might return either flatgeobuf::Error or
   geozero::GeozeroError, use Box<dyn std::error::Error>.
michaelkirk added a commit to michaelkirk/flatgeobuf that referenced this issue Sep 27, 2023
Motivation:

Currently we have a circular dependency between the geozero and
flatgeobuf crates.

It seems like we're on board with breaking that cycle, but aren't quite
sure how to do that.

Some related discussion in
flatgeobuf#283

What changed:

1. Introduce a new flatgeobuf::Error type and use it for code that
   doesn't depend on Geozero.
2. Relocate code that depends on Geozero into separate
   `geozero_integration` modules. This code still returns GeozeroError.
3. For tests, which might return either flatgeobuf::Error or
   geozero::GeozeroError, use Box<dyn std::error::Error>.
@pka
Copy link
Member Author

pka commented Sep 27, 2023

I recently grepped for flatgeobuf in the geozero "core" crate and found so few hits, that I think option 3, removing flatgeobuf from the geozero crate is the way to go.

The other crates would still have a dependency on flatgeobuf. For example geozero-cli:

geozero 0.11 ────►  flatgeobuf v3.23.0

     │                     │
     │                     └─────────────►
     │                                      geozero-cli
     └───────────────────────────────────►

As long we have versionend dependencies, this shouldn't be a problem. After a geozero update we have to stay on the old version as dependency for geozero-cli until we've also updated flatgeobuf.

That all, or am I missing something?

michaelkirk added a commit to michaelkirk/flatgeobuf that referenced this issue Sep 27, 2023
Motivation:

Currently we have a circular dependency between the geozero and
flatgeobuf crates.

It seems like we're on board with breaking that cycle, but aren't quite
sure how to do that.

Some related discussion in
flatgeobuf#283

What changed:

1. Introduce a new flatgeobuf::Error type and use it for code that
   doesn't depend on Geozero.
2. Relocate code that depends on Geozero into separate
   `geozero_integration` modules. This code still returns GeozeroError.
3. For tests, which might return either flatgeobuf::Error or
   geozero::GeozeroError, use Box<dyn std::error::Error>.
@michaelkirk
Copy link
Collaborator

I think you're right that currently more of geozero lives in FGB than FGB lives in geozero. So if the metric is "What is the least lines of code to move", removing fgb from geozero would probably win.

However, I will lay my cards on the table and say:

  1. I prefer the opposite approach - where we remove geozero from FGB and instead have geozero depend on FGB.
  2. Currently, this preference is based primarily on "instincts" and "feelings" about software. The current dependency is reversed from every other format geozero supports (geojson, wkt, mvt), and I think it works better the other way.

That said, it's dumb to manage software projects on only instincts and feelings. 😅

I'm willing to put effort into manifesting my vision here, and think I could have it wrapped up in a week or two, but if someone feels like its a bad idea and wants to do it the other way, I won't stop you. I don't want my notion of perfection to stand in the way of your progress.

Maybe a middle ground is to scrutinize PR's like #310 under the lens of "What if Michael disappears and the big switch never happens?" would we still want this? I think that I can build up some mostly uncontroversial changes that will make the ultimate switchover fairly easy without doing a bunch of "damage" to the current codebase.

@pka
Copy link
Member Author

pka commented Sep 28, 2023

The reason why geozero exists is that I wanted to have a generic processing API for my flatgeobuf implementation in Rust. That's also the reason why most initial tests and examples in geozero use flatgeobuf.

So trying to remove geozero from flatgeobuf feels (!) strange to me, indeed. I could definitely see better APIs and when a geo-iterator API gets real, I'm all for implementing it also for flatgeobuf.

So isolating the geozero usage as you do in #310 is fine for me, but it doesn't solve this issue though.

My point is, that it doesn't make sense that geozero depends on flatgeobuf, but it makes sense the other way around. Or to formulate it more objectively: An API should not depend on a certain storage format, but a format implementation can use a generic processing API.

@michaelkirk
Copy link
Collaborator

An API should not depend on a certain storage format, but a format implementation can use a generic processing API.

I wonder if there's a misunderstanding here. I'm not advocating for making geozero require flatgeobuf, which would be strange indeed. I'm advocating for flatgeobuf to be "just another format" that geozero optionally supports.

Specifically:

  1. The flatgeobuf crate would be stand alone, and not depend on geozero.
  2. The geozero crate would have an optional dependency on flatgeobuf (just like geozero currently has an optional dependency for any other format).

Part of this would require extracting, and potentially duplicating, some things that flatgeobuf relies on that currently live in geozero like ColumnValue.

Duplication isn't great, but I think of it more like decoupling.

For instance, currently, if flatgeobuf wanted to add a new data type, all our existing writers would have to support it, which isn't inherently bad, though a little awkward at times. But the same thing won't happen if any other format (like arrow) wanted to add a new data type. To have flatgeobuf's features be the mediator for how all other formats interact in geozero feels a bit like "the tail wagging the dog".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants