Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WATCH_MODE info #288

Open
iferencik opened this issue Dec 7, 2021 · 26 comments · May be fixed by #1179
Open

WATCH_MODE info #288

iferencik opened this issue Dec 7, 2021 · 26 comments · May be fixed by #1179

Comments

@iferencik
Copy link

Hello there

I'd like to know if setting WATCH_MODE=true makes martin reload the config file.
I am publishing many postgis layer with even more properties/columns and in order to limit these i need to use a configuration file.
I am thinking to create a python script that will generate the config yaml and it would be really nice if martin could re-read this in WATCH_MODE

thank you for this great software box

@sharkAndshark
Copy link
Collaborator

If this is on road map,can i try this?

@stepankuzmin
Copy link
Collaborator

Hi @sharkAndshark,

Sure!

@sharkAndshark
Copy link
Collaborator

Based on the discusion in #349 ,there will be no watch mode? @stepankuzmin

@nyurik
Copy link
Member

nyurik commented Aug 10, 2022

@sharkAndshark I think we may need to rethink the watch mode, and possibly add it back before the next release.

  • Refresh should NOT be initiated on GET request, as this is a direct path to DDOSing the service, plus it may get cached by some proxy in between, or the browser
  • Martin should be configurable with regular refresh intervals, i.e. every 10 minutes
  • Martin should be configurable to allow a POST request to refresh -- easier to filter with firewall rules, does not get cached by proxies. Configuration should allow a subset of IPs, and only from the localhost by default.
  • rework refresh using shared mutable state concepts. I think it should be something like "get X without the shared multi-threaded state, unless X does not exist, in which case get a read of a read/write lock, and copy the latest state to the worker's state". Not certain if this is doable though.
  • Add support for volatile/versioned data sources:
    • tilejson should return a slightly different path for the actual tiles. E.g. if the path is /rpc/getmvt.json, the tile path should be /rpc/getmvt/<version>/<z>/<x>/<y>.pbf, where the <version> is 1,2,... This path should be dynamic, and change whenever a new version is detected. Only applies when it is possible to access different versions at the same time. Note that this approach would only work if there is only one instance of Martin, which is not scalable.
    • if the backend is an mbtiles (sqlite file) or a PMtiles file, the file could in theory be renamed while the pooled connections still have it open, and a new file is placed in the same location -- so this could work with versions.

That said, I really feel like the more substantial changes should be done with kubernetes and helm and deployments, and the watch mode is a development hack, making infrastructure too fragile. A better set up would be:

  • set up a kubernetes+helm deployment with multiple redundant instances of Martin
  • when a new config/data sources are needed, create a new helm deployment version with new configuration (could even automate these steps)
  • initiate a gradual roll-out of the new version, using automated tests to ensure the roll-out is going on ok, and rolling back to the last stable configuration if it is not

@sharkAndshark
Copy link
Collaborator

@nyurik I will give it a a shot and then more discussion

@nyurik
Copy link
Member

nyurik commented Aug 11, 2022

Sounds good! @stepankuzmin any thoughts?

@stepankuzmin
Copy link
Collaborator

@sharkAndshark sure, that would be great 👍

@pascal-codetaal
Copy link

@nyurik We are currently using the latest version of martin, and there are some real benefits in using this version.
But the only thing that is stil a downside in the new version, is the remove of the watch or the option to reload the index.
We are running on a Kubernetes cluster and we reload the index by creating a new pod, and killing the other one.
But this process can take up to 50 seconds or so, that's the time the user has to wait before, before being able to see the updated data.
So is there a possibility to add another way to refresh the data?
Perhaps we can try to help with implementing this feature?

@nyurik
Copy link
Member

nyurik commented Oct 25, 2022

@pascal-codetaal I'm beginning to think we really do need to reimplement it, and some help would be great. So lets have a dedicated POST /reload endpoint.

My only concern is that we would want to avoid a global lock on each request. I'm not yet certain how to do that best.

@iferencik
Copy link
Author

iferencik commented Oct 25, 2022

@nyurik We are currently using the latest version of martin, and there are some real benefits in using this version. But the only thing that is stil a downside in the new version, is the remove of the watch or the option to reload the index. We are running on a Kubernetes cluster and we reload the index by creating a new pod, and killing the other one. But this process can take up to 50 seconds or so, that's the time the user has to wait before, before being able to see the updated data. So is there a possibility to add another way to refresh the data? Perhaps we can try to help with implementing this feature?

@pascal-codetaal
if you do a

    kubectl rollout restart deployment <depl_name> 

there is no downtime. kube will ensure the old pod(s) is/are working while the new one gets deployed.

Just make sure you have enough capacity on the node to spin a new deployment.

Also, you should ideally have more(>1) replicas.

I have the same issues and being under time pressure I resorted to something else
I created a sh [script] (https://github.com/UNDP-Data/martin-docker/blob/main/reactive_entrypoint.sh) that checks the local config config file against a remote one located at a web accessible url (Azure file share).

An env vars specify how often the url/config is pulled and where the URL is pulled from.

This might not be very elegant but it works like a charm specially because i use a tool to create (optionally upload to azure) the config files. We actually update the layers on a daily, sometimes hourly basis. In future i will pool the remote config once a day.

The tool has a quirk that it creates configs only for tables that have the "publish=True" in the table description. The same is valid for columns, if you mark a column with publish=False the toll will skip it.
This is not a very good idea and I think I will scrape it in favor of a more robust technique like revoking read rights for the user under which martin connects to the database.

@pascal-codetaal
Copy link

pascal-codetaal commented Oct 25, 2022

We don't have a downtime, it just can take up to 50 seconds before the new pod is up and running...
We let users upload data to our db and they don't like to wait an extra 50 seconds before they can see the updated data. I'm going to check your script, thx

@saosebastiao
Copy link

I would love to be able to reload the schema config without a full restart. The PostgREST project has two cool ways to do this:

https://postgrest.org/en/stable/schema_cache.html#schema-cache-reloading

  • listen for a SIGUSR1 signal
  • Using Postgres' LISTEN/NOTIFY with a dedicated channel

The second method in particular is interesting because it allows the DB admin to set up custom triggers that would send notifications to the custom channel any time a geospatial table changes.

@nyurik
Copy link
Member

nyurik commented Apr 3, 2023

There are currently two issues I see (i'm sure there are more):

  • Performance - if we introduce a global lock for each request in order to access a singleton configuration, we might not be able to scale efficiently. This might be negligible - will need to evaluate in some simple stress test (e.g. tiny actix app that returns a static string after accessing some global lock var - while under a massive stress test, and the app is limited to just several CPUs)
  • Source ID stability - when reloading, the auto-generated IDs could end up with a different one, e.g. in case multiple sources have the same table name. Possible solutions:
    • add an ability to generate a hash param from an internal "long ID": hash(server_name, server_port, db_name, namespace, table, geo_column), and users can set it with id_format: '{table}.{hash}' in the auto_publish settings.
    • add some heuristics to try to match existing IDs to the new IDs based on the internal "long ID"

@tordans
Copy link

tordans commented Jan 25, 2024

We just ran into this issue after switching from pg_tileserv to Martin. With pg_tileserv all new tables where discovered right away. Now, we have to restart our Martin Docker container to get an updated table list.

Our docker setup https://github.com/FixMyBerlin/atlas-geo/blob/main/docker-compose.yml starts our backend server with db, processing and tiles. It looks like we don't have a good way to restart the tile docker after the app image closed and finished processing our data…

Is there some API like <marinserver.com>/refresh that we could ping to notify Martin to refresh tables?

@nyurik
Copy link
Member

nyurik commented Jan 25, 2024

@tordans we all want that api to happen - I plan to work on it in the future, but no certainty on the timeframe -- it is mostly a personal project, so if anyone has any time to work on it, I will be happy to guide and assist that effort.

@pascal-codetaal
Copy link

Hello @nyurik and the Martin contributors,

We've been closely following the discussion about the need for a watch mode or similar feature to enable dynamic config reloading in Martin. As we understand the significance of this feature for users, and us, who need to see updated data without substantial downtime, we're keen on contributing to its development.

Though we're not currently well-versed in Rust, we're ready to put in the necessary time to become proficient and contribute meaningfully to this project. We're particularly interested in understanding the preferred approach.

Could you provide us with some guidance on where you believe we should start, or if there are any particular areas of the codebase or documentation that we should familiarize ourselves with first?

We're looking forward to collaborating with the community on this feature. Thank you for considering our offer to help.

@nyurik
Copy link
Member

nyurik commented Jan 25, 2024

@pascal-codetaal I will be happy to help you.

The very first thing I think we should evaluate is the performance aspect outlined in #288 (comment) -- create a tiny Actix app using one of the examples they have as a starting point. Inside that app, create two endpoints, e.g. /lock and /nolock to see the impact of the global lock. You can use the oha tool (used in the justfile) to compare the performance of both endpoints.

This should get you familiarized with the Rust ecosystem, making it much easier to make changes to a bigger project.

#[derive(Clone)]
struct State {
    pub values: HashMap<String, String>,
    pub values_mutex: Arc<Mutex<HashMap<String, String>>>,
}

impl State {
    pub fn new() -> Self {
        let values = vec![(String::from("key"), String::from("value"))]
            .into_iter()
            .collect::<HashMap<String, String>>();
        Self {
            values_mutex: Arc::new(Mutex::new(values.clone())),
            values,
        }
    }
}

#[get("/lock")]
async fn get_lock(state: Data<State>) -> String {
    let vals = state.values_mutex.lock().unwrap();
    vals.get("key").unwrap().clone()
}

#[get("/nolock")]
async fn get_nolock(state: Data<State>) -> String {
    state.values.get("key").unwrap().clone()
}

@pascal-codetaal
Copy link

@nyurik the development will be done by @mesudBe-Orbit
He will start with the development of the separate Actix application, and he will push this to a public repository.

@mesudBe-Orbit
Copy link

@nyurik @pascal-codetaal the development is complete, and you can access the tiny application and view the results by following this link: https://github.com/mesudBe-Orbit/actix-lock-stress-tester?tab=readme-ov-file#test-results
In short, under these conditions, it seems there is no significant difference.

@nyurik
Copy link
Member

nyurik commented Jan 30, 2024

awesome news, thanks! So this should make it easier.

The current config system process:

  • In start(), read config file if given, using current env vars for any variable expansions.
  • Merge together CLI parameters and the config file (as well as any env vars)
  • Run .resolve() to match the configuration to the actual state of the database/files, and create a ServerState (tile sources, fonts, etc)
    let env = OsEnv::default();
    let save_config = args.meta.save_config.clone();
    let mut config = if let Some(ref cfg_filename) = args.meta.config {
    info!("Using {}", cfg_filename.display());
    read_config(cfg_filename, &env)?
    } else {
    info!("Config file is not specified, auto-detecting sources");
    Config::default()
    };
    args.merge_into_config(&mut config, &env)?;
    config.finalize()?;
    let sources = config.resolve().await?;

The resolve method creates cache and an ID Resolver, and uses resolve_tile_sources to process all tile sources in parallel.

pub async fn resolve(&mut self) -> MartinResult<ServerState> {

IdResolver maps "unique full source IDs" (like database.namespace.tablename or even longer) to some short name the user may want, like tablename. If the short name already exists, it will append numbers to it, e.g. tablename.1, but if the same full unique name is passed to IdResolver, it will re-return the same short name as before - so my point above about stable resolution is mostly done - same table will get the same short name.

PostgreSQL resolution is much more involved than pmtiles/mbtiles. It runs these steps in parallel for all configured postgres connections:

  • create a PgBuilder instance from the PgConfig instance.
  • in parallel, query PG to discover all existing tables and functions that can be used as sources (e.g. query_available_tables)
  • for all configured tables/funcs, match them with the ones discovered in the database, and create a new source instance - in parallel (e.g. instantiate_tables)
  • if auto-discovery is enabled, instantiate the rest of the tables/funcs that have not been matched up with the configuration

pub async fn resolve(&mut self, id_resolver: IdResolver) -> MartinResult<TileInfoSources> {

So one path to implement watch mode:

  • Change all endpoint functions to accept sources: Data<RwLock<TileSources>> instead of Data<TileSources>, and get a read lock on all of them. Note that RwLock might not be good for our usecase, but it should get us started.
  • after parsing and resolving Config struct, store it (as Arc<Mutex<Config>>) inside the actix's Data<...>
  • add a /refresh endpoint that gets Data<Arc<Mutex<Config>>> from Actix
  • refresh should call resolve to compute new set of sources, and once done, lock and update all tile sources using the write of the RwLock.
  • the Config::resolve should be changed to preserve IdResolver instance between calls - e.g. move it to constructor?

@nyurik
Copy link
Member

nyurik commented Jan 30, 2024

@mesudBe-Orbit @pascal-codetaal the development is complete, and you can access the tiny application and view the results by following this link: https://github.com/mesudBe-Orbit/actix-lock-stress-tester?tab=readme-ov-file#test-results In short, under these conditions, it seems there is no significant difference.

Please run your code with cargo run --release, and also you should probably increase the number of requests, e.g. aim for about 5 minute run (?)

@pascal-codetaal
Copy link

@nyurik the 5 minute run had the same result for both endpoints.
We started with the implementation of changing the endpoint functions.

Change all endpoint functions to accept sources: Data<RwLock> instead of Data, and get a read lock on all of them. Note that RwLock might not be good for our usecase, but it should get us started.

@nyurik
Copy link
Member

nyurik commented Feb 2, 2024

@pascal-codetaal sounds good. One other thing - I would recommend creating a Martin fork, create a branch there, and create a draft pull request into martin. Make sure to commit and push your code often (even if it does not compile). This way everyone can see the progress, and make suggestions as you go, without waiting for the full completion, and to correct mistakes early on. This way you will also get the benefit of constant CI checks.

@mesudBe-Orbit
Copy link

@nyurik we have created a draft PR into martin. you can find it here And commit and push our latest code (the latest commit was about mistakenly added repetition).

@nyurik
Copy link
Member

nyurik commented Feb 7, 2024

@mesudBe-Orbit yes, thanks, i saw it and already commented on that PR. Keep adding things to it (by pushing to the same branch), and once it is ready to be merged, click the Ready to review button. Make sure to either merge main branch into it, or rebase on main and force push.

@nyurik
Copy link
Member

nyurik commented May 6, 2024

pmtiles support for this would be good as well. I outlined how it can be done in stadiamaps/pmtiles-rs#40 - and once done, we can use it in Martin too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants