Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Migrate Training Update Process to Sidekiq #5754

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

ujjwalpathaak
Copy link
Contributor

@ujjwalpathaak ujjwalpathaak commented Apr 5, 2024

fixes #4712
This PR aims to migrate the training update process to Sidekiq. So that there are free resources available on the main thread.

  • Move the update process to Sidekiq
  • Deal with YamlTrainingLoader & InvalidYamlError error?
  • "Another training update process is already in progress. Try again later."
  • Rescue ModuleNotFound error?
  • Changes to allow addition of new content
  • Edge Cases
  • Write Tests

@ujjwalpathaak
Copy link
Contributor Author

ujjwalpathaak commented Apr 5, 2024

Hello @ragesoss, I have tried to move the process to Sidekiq it is working fine. LMKWYT if this seems feasible, I'll continue with this approach to add other features.

@ujjwalpathaak
Copy link
Contributor Author

ujjwalpathaak commented Apr 9, 2024

Hey, I tried a few things to identify if there is a Sidekiq job already running? I am not able to figure out a clean way to do it.

You suggested to handle this on the database level but if I add a boolean column to every training content, will it be a nice way to do things?
As needs_update only updates those courses which have this bool set to true, whereas libraries and modules gets updated on each ./reload request?

@ujjwalpathaak
Copy link
Contributor Author

ujjwalpathaak commented Apr 9, 2024

Few of the options I tried-

  1. Adding "redis" as a gem
  • using redis.set('training_update_in_progress', true) as a check if an older update is already running? This would need cleanup code in case of unexpected server termination as training_update_in_progress is set false after loading of slides is done by sidekiq.
  1. Get all jobs from the default queue and check if "TrainingBaseWorker" job exists.
  • This is not consistent ad it takes time for TrainingBaseWorker to show up int he default queue.
  1. We are using the gem 'sidekiq-unique-jobs' as sidekiq_options lock: :until_executed, it has a few options Link to SidekiqUniqueJobs documentation to solve these conflicts
  • but the problem is that it would not give feedback to the user on client that another job is running but will notify this conflict on server side

@ragesoss
Copy link
Member

ragesoss commented Apr 9, 2024

I don't know the best approach here, but one that might work would be:

Include a column for update status on TrainingLibrary and TrainingModule records, but make it able to handle more details than just a boolean state. (It might or might not make sense to include more than one column.) The idea would be to be able to keep track of whether it was not scheduled for an update, scheduled for an update that hasn't started yet, or started an update that has not successfully completed yet (with a timestamp for when it started). So the update process would update that status in the database as soon as it starts, and then we could use some logic like 'if it's been more than X amount of time since the start of an update, check whether an update process is running and if not, it probably errored'.

Then, except for that case where an update failed but the status still shows it being updated, we wouldn't need to check Redis to show state to the user.

@ujjwalpathaak ujjwalpathaak marked this pull request as draft April 11, 2024 17:11
@ujjwalpathaak
Copy link
Contributor Author

ujjwalpathaak commented Apr 11, 2024

Hi @ragesoss,
So I tried the approach you suggested, let me know if you suggest any changes in it.
I have added a few tasks in the PR desc that i think are still needed to close this PR, I'll add more as i encounter them.

Here is the working ( not written tests for them till now but it should work fine after i write tests and do minor changes accordingly )

  • there are 2 new cols for each table - libraries, modules & slides.
    • update_status -
      • It will have 3 states
        • 0 -> default / not_scheduled state
        • 1 -> scheduled but update not started state
        • 2 -> update started but not completed
    • update_error
      • default / no_error - NULL
      • error - e.message ( that is to be rendered to the user )

For errors like InvalidWikiContentError & NoMatchingWikiPagesFound - all the rows in the table will have the error as this is not slug specific.
For errors like DuplicateSlugError - only the rows which have the errored_slug would contain the message.

After hitting on /reload_trainings a timer (currently for 15min) would run up with all the rows in training_libraries changes their update_status to 1 ( scheduled ).. as we will start to fetch content from wikimedia for a slug it will change state to 2 ( stared updating ) then after calling .inflate on that slug if no errors occur it will again default to 0 showing update complete

if errors occurs it is handled by not setting state of 2 to 0 so we could catch if error occured. After 15min the timer would check for all rows if a row has 2 as update_status then it would render the error stored in update_error of that row to the user

we can also check if a thread of this 15min timer is currently running to see if a update is already running?

I still have a few blockers as to how to deal with yaml content should i entierly exclude it from the new workflow or just add conditions in between and make it work & how to deal with ModuleNotFound error as to where to store if this occurs.

I'll work on them and add new commits as I figure out the issues.

@ujjwalpathaak
Copy link
Contributor Author

@ragesoss I had a doubt.
I was getting a DuplicateSlugError error for "editing-basics" slug while loading it from wiki_pages.
what is the expected behaviour here? should the update_process skip it and go ahead or it should show the error to user and terminal the process?

@ragesoss
Copy link
Member

That typically happens because you have done training updates in both wiki_education true and false modes, and there is one module with that slug that gets pulled from the wiki (in false mode) and another from the .yml files (in true mode). Delete all the TrainingModule and TrainingSlide records and you should be able to proceed.

@ujjwalpathaak
Copy link
Contributor Author

I wanted to ask how should these erros in prod behave? if an error is raised should it stop the workflow or that slug should be skipped and the workflow should continue.

@ragesoss
Copy link
Member

Hopefully such errors won't occur in production, but skipping that slug and continuing would be nice

@ujjwalpathaak
Copy link
Contributor Author

ujjwalpathaak commented Apr 15, 2024

Hey @ragesoss,
For errors like NoMatchingWikiPageFound, NoModuleFound and more... I cannot target a single row, currenlty I am solving this problem using a tacky solution but I am not finding that implementation nice.

I was wondering if it is feasible to make a new table to record the errors that are raised during the background update process so we dont need to iterate on all the rows in all 3 tables to do any operations.

Table Name - training_update_status
------ 1. row ------
content_class: TrainingLibrary
update_status: 0 // not_scheduled
error_message: NULL

------ 2. row ------
content_class: TrainingModule
update_status: 1 // scheduled
error_message: NULL

------ 3. row ------
content_class: TrainingSlide
update_status: 2 // error_occured
error_message: could not get links from '#{@wiki_base_page}'

This would reduce the complexity of the workflow + reduce iterations over the dashboard

@ragesoss
Copy link
Member

I'd rather not add a table just to hold this information. If the idea would be to limit the data to just one entry per record type, perhaps a Setting record could be used to store the status.

@ujjwalpathaak
Copy link
Contributor Author

ujjwalpathaak commented Apr 15, 2024

Okay I'll use setting records to store the data, and I'll start to write tests to test the functionality

@ujjwalpathaak
Copy link
Contributor Author

Hey @ragesoss
I have been trying to tweek a lot of things here and there, but still there are a few loose ends to tie up.
If you could can you have a look and maybe guide me what all should I keep in mind ahead from here in terms of testing and expected behaviour?

@ujjwalpathaak
Copy link
Contributor Author

Hey @ragesoss,
Most of the functionality is now working for me locally, but there is one issue I need some assistance with.

I am adding a new initializer - WikiEduDashboard/config/initializers/training_update_process_cleanup.rb

require_dependency "#{Rails.root}/lib/data_cycle/batch_update_logging"
require_dependency "#{Rails.root}/lib/training/training_base"

include BatchUpdateLogging

delete_pid_file(:training)
TrainingBase.new.update_process_state(0)
TrainingBase.new.clear_error_messages

This is to basically clean any leftovers from the last run ( in development ), this is working locally for me but when I push this to github I get this error -

Run mkdir tmp -m 777
  mkdir tmp -m 777
  cp config/application.example.yml config/application.yml
  cp config/database.example.yml config/database.yml
  bin/rails db:migrate RAILS_ENV=test
  shell: /usr/bin/bash -e {0}
  env:
    RAILS_ENV: test
    DATABASE_PORT: 3306
rails aborted!
ActiveRecord::StatementInvalid: Mysql2::Error: Table 'dashboard_testing.settings' doesn't exist

am I missing a step or doing something wrong?

@ragesoss
Copy link
Member

I don't see that error in the CI log. Where during the build is it happening?

@ujjwalpathaak
Copy link
Contributor Author

Commit - e00b7e0
Test Logs for the above commit - https://github.com/WikiEducationFoundation/WikiEduDashboard/pull/5754/checks?check_run_id=24057578970

Started to after this commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve training update process
2 participants