Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Adding a keyword filter #37

Closed
wants to merge 221 commits into from
Closed

[backend] Adding a keyword filter #37

wants to merge 221 commits into from

Conversation

AnomalRoil
Copy link

This is meant to reduce the size of the DB and avoid saving all commits and changelogs:

- without this:
Database Stats
Collections (incl. system.namespaces) 	2
Data Size 	975 KB
Storage Size 	524 KB
Avg Obj Size # 	244 KB
Objects # 	4
Indexes # 	2
Index Size 	73.7 KB

- with this:
Database Stats
Collections (incl. system.namespaces) 	2
Data Size 	562 KB
Storage Size 	279 KB
Avg Obj Size # 	281 KB
Objects # 	2
Indexes # 	2
Index Size 	41.0 KB

Fixes #33

})
});

parse_texts(data)
Copy link
Contributor

@mimoo mimoo Mar 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly you're re-using the UpdateMetadata field instead of creating a new field for this analysis.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is not about analysis yet, this is about saving storage space.
I thought the analysis would come later in the prioritization engine.

fn parse_texts(input: Result<UpdateMetadata>) -> Result<UpdateMetadata> {
if input.is_err() {
return input;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done on the caller side. You can re-write the caller side as:

 let data: UpdateMetadata = serde_json::from_slice(&output.stdout).map_err(|e| {
        error!("{}", String::from_utf8_lossy(&output.stdout));
        anyhow::Error::msg(e)
    })?;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, better be foolproof in case someone changes the code later, no? I could do the check on the caller side as well, to exit earlier, agreed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how making the check here makes it foolproof. Accepting a Result as argument is not really idiomatic in Rust : o

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I should change it to accept an UpdateMetadata directly, yeah

if std::env::var("RETAIN_ALL").is_ok() && std::env::var("RETAIN_ALL").unwrap() != "" {
println!("DISABLED TEXT PARSING, RETAINING ALL DATA. $RETAIN_ALL={}", std::env::var("RETAIN_ALL").unwrap());
return input;
}
Copy link
Contributor

@mimoo mimoo Mar 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so two things:

  1. do we really want this RETAIN feature? Is it ever going to be useful if we don't really display that information to the user anyway?
  2. I think this if should be on the caller side. Actually, I think it shouldn't even live in this file as this priority engine logic, and this file is just about bindings to dependabot.

wdyt?

commit.message = "".to_string();
}
}
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete the comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's left here for the review to have some food for thoughts:

  • do we want to remove the entire commits, or do we want to just truncate uninteresting messages to the null string?

The first method is more effective to save storage space.
The second would allow us to keep the html_url field of all commits... But since I'm not filtering the commits_url: Option<String>, field, we already have all commits urls even these of the commits that were not "flagged".

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the first question we should ask ourselves is: do we want to keep these in? I know we can reduce storage by removing them, but we're still 10MB away from mongodb limit on documents so we can still ask ourselves if it's worth keeping. I don't necessary like having devs make decisions based on changelog and commit because these can be faked, and don't necessarily reflect the reality of what's on crates.io. But it can still be useful to have if we want to allow the dev to make a more informed decision on prioritization. wdyt?

If we do want to display it to the user, we can simply display the thing in the create PR/review page.

If we don't want to display it to the user, we can add serde::skip to these fields to avoid storing in storage, but have still have the fields to do priority analysis on it (but save the analysis in another field).


fn flagged_text(text: &String) -> bool {
for word in FLAGGED_WORDS {
if text.contains(word) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better if we use regexes, otherwise you'll get words that include the letters "rce" without being the actual "rce" word.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for "sec". I predict we will have a lot of false positive : D

actually, it's fine to have false positives, but it'd be good to see how useful these keywords are. I'm wondering if it would be a good idea to show the user context around the words we grepped. I guess this is what you're doing already by only retaining commits or changelog that passed this test...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought "false positives aren't an issue". But we could move to regexes to be me accurate. As you want?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we could have a latter prioritization step that would give different weights to different words, since RCE is prolly worse than "bugfix"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we could have a latter prioritization step that would give different weights to different words, since RCE is prolly worse than "bugfix"

let's keep it simple for now : o but you can that add the idea for later

Yeah I thought "false positives aren't an issue". But we could move to regexes to be me accurate. As you want?

we can keep this for now and see how much false positives we get, from past experience I predict we'll too much but no need to optimize this now anyway

}
}
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't know if we want to care about EOF, we don't allow a file that doesn't end with a linebreak in diem/diem

@@ -23,6 +27,7 @@ services:
environment:
- "GITHUB_TOKEN=$GITHUB_TOKEN" # an optional PAT for Github
- "CARGO_HOME=/cargo" # used with a volume to persist cargo stuff
- "RETAIN_ALL=$RETAIN_ALL" # to disable parsing of the commit and changelog messages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem with this is that currently I imagine that we can prioritize things on the frontend side by checking if there are changelog/commits present in the UpdateMetadata field. But if we enable this, then every update has these fields so we can't prioritize anymore.

Copy link
Contributor

@mimoo mimoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is good as a first pass. We should set some time to figure out how to show that on the frontend. I think we need to figure out these before merging:

  • either remove RETAIN_ALL or figure out a way to keep the prioritization when RETAIN_ALL is enabled. I think using a different field than UpdateMetadata is a good idea: we could have a commit_words: Vec<String> containing the words that were flagged.
  • move the filtering out of dependanbot. Perhaps directly in the priority() function
  • use regexes to parse commits/changelogs

@AnomalRoil
Copy link
Author

AnomalRoil commented Mar 15, 2021

I was thinking that we might have 2 different things:

  • a prioritization engine that flags different keywords with different weights and all.
  • a filter to try and reduce the amount of data we are currently storing.

This tries to address the latter as it's easy and it's better to save some storage from scratch, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants