[backend] Adding a keyword filter #37

AnomalRoil · 2021-03-12T19:26:45Z

This is meant to reduce the size of the DB and avoid saving all commits and changelogs:

- without this:
Database Stats
Collections (incl. system.namespaces) 	2
Data Size 	975 KB
Storage Size 	524 KB
Avg Obj Size # 	244 KB
Objects # 	4
Indexes # 	2
Index Size 	73.7 KB

- with this:
Database Stats
Collections (incl. system.namespaces) 	2
Data Size 	562 KB
Storage Size 	279 KB
Avg Obj Size # 	281 KB
Objects # 	2
Indexes # 	2
Index Size 	41.0 KB

Fixes #33

…nce we don't use rustup...

(got error `error: error validating "k8s/overlays/local": error validating data: ValidationError(Deployment.spec.template.spec.containers): invalid type for io.k8s.api.core.v1.PodSpec.containers: got "map", expected "array"; if you choose to ignore these errors, turn validation off with --validate=false`)

WIP kustomize

to reduce the size of the DB and avoid saving all commits and changelogs

mimoo · 2021-03-14T18:32:13Z

web-backend/metrics/src/common/dependabot.rs

-    })
+    });
+
+    parse_texts(data)


So if I understand correctly you're re-using the UpdateMetadata field instead of creating a new field for this analysis.

So, this is not about analysis yet, this is about saving storage space.
I thought the analysis would come later in the prioritization engine.

mimoo · 2021-03-14T18:33:57Z

web-backend/metrics/src/common/dependabot.rs

+fn parse_texts(input: Result<UpdateMetadata>) -> Result<UpdateMetadata> {
+    if input.is_err() {
+        return input;
+    }


this should be done on the caller side. You can re-write the caller side as:

let data: UpdateMetadata = serde_json::from_slice(&output.stdout).map_err(|e| { error!("{}", String::from_utf8_lossy(&output.stdout)); anyhow::Error::msg(e) })?;

Well, better be foolproof in case someone changes the code later, no? I could do the check on the caller side as well, to exit earlier, agreed.

I'm not sure how making the check here makes it foolproof. Accepting a Result as argument is not really idiomatic in Rust : o

I guess I should change it to accept an UpdateMetadata directly, yeah

mimoo · 2021-03-14T18:37:20Z

web-backend/metrics/src/common/dependabot.rs

+    if std::env::var("RETAIN_ALL").is_ok() &&  std::env::var("RETAIN_ALL").unwrap() != ""  {
+        println!("DISABLED TEXT PARSING, RETAINING ALL DATA. $RETAIN_ALL={}", std::env::var("RETAIN_ALL").unwrap());
+        return input;
+    }


so two things:

do we really want this RETAIN feature? Is it ever going to be useful if we don't really display that information to the user anyway?

I think this if should be on the caller side. Actually, I think it shouldn't even live in this file as this priority engine logic, and this file is just about bindings to dependabot.

wdyt?

mimoo · 2021-03-14T18:37:46Z

web-backend/metrics/src/common/dependabot.rs

+            commit.message = "".to_string();
+        }
+    }
+    */


delete the comment

That's left here for the review to have some food for thoughts:

do we want to remove the entire commits, or do we want to just truncate uninteresting messages to the null string?

The first method is more effective to save storage space.
The second would allow us to keep the html_url field of all commits... But since I'm not filtering the commits_url: Option<String>, field, we already have all commits urls even these of the commits that were not "flagged".

WDYT?

So the first question we should ask ourselves is: do we want to keep these in? I know we can reduce storage by removing them, but we're still 10MB away from mongodb limit on documents so we can still ask ourselves if it's worth keeping. I don't necessary like having devs make decisions based on changelog and commit because these can be faked, and don't necessarily reflect the reality of what's on crates.io. But it can still be useful to have if we want to allow the dev to make a more informed decision on prioritization. wdyt?

If we do want to display it to the user, we can simply display the thing in the create PR/review page.

If we don't want to display it to the user, we can add serde::skip to these fields to avoid storing in storage, but have still have the fields to do priority analysis on it (but save the analysis in another field).

mimoo · 2021-03-14T18:38:34Z

web-backend/metrics/src/common/dependabot.rs

+
+fn flagged_text(text: &String) -> bool {
+    for word in FLAGGED_WORDS {
+        if text.contains(word) {


it's better if we use regexes, otherwise you'll get words that include the letters "rce" without being the actual "rce" word.

same for "sec". I predict we will have a lot of false positive : D

actually, it's fine to have false positives, but it'd be good to see how useful these keywords are. I'm wondering if it would be a good idea to show the user context around the words we grepped. I guess this is what you're doing already by only retaining commits or changelog that passed this test...

Yeah I thought "false positives aren't an issue". But we could move to regexes to be me accurate. As you want?

I thought we could have a latter prioritization step that would give different weights to different words, since RCE is prolly worse than "bugfix"

I thought we could have a latter prioritization step that would give different weights to different words, since RCE is prolly worse than "bugfix"

let's keep it simple for now : o but you can that add the idea for later

Yeah I thought "false positives aren't an issue". But we could move to regexes to be me accurate. As you want?

we can keep this for now and see how much false positives we get, from past experience I predict we'll too much but no need to optimize this now anyway

mimoo · 2021-03-14T18:45:21Z

web-backend/metrics/src/common/dependabot.rs

+        }
+    }
+    return false;
+}


don't know if we want to care about EOF, we don't allow a file that doesn't end with a linebreak in diem/diem

mimoo · 2021-03-14T18:48:10Z

docker-compose.yml

@@ -23,6 +27,7 @@ services:
    environment:
      - "GITHUB_TOKEN=$GITHUB_TOKEN" # an optional PAT for Github
      - "CARGO_HOME=/cargo" # used with a volume to persist cargo stuff
+      - "RETAIN_ALL=$RETAIN_ALL" # to disable parsing of the commit and changelog messages


So the problem with this is that currently I imagine that we can prioritize things on the frontend side by checking if there are changelog/commits present in the UpdateMetadata field. But if we enable this, then every update has these fields so we can't prioritize anymore.

mimoo

So this is good as a first pass. We should set some time to figure out how to show that on the frontend. I think we need to figure out these before merging:

either remove RETAIN_ALL or figure out a way to keep the prioritization when RETAIN_ALL is enabled. I think using a different field than UpdateMetadata is a good idea: we could have a commit_words: Vec<String> containing the words that were flagged.
move the filtering out of dependanbot. Perhaps directly in the priority() function
use regexes to parse commits/changelogs

AnomalRoil · 2021-03-15T12:25:55Z

I was thinking that we might have 2 different things:

a prioritization engine that flags different keywords with different weights and all.
a filter to try and reduce the amount of data we are currently storing.

This tries to address the latter as it's easy and it's better to save some storage from scratch, no?

mimoo added 30 commits January 19, 2021 14:30

first commit

c570214

better README

0032e91

Adding Vue3 frontend

6fc0b45

added dockerfile for frontend

e4cc9b7

added makefile

25c58fc

adding cronjobs

47da43e

update docker compose with front end

b2cfc16

cleaning up

a314a39

adding dashboard component to vue

a95b834

change cronjob to daily

8c2aff0

gitignores

20ba0dd

moving things around

21e09b9

cleaning up + writing up metrics abstraction

72fd124

README: adding fluff on metrics service

89ae111

refactor metrics

61bb5e0

adding database folder (just a README for now)

403b5ce

using a bounded channel between backend and metrics service

ff121ec

fix backend

bf0d899

moving metrics in backend

4e81024

added diagram for metrics service

cf1a941

pinning backend to nightly (needed by rocket)

9448621

fixing errors

35c2cfa

fix metrics diagram

feacd41

update metrics README

793d76f

update Makefile to avoid re-building

542afbe

added testing resource

a6c30f7

work on metrics

fd0d52c

moving things around

17e5672

re-adding rust stuff

e9630e6

refactoring metrics

a47c1d7

mimoo and others added 19 commits February 17, 2021 20:54

[backend] re-introduced rust-toolchain, but not sure if it'll work si…

0546617

…nce we don't use rustup...

[backend] fixed clippy lints

1febd90

[backend] fix rustup/cargo_home

82fec80

WIP kustomize

353c1e4

move mongo-express to local deploys only

4704109

fix scripts; update Makefile for kustomize

b94791d

[k8s] added README

4599724

Merge pull request #31 from jnaulty/jnaulty/kustomize

a70f19f

WIP kustomize

[k8s] move doc/ stuff to k8s/

c395566

[k8s] added link to main README

b67a482

[k8s] cleaning README

9c58d79

[k8s] fix port frontend

90f9cce

[frontend] fixed multi-repo with vuex

4e1aaa6

[frontend] fix new rustsec

3362e45

[frontend] fix console errors

33c7349

[frontend] fix breadcrumbs

a20015d

[frontend] cleaner review section + different link if there's a risk

3ed09ae

[backend] Adding a keyword filter

e90db55

to reduce the size of the DB and avoid saving all commits and changelogs

mimoo reviewed Mar 14, 2021

View reviewed changes

mimoo suggested changes Mar 14, 2021

View reviewed changes

mimoo closed this Mar 17, 2021

mimoo force-pushed the main branch from 6837e85 to d6f7ed9 Compare March 17, 2021 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backend] Adding a keyword filter #37

[backend] Adding a keyword filter #37

AnomalRoil commented Mar 12, 2021

mimoo Mar 14, 2021 •

edited

AnomalRoil Mar 15, 2021

mimoo Mar 14, 2021

AnomalRoil Mar 15, 2021

mimoo Mar 15, 2021

AnomalRoil Mar 15, 2021

mimoo Mar 14, 2021 •

edited

mimoo Mar 14, 2021

AnomalRoil Mar 15, 2021

mimoo Mar 15, 2021

mimoo Mar 14, 2021

mimoo Mar 14, 2021

AnomalRoil Mar 15, 2021

AnomalRoil Mar 15, 2021

mimoo Mar 15, 2021

mimoo Mar 14, 2021

mimoo Mar 14, 2021

mimoo left a comment

AnomalRoil commented Mar 15, 2021 •

edited

[backend] Adding a keyword filter #37

[backend] Adding a keyword filter #37

Conversation

AnomalRoil commented Mar 12, 2021

mimoo Mar 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimoo Mar 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimoo left a comment

Choose a reason for hiding this comment

AnomalRoil commented Mar 15, 2021 • edited

mimoo Mar 14, 2021 •

edited

mimoo Mar 14, 2021 •

edited

AnomalRoil commented Mar 15, 2021 •

edited