Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add log-based alerts for the snapshot service #379

Open
LesnyRumcajs opened this issue Jan 15, 2024 · 2 comments
Open

Add log-based alerts for the snapshot service #379

LesnyRumcajs opened this issue Jan 15, 2024 · 2 comments
Assignees
Labels
daily-snapshot Issues related to the daily snapshot service new-relic Priority: P1 Added to issues and PRs relating to a high severity bugs.

Comments

@LesnyRumcajs
Copy link
Member

Issue summary

Follow-up of #354

I had some issues creating a proper alert based on snapshot logs. While the alert condition is met, it is not triggered for some reason. Moving it into a separate issue to unblock infrastructure changes and possibly parallelize work.

Other information and links

@LesnyRumcajs LesnyRumcajs added daily-snapshot Issues related to the daily snapshot service new-relic Priority: P1 Added to issues and PRs relating to a high severity bugs. labels Jan 15, 2024
@LesnyRumcajs
Copy link
Member Author

Rules that elevated the frustration levels, but should otherwise work fine:

# This file constains NR event rules used to generate metrics from logs, given that
# the service is not generating metrics by itself.
resource "newrelic_events_to_metrics_rule" "generate_snapshot_attempt_metrics" {
  account_id = var.new_relic_account_id
  for_each   = toset(["mainnet", "calibnet"])

  name        = format("%s %s snapshot generation attempts", var.service_name, each.key)
  description = "Snapshot generation attempts"
  nrql        = "From Log select uniqueCount(message) as '${var.service_name}.${each.key}.snapshot_generation_run' WHERE `hostname` = '${var.service_name}' AND filePath ='/root/logs/${each.key}_log.txt' AND message LIKE '%running snapshot export%'"
}

resource "newrelic_events_to_metrics_rule" "generate_snapshot_success_metrics" {
  account_id = var.new_relic_account_id
  for_each   = toset(["mainnet", "calibnet"])

  name        = format("%s %s snapshot generation success", var.service_name, each.key)
  description = "Success snapshot generations"
  nrql        = "From Log select uniqueCount(message) as '${var.service_name}.${each.key}.snapshot_generation_ok' WHERE `hostname` = '${var.service_name}' AND filePath ='/root/logs/${each.key}_log.txt' AND message LIKE '%Snapshot uploaded for%'"
}

resource "newrelic_events_to_metrics_rule" "generate_snapshot_fail_metrics" {
  account_id = var.new_relic_account_id
  for_each   = toset(["mainnet", "calibnet"])

  name        = format("%s %s snapshot generation failure", var.service_name, each.key)
  description = "Failed snapshot generations"
  nrql        = "From Log select uniqueCount(message) as '${var.service_name}.${each.key}.snapshot_generation_fail' WHERE `hostname` = '${var.service_name}' AND filePath ='/root/logs/${each.key}_log.txt' AND message LIKE '%Snapshot upload failed for%'"
}

@LesnyRumcajs
Copy link
Member Author

Leftovers that didn't work as expected. Something must be missing (null values?)

resource "newrelic_nrql_alert_condition" "snapshot_frequency_condition" {
  for_each    = toset(["mainnet", "calibnet"])
  policy_id   = newrelic_alert_policy.alert.id
  type        = "static"
  name        = format("Low snapshot generation frequency - %s", each.key)
  description = "Alert when snapshots are not generated within required time interval"
  enabled     = true

  # evaluation_delay = 7200 
  # aggregation_window = 14400
  aggregation_window = 360

  nrql {
    query = format("FROM Metric SELECT count(`${var.service_name}.${each.key}.snapshot_generation_ok`)")
  }

  warning {
    operator  = "below"
    threshold = 1
    # threshold_duration    = 14400
    threshold_duration    = 360
    threshold_occurrences = "ALL"
  }

  critical {
    operator  = "below"
    threshold = 1
    # threshold_duration    = 28800
    threshold_duration    = 720
    threshold_occurrences = "ALL"
  }
}

@LesnyRumcajs LesnyRumcajs self-assigned this Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
daily-snapshot Issues related to the daily snapshot service new-relic Priority: P1 Added to issues and PRs relating to a high severity bugs.
Projects
None yet
Development

No branches or pull requests

1 participant