New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing logs from Nimbus ELK stack #81
Comments
While checking Logstash logs I found errors about inconsistent formatting of But that's not what I'm looking for. |
Cluster and indices appear healthy:
|
Nodes look connected fine, including dashboard LB node
|
Another logging issue found: status-im/nimbus-eth2#3336 |
I've been trying to check what messages arrive in Logstash from
Within the |
This definitely should work, since this basic config works:
Maybe the issue are permissions... |
Yup, found it!
|
What the fuck is happening. I have a clearly defined
This makes no fucking sense. |
If I drop all other outputs it loads fine:
But why?! If there's a syntax issue it should just fail to start. |
Ok, so it appears Logstash creates the output only when the first rule matching the And because I restarted Logstash for some reason it stopped receiving logs from our nimbus hosts, unless I restart
And then it works... |
I made simplest config possible consisting of one file:
And it doesn't show up in the file. What the actual fuck. This means the issue has to be on the |
There's a whole stream of forks of
The last one is the newest and also has a release: |
You can get some stuff from it:
Most metrics are |
Based on some stuff I've read about tuning
I've tuned the Logstash output action configuration: https://github.com/status-im/infra-role-bootstrap-linux/commit/87954443
|
I started a local instance of Logstash just to debug that one node, but for some reason I can't get it to open a connection to it:
|
Looks like we're not getting logs from all Nimbus hosts. Not sure what that's about, will figure it out tomorrow. |
I found out why logs are missing. My own mistake when adjusting Reverted: https://github.com/status-im/infra-role-bootstrap-linux/commit/2da26200 |
For some reason changes to |
"making a change to any of journald's configuration files (/etc/systemd/journald.conf, for example) requires subsequently running systemctl restart systemd-journald in order for the changes to take effect. However, restarting the journald service appears to break logging for all other running services until those services are also restarted" - systemd/systemd#2236 :-) |
Now isn't that lovely. |
I think we have a much bigger problem than just some logs missing. The MAJORITY of logs are not stored in ES cluster:
That's 6 million entries for just one hour of one node on one host, and here's the size of the index from today:
The index for today's whole day has less logs than one hour of one node on one host... |
Ok, I think I can see the issue: We're simply hitting the 1GBit/s limit of the link installed in our Hetzner hosts:
|
If I look at traffic sent via Other hosts show something similar, which from a rought calculation gives us: 16*70 = 980Mb/s |
The other option is to try to get a 10Gb/s link for the Hetzner host: Though for that to work we'd also have to move our ElasticSearch cluster to Hetzner and put it under the same switch. |
So I see two options:
Not sure about the viability with the 5-port host, I'd have to ask Hetzner support. |
Those logs are highly compressible. Can we send them compressed or re-compress them in flight? |
Yes, that's what I already suggested in #81 (comment), but there's no ready-made solution for compressing logs sent to Logstash: So it would require some kind of custom solution with some intermediate service that would do the compression of the traffic, and also would require optimized buffering for the compression to be actually effective. |
In comparison a I guess this is related to the number of peers in the network. Or something like that @arnetheduck ? |
a combination of validators and peers |
Very considerable difference, I'd say I'm going to leave it like that - with logging to ELK for 4 mainnet hosts off - until I resolve this issue. |
Looks like replacing the cluster with one hosted on Hetzner metal nodes did help: https://nimbus-logs.infra.status.im/goto/ba7674a1469e911b2fafab3573eeee4e |
Now the Hetzner hosts are receiving around ~170 MB/s in logs from aggregation hosts: Today's index is ALREADY 22 GB in size(11GB per host):
Not sure how sustainable this is... this means 24h of logs is going to be ~240 GB, and those host have only ~400 GB of storage. We will have to add some rules to logstash to possible cut out the most verbose and useless logs. |
The If we could cut that one out we'd save a LOT of space and also make queries much faster. @arnetheduck thoughts? |
ok to filter them from elk |
Yep, the size is ridiculous, I will apply the rules:
|
The change - https://github.com/status-im/infra-hq/commit/d699e893 - results in about 96% reduction in log volume: The step is because I first deployed to |
lgtm, should also speed up searches considerably |
Looks like we have a new issue on the ES cluster:
|
I can't even list existing indices:
|
I bumped the JVM memory limits to 80% of what system has: ba2caedb |
Oh, i see it, I used the old format for accessing nested fields, fixed: https://github.com/status-im/infra-hq/commit/b2505364 |
I've also lowered number of indices(days) kept to 10 to avoid running out of space: fb7bde23 |
We have bigger log volume now that we fixed the logging setup in: #81 Signed-off-by: Jakub Sokołowski <jakub@status.im>
Interestingly enough I've noticed a reduction in disk space used after an index for the new day was created:
My best bet is that this is due to ElasticSearch compression taking effect. Possibly we could get an even better result by adjusting
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/index-modules.html |
#81 Signed-off-by: Jakub Sokołowski <jakub@status.im>
Actually, we can save some more space by just disabling replicas, since this is not high value data: infra-nimbus/ansible/group_vars/logs.nimbus.yml Lines 31 to 34 in fbb1ee6
This can save up quite a lot:
Change: 0398c31a |
Currently daily indices take up 10 GB per host, so it' fine. #81 Signed-off-by: Jakub Sokołowski <jakub@status.im>
Ok, so we're seeing about ~30 GB of logs per day, which is ~10 GB per day per node, since we have 3 ES nodes:
This should give us ~30 days of logs easily, so I've increased the limit back to 20 days: 392d4124 |
I consider this resolved. At least for now... |
Some logs are missing when querying them in our Nimbus Kibana instance, specifically:
Querying for these returns nothing:
https://nimbus-logs.infra.status.im/goto/05738788ae13e81a579cbcadc06e4cbb
The text was updated successfully, but these errors were encountered: