feat(agent): output buffer persistence #15221

DStrand1 · 2024-04-24T18:13:28Z

Summary

Implements the write-through buffer persistence strategy detailed in the spec added in #14928.

Currently this PR is mostly a draft of separating the output buffer into multiple implementations, as well as experimentation with a WAL file library. Largest outstanding issue is around metric serialization to []byte.

Using the influx parser+serializer causes a cyclical import, but also drops the metric value type field. However this is probably the cleanest option to investigate since it would make the WAL files easy to re-import
Another suggestion was to use encoding/gob, which is what the PR is currently doing. However this has issues with un-exported fields, so I need to look more into how to work around this

Checklist

No AI generated code was used in this PR

Related issues

Related to #802, #14805

srebhan

Thanks @DStrand1 for this draft. It looks quite nice overall and I left some suggestions in the code.

What concerns me a bit is the underlying WAL implementation. Looking through their code I discovered some fundamental flaws.

The WAL file is not opened with SYNC, so any issue on disk or with Telegraf will loose more metrics than expected (in the worst case with a lot of RAM) all metrics are lost...
The filenames are saved with the pattern <user defined name>.<index>.<offset> and are string-sorted, so if the index or offset cross the order-boundary (i.e. a digit is added 9 -> 10) the order is messed up as lala.10.xyz follows lala.1.xyz instead of lala.9.xyz! This will in turn mess up the metric order.
The WAL implementation does not care about removing files, so if you got WAL file(s) and restart Telegraf multiple time the metrics in the WAL will be written multiple times without further handling. Doing this handling outside of the WAL implementation is hard as you cannot know which file was processed.
In the current form the WAL implementation is not capable of truncate the files front-to-back, so metrics are prone to be sent multiple times if the file was not completely flushed...

There are more (smaller) issues, but looking at the issues above I think you should go for another WAL library that is more mature. I looked at https://github.com/tidwall/wal in the past and think it can do what we need. Not a must, just a suggestion. ;-)

config/config.go

models/buffer.go

models/buffer_disk.go

models/buffer_mem.go

models/buffer.go

powersj · 2024-04-25T17:34:43Z

@DStrand1,

Thanks for the PR, super excided to see this. To capture what we talked about in pairs; specifically, to have you address some of the initial comments and switch to the proposed WAL library.

Let's plan to talk through where you are at or any issues with the new library in Monday's pairs.

Thanks!

srebhan

Thanks @DStrand1 for the nice update! My only concern (despite test-coverage) is the error-handling. IMO we should pass errors up to be able to log them e.g. if we are running out-of-disk. For unrecoverable errors (e.g. corrupt WAL files, non-serializable metrics, etc) we probably should panic, but everything else should propagate up to the plugin level (at least).

srebhan · 2024-05-03T10:32:55Z

models/buffer_disk.go

+func (b *DiskBuffer) addSingle(metric telegraf.Metric) bool {
+	err := b.walFile.Write(b.writeIndex(), b.metricToBytes(metric))
+	metric.Accept()
+	if err == nil {
+		b.metricAdded()
+		return true
+	}
+	return false
+}


Is it really worth the code duplication or should we use the batch-interface instead?

srebhan · 2024-05-03T10:33:45Z

models/buffer_disk.go

+	return index
+}
+
+func (b *DiskBuffer) Add(metrics ...telegraf.Metric) int {


I believe we should return an error here to log it or do other mitigation...

srebhan · 2024-05-03T10:34:03Z

models/buffer_disk.go

+	return written
+}
+
+func (b *DiskBuffer) Batch(batchSize int) []telegraf.Metric {


Return the error.

telegraf-tiger · 2024-06-04T20:26:34Z

Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

☺️ This pull request doesn't significantly change the Telegraf binary size (less than 1%)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB	RPM	TAR GZ	ZIP
amd64.deb	aarch64.rpm	darwin_amd64.tar.gz	windows_amd64.zip
arm64.deb	armel.rpm	darwin_arm64.tar.gz	windows_arm64.zip
armel.deb	armv6hl.rpm	freebsd_amd64.tar.gz	windows_i386.zip
armhf.deb	i386.rpm	freebsd_armv7.tar.gz
i386.deb	ppc64le.rpm	freebsd_i386.tar.gz
mips.deb	riscv64.rpm	linux_amd64.tar.gz
mipsel.deb	s390x.rpm	linux_arm64.tar.gz
ppc64el.deb	x86_64.rpm	linux_armel.tar.gz
riscv64.deb		linux_armhf.tar.gz
s390x.deb		linux_i386.tar.gz
		linux_mips.tar.gz
		linux_mipsel.tar.gz
		linux_ppc64le.tar.gz
		linux_riscv64.tar.gz
		linux_s390x.tar.gz

srebhan

Thanks for the update @DStrand1! Here are my major concerns with the disk implementation

The error handling is not in a good shape yet, we panic in quite a few places, allowing for no mitigation or graceful operation. While this is OK for cases where we can be sure no error occurs (like when serializing the metric), I do have some headaches with the WAL writing not producing an error or panicking as there might be a few cases where errors can occur like permission issues, disk full etc.
We must provide a way to close the WAL file in a clean fashion in case the underlying library relies on this. Please extend the interface accordingly!
We cannot accept metrics in the current spot as it is not yet guaranteed to be written to disk. Please carefully check for those issues as we really must guarantee that if we say we accepted the metric it will not be lost!
We are currently prone to accept the metric twice, once in the addBatch path and once when the output calls Accept() for the same metric. IIRC this will cause a panic as the refcount becomes negative... I think each buffer implementation requires an own way of metricWritten...
There is an undefined situation for what happens if the output drops or rejects the metric? Do we remove them from disk? If so, the input will never notice that the metric is dropped as we already accepted it... We must clarify this case (similar for reject)!

srebhan · 2024-06-07T09:59:25Z

models/buffer.go

+	case "overflow":
+		// todo implementme
+		// todo log currently unimplemented
+		return NewMemoryBuffer(capacity, bm)


Remove this for now and add it back in a later PR!

srebhan · 2024-06-07T10:00:22Z

models/buffer.go

+	// todo log invalid buffer strategy configuration provided, falling back to memory
+	return NewMemoryBuffer(capacity, bm)


No. Please fail here with

Suggested change

// todo log invalid buffer strategy configuration provided, falling back to memory

return NewMemoryBuffer(capacity, bm)

return nil, fmt.Errorf("invalid buffer strategy %q", strategy)

Don't try to be clever as this leads to unexpected behavior in a very critical area!

srebhan · 2024-06-07T10:02:08Z

models/buffer_disk.go

+	"github.com/influxdata/telegraf"
+	"github.com/influxdata/telegraf/metric"
+	"github.com/tidwall/wal"
+)


Suggested change

"github.com/influxdata/telegraf"

"github.com/influxdata/telegraf/metric"

"github.com/tidwall/wal"

)

"github.com/tidwall/wal"

"github.com/influxdata/telegraf"

"github.com/influxdata/telegraf/metric"

)

srebhan · 2024-06-07T10:04:11Z

models/buffer_disk.go

+	walFile     *wal.Log
+	walFilePath string


You might want to shorten the names e.g.

Suggested change

walFile *wal.Log

walFilePath string

file *wal.Log

path string

srebhan · 2024-06-07T10:11:55Z

models/buffer_disk.go

+	dropped := 0
+	for _, m := range metrics {
+		if !b.addSingle(m) {
+			dropped++
+		}
+	}
+	b.BufferSize.Set(int64(b.length()))
+	return dropped
+	// todo implement batched writes


Shouldn't you use the addBatch function here?

srebhan · 2024-06-07T10:16:07Z

models/buffer_disk.go

+		if err != nil {
+			panic(err)
+		}
+		m.Accept() // accept here, since the metric object is no longer retained from here


I don't think we can accept the metric here. We need to really be sure it's written first, even though this might require another for loop...

srebhan · 2024-06-07T10:16:46Z

models/buffer_disk.go

+
+	if b.length() == 0 {
+		// no metrics in the wal file, so return an empty array
+		return make([]telegraf.Metric, 0)


Suggested change

return make([]telegraf.Metric, 0)

return []telegraf.Metric{}

I think you might even be able to return nil here...

srebhan · 2024-06-07T10:19:19Z

models/buffer_disk.go

+	b.batchFirst = b.readIndex()
+	metrics := make([]telegraf.Metric, b.batchSize)
+
+	for i := 0; i < int(b.batchSize); i++ {


Suggested change

for i := 0; i < int(b.batchSize); i++ {

for i := b.batchFirst; i < b.batchFirst + b.batchSize; i++ {

is easier to use in the code I guess...

srebhan · 2024-06-07T10:23:18Z

models/buffer_disk.go

+	err := b.walFile.WriteBatch(batch)
+	if err != nil {
+		return 0 // todo error handle, test if a partial write occur
+	}
+	return written


We cannot do this! Either we panic here if we decide this is the right thing to do or we do proper error handling.

Imagine a use-case where it is crucial to collect all metrics and not loose any and your disk runs full. The user might have a kafka instance for buffering. Now you accepted the metrics in the for-loop above but then writing fails here due to the disk being full... There will be no error, Telegraf will show zero metrics written (just as if no metric arrived at the output) but the metrics are removed from Kafka because we accepted them. Now imaging the fun debugging this situation! :-)

srebhan · 2024-06-07T10:26:20Z

models/mock/metric.go

@@ -0,0 +1,22 @@
+package models


Please move this to the tests instead of adding another package.

telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Apr 24, 2024

srebhan reviewed Apr 25, 2024

View reviewed changes

srebhan self-assigned this Apr 25, 2024

DStrand1 requested a review from srebhan April 30, 2024 16:01

srebhan reviewed May 3, 2024

View reviewed changes

DStrand1 force-pushed the fix/802 branch from 0f93b91 to f074af5 Compare May 13, 2024 17:05

DStrand1 mentioned this pull request May 17, 2024

chore(agent): Export internal metric fields #15376

Merged

1 task

DStrand1 force-pushed the fix/802 branch from 86fa3bd to f9cce86 Compare June 4, 2024 18:12

DStrand1 added 6 commits June 4, 2024 13:14

feat(agent): output buffer persistence

3092537

move to tidwall/wal library

abf8617

address some review comments

af0ab15

suite tests, error handling, work on new encoding

1b8d390

remove shadowed builtin min function

c7fde6d

fix many issues and get all tests passing

19a1e36

DStrand1 force-pushed the fix/802 branch from f9cce86 to 19a1e36 Compare June 4, 2024 18:14

DStrand1 marked this pull request as ready for review June 4, 2024 18:14

DStrand1 added 2 commits June 4, 2024 13:41

add new libraries to dependency license list

64f4b87

address errcheck issue

20a35a0

srebhan requested changes Jun 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): output buffer persistence #15221

feat(agent): output buffer persistence #15221

DStrand1 commented Apr 24, 2024

srebhan left a comment

powersj commented Apr 25, 2024

srebhan left a comment

srebhan May 3, 2024

srebhan May 3, 2024

srebhan May 3, 2024

telegraf-tiger bot commented Jun 4, 2024

Artifact URLs

srebhan left a comment

srebhan Jun 7, 2024

srebhan Jun 7, 2024

srebhan Jun 7, 2024

srebhan Jun 7, 2024

srebhan Jun 7, 2024

srebhan Jun 7, 2024

srebhan Jun 7, 2024

srebhan Jun 7, 2024

srebhan Jun 7, 2024

srebhan Jun 7, 2024

		// todo log invalid buffer strategy configuration provided, falling back to memory
		return NewMemoryBuffer(capacity, bm)

	// todo log invalid buffer strategy configuration provided, falling back to memory
	return NewMemoryBuffer(capacity, bm)
	return nil, fmt.Errorf("invalid buffer strategy %q", strategy)

	for i := 0; i < int(b.batchSize); i++ {
	for i := b.batchFirst; i < b.batchFirst + b.batchSize; i++ {

feat(agent): output buffer persistence #15221

Are you sure you want to change the base?

feat(agent): output buffer persistence #15221

Conversation

DStrand1 commented Apr 24, 2024

Summary

Checklist

Related issues

srebhan left a comment

Choose a reason for hiding this comment

powersj commented Apr 25, 2024

srebhan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

telegraf-tiger bot commented Jun 4, 2024

Artifact URLs

srebhan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment