Filebeat (Python) Unit Tests are flaky on Buildkite macOS x86_64 agents especially on `8.13` branch #39613

dliappis · 2024-05-17T10:15:14Z

Flaky Test

Test Name: tests/system/test_shutdown.py but also tests/system/test_registrar.py

Link:

beats/filebeat/tests/system/test_shutdown.py

Lines 17 to 31 in bf13818

    
               def test_shutdown(self): 
        
                   """ 
        
                   Test starting and stopping Filebeat under load. 
        
                   """ 
        
                   self.nasa_logs() 
        
                   self.render_config_template( 
        
                       path=os.path.abspath(self.working_dir) + "/log/*", 
        
                       ignore_older="1h" 
        
                   ) 
        
                   for i in range(1, 5): 
        
                       proc = self.start_beat(logging_args=["-e", "-v"]) 
        
                       time.sleep(.5) 
        
                       proc.check_kill_and_wait()

Branch: 8.13 but also other branches like 8.14
Artifact Link: https://buildkite.com/organizations/elastic/pipelines/filebeat/builds/6343/jobs/018f858e-f9e1-46d0-ac2a-806e13454195/artifacts/018f85a5-23d2-4008-88c8-764765abaab5
Notes: Additional details about the test. e.g. theory as to failure cause

Stack Trace

`tests/system/test_registrar.py`

Buildkite failure link: https://buildkite.com/elastic/filebeat/builds/6343#018f825c-39a5-4588-8257-9bac31116e69/1804-1935

=================================== FAILURES ===================================
________________________ Test.test_restart_state_reset _________________________
self = <test_registrar.Test testMethod=test_restart_state_reset>
    def test_restart_state_reset(self):
        """
        Test that ttl is set to -1 after restart and no inputs covering it
        """
        self.render_config_template(
            path=os.path.abspath(self.working_dir) + "/log/*",
            clean_inactive="10s",
            ignore_older="5s"
        )
        os.mkdir(self.working_dir + "/log/")
        testfile_path = self.working_dir + "/log/test.log"
        with open(testfile_path, 'w') as testfile:
            testfile.write("Hello World\n")
        filebeat = self.start_beat()
        # Wait until state written
>       self.wait_until(
            lambda: self.output_has(lines=1),
            max_timeout=30)
tests/system/test_registrar.py:991:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <test_registrar.Test testMethod=test_restart_state_reset>
cond = <function Test.test_restart_state_reset.<locals>.<lambda> at 0x10eb79870>
max_timeout = 30, poll_interval = 0.1, name = 'cond', err_msg = ''
    def wait_until(self, cond, max_timeout=20, poll_interval=0.1, name="cond", err_msg=""):
        """
        TODO: this can probably be a "wait_until_output_count", among other things, since that could actually use `self`, and this can become an internal function
        Waits until the cond function returns true,
        or until the max_timeout is reached. Calls the cond
        function every poll_interval seconds.
        If the max_timeout is reached before cond() returns
        true, an exception is raised.
        """
        start = datetime.now()
        while not cond():
            if datetime.now() - start > timedelta(seconds=max_timeout):
                print("Test has failed, here are the Beat logs")
                for l in self.get_log_lines():
                    print(l)
>               raise WaitTimeoutError(
                    f"Timeout waiting for condition '{name}'. Waited {max_timeout} seconds: {err_msg}")
E               beat.beat.WaitTimeoutError: Timeout waiting for condition 'cond'. Waited 30 seconds:
../libbeat/tests/system/beat/beat.py:449: WaitTimeoutError

`tests/system/test_shutdown.py`

Buildkite failure link: https://buildkite.com/elastic/filebeat/builds/6343#018f858e-f9e1-46d0-ac2a-806e13454195/1889-1939

______________________________ Test.test_shutdown ______________________________
self = <test_shutdown.Test testMethod=test_shutdown>
    @unittest.skipIf(platform.platform().startswith("Windows-7"),
                     "Flaky test: https://github.com/elastic/beats/issues/22795")
    def test_shutdown(self):
        """
        Test starting and stopping Filebeat under load.
        """
        self.nasa_logs()
        self.render_config_template(
            path=os.path.abspath(self.working_dir) + "/log/*",
            ignore_older="1h"
        )
        for i in range(1, 5):
            proc = self.start_beat(logging_args=["-e", "-v"])
            time.sleep(.5)
>           proc.check_kill_and_wait()
tests/system/test_shutdown.py:31:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../libbeat/tests/system/beat/beat.py:153: in check_kill_and_wait
    return self.check_wait(exit_code=exit_code)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <beat.beat.Proc object at 0x105d7a9e0>, exit_code = 0
    def check_wait(self, exit_code=0):
        """
        check_wait waits for the process to exit, and checks the return code of the process
        """
        actual_exit_code = self.wait()
>       assert actual_exit_code == exit_code, f"Expected exit code to be {exit_code}, but it was {actual_exit_code}"
E       AssertionError: Expected exit code to be 0, but it was -15
../libbeat/tests/system/beat/beat.py:133: AssertionError
- generated xml file: /Users/admin/builds/bk-agent-prod-orka-1715932705515043799/elastic/filebeat/filebeat/build/TEST-python-unit.xml -
============================= slowest 20 durations =============================
27.73s call     filebeat/tests/system/test_crawler.py::Test::test_encodings
15.23s call     filebeat/tests/system/test_harvester.py::Test::test_close_timeout
10.66s call     filebeat/tests/system/test_reload_inputs.py::Test::test_reload_same_input
9.70s call     filebeat/tests/system/test_stdin.py::Test::test_stdin_eof
9.20s call     filebeat/tests/system/test_registrar.py::Test::test_restart_state_reset_ttl_with_space
8.49s call     filebeat/tests/system/test_registrar.py::Test::test_restart_state
8.15s call     filebeat/tests/system/test_registrar.py::Test::test_restart_state_reset_ttl_no_clean_inactive
8.05s call     filebeat/tests/system/test_registrar.py::Test::test_registry_file_default_permissions
7.30s call     filebeat/tests/system/test_registrar.py::Test::test_rotating_file_with_restart
7.28s call     filebeat/tests/system/test_registrar.py::Test::test_rotating_file_inode
7.03s call     filebeat/tests/system/test_registrar.py::Test::test_restart_state_reset_ttl
6.95s call     filebeat/tests/system/test_reload_inputs.py::Test::test_reload_add
6.91s call     filebeat/tests/system/test_reload_inputs.py::Test::test_start_stop
6.87s call     filebeat/tests/system/test_registrar.py::Test::test_state_after_rotation_ignore_older
6.82s call     filebeat/tests/system/test_reload_inputs.py::Test::test_reload_same_config
6.54s call     filebeat/tests/system/test_harvester.py::Test::test_decode_error
6.48s call     filebeat/tests/system/test_crawler.py::Test::test_utf8
6.36s call     filebeat/tests/system/test_input.py::Test::test_rotating_close_inactive_low_write_rate
6.27s call     filebeat/tests/system/test_input.py::Test::test_harvester_limit
6.13s call     filebeat/tests/system/test_input.py::Test::test_input_filter_dropfields
=========================== short test summary info ============================
FAILED tests/system/test_shutdown.py::Test::test_shutdown - AssertionError: Expected exit code to be 0, but it was -15
============ 1 failed, 189 passed, 175 skipped in 608.83s (0:10:08) ============

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-05-17T10:16:40Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

dliappis · 2024-05-22T08:16:05Z

Also seen in CI runs for merge commits against main e.g. in https://buildkite.com/elastic/filebeat/builds/6466

alexsapran · 2024-06-05T11:24:02Z

@pierrehilbert this is something that is affecting your CI metrics.

dliappis added the flaky-test Unstable or unreliable test cases. label May 17, 2024

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 17, 2024

dliappis added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 17, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filebeat (Python) Unit Tests are flaky on Buildkite macOS x86_64 agents especially on `8.13` branch #39613

Filebeat (Python) Unit Tests are flaky on Buildkite macOS x86_64 agents especially on `8.13` branch #39613

dliappis commented May 17, 2024

elasticmachine commented May 17, 2024

dliappis commented May 22, 2024

alexsapran commented Jun 5, 2024

Filebeat (Python) Unit Tests are flaky on Buildkite macOS x86_64 agents especially on 8.13 branch #39613

Filebeat (Python) Unit Tests are flaky on Buildkite macOS x86_64 agents especially on 8.13 branch #39613

Comments

dliappis commented May 17, 2024

Flaky Test

Stack Trace

tests/system/test_registrar.py

tests/system/test_shutdown.py

elasticmachine commented May 17, 2024

dliappis commented May 22, 2024

alexsapran commented Jun 5, 2024

Filebeat (Python) Unit Tests are flaky on Buildkite macOS x86_64 agents especially on `8.13` branch #39613

Filebeat (Python) Unit Tests are flaky on Buildkite macOS x86_64 agents especially on `8.13` branch #39613

`tests/system/test_registrar.py`

`tests/system/test_shutdown.py`