Remote task monitoring#6308
Conversation
Make more proper use of the avocado task state machine by detaching from the task runner. This prevents the respective coroutine from spending all the time at the task spawning stage instead of properly monitoring the spawned task. While not fatal, the previous behavior also led to "task ended too fast" warnings at the end of the long task spawning wait where the task actually ended but definitely not too fast. While benevolent this change will be in need of some supporting changes to provide enough resilience to the monitoring which come next. Signed-off-by: Plamen Dimitrov <plamen.dimitrov@intra2net.com>
A potential async yield due to slightly longer nonzero IO wait of a forked command could result in the task only being logged as "successfully spawned" after it is entirely complete if the other coroutines spend too much time before coming back to this one. This in turn would once again result in a "task ended too early" warning all because it was revisited at a much later time. Worse yet, it will also result in a skipped monitor stage where the task result might be awaited indefinitely and any potential task timeout ignored. Also make remote command running entirely in-sync so that even though a drop at the call is highly unlikely (no IO waits), it will now be fully prevented. Signed-off-by: Plamen Dimitrov <plamen.dimitrov@intra2net.com>
There was a problem hiding this comment.
Code Review
This pull request replaces the asynchronous remote command execution with a synchronous implementation and introduces a synchronous retry loop with time.sleep in is_task_alive. The review feedback highlights that these synchronous calls will block the main asyncio event loop, freezing the application and preventing other concurrent tasks from progressing. It is highly recommended to keep these operations asynchronous.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6308 +/- ##
==========================================
- Coverage 73.60% 72.04% -1.57%
==========================================
Files 206 206
Lines 22505 23356 +851
==========================================
+ Hits 16565 16826 +261
- Misses 5940 6530 +590 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There are observable cases where it might take a very short while for the process to appear and thus the task be considered alive so make sure the overall check makes at least a few tries within a ten second window. Signed-off-by: Plamen Dimitrov <plamen.dimitrov@intra2net.com>
Prevent errors where we could not retrieve the command status like
File "/usr/lib/python3.13/site-packages/avocado_spawner_remote/__init__.py", line 218, in wait_task
if not RemoteSpawner.is_task_alive(runtime_task):
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
File "/usr/lib/python3.13/site-packages/avocado_spawner_remote/__init__.py", line 142, in is_task_alive
status, output = session.cmd_status_output(
~~~~~~~~~~~~~~~~~~~~~~~~~^
f"pgrep -r R,S -f {runtime_task.task.identifier}"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/usr/lib/python3.13/site-packages/aexpect/client.py", line 1491, in cmd_status_output
raise ShellStatusError(cmd, out) from error
Use the "safe" flag to handle cases where the shell prompt might be
polluted with "[Done] some-background-process" appearing from the
detached avocado task process but also handle any further unexpected
status retrieval errors. There were still rare cases where the safe
flag might miss something yet it does filter our most cases from
needing to catch status errors (which is also worst in terms of
peformance compared to a regular boolean check).
Signed-off-by: Plamen Dimitrov <plamen.dimitrov@intra2net.com>
1592cc5 to
678478b
Compare
|
I assume the failures are due to rebasing on original branch on top of 113.0. I was hoping I can just push from a tag and have enough stability but if it is really needed I will rebase on top of the most recent master and push again. Let me know. |
The is_task_alive checks may end up with a false positive outcome of the test process has spawned its own subprocess which contains its name as an argument. This can e.g. happen when checking a VT test task with a windows 11 vm which might have spawned a TPM 2.0 emulator with a socket server like 2732988 pts/8 S+ 0:00 \_ /usr/bin/swtpm socket --ctrl type=unixio,path=/root/avocado/data/avocado-vt/swtpm/mw111_tpm0_swtpm.sock,mode=0600 --tpmstate dir=/root/avocado/data/avocado-vt/swtpm/mw111_tpm0_state,mode=0600 --terminate --tpm2 --log file=/mnt/local/results/job-name/test-name/vtpm_mw111_tpm0_swtpm.log Let's prevent this by also filtering for the task-run prefix to identify the exact parent process for the test task.
Improve the integration between the avocado task state machine and the remote process spawner in order to properly monitor tasks and respect their configured or default timeouts.