Skip to content

fix: delete stale error.pb before uploading outputs on JobSet restart#1204

Open
AdilFayyaz wants to merge 2 commits into
mainfrom
adil/jobsets-restart-fix
Open

fix: delete stale error.pb before uploading outputs on JobSet restart#1204
AdilFayyaz wants to merge 2 commits into
mainfrom
adil/jobsets-restart-fix

Conversation

@AdilFayyaz

Copy link
Copy Markdown
Collaborator

Motivation

When a clustered (JobSet) task fails, it writes an error.pb to the output path. If a later restart attempt succeeds, that stale error file is left behind and the action can be misreported as failed despite a successful run.

Summary

  • Add _clear_stale_clustered_error_if_needed() to upload_outputs: on restart attempts (JOBSET_RESTART_ATTEMPT > 0), delete a leftover error.pb from the output path before writing outputs.
  • Add _delete_path() helper that deletes via the underlying fsspec filesystem, handling both async (_rm_file/_rm) and sync (rm_file/rm) backends.
  • Delete is best-effort: missing files are debug-logged, other failures are downgraded to a warning so output upload still proceeds.
  • Attempt 0 is a no-op (no exists/filesystem calls), keeping the non-restart hot path unchanged.

Test Plan

  • New tests/flyte/clustered/test_runtime_io_cleanup.py covers:
  • restart attempt > 0 deletes the stale error then uploads outputs (correct call order),
  • attempt 0 skips the existence check / delete entirely,
  • a delete failure is soft-failed (warning logged, outputs still uploaded).
  • Run: uv run pytest tests/flyte/clustered/test_runtime_io_cleanup.py -v

Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
@AdilFayyaz AdilFayyaz self-assigned this Jun 12, 2026
Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
@AdilFayyaz AdilFayyaz requested a review from pingsutw June 12, 2026 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant