Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/actions/pr-gate/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ description: >
Resolve PR metadata for a `pull-request/<N>` push from copy-pr-bot and decide
whether the workflow should run. Sets `should-run=true` only when the pushed
SHA still matches the PR head SHA. If `required_label` is provided, the PR
must also carry that label. For non-`push` events (e.g. `workflow_dispatch`),
always sets `should-run=true`.
must also carry that label. For non-`push` events (e.g. `workflow_dispatch`
and `merge_group`), always sets `should-run=true`.
inputs:
required_label:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/branch-checks.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
name: Branch Checks

on:
merge_group:
types: [checks_requested]
push:
branches:
- "pull-request/[0-9]+"
Expand Down
32 changes: 23 additions & 9 deletions .github/workflows/branch-e2e.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
name: Branch E2E Checks

on:
merge_group:
types: [checks_requested]
push:
branches:
- "pull-request/[0-9]+"
Expand Down Expand Up @@ -37,15 +39,27 @@ jobs:
shell: bash
run: |
set -euo pipefail
if [ "$EVENT_NAME" != "push" ]; then
run_core_e2e=true
run_gpu_e2e=true
run_kubernetes_ha_e2e=true
else
run_core_e2e="$(jq -r 'index("test:e2e") != null' <<< "$LABELS_JSON")"
run_gpu_e2e="$(jq -r 'index("test:e2e-gpu") != null' <<< "$LABELS_JSON")"
run_kubernetes_ha_e2e="$(jq -r 'index("test:e2e-kubernetes") != null' <<< "$LABELS_JSON")"
fi
case "$EVENT_NAME" in
push)
run_core_e2e="$(jq -r 'index("test:e2e") != null' <<< "$LABELS_JSON")"
run_gpu_e2e="$(jq -r 'index("test:e2e-gpu") != null' <<< "$LABELS_JSON")"
run_kubernetes_ha_e2e="$(jq -r 'index("test:e2e-kubernetes") != null' <<< "$LABELS_JSON")"
;;
merge_group)
run_core_e2e=true
run_gpu_e2e=true
run_kubernetes_ha_e2e=false
Comment on lines +49 to +51

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rather not run e2e tests in the merge queue -- or use the same logic as we do for PRs?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just because of waiting for e2e tests can take a while or might flake for infra reasons?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to not running all E2E. We could get started with that and potentially have a smaller set of E2E that runs on merge queue if needed?

For PRs that need e2e (i.e. have the label) we could have them to be up to date with main. Github has that as a setting, but it's global, so it wouldn't be enforced though.

;;
workflow_dispatch)
run_core_e2e=true
run_gpu_e2e=true
run_kubernetes_ha_e2e=true
;;
*)
echo "::error::Unsupported event '$EVENT_NAME'" >&2
exit 1
;;
Comment on lines +58 to +61

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why not just default to the values for workflow_dispatch (which is what this code was doing beforehand).

esac
if [ "$run_core_e2e" = "true" ] || [ "$run_gpu_e2e" = "true" ] || [ "$run_kubernetes_ha_e2e" = "true" ]; then
run_any_e2e=true
else
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/helm-lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
name: Helm Lint

on:
merge_group:
types: [checks_requested]
push:
branches:
- "pull-request/[0-9]+"
Expand Down
66 changes: 57 additions & 9 deletions .github/workflows/required-ci-gates.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
name: Required CI Gates

on:
merge_group:
types: [checks_requested]
pull_request_target:
types: [opened, synchronize, reopened, ready_for_review, labeled, unlabeled]
workflow_run:
Expand All @@ -17,7 +19,7 @@ permissions:
statuses: write

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.event.workflow_run.head_sha || github.run_id }}
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.event.workflow_run.head_sha || github.sha || github.run_id }}
cancel-in-progress: true

jobs:
Expand All @@ -33,9 +35,13 @@ jobs:
PR_NUMBER_FROM_EVENT: ${{ github.event.pull_request.number }}
PR_HEAD_SHA_FROM_EVENT: ${{ github.event.pull_request.head.sha }}
PR_LABELS_FROM_EVENT: ${{ toJSON(github.event.pull_request.labels.*.name) }}
GITHUB_SHA_FROM_CONTEXT: ${{ github.sha }}
GITHUB_REF_NAME_FROM_CONTEXT: ${{ github.ref_name }}
GITHUB_RUN_ID_FROM_CONTEXT: ${{ github.run_id }}
WORKFLOW_RUN_HEAD_SHA: ${{ github.event.workflow_run.head_sha }}
WORKFLOW_RUN_HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }}
WORKFLOW_RUN_EVENT: ${{ github.event.workflow_run.event }}
WORKFLOW_RUN_HTML_URL: ${{ github.event.workflow_run.html_url }}
shell: bash
run: |
set -euo pipefail
Expand Down Expand Up @@ -67,12 +73,19 @@ jobs:
}

resolve_pull_request_event() {
CONTEXT_KIND="pull_request"
WORKFLOW_EVENT="push"
PR_NUMBER="$PR_NUMBER_FROM_EVENT"
HEAD_SHA="$PR_HEAD_SHA_FROM_EVENT"
LABELS_JSON=$(jq -c . <<< "$PR_LABELS_FROM_EVENT")
MIRROR_REF="pull-request/$PR_NUMBER"
EXPECTED_HEAD_BRANCH="$MIRROR_REF"
STATUS_TARGET_URL="https://github.com/$GH_REPO/pull/$PR_NUMBER"
}

load_pr_context() {
CONTEXT_KIND="pull_request"
WORKFLOW_EVENT="push"
PR_NUMBER="$1"

local pr state
Expand All @@ -85,9 +98,35 @@ jobs:

HEAD_SHA=$(jq -r '.head.sha' <<< "$pr")
LABELS_JSON=$(gh api "repos/$GH_REPO/issues/$PR_NUMBER" --jq '[.labels[].name]')
MIRROR_REF="pull-request/$PR_NUMBER"
EXPECTED_HEAD_BRANCH="$MIRROR_REF"
STATUS_TARGET_URL="https://github.com/$GH_REPO/pull/$PR_NUMBER"
}

resolve_merge_group_event() {
CONTEXT_KIND="merge_group"
WORKFLOW_EVENT="merge_group"
HEAD_SHA="$GITHUB_SHA_FROM_CONTEXT"
LABELS_JSON="[]"
EXPECTED_HEAD_BRANCH="$GITHUB_REF_NAME_FROM_CONTEXT"
STATUS_TARGET_URL="https://github.com/$GH_REPO/actions/runs/$GITHUB_RUN_ID_FROM_CONTEXT"
}

resolve_merge_group_workflow_run_event() {
CONTEXT_KIND="merge_group"
WORKFLOW_EVENT="merge_group"
HEAD_SHA="$WORKFLOW_RUN_HEAD_SHA"
LABELS_JSON="[]"
EXPECTED_HEAD_BRANCH="$WORKFLOW_RUN_HEAD_BRANCH"
STATUS_TARGET_URL="${WORKFLOW_RUN_HTML_URL:-https://github.com/$GH_REPO/actions}"
}

resolve_workflow_run_event() {
if [ "$WORKFLOW_RUN_EVENT" = "merge_group" ]; then
resolve_merge_group_workflow_run_event
return
fi

if [ "$WORKFLOW_RUN_EVENT" != "push" ]; then
echo "Ignoring workflow_run from event '$WORKFLOW_RUN_EVENT'."
exit 0
Expand All @@ -112,29 +151,34 @@ jobs:
resolve_context() {
if [ "$EVENT_NAME" = "pull_request_target" ]; then
resolve_pull_request_event
elif [ "$EVENT_NAME" = "merge_group" ]; then
resolve_merge_group_event
elif [ "$EVENT_NAME" = "workflow_run" ]; then
resolve_workflow_run_event
else
echo "Unsupported event '$EVENT_NAME'."
exit 1
fi

PR_URL="https://github.com/$GH_REPO/pull/$PR_NUMBER"
MIRROR_REF="pull-request/$PR_NUMBER"
STATUS_TARGET_URL="${STATUS_TARGET_URL:-https://github.com/$GH_REPO/actions}"
}

verify_mirror() {
local context="$1"
local mirror_sha

if [ "$CONTEXT_KIND" = "merge_group" ]; then
return 0
fi

mirror_sha=$(gh api "repos/$GH_REPO/branches/$MIRROR_REF" --jq '.commit.sha' 2>/dev/null || true)
if [ -z "$mirror_sha" ]; then
post_status "$context" pending "Waiting for /ok to test mirror" "$PR_URL"
post_status "$context" pending "Waiting for /ok to test mirror" "$STATUS_TARGET_URL"
return 1
fi

if [ "$mirror_sha" != "$HEAD_SHA" ]; then
post_status "$context" pending "Waiting for /ok to test mirror" "$PR_URL"
post_status "$context" pending "Waiting for /ok to test mirror" "$STATUS_TARGET_URL"
return 1
fi

Expand All @@ -149,8 +193,8 @@ jobs:
local required_job_name="${5:-}"
local workflow_url="https://github.com/$GH_REPO/actions/workflows/$workflow_file"

if [ -n "$required_label" ] && ! has_label "$required_label"; then
post_status "$context" success "$required_label not applied" "$PR_URL"
if [ "$CONTEXT_KIND" = "pull_request" ] && [ -n "$required_label" ] && ! has_label "$required_label"; then
post_status "$context" success "$required_label not applied" "$STATUS_TARGET_URL"
return 0
fi

Expand All @@ -159,8 +203,12 @@ jobs:
fi

local runs latest run_id status conclusion run_url real_success
runs=$(gh api "repos/$GH_REPO/actions/workflows/$workflow_file/runs?head_sha=$HEAD_SHA&event=push" --jq '.workflow_runs')
latest=$(jq -c --arg branch "$MIRROR_REF" '[.[] | select(.head_branch == $branch)] | sort_by(.created_at) | reverse | .[0] // empty' <<< "$runs")
runs=$(gh api "repos/$GH_REPO/actions/workflows/$workflow_file/runs?head_sha=$HEAD_SHA&event=$WORKFLOW_EVENT" --jq '.workflow_runs')
if [ -n "${EXPECTED_HEAD_BRANCH:-}" ]; then
latest=$(jq -c --arg branch "$EXPECTED_HEAD_BRANCH" '[.[] | select(.head_branch == $branch)] | sort_by(.created_at) | reverse | .[0] // empty' <<< "$runs")
else
latest=$(jq -c 'sort_by(.created_at) | reverse | .[0] // empty' <<< "$runs")
fi

if [ -z "$latest" ]; then
post_status "$context" pending "Waiting for $workflow_name" "$workflow_url"
Expand Down
38 changes: 31 additions & 7 deletions CI.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ PR CI that runs on NVIDIA self-hosted runners uses NVIDIA's copy-pr-bot. The bot

`Branch Checks` run automatically after copy-pr-bot mirrors the PR. `Required CI Gates` posts PR-head statuses that verify the mirror exists, is current, and ran the expected push-based workflows. E2E suites are opt-in because they are more expensive and publish temporary images.

Merge queue validation is a second integration gate for `main`. After a PR has passed the required PR-head statuses, a maintainer adds it to the merge queue. GitHub creates a temporary merge-group branch that combines the latest `main`, the queued PR, and any earlier queued PRs. The same required `OpenShell / ...` status contexts are then published against the merge-group SHA before GitHub merges it.

Three opt-in labels enable the long-running E2E suites:

- `test:e2e` runs the standard E2E suite in `Branch E2E Checks`
Expand Down Expand Up @@ -75,6 +77,7 @@ Flow:
4. The maintainer opens that link and clicks **Re-run all jobs**. This time `pr_metadata` sees the label and the build/E2E jobs run.
5. When the run finishes, the matching `OpenShell / ...` gate status flips to green automatically.
6. New commits push to the mirror automatically and re-trigger `Branch Checks` plus any labeled E2E jobs in `Branch E2E Checks`.
7. When the PR is ready to merge, use **Add to merge queue** instead of merging directly. The queue validates the final integration state before updating `main`.

### Forked PR

Expand All @@ -88,9 +91,30 @@ Flow:
1. Open the PR. The vouch check confirms you are vouched (otherwise the PR is auto-closed).
2. copy-pr-bot does not mirror forks automatically. A maintainer reviews the diff and comments `/ok to test <SHA>` with your latest commit SHA.
3. After `/ok to test`, copy-pr-bot mirrors to `pull-request/<N>`. From here the flow is identical to internal PRs: `Required CI Gates` verifies the mirror and required push workflows, and maintainers apply the E2E label when the extra suites are needed.
4. When the PR is ready to merge, maintainers add it to the merge queue so the queued integration state is tested before it reaches `main`.

Important: every new commit you push requires another `/ok to test <new-SHA>` from a maintainer before push-based CI will run on it. If a label is applied while the mirror is stale, `E2E Label Help` will post a comment explaining what's needed.

## Merge queue

GitHub merge queue is required for `main`. Repository administrators must enable **Require merge queue** in the branch ruleset for `main` and keep these required status contexts aligned with the PR gates:

- `OpenShell / Branch Checks`
- `OpenShell / E2E`
- `OpenShell / GPU E2E`
- `OpenShell / Helm Lint`

Do not require the underlying workflow job names directly. `Required CI Gates` publishes stable commit statuses for both PR-head mirror commits and merge-group commits.

Merge-group runs use the `merge_group` event. The event is distinct from `pull_request` and `push`, and GitHub will not report required checks for queued PRs unless the workflows include it. In this repository:

- `Branch Checks` runs the standard non-E2E gates on the merge-group SHA.
- `Branch E2E Checks` runs core E2E and GPU E2E for merge groups. Kubernetes HA E2E remains optional and label-driven on PRs.
- `Helm Lint` runs for merge groups without the PR diff optimization, because the merge-group branch is the final integration state.
- `Required CI Gates` posts the same `OpenShell / ...` statuses to the merge-group SHA and does not require a `pull-request/<N>` mirror for merge-group events.

Maintainers should add ready PRs to the queue rather than pressing a direct merge button. GitHub removes a PR from the queue if the merge-group checks fail or time out.

## copy-pr-bot

[copy-pr-bot](https://github.com/apps/copy-pr-bot) is a GitHub App maintained by NVIDIA that solves a specific GitHub Actions security problem: by default, `pull_request`-triggered workflows on a self-hosted runner can run an arbitrary contributor's code on hardware the project owns. For projects that need self-hosted runners (GPU access, ARM hardware, on-prem secrets), GitHub's recommended pattern is to never trigger workflows directly from external `pull_request` events.
Expand All @@ -109,12 +133,12 @@ The bot's full administrator documentation is internal to NVIDIA. The only comma

| File | Role |
|---|---|
| `.github/workflows/branch-checks.yml` | Required non-E2E PR checks. Triggers on `push: pull-request/[0-9]+`. |
| `.github/workflows/branch-e2e.yml` | Opt-in standard, GPU, and Kubernetes HA E2E. Triggers on `push: pull-request/[0-9]+` and runs jobs selected by `test:e2e`, `test:e2e-gpu`, or `test:e2e-kubernetes`. |
| `.github/workflows/helm-lint.yml` | Helm chart validation. Triggers on `push: pull-request/[0-9]+` and skips lint jobs unless Helm inputs changed. |
| `.github/actions/pr-gate/action.yml` | Composite action that resolves PR metadata and verifies the required label is set. |
| `.github/workflows/branch-checks.yml` | Required non-E2E checks. Triggers on `push: pull-request/[0-9]+` for PR mirrors and `merge_group` for queued merges. |
| `.github/workflows/branch-e2e.yml` | Standard, GPU, and Kubernetes HA E2E. PR mirror pushes use `test:e2e`, `test:e2e-gpu`, and `test:e2e-kubernetes` labels; merge groups run core and GPU E2E. |
| `.github/workflows/helm-lint.yml` | Helm chart validation. PR mirror pushes skip lint jobs unless Helm inputs changed; merge groups always validate Helm because they represent the final integration state. |
| `.github/actions/pr-gate/action.yml` | Composite action that resolves PR metadata and verifies the required label is set for PR mirror pushes. Non-push events are allowed through. |
| `.github/actions/pr-merge-base/action.yml` | Composite action that resolves and fetches the merge-base commit for `pull-request/<N>` push workflows. |
| `.github/workflows/required-ci-gates.yml` | Posts required PR-head statuses for push-based CI workflows. This is what branch protection should require. |
| `.github/workflows/required-ci-gates.yml` | Posts required PR-head and merge-group statuses for gated CI workflows. This is what branch protection and merge queue should require. |
| `.github/workflows/e2e-label-help.yml` | When a `test:e2e*` label is applied, posts a PR comment telling the maintainer the next manual step (re-run an existing workflow run, or `/ok to test <SHA>` to refresh the mirror). |

## Release workflows
Expand All @@ -129,11 +153,11 @@ These workflows run after merge to publish dev/tagged artifacts and verify them.

## Required status contexts

Require these statuses in the branch ruleset for push-based CI:
Require these statuses in the branch ruleset for PR and merge-queue CI:

- `OpenShell / Branch Checks`
- `OpenShell / E2E`
- `OpenShell / GPU E2E`
- `OpenShell / Helm Lint`

Do not require the underlying push workflow jobs directly. Those jobs only appear after copy-pr-bot mirrors trusted code, so they cannot independently prove that an untrusted or stale PR head was tested.
Do not require the underlying workflow jobs directly. PR workflow jobs only appear after copy-pr-bot mirrors trusted code, and merge-group workflow jobs run on temporary queue branches. The stable `OpenShell / ...` contexts prove the expected workflow completed for the commit that GitHub is about to merge.
9 changes: 5 additions & 4 deletions architecture/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,17 +128,18 @@ do not infer from kube context.

## CI and E2E

Required checks run on GitHub Actions. Workflows that use NVIDIA self-hosted runners trigger from copy-pr-bot mirror branches, so trusted PRs are mirrored into `pull-request/<N>` branches before those workflows run.
Required checks run on GitHub Actions. Workflows that use NVIDIA self-hosted runners trigger from copy-pr-bot mirror branches, so trusted PRs are mirrored into `pull-request/<N>` branches before those workflows run. `main` also uses GitHub merge queue so the final queued integration commit is validated before it merges.

The high-level CI model:

1. PR-context gate jobs publish required statuses for the PR head commit.
2. Standard branch checks run from trusted mirror branches.
3. Label-gated E2E, GPU, and Kubernetes checks run from trusted mirror branches.
4. Gate jobs verify that the mirror branch matches the PR head and that the expected non-gate workflow actually ran.
5. Release workflows rebuild and publish binaries, wheels, images, and docs.
4. Merge-group checks run against GitHub's temporary queue branch for the final integration state.
5. Gate jobs verify that the mirror branch matches the PR head, or that the merge-group workflow ran for the queued SHA, and that the expected non-gate workflow actually ran.
6. Release workflows rebuild and publish binaries, wheels, images, and docs.

See `CI.md` for the contributor workflow and labels.
See `CI.md` for the contributor workflow, labels, and maintainer merge-queue workflow.

## Docs Site

Expand Down
Loading