Skip to content

Improve Linux distro rootfs compatibility on Apple Silicon#96

Closed
doanbaotrung wants to merge 1 commit into
sysprog21:mainfrom
open-sources-port:temp/fix_foot_run
Closed

Improve Linux distro rootfs compatibility on Apple Silicon#96
doanbaotrung wants to merge 1 commit into
sysprog21:mainfrom
open-sources-port:temp/fix_foot_run

Conversation

@doanbaotrung

@doanbaotrung doanbaotrung commented Jun 12, 2026

Copy link
Copy Markdown

Improve compatibility with real Linux distro rootfs environments on
Apple Silicon hosts. Package-manager and shell workflows need behavior
closer to Linux for credentials, script execution, fork/clone state,
wait handling, pipes, /proc, and shared mappings.

Preserve dynamic guest UID/GID state in auxv instead of always reporting
fixed guest IDs, and allow the initial guest identity to be configured
with ELFUSE_GUEST_UID and ELFUSE_GUEST_GID. This lets distro workflows
such as apt post-install scripts run with root-like guest credentials
when needed.

Probe ELF binaries quietly before falling back to shebang handling, so
script execution does not emit misleading "not an ELF" diagnostics.

Extend fork IPC state and child restore handling to carry more complete
CPU state, including TLS-related registers, PAC keys, clone flags,
child TID handling, TPIDRRO_EL0, TPIDR2_EL0, and the original SPSR. Add
child process monitoring so host child exit can wake Linux-style wait
and signal behavior.

Align non-fixed file-backed MAP_SHARED mappings to 2 MiB stage-2
boundaries to avoid HVF mapping issues on Apple Silicon.

Improve sysroot symlink creation for absolute guest symlink targets, and
add small Linux compatibility behavior for sync_file_range and pipe
F_SETNOSIGPIPE.

These changes were tested with an Ubuntu arm64 rootfs using shell
pipelines, /proc checks, and apt-get update smoke testing.


Summary by cubic

Improves Linux distro rootfs behavior on Apple Silicon so package managers and shell pipelines run cleanly. Tightens identity, exec, fork/clone, memory, pipes, /proc, and symlink handling to better match real Linux.

  • New Features

    • Configurable guest identity: AT_UID/EUID and AT_GID/EGID reflect runtime IDs; set initial IDs via ELFUSE_GUEST_UID/ELFUSE_GUEST_GID. /proc and getgroups return these values. set*id syscalls follow Linux privileged semantics.
    • Exec: quietly probe ELF before shebang so scripts run without “not an ELF” noise.
    • Memory: non-fixed, file-backed MAP_SHARED mmaps align to 2 MiB stage-2 blocks to avoid HVF issues.
    • FS: symlinkat rewrites absolute guest targets to a relative path inside the sysroot; pipe2 sets F_SETNOSIGPIPE when available; sync_file_range is stubbed.
    • Diagnostics: expand syscall debug logging to more path-bearing calls.
  • Bug Fixes

    • Fork/clone correctness: carry TLS regs (TPIDRRO_EL0, TPIDR2_EL0), pointer-auth keys, and original SPSR; support CLONE_CHILD_{SETTID,CLEARTID} and write child_tid; smaller SCM_RIGHTS chunks.
    • Child lifecycle: kqueue-based child monitor raises SIGCHLD on host child exit (for all clones); also signal on proc_mark_child_exited for Linux-style wait behavior.
    • Security: include the dynamic linker’s high mapping in forked children so libc’s stack canary stays valid.
    • PTY keepalive: safer dup/snapshot for live and stale entries; avoid FD leaks and ensure CLOEXEC.
    • Signals/pipes: no manual SIGPIPE on write; rely on EPIPE and F_SETNOSIGPIPE to match Linux behavior.
    • Sysroot paths: validate parent directories before create operations to keep writes inside the sysroot.

Written for commit 22d8532. Summary will update on new commits.

Review in cubic

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found and verified against the latest diff

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/syscall/fs.c">

<violation number="1" location="src/syscall/fs.c:335">
P1: `relative_path_between` incorrectly returns EXDEV for single-component sysroot paths</violation>
</file>

<file name="src/syscall/proc-identity.c">

<violation number="1" location="src/syscall/proc-identity.c:39">
P2: Environment UID/GID parsing accepts UINT32_MAX ((uint32_t)-1), a reserved Linux sentinel, as a valid initial identity value</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread src/syscall/fs.c
common = i;
}

if (common == 0) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: relative_path_between incorrectly returns EXDEV for single-component sysroot paths

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/fs.c, line 335:

<comment>`relative_path_between` incorrectly returns EXDEV for single-component sysroot paths</comment>

<file context>
@@ -288,6 +288,133 @@ static int64_t reject_unsupported_fuse_path_op(const path_translation_t *tx)
+            common = i;
+    }
+
+    if (common == 0) {
+        errno = EXDEV;
+        return -1;
</file context>

Comment thread src/runtime/forkipc.c
errno = 0;
char *end = NULL;
unsigned long parsed = strtoul(value, &end, 10);
if (errno != 0 || end == value || *end != '\0' || parsed > UINT32_MAX)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Environment UID/GID parsing accepts UINT32_MAX ((uint32_t)-1), a reserved Linux sentinel, as a valid initial identity value

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/proc-identity.c, line 39:

<comment>Environment UID/GID parsing accepts UINT32_MAX ((uint32_t)-1), a reserved Linux sentinel, as a valid initial identity value</comment>

<file context>
@@ -24,16 +27,33 @@ static _Atomic int64_t guest_sid = 1, guest_pgid = 1;
+    errno = 0;
+    char *end = NULL;
+    unsigned long parsed = strtoul(value, &end, 10);
+    if (errno != 0 || end == value || *end != '\0' || parsed > UINT32_MAX)
+        return fallback;
+    return (uint32_t) parsed;
</file context>
Suggested change
if (errno != 0 || end == value || *end != '\0' || parsed > UINT32_MAX)
if (errno != 0 || end == value || *end != '\0' || parsed >= UINT32_MAX)

@jserv jserv requested a review from Max042004 June 12, 2026 16:47
Comment thread src/runtime/forkipc.c

static int fork_child_vfork_notify_fd = -1;

/* Linux clone flags */

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completely duplicated to forkipc.c:574–582

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@doanbaotrung doanbaotrung force-pushed the temp/fix_foot_run branch 3 times, most recently from bc4c7e0 to a2d33ce Compare June 14, 2026 11:32
Improve compatibility with real Linux distro rootfs environments on
Apple Silicon hosts. Package-manager and shell workflows need behavior
closer to Linux for credentials, script execution, fork/clone state,
wait handling, pipes, /proc, and shared mappings.

Preserve dynamic guest UID/GID state in auxv instead of always reporting
fixed guest IDs, and allow the initial guest identity to be configured
with ELFUSE_GUEST_UID and ELFUSE_GUEST_GID. This lets distro workflows
such as apt post-install scripts run with root-like guest credentials
when needed.

Probe ELF binaries quietly before falling back to shebang handling, so
script execution does not emit misleading "not an ELF" diagnostics.

Extend fork IPC state and child restore handling to carry more complete
CPU state, including TLS-related registers, PAC keys, clone flags,
child TID handling, TPIDRRO_EL0, TPIDR2_EL0, and the original SPSR. Add
child process monitoring so host child exit can wake Linux-style wait
and signal behavior.

Align non-fixed file-backed MAP_SHARED mappings to 2 MiB stage-2
boundaries to avoid HVF mapping issues on Apple Silicon.

Improve sysroot symlink creation for absolute guest symlink targets, and
add small Linux compatibility behavior for sync_file_range and pipe
F_SETNOSIGPIPE.

These changes were tested with an Ubuntu arm64 rootfs using shell
pipelines, /proc checks, and apt-get update smoke testing.

@jserv jserv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging real Linux semantics divergences and a few behavioral regressions; the pauth/TLS reg plumbing and identity-driven AT_UID changes are good and worth keeping.

Comment thread src/syscall/io.c
if (saved_errno == EPIPE)
signal_queue(LINUX_SIGPIPE);
errno = saved_errno;
return linux_errno();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGPIPE is no longer queued on EPIPE for sys_write. F_SETNOSIGPIPE in pipe2 only suppresses the host signal; the guest stops seeing SIGPIPE on broken-pipe write. Restore signal_queue(LINUX_SIGPIPE) on EPIPE here.

Comment thread src/syscall/syscall.c
}
if (nr == SYS_write && errno == EPIPE)
signal_queue(LINUX_SIGPIPE);
result = linux_errno();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same SIGPIPE regression on the dispatch fast path. Re-queue LINUX_SIGPIPE when errno == EPIPE before returning linux_errno().

Comment thread src/syscall/mem.c

/* Round length up to align size (overflow-safe) */
if (length > UINT64_MAX - (align - 1))
return -LINUX_ENOMEM;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rounding length up to BLOCK_2MIB for file-backed MAP_SHARED turns a 4 KiB shm into a 2 MiB VMA. Tail access past EOF SIGBUSes and the Linux-visible length is wrong. Only 2 MiB-align placement (search start); keep length at PAGE_ALIGN_UP(length, 4 KiB).

Comment thread src/syscall/fs.c
}
} else {
char dir_host[LINUX_PATH_MAX];
if (fcntl(dir_ref->fd, F_GETPATH, dir_host) < 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fcntl(dir_ref->fd, F_GETPATH, ...) returns EBADF when dirfd is AT_FDCWD, so symlinkat(target="/abs", AT_FDCWD, rel-linkpath) fails with -EBADF. Branch on dir_ref->fd == AT_FDCWD and use getcwd(); fall back to the original guest target if path recovery still fails.

Comment thread src/runtime/forkipc.c
@@ -762,6 +913,9 @@ static void *thread_create_and_run(void *arg)
} else {
WORKER_HV(hv_vcpu_set_sys_reg(vcpu, HV_SYS_REG_TPIDR_EL0, tca->tpidr));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two lines below this added pauth restore, line 942 still has hv_vcpu_set_reg(vcpu, HV_REG_CPSR, 0) /* EL0t */. Same pattern at vm_clone_thread_run line 1252. fork_child_main was fixed to use regs.spsr_el1; these two in-process worker paths should also set HV_REG_CPSR to tca->spsr so parent NZCV/PSTATE survives the clone return.

Comment thread src/runtime/forkipc.c
errno = 0;
} while (kevent(kq, NULL, 0, &kev, 1, NULL) < 0 && errno == EINTR);
close(kq);
signal_queue(LINUX_SIGCHLD);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kqueue loop sets errno=0 then waits; on non-EINTR kevent failure or zero events SIGCHLD is queued anyway. Also: no waitpid/status capture, no pidfd notify, no shutdown hook on exit_group. Gate signal_queue on (ret == 1 && (kev.fflags & NOTE_EXIT)); add pidfd notification; tie monitor lifetime to a shutdown flag.

Comment thread src/runtime/procemu.c
pty_keepalive_table[slot].slave_host_fd = slave_host_fd;
} else {
if (slave_host_fd >= 0)
close(slave_host_fd);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing slave_host_fd when stale_open_once is false defeats the keepalive's HUP suppression for live entries. If the intent is to let HUP propagate on real child close, split that into an explicit flag distinct from stale_open_once and add a regression test for master HUP behavior.

Comment thread src/syscall/sys.c
int ngroups = get_cached_linux_groups();
if (ngroups < 0)
return linux_errno();
const int ngroups = 1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning [proc_get_gid()] fabricates membership in the primary gid (Linux supplementary groups are independent of primary gid). Return ngroups=0 until elfuse implements setgroups.

Comment thread src/core/guest.c
* otherwise libc's post-fork canary check observes zeroed guard storage
* and aborts before the child can exec.
*/
if (n < max && g->interp_base > 0 &&

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded BLOCK_2MIB at interp_base may over- or under-copy a future dynamic linker. Track the interpreter's actual load_min..load_max in elf_resolve_interp and emit a region rounded from that. The comment about __stack_chk_guard is also misleading: that symbol normally lives in libc, not ld.so.

Comment thread src/runtime/fork-state.h
* interpret unknown trailing fields.
*/
#define IPC_VERSION 11
#define IPC_VERSION 13

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPC_VERSION jumps 11 -> 13, skipping wire value 12. The magic-mismatch check still catches old children, but the gap reads like a rebase artifact. Either renumber to 12 or note in the comment that version 12 was rolled into the same release.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPC_VERSION was removed in #93
Don't touch this portion.

@doanbaotrung doanbaotrung deleted the temp/fix_foot_run branch June 16, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants