"Hahaha, look at how Rust failed here."

isotopp@infosec.exchange

RE: https://infosec.exchange/@lcamtuf/116517194178120536

Maybe writing a utility like cp without TOCTOU, race conditions, symlink exploits and the like shouldn't be hard. Maybe copying a file shouldn't require more than a single line in userspace.

Maybe the UNIX file API is incomplete and could do with a number of revisions and updates. Maybe, after 40, 50 years we have learned a few things and should go through it with a fine comb.

Of course we shouldn't break userspace. We can still provide the old, broken calls.

But maybe we should discuss how we can come up with something systematic that doesn't suck and invite these kinds of bugs. In any language.

barubary@infosec.exchange

@isotopp Yes! The file system is just a big ball of race conditions bundled together (and so are processes/PIDs).

aheadofthekrauts@social.tchncs.de

@isotopp Yet you both are right.

iwein@mas.to

@isotopp That shit is hard, and the implementations on all operating systems are weirdly different. If only we could improve that instead of another magic "paradigm shift" eh?

monospace@floss.social

@isotopp There is no hope for a better past. But we can at least learn from it.

masek@infosec.exchange

@isotopp You're aware, that this is a start of the next systemd-like discussion ?

I may not be 100% in agreement with everything systemd project does. But the people in that project know things much better than I do and even I clearly can see the desperate need for modernization.

So I would give them a lot of leeway and am totally aghast about the amount of hate they receive.

What you (correctly) said above would (as a project) draw 10 times the ire systemd did.

slink@fosstodon.org

@isotopp https://fosstodon.org/@slink/116486258791687186

isotopp@infosec.exchange

@iwein I don't care much. If the Linux kernel does it, the rest will eventually follow. Or not, but they are sidelined already anyway, so who cares.

hikhvar@norden.social

@masek

The biggest thing systemd brought was consistent, well documented behaviour.

That behaviour is questionable sometimes, but consistent for the user.

Start a systemd SYSCALL API and both the fediverse and reddit will DDOS itself.

@isotopp

isotopp@infosec.exchange

Part of that work is already done.

Linux’s syscall surface has a pattern: take a narrow primitive, remove implicit global state, make it composable, and push work into the kernel to avoid copies or races. clone(), openat(), and splice() fit that pattern well.

There are several other clusters of similar “upgrades”.

First, the at family generalizes path-based syscalls to operate relative to a directory file descriptor, which eliminates reliance on the process-wide CWD and closes race windows.

Besides openat(), there are fstatat(), linkat(), renameat(), unlinkat(), mkdirat(), symlinkat(), and more recently openat2() with a struct-based argument that lets you constrain resolution (no symlinks, stay beneath a dir, etc.).

POSIX standardized a subset of this idea in POSIX.1-2008: the basic *at() calls exist there, but Linux-specific extensions like openat2() and its resolution flags are not in POSIX.

Second, file-descriptor–centric design is pushed much further than POSIX.

Linux prefers operations that take FDs instead of paths and adds syscalls to obtain stable references: O_PATH, name_to_handle_at() and open_by_handle_at() (exportable file handles), pidfd_open() and the broader pidfd API for race-free process management, and memfd_create() for anonymous in-kernel files.

POSIX largely sticks to PIDs and pathnames; pidfds, memfd, and file handles are Linux-only.

Third, race-free event and I/O multiplexing. Linux moved from select()/poll() to epoll (edge-triggered, scalable readiness notification) and then to io_uring, which is a much bigger step: shared submission/completion queues, batching, fixed buffers/files, and true async operations with fewer syscalls.

POSIX includes select() and poll(), and optionally AIO (aio_*), but epoll and io_uring are Linux-specific.

Fourth, zero-copy and in-kernel data movement. Beyond sendfile() → splice(), there’s tee() (duplicate a pipe buffer without copying) and vmsplice() (map user pages into a pipe).

These let you build pipelines where data stays in kernel space. POSIX has sendfile() only via non-standard extensions on some systems; splice/tee/vmsplice are not in POSIX.

Fifth, vector and message-oriented batching. readv()/writev() exist in POSIX, but Linux extends batching with preadv2()/pwritev2() flags, recvmmsg()/sendmmsg() to amortize syscall overhead for datagrams, and various flags for finer control.

The mmsg calls are Linux-specific.

Sixth, futexes for user-space synchronization. futex() lets user space do uncontended locking without syscalls and only enter the kernel on contention.

This is the basis for efficient pthread mutexes/condvars on Linux.

POSIX defines the pthread APIs, not the futex primitive; futex is Linux-specific.

Seventh, namespaces and capabilities. Syscalls like unshare(), setns(), and clone() flags create per-process views of resources (mount, PID, net, user namespaces).

This is foundational for containers.

POSIX has no concept of namespaces or Linux capabilities.

Eighth, timers, event FDs, and signal improvements. timerfd_create(), eventfd(), and signalfd() turn timers, counters, and signals into file descriptors that integrate with epoll.

POSIX has timers and signals, but not these FD-based forms.

Ninth, process creation refinement. clone3() is a modern, extensible variant of clone() with a struct argument, similar in spirit to openat2().

POSIX sticks with fork() and posix_spawn(); clone* is Linux-specific.

Tenth, memory management extensions. mremap(), madvise() flags beyond POSIX, userfaultfd() (handle page faults in user space), memfd_secret (restricted mappings).

POSIX defines mmap()/mprotect()/msync(); the rest are Linux extensions.

Eleventh, mount API overhaul. The newer mount API (open_tree(), move_mount(), fsopen(), fsconfig(), fsmount()) replaces the legacy mount() string interface with FD-based, race-resistant operations.

This is Linux-only.

Twelfth, BPF as a syscall-backed subsystem. The bpf() syscall exposes a programmable kernel data path and observability tools.

Entirely Linux-specific.

On POSIX coverage, the pattern is consistent: when Linux introduces a generalization that reduces races and global state in a way that’s broadly portable, a conservative subset may eventually appear in POSIX (the *at() family, readv/writev, posix_spawn). The more ambitious pieces that depend on Linux’s internal models or aim at performance and containerization (epoll, io_uring, pidfds, namespaces, futex, BPF, new mount API, zero-copy pipe primitives) are not in POSIX and are unlikely to be standardized in their current form.

isotopp@infosec.exchange

File naming has been decoupled from the API that does things with files through these fd-based calls. So once you have an fd you should be set.

But:

Linux at the upper kernel layer does not change POSIX filename requirements. Filenames can be random garbage as long as they do not contain pathsep (the slash) and Nullbytes.

There is no clean, portable Linux syscall that says: “this directory accepts exactly UTF-8” or “this directory accepts arbitrary bytes.” The safe model is still: treat filenames as byte strings, not text, until you must display or create human-facing names.

That means your programming language must work with byte-arrays as filenames, even when that seems to be silly.

Linux pathname rules are byte-oriented: a pathname is a null-terminated byte sequence, interior null bytes are forbidden, and / is the separator, not a filename byte.

Directory reads likewise return null-terminated d_name entries, not Unicode strings.

For creating new human-facing filenames, emit valid UTF-8, preferably normalized to NFC at the application layer. But still be prepared for EINVAL, ENAMETOOLONG, EEXIST, or filesystem-specific rejection – some filesystems have magic filenames such as nul, prn or con and you won't be able to use them.

For accepting existing names, accept arbitrary bytes except / and \0. A UTF-8-only application that refuses to operate on invalid names will break on normal Unix trees, backups, tar extractions, old mounts, removable media, and network filesystems.

For detecting constraints, the options are weak:

statfs() tells you the filesystem type, so you can special-case ext4, vfat, ntfs3, btrfs, overlayfs, etc., but that is not a semantic contract. You will need to know the rules for each filesystem type, there is no way to query them.

pathconf(path, _PC_NAME_MAX) tells you name length limits, not encoding.

statx() gives richer file metadata, but not a general filename-encoding capability, and surely not lists of reserved names or other fancyness.

Some filesystems have feature-specific behavior. ext4 casefolding, for example, stores a filesystem-wide encoding model for case-insensitive directories, defaulting to UTF-8 in the kernel documentation. That does not turn Linux pathname handling generally into Unicode.

So we DO have an API that could be race-free if you used to fully, and in order to do that you'd use Linux specific syscalls.

We LACK an API that can handle structured filenames, and answer questions about naming restrictions properly.

There is no

query_name_policy(dirfd) → accepted encoding, normalization rules, case-sensitivity/case-folding, max component length in bytes and characters, reserved names, forbidden code points/bytes, equivalence rules, stable display form, and whether invalid legacy names may already exist.

isotopp@infosec.exchange

@hllizi See followup post: This has largely alread happened (improving the kernel fd API).

isotopp@infosec.exchange

What we do need is a Linux-centric update to W. Richard Stevens of APUE, an APLE book.

It would be discussing working with the Linux kernel API correctly, maybe implementing a libc or a libc-replacement, or a Python or Rust kernel API interface, correctly, with error handling, using code.

Stevens had a wonderful writing style, in English and in Code, showcasing the point made in a chapter without compromising on correctness and production-ready error handling.

But his books and the API he describes are old, and from the Linux Kernel PoV also inferior and outdated.

barubary@infosec.exchange

@isotopp How do you read the contents of a directory in a race-free way?

isotopp@infosec.exchange

@barubary You do not read a directory race-free in the snapshot sense.

A directory fd gives you a stable ref to THAT directory object, not a stable list of its children.

You CAN do

dirfd = openat2(parentfd, "subdir", ...);
getdents64(dirfd, ...);      // or fdopendir/readdir
openat(dirfd, name, ...);    // act relative to the same directory

That avoids races involving CWD, replaced parent paths, symlinked path components, and “I checked one path but opened another”.

The entries themselves can still change while you read. Another process can create, delete, rename, or replace name after readdir() returns it and before openat() uses it. Linux does not make readdir() a frozen transaction. A directory fd pins the directory, not its contents.

So you'd

open directory by fd
read entry name as bytes
openat(dirfd, entry_name, flags that express intent)
fstat the returned fd
decide based on the object actually opened
operate on the fd, not the path

For recursive traversal, you extend the same rule: open child directories with openat() or openat2(), reject symlinks with flags/resolution constraints, keep dirfds on a stack, and perform later operations relative to those dirfds.

The oss-sec report’s uutils examples are mostly failures of this kind: path-based second operations, permission changes after creation, missing O_NOFOLLOW, missing O_EXCL, or creating too broadly and tightening later.

A truly race-free directory listing would mean one of three things:

A filesystem snapshot.
Kernel support for transactional directory enumeration plus later object resolution against that transaction.
Locking/excluding all concurrent mutation, which Unix generally does not provide as a normal directory API.

None of that is desireable, because it will scale like shit.

A snapshot is desirable for backups, indexing, forensics, package database consistency, and reproducible tree copies. But then the right answer is usually “use a filesystem snapshot”, not “make readdir() magic”.

CIRCLE WITH A DOT

"Hahaha, look at how Rust failed here."