"Hahaha, look at how Rust failed here."
-
RE: https://infosec.exchange/@lcamtuf/116517194178120536
"Hahaha, look at how Rust failed here."
Maybe writing a utility like
cpwithout TOCTOU, race conditions, symlink exploits and the like shouldn't be hard. Maybe copying a file shouldn't require more than a single line in userspace.Maybe the UNIX file API is incomplete and could do with a number of revisions and updates. Maybe, after 40, 50 years we have learned a few things and should go through it with a fine comb.
Of course we shouldn't break userspace. We can still provide the old, broken calls.
But maybe we should discuss how we can come up with something systematic that doesn't suck and invite these kinds of bugs. In any language.
-
RE: https://infosec.exchange/@lcamtuf/116517194178120536
"Hahaha, look at how Rust failed here."
Maybe writing a utility like
cpwithout TOCTOU, race conditions, symlink exploits and the like shouldn't be hard. Maybe copying a file shouldn't require more than a single line in userspace.Maybe the UNIX file API is incomplete and could do with a number of revisions and updates. Maybe, after 40, 50 years we have learned a few things and should go through it with a fine comb.
Of course we shouldn't break userspace. We can still provide the old, broken calls.
But maybe we should discuss how we can come up with something systematic that doesn't suck and invite these kinds of bugs. In any language.
@isotopp Yes! The file system is just a big ball of race conditions bundled together (and so are processes/PIDs).
-
RE: https://infosec.exchange/@lcamtuf/116517194178120536
"Hahaha, look at how Rust failed here."
Maybe writing a utility like
cpwithout TOCTOU, race conditions, symlink exploits and the like shouldn't be hard. Maybe copying a file shouldn't require more than a single line in userspace.Maybe the UNIX file API is incomplete and could do with a number of revisions and updates. Maybe, after 40, 50 years we have learned a few things and should go through it with a fine comb.
Of course we shouldn't break userspace. We can still provide the old, broken calls.
But maybe we should discuss how we can come up with something systematic that doesn't suck and invite these kinds of bugs. In any language.
@isotopp Yet you both are right.

-
RE: https://infosec.exchange/@lcamtuf/116517194178120536
"Hahaha, look at how Rust failed here."
Maybe writing a utility like
cpwithout TOCTOU, race conditions, symlink exploits and the like shouldn't be hard. Maybe copying a file shouldn't require more than a single line in userspace.Maybe the UNIX file API is incomplete and could do with a number of revisions and updates. Maybe, after 40, 50 years we have learned a few things and should go through it with a fine comb.
Of course we shouldn't break userspace. We can still provide the old, broken calls.
But maybe we should discuss how we can come up with something systematic that doesn't suck and invite these kinds of bugs. In any language.
@isotopp That shit is hard, and the implementations on all operating systems are weirdly different. If only we could improve that instead of another magic "paradigm shift" eh?
-
RE: https://infosec.exchange/@lcamtuf/116517194178120536
"Hahaha, look at how Rust failed here."
Maybe writing a utility like
cpwithout TOCTOU, race conditions, symlink exploits and the like shouldn't be hard. Maybe copying a file shouldn't require more than a single line in userspace.Maybe the UNIX file API is incomplete and could do with a number of revisions and updates. Maybe, after 40, 50 years we have learned a few things and should go through it with a fine comb.
Of course we shouldn't break userspace. We can still provide the old, broken calls.
But maybe we should discuss how we can come up with something systematic that doesn't suck and invite these kinds of bugs. In any language.
@isotopp There is no hope for a better past. But we can at least learn from it.
-
RE: https://infosec.exchange/@lcamtuf/116517194178120536
"Hahaha, look at how Rust failed here."
Maybe writing a utility like
cpwithout TOCTOU, race conditions, symlink exploits and the like shouldn't be hard. Maybe copying a file shouldn't require more than a single line in userspace.Maybe the UNIX file API is incomplete and could do with a number of revisions and updates. Maybe, after 40, 50 years we have learned a few things and should go through it with a fine comb.
Of course we shouldn't break userspace. We can still provide the old, broken calls.
But maybe we should discuss how we can come up with something systematic that doesn't suck and invite these kinds of bugs. In any language.
@isotopp You're aware, that this is a start of the next systemd-like discussion
?I may not be 100% in agreement with everything systemd project does. But the people in that project know things much better than I do and even I clearly can see the desperate need for modernization.
So I would give them a lot of leeway and am totally aghast about the amount of hate they receive.
What you (correctly) said above would (as a project) draw 10 times the ire systemd did.
-
RE: https://infosec.exchange/@lcamtuf/116517194178120536
"Hahaha, look at how Rust failed here."
Maybe writing a utility like
cpwithout TOCTOU, race conditions, symlink exploits and the like shouldn't be hard. Maybe copying a file shouldn't require more than a single line in userspace.Maybe the UNIX file API is incomplete and could do with a number of revisions and updates. Maybe, after 40, 50 years we have learned a few things and should go through it with a fine comb.
Of course we shouldn't break userspace. We can still provide the old, broken calls.
But maybe we should discuss how we can come up with something systematic that doesn't suck and invite these kinds of bugs. In any language.
-
@isotopp That shit is hard, and the implementations on all operating systems are weirdly different. If only we could improve that instead of another magic "paradigm shift" eh?
@iwein I don't care much. If the Linux kernel does it, the rest will eventually follow. Or not, but they are sidelined already anyway, so who cares.
-
@isotopp You're aware, that this is a start of the next systemd-like discussion
?I may not be 100% in agreement with everything systemd project does. But the people in that project know things much better than I do and even I clearly can see the desperate need for modernization.
So I would give them a lot of leeway and am totally aghast about the amount of hate they receive.
What you (correctly) said above would (as a project) draw 10 times the ire systemd did.
-
RE: https://infosec.exchange/@lcamtuf/116517194178120536
"Hahaha, look at how Rust failed here."
Maybe writing a utility like
cpwithout TOCTOU, race conditions, symlink exploits and the like shouldn't be hard. Maybe copying a file shouldn't require more than a single line in userspace.Maybe the UNIX file API is incomplete and could do with a number of revisions and updates. Maybe, after 40, 50 years we have learned a few things and should go through it with a fine comb.
Of course we shouldn't break userspace. We can still provide the old, broken calls.
But maybe we should discuss how we can come up with something systematic that doesn't suck and invite these kinds of bugs. In any language.
Part of that work is already done.
Linux’s syscall surface has a pattern: take a narrow primitive, remove implicit global state, make it composable, and push work into the kernel to avoid copies or races.
clone(),openat(), andsplice()fit that pattern well.There are several other clusters of similar “upgrades”.
First, the
atfamily generalizes path-based syscalls to operate relative to a directory file descriptor, which eliminates reliance on the process-wide CWD and closes race windows.Besides
openat(), there arefstatat(),linkat(),renameat(),unlinkat(),mkdirat(),symlinkat(), and more recentlyopenat2()with a struct-based argument that lets you constrain resolution (no symlinks, stay beneath a dir, etc.).POSIX standardized a subset of this idea in POSIX.1-2008: the basic *at() calls exist there, but Linux-specific extensions like
openat2()and its resolution flags are not in POSIX.Second, file-descriptor–centric design is pushed much further than POSIX.
Linux prefers operations that take FDs instead of paths and adds syscalls to obtain stable references:
O_PATH,name_to_handle_at()andopen_by_handle_at()(exportable file handles),pidfd_open()and the broaderpidfdAPI for race-free process management, andmemfd_create()for anonymous in-kernel files.POSIX largely sticks to PIDs and pathnames; pidfds, memfd, and file handles are Linux-only.
Third, race-free event and I/O multiplexing. Linux moved from
select()/poll()toepoll(edge-triggered, scalable readiness notification) and then toio_uring, which is a much bigger step: shared submission/completion queues, batching, fixed buffers/files, and true async operations with fewer syscalls.POSIX includes
select()andpoll(), and optionally AIO (aio_*), butepollandio_uringare Linux-specific.Fourth, zero-copy and in-kernel data movement. Beyond
sendfile()→splice(), there’stee()(duplicate a pipe buffer without copying) andvmsplice()(map user pages into a pipe).These let you build pipelines where data stays in kernel space. POSIX has
sendfile()only via non-standard extensions on some systems;splice/tee/vmspliceare not in POSIX.Fifth, vector and message-oriented batching.
readv()/writev()exist in POSIX, but Linux extends batching withpreadv2()/pwritev2()flags,recvmmsg()/sendmmsg()to amortize syscall overhead for datagrams, and various flags for finer control.The
mmsgcalls are Linux-specific.Sixth, futexes for user-space synchronization.
futex()lets user space do uncontended locking without syscalls and only enter the kernel on contention.This is the basis for efficient pthread mutexes/condvars on Linux.
POSIX defines the pthread APIs, not the futex primitive; futex is Linux-specific.
Seventh, namespaces and capabilities. Syscalls like
unshare(),setns(), andclone()flags create per-process views of resources (mount, PID, net, user namespaces).This is foundational for containers.
POSIX has no concept of namespaces or Linux capabilities.
Eighth, timers, event FDs, and signal improvements.
timerfd_create(),eventfd(), andsignalfd()turn timers, counters, and signals into file descriptors that integrate withepoll.POSIX has timers and signals, but not these FD-based forms.
Ninth, process creation refinement.
clone3()is a modern, extensible variant ofclone()with a struct argument, similar in spirit toopenat2().POSIX sticks with
fork()andposix_spawn();clone*is Linux-specific.Tenth, memory management extensions.
mremap(),madvise()flags beyond POSIX,userfaultfd()(handle page faults in user space),memfd_secret(restricted mappings).POSIX defines
mmap()/mprotect()/msync(); the rest are Linux extensions.Eleventh, mount API overhaul. The newer mount API (
open_tree(),move_mount(),fsopen(),fsconfig(),fsmount()) replaces the legacymount()string interface with FD-based, race-resistant operations.This is Linux-only.
Twelfth, BPF as a syscall-backed subsystem. The
bpf()syscall exposes a programmable kernel data path and observability tools.Entirely Linux-specific.
On POSIX coverage, the pattern is consistent: when Linux introduces a generalization that reduces races and global state in a way that’s broadly portable, a conservative subset may eventually appear in POSIX (the *at() family, readv/writev, posix_spawn). The more ambitious pieces that depend on Linux’s internal models or aim at performance and containerization (epoll, io_uring, pidfds, namespaces, futex, BPF, new mount API, zero-copy pipe primitives) are not in POSIX and are unlikely to be standardized in their current form.
-
Part of that work is already done.
Linux’s syscall surface has a pattern: take a narrow primitive, remove implicit global state, make it composable, and push work into the kernel to avoid copies or races.
clone(),openat(), andsplice()fit that pattern well.There are several other clusters of similar “upgrades”.
First, the
atfamily generalizes path-based syscalls to operate relative to a directory file descriptor, which eliminates reliance on the process-wide CWD and closes race windows.Besides
openat(), there arefstatat(),linkat(),renameat(),unlinkat(),mkdirat(),symlinkat(), and more recentlyopenat2()with a struct-based argument that lets you constrain resolution (no symlinks, stay beneath a dir, etc.).POSIX standardized a subset of this idea in POSIX.1-2008: the basic *at() calls exist there, but Linux-specific extensions like
openat2()and its resolution flags are not in POSIX.Second, file-descriptor–centric design is pushed much further than POSIX.
Linux prefers operations that take FDs instead of paths and adds syscalls to obtain stable references:
O_PATH,name_to_handle_at()andopen_by_handle_at()(exportable file handles),pidfd_open()and the broaderpidfdAPI for race-free process management, andmemfd_create()for anonymous in-kernel files.POSIX largely sticks to PIDs and pathnames; pidfds, memfd, and file handles are Linux-only.
Third, race-free event and I/O multiplexing. Linux moved from
select()/poll()toepoll(edge-triggered, scalable readiness notification) and then toio_uring, which is a much bigger step: shared submission/completion queues, batching, fixed buffers/files, and true async operations with fewer syscalls.POSIX includes
select()andpoll(), and optionally AIO (aio_*), butepollandio_uringare Linux-specific.Fourth, zero-copy and in-kernel data movement. Beyond
sendfile()→splice(), there’stee()(duplicate a pipe buffer without copying) andvmsplice()(map user pages into a pipe).These let you build pipelines where data stays in kernel space. POSIX has
sendfile()only via non-standard extensions on some systems;splice/tee/vmspliceare not in POSIX.Fifth, vector and message-oriented batching.
readv()/writev()exist in POSIX, but Linux extends batching withpreadv2()/pwritev2()flags,recvmmsg()/sendmmsg()to amortize syscall overhead for datagrams, and various flags for finer control.The
mmsgcalls are Linux-specific.Sixth, futexes for user-space synchronization.
futex()lets user space do uncontended locking without syscalls and only enter the kernel on contention.This is the basis for efficient pthread mutexes/condvars on Linux.
POSIX defines the pthread APIs, not the futex primitive; futex is Linux-specific.
Seventh, namespaces and capabilities. Syscalls like
unshare(),setns(), andclone()flags create per-process views of resources (mount, PID, net, user namespaces).This is foundational for containers.
POSIX has no concept of namespaces or Linux capabilities.
Eighth, timers, event FDs, and signal improvements.
timerfd_create(),eventfd(), andsignalfd()turn timers, counters, and signals into file descriptors that integrate withepoll.POSIX has timers and signals, but not these FD-based forms.
Ninth, process creation refinement.
clone3()is a modern, extensible variant ofclone()with a struct argument, similar in spirit toopenat2().POSIX sticks with
fork()andposix_spawn();clone*is Linux-specific.Tenth, memory management extensions.
mremap(),madvise()flags beyond POSIX,userfaultfd()(handle page faults in user space),memfd_secret(restricted mappings).POSIX defines
mmap()/mprotect()/msync(); the rest are Linux extensions.Eleventh, mount API overhaul. The newer mount API (
open_tree(),move_mount(),fsopen(),fsconfig(),fsmount()) replaces the legacymount()string interface with FD-based, race-resistant operations.This is Linux-only.
Twelfth, BPF as a syscall-backed subsystem. The
bpf()syscall exposes a programmable kernel data path and observability tools.Entirely Linux-specific.
On POSIX coverage, the pattern is consistent: when Linux introduces a generalization that reduces races and global state in a way that’s broadly portable, a conservative subset may eventually appear in POSIX (the *at() family, readv/writev, posix_spawn). The more ambitious pieces that depend on Linux’s internal models or aim at performance and containerization (epoll, io_uring, pidfds, namespaces, futex, BPF, new mount API, zero-copy pipe primitives) are not in POSIX and are unlikely to be standardized in their current form.
File naming has been decoupled from the API that does things with files through these
fd-based calls. So once you have anfdyou should be set.But:
Linux at the upper kernel layer does not change POSIX filename requirements. Filenames can be random garbage as long as they do not contain
pathsep(the slash) and Nullbytes.There is no clean, portable Linux syscall that says: “this directory accepts exactly UTF-8” or “this directory accepts arbitrary bytes.” The safe model is still: treat filenames as byte strings, not text, until you must display or create human-facing names.
That means your programming language must work with byte-arrays as filenames, even when that seems to be silly.
Linux pathname rules are byte-oriented: a pathname is a null-terminated byte sequence, interior null bytes are forbidden, and
/is the separator, not a filename byte.Directory reads likewise return null-terminated
d_nameentries, not Unicode strings.For creating new human-facing filenames, emit valid UTF-8, preferably normalized to NFC at the application layer. But still be prepared for EINVAL, ENAMETOOLONG, EEXIST, or filesystem-specific rejection – some filesystems have magic filenames such as nul, prn or con and you won't be able to use them.
For accepting existing names, accept arbitrary bytes except
/and\0. A UTF-8-only application that refuses to operate on invalid names will break on normal Unix trees, backups, tar extractions, old mounts, removable media, and network filesystems.For detecting constraints, the options are weak:
statfs()tells you the filesystem type, so you can special-case ext4, vfat, ntfs3, btrfs, overlayfs, etc., but that is not a semantic contract. You will need to know the rules for each filesystem type, there is no way to query them.pathconf(path, _PC_NAME_MAX)tells you name length limits, not encoding.statx()gives richer file metadata, but not a general filename-encoding capability, and surely not lists of reserved names or other fancyness.Some filesystems have feature-specific behavior. ext4 casefolding, for example, stores a filesystem-wide encoding model for case-insensitive directories, defaulting to UTF-8 in the kernel documentation. That does not turn Linux pathname handling generally into Unicode.
So we DO have an API that could be race-free if you used to fully, and in order to do that you'd use Linux specific syscalls.
We LACK an API that can handle structured filenames, and answer questions about naming restrictions properly.
There is no
query_name_policy(dirfd)→ accepted encoding, normalization rules, case-sensitivity/case-folding, max component length in bytes and characters, reserved names, forbidden code points/bytes, equivalence rules, stable display form, and whether invalid legacy names may already exist. -
@hllizi See followup post: This has largely alread happened (improving the kernel fd API).
-
File naming has been decoupled from the API that does things with files through these
fd-based calls. So once you have anfdyou should be set.But:
Linux at the upper kernel layer does not change POSIX filename requirements. Filenames can be random garbage as long as they do not contain
pathsep(the slash) and Nullbytes.There is no clean, portable Linux syscall that says: “this directory accepts exactly UTF-8” or “this directory accepts arbitrary bytes.” The safe model is still: treat filenames as byte strings, not text, until you must display or create human-facing names.
That means your programming language must work with byte-arrays as filenames, even when that seems to be silly.
Linux pathname rules are byte-oriented: a pathname is a null-terminated byte sequence, interior null bytes are forbidden, and
/is the separator, not a filename byte.Directory reads likewise return null-terminated
d_nameentries, not Unicode strings.For creating new human-facing filenames, emit valid UTF-8, preferably normalized to NFC at the application layer. But still be prepared for EINVAL, ENAMETOOLONG, EEXIST, or filesystem-specific rejection – some filesystems have magic filenames such as nul, prn or con and you won't be able to use them.
For accepting existing names, accept arbitrary bytes except
/and\0. A UTF-8-only application that refuses to operate on invalid names will break on normal Unix trees, backups, tar extractions, old mounts, removable media, and network filesystems.For detecting constraints, the options are weak:
statfs()tells you the filesystem type, so you can special-case ext4, vfat, ntfs3, btrfs, overlayfs, etc., but that is not a semantic contract. You will need to know the rules for each filesystem type, there is no way to query them.pathconf(path, _PC_NAME_MAX)tells you name length limits, not encoding.statx()gives richer file metadata, but not a general filename-encoding capability, and surely not lists of reserved names or other fancyness.Some filesystems have feature-specific behavior. ext4 casefolding, for example, stores a filesystem-wide encoding model for case-insensitive directories, defaulting to UTF-8 in the kernel documentation. That does not turn Linux pathname handling generally into Unicode.
So we DO have an API that could be race-free if you used to fully, and in order to do that you'd use Linux specific syscalls.
We LACK an API that can handle structured filenames, and answer questions about naming restrictions properly.
There is no
query_name_policy(dirfd)→ accepted encoding, normalization rules, case-sensitivity/case-folding, max component length in bytes and characters, reserved names, forbidden code points/bytes, equivalence rules, stable display form, and whether invalid legacy names may already exist.What we do need is a Linux-centric update to W. Richard Stevens of APUE, an APLE book.
It would be discussing working with the Linux kernel API correctly, maybe implementing a libc or a libc-replacement, or a Python or Rust kernel API interface, correctly, with error handling, using code.
Stevens had a wonderful writing style, in English and in Code, showcasing the point made in a chapter without compromising on correctness and production-ready error handling.
But his books and the API he describes are old, and from the Linux Kernel PoV also inferior and outdated.
-
File naming has been decoupled from the API that does things with files through these
fd-based calls. So once you have anfdyou should be set.But:
Linux at the upper kernel layer does not change POSIX filename requirements. Filenames can be random garbage as long as they do not contain
pathsep(the slash) and Nullbytes.There is no clean, portable Linux syscall that says: “this directory accepts exactly UTF-8” or “this directory accepts arbitrary bytes.” The safe model is still: treat filenames as byte strings, not text, until you must display or create human-facing names.
That means your programming language must work with byte-arrays as filenames, even when that seems to be silly.
Linux pathname rules are byte-oriented: a pathname is a null-terminated byte sequence, interior null bytes are forbidden, and
/is the separator, not a filename byte.Directory reads likewise return null-terminated
d_nameentries, not Unicode strings.For creating new human-facing filenames, emit valid UTF-8, preferably normalized to NFC at the application layer. But still be prepared for EINVAL, ENAMETOOLONG, EEXIST, or filesystem-specific rejection – some filesystems have magic filenames such as nul, prn or con and you won't be able to use them.
For accepting existing names, accept arbitrary bytes except
/and\0. A UTF-8-only application that refuses to operate on invalid names will break on normal Unix trees, backups, tar extractions, old mounts, removable media, and network filesystems.For detecting constraints, the options are weak:
statfs()tells you the filesystem type, so you can special-case ext4, vfat, ntfs3, btrfs, overlayfs, etc., but that is not a semantic contract. You will need to know the rules for each filesystem type, there is no way to query them.pathconf(path, _PC_NAME_MAX)tells you name length limits, not encoding.statx()gives richer file metadata, but not a general filename-encoding capability, and surely not lists of reserved names or other fancyness.Some filesystems have feature-specific behavior. ext4 casefolding, for example, stores a filesystem-wide encoding model for case-insensitive directories, defaulting to UTF-8 in the kernel documentation. That does not turn Linux pathname handling generally into Unicode.
So we DO have an API that could be race-free if you used to fully, and in order to do that you'd use Linux specific syscalls.
We LACK an API that can handle structured filenames, and answer questions about naming restrictions properly.
There is no
query_name_policy(dirfd)→ accepted encoding, normalization rules, case-sensitivity/case-folding, max component length in bytes and characters, reserved names, forbidden code points/bytes, equivalence rules, stable display form, and whether invalid legacy names may already exist.@isotopp How do you read the contents of a directory in a race-free way?
-
@isotopp How do you read the contents of a directory in a race-free way?
@barubary You do not read a directory race-free in the snapshot sense.
A directory fd gives you a stable ref to THAT directory object, not a stable list of its children.
You CAN do
dirfd = openat2(parentfd, "subdir", ...);
getdents64(dirfd, ...); // or fdopendir/readdir
openat(dirfd, name, ...); // act relative to the same directoryThat avoids races involving CWD, replaced parent paths, symlinked path components, and “I checked one path but opened another”.
The entries themselves can still change while you read. Another process can create, delete, rename, or replace name after readdir() returns it and before openat() uses it. Linux does not make readdir() a frozen transaction. A directory fd pins the directory, not its contents.
So you'd
- open directory by fd
- read entry name as bytes
- openat(dirfd, entry_name, flags that express intent)
- fstat the returned fd
- decide based on the object actually opened
- operate on the fd, not the path
For recursive traversal, you extend the same rule: open child directories with openat() or openat2(), reject symlinks with flags/resolution constraints, keep dirfds on a stack, and perform later operations relative to those dirfds.
The oss-sec report’s uutils examples are mostly failures of this kind: path-based second operations, permission changes after creation, missing O_NOFOLLOW, missing O_EXCL, or creating too broadly and tightening later.
A truly race-free directory listing would mean one of three things:
- A filesystem snapshot.
- Kernel support for transactional directory enumeration plus later object resolution against that transaction.
- Locking/excluding all concurrent mutation, which Unix generally does not provide as a normal directory API.
None of that is desireable, because it will scale like shit.
A snapshot is desirable for backups, indexing, forensics, package database consistency, and reproducible tree copies. But then the right answer is usually “use a filesystem snapshot”, not “make readdir() magic”.
-
R relay@relay.infosec.exchange shared this topic