Two Objects Not Namespaced by the Linux Kernel (2017)

blog.jessfraz.com

169 points by setra 7 years ago · 58 comments

Reader

> The current set of namespaces in the kernel are: mount, pid, uts, ipc, net, user, and cgroup. [...] [Time is] not namespaced. [...] The kernel keyring is another item not namespaced.

I've always argued that "everything is a file" is an exaggeration. These moments make the extent of that exaggeration clear.

If everything truly was a file, the only thing you would need to namespace is the filesystem. But in reality there are a lot of other kernel objects that are not files at all.

zapita 7 years ago

You are 100% correct. “Everything is a file” was more of an early design insight, which was gradually abandoned as new features were added.
There is a movement of “Unix purists” who lament this deviation from founding principles, and advocate for a return to them. The most notable example is Plan 9.
In Plan 9, everything actually is a file. And exactly as you said, all resources are namespaced via the filesystem. It’s quite elegant and practical.
Sadly Plan 9 has remained a fringe OS, and although it influenced mainstream operating system design in many ways (including the concept of /proc), I wish that influence had been stronger.
- AceJohnny2 7 years ago
  
  I also liked QNX, when I worked with it.
  You really did access devices through the /dev/ system, and device-drivers were userspace programs that created files in /dev/.
  If your driver crashed, you could kill the userspace driver (which deleted the file under /dev) and restart it (assuming hardware blah blah blah).
  - Someone 7 years ago
    
    ”device-drivers were userspace programs that created files in /dev/. If your driver crashed, you could kill the userspace driver (which deleted the file under /dev)”
    I think that shows not everything is a file. If everything were, you would start the driver by creating the file (say as a hard link from a file in /dev to the driver executable) and kill the driver by rm-ing the file.
    (Chances are that, if you follow this through, this idea won’t support everything we want to do with drivers, but if so, that’s an indication that “everything is a file” doesn’t work)
    
    zapita 7 years ago
    
    To give you a sense of how far Plan9 took the idea... To open a tcp connection, you create a special “control file” at `/net/tcp/ctl` or some similar path, then write newline-terminated text commands to the file descriptor. That descriptor now represents your socket. You can also browse its contents as a directory (in plan9 each node in the filesystem can be both a regular file and a parent directory).
    
    AceJohnny2 7 years ago
    
    Great point.
- pjmlp 7 years ago
  
  It might have been elegant, but doing high performance graphics rendering wasn't something Rio was able to do.
- temac 7 years ago
  
  hm is NT a purer Unix than Unix then? After all, it has all its object in a filesystem-like tree...
  - tadfisher 7 years ago
    
    In Windows NT, everything is an object. This is derived from VMS, which is essentially NT's predecessor (a principal designer of both being David Cutler of Digital Research and later Microsoft).
    
    jackfraser 7 years ago
    
    The problem I always had with this was that Windows has this whole layer of objects and what might even be elegance in places all hidden under the hood, and unless you're a C++ hacker you can't actually work with most of it. CMD and the GUI tools never exposed half of it to you; Powershell helps, but it's all still very hidden and hard to get to.
    In comparison, Unix provides all the tools needed to take it apart and put it back together again. When you do need to interact with some syscall interface, there's almost always a complete CLI around it. It really makes it much easier to get into the nooks and crannies, inconsistencies aside.
    
    yjftsjthsd-h 7 years ago
    
    Yeah, as a person who hates Windows and loves all things unix, the NT kernel and underlying system have long struck me as a well-designed, nice system... with a poor userland and a terrible UI on top. But the kernel is nice.
    
    atombender 7 years ago
    
    He's talking about the kernel. The Windows NT kernel and the Windows APIs use handles to represent kernel/API objects, and every handle has things like security associated with it.
    For example, CreateProcess() gives you back a HANDLE value representing the process, and you can close it with CloseHandle(). Everything is a HANDLE: Files, pipes, threads, etc. A notable exception is sockets, which for historical reasons use an API modeled on BSD sockets.
    The object stuff you're talking about is presumably COM, which is different. COM is great, but has nothing to do with the kernel.
    
    hyperman1 7 years ago
    
    He's not talking about COM. Thé HANDLEs are a handleiding to a kernel object. This means multiple differentiatie handles light exist to the same object, and handles in different processen have probably differentiatie numeric values for the same object. Think win32 handle = Unix file descriptor (roughly)
    
    atombender 7 years ago
    
    No, the poster said: "unless you're a C++ hacker you can't actually work with most of it." That strongly implies COM. The Windows APIs are C.
    
    martyvis 7 years ago
    
    David Cutler developed VMS at Digital Equipment Corporation (DEC). Digital Research was a different company - it developed CP/M.
caf 7 years ago

I've always argued that "everything is a file" is an exaggeration.
This is true, but also bear in mind that "everything is a file" didn't mean "everything is represented by a name in the mount tree", it really meant "(almost) everything is referred to by a file descriptor".
It's still true that the most painful things to deal with are the ones that aren't represented by a file descriptor.
yayana 7 years ago

I've always thought of it as the preferred interface to Userland when there isn't an overriding factor.
Within a kernel it seems like no one cares how the sausage is made.
DonHopkins 7 years ago

If time were a file, you could gzip it up to compress it, and store it away for later.
Time files like an arrow!
- steffan 7 years ago
  
  Fruit files like an Apple?
tyingq 7 years ago

I agree it's unfortunate, but it doesn't really seem in conflict with "everything is a file".
Making it a file is separate from making it sensible/usable for containers. Like the /proc filesystem. They are "files", but don't many don't work as expected without something like lxcfs. Like /proc/uptime, for example.
sytelus 7 years ago

The abstraction is not really file but stream of bytes. It turns of any object with stream of bytes will need similar set of operations: open, create, read, write, close, seek etc. This is fairly generic and powerful abstraction.
- cyphar 7 years ago
  
  The abstraction is a file descriptor. Not all things represented by file descriptors support read(2) or _llseek(2), but by representing them as file descriptors you can reuse other things like af_unix file descriptor passing.
fulafel 7 years ago

Yeah, this was one of the headline changes in Plan 9 (the second OS made by the Fathers of Unix).
- pjmlp 7 years ago
  
  And improved in Inferno, which fixed some of Plan 9 flaws, the third OS made by the Fathers of Unix that HNers keep forgetting about.
madhadron 7 years ago

Everything is a file hasn't been true for Unix since almost the beginning. It's kind of like the Unix philosophy of small, independent tools...except for the database where you store all the important data.
lunchables 7 years ago

I thought "everything is a file" referred to user land.

wmf 7 years ago

Since this was written a time namespace was proposed: https://www.phoronix.com/scan.php?page=news_item&px=Linux-Ti...

DonHopkins 7 years ago

This proposal implements clock offsets, but does it support continuous time scaling? One clock-handy use case would be to run your programs really slow or fast (or backwards!), for testing purposes.
Kaleida Lab's ScriptX (a multimedia programing language kinda like Dylan with classes) had built-in support for hierarchal clocks within the container (in the sense of "window" not "vm") hierarchy. The same way every window or node has a 2D or 3D transformation matrix, each clock has a time scale and offset relative to its parent, so anything that consumes time (like a QuickTime player, or a simulation) runs at the scaled and offset time that it inherits from through its parent containers. And you can move and scale each container around in time as necessary, to pause movies or simulations.) You could even set the scale to a negative number, and it played QuickTime movies backwards! (That was pretty cool in 1995 -- try playing a movie backwards in the web browser today!)
http://www.art.net/~hopkins/Don/lang/scriptx/tech-qa.html
Q: How does the ScriptX core class library compare to class libraries available with other programming languages (e.g. MFC, OWL, MacApp, or TCL)?
A: The ScriptX core class library has many similarities to other object oriented frameworks in that it provides many basic services common to all applications built on them. All frameworks provide classes for creating windows, handling keyboard and mouse events, reading and writing files, etc.
Where ScriptX is unique is in its focus. The ScriptX core classes are oriented towards interactive, media rich applications. For example, clocks and timing are fundamental in the ScriptX class library; most other frameworks have no concept of timing built in.
ScriptX also tightly integrates media data (bitmaps, video, audio) with the class library, and hides the details of storing, retrieving, and presenting media to the user.
Q: What are the major sections of the core class library?
Clocks, players, and animation.
Time is a fundamental element of the core classes. Starting with basic clocks, subclasses extend the capabilities for animation, video, and audio playback.
Clocks can be tied to underlying hardware clocks or slaved in a hierarchical fashion to other clock objects. Varying the rate of a master clock, all sub-clocks will stay synchronized to the master clock, permitting the programmer to precisely control time in a title. Clock hierarchies also free the programmer from dealing with differences in performance between slower and faster CPUs.
Player classes build upon clocks. These classes allow you to create and play sequences of actions that take place over time. These sequences can be used to create animations as well as control other presentation elements such as video or sound.
A special type of clock object, the action list player, can be used to play actions in sequence at specified times. Various action objects are added to an action list specifying the time at which the action is to occur. Action objects are used to move graphic elements on the screen, execute ScriptX code, or modify the action list.
Other player classes provide simple ways to play digital video, audio, and MIDI. As with all players, the clocks underlying these players can be sped up, slowed down, or run backwards.
Q: Can video be synchronized with other events?
A: Yes, internal video players are based on ScriptX clocks and can be slaved together to provide synchronization with animations and other events. For example, buttons can appear in a window at precise times based on video playback.
- cyphar 7 years ago
  
  > This proposal implements clock offsets, but does it support continuous time scaling?
  No. The main reason why is because it's very difficult to do with the current time-keeping machinery within the kernel. Some people also want the ability to freeze the current time, which is also similarly difficult -- and in some cases harder because then what should CLOCK_MONATONIC give you? There's also the fact that there's currently no interface to set the "clock speed" do any of these things.
  Making time go backwards I think would simply be impossible, due to how many things in the kernel that interact with time probably make the (reasonable) assumption that time goes forwards. Also CLOCK_MONATONIC would do the exact opposite in such circumstances.
  - cperciva 7 years ago
    
    You mean "CLOCK_MONOTONIC", not "CLOCK_MONATONIC". (I'm guessing this is a misspelling, not a typo, since it appeared twice.)
    And the simple answer is that if time stops then CLOCK_MONOTONIC always returns the same time. This is perfectly fine given correct software; CLOCK_MONOTONIC is guaranteed to not go backwards but it it not guaranteed to always go forward. One could imagine for example a system with a very inaccurate clock where CLOCK_MONOTONIC simply counts days, for example.
- OskarS 7 years ago
  
  This proposal implements clock offsets, but does it support continuous time scaling? One clock-handy use case would be to run your programs really slow or fast (or backwards!), for testing purposes.
  What use-case would you have for this? Making sure your program runs properly in the near presence of a black hole?
  - guipsp 7 years ago
    
    Your snark aside, clocks are not perfect, and a malfunctioning clock might speed up or slow down.
    Also, speeding up the clock is a technique already used in testing enviroments [1].
    [1]: https://github.com/majek/fluxcapacitor
    
    OskarS 7 years ago
    
    Sorry, I honestly didn't mean to be snarky. It was a genuine question. I couldn't really see the real-case justification for testing a clock that slowed down, sped up, or went backwards. But you're right, malfunctioning clocks would be an example.

derefr 7 years ago

I wonder whether namespacing time would also result in those namespaces being able to have separate "clocks" (time backends? time schedulers?) that progress at different rates, or for different reasons.

Being able to put a process into a time namespace with a deterministic "clock" would obviate a large benefit of http://www.zerovm.org/.

Also, having "clock slew" be a matter of perspective—with processes that can handle leap seconds seeing them happen instantaneously; and processes that can't handle leap-seconds, seeing slewed time—would be nice. Then you could have different system facilities that care about monotonic time, vs. synced to calendar time, vs. one second per second time, all having that kind of time available to them as "the time", rather than through different APIs.

rwmj 7 years ago

Accelerated time might also be a way to test programs. It's similar to techniques used to test planes (by repeatedly pressurising and depressurising them). It might, for example, reveal race conditions faster in programs that ordinarily do a lot of sleeping. I wrote a bit more about this (unproven) idea here: https://rwmj.wordpress.com/2010/10/14/half-baked-ideas-accel...
kbenson 7 years ago

> Also, having "clock slew" be a matter of perspective—with processes that can handle leap seconds seeing them happen instantaneously; and processes that can't handle leap-seconds, seeing slewed time—would be nice.
I imagine there might be some really interesting (for meanings of interesting that include shoot me now) and hard to track down bugs as you deal with inconsistent clocks not just across systems within a network, but processes within a single system.
- derefr 7 years ago
  
  > I imagine there might be some really interesting (for meanings of interesting that include shoot me now) and hard to track down bugs as you deal with inconsistent clocks not just across systems within a network, but processes within a single system.
  I feel like the "safe assumption" that the other end of a given IPC channel (or even inter-thread communication channel) is on the same machine, is responsible for the vast majority of failures we see in e.g. Jepsen testing of databases.
  After all, in sufficiently-large computers (i.e. HPC clusters that pretend to be one "computer"), you've got NUMA zones that are light-microseconds away from one another, where even threads of the same process can literally end up needing vector clocks to linearize events between themselves.
  It probably wouldn't be too bad a thing if things like the Linux base-system used only internal IPC mechanisms that exposed this unreliability (like e.g. Erlang does with "unreliable async message passing" as its IPC primitive), forcing each component to deal with the fact that its peers may or may not be netsplit away from it.
  Even if that scenario will only come up if you're writing code to get your GPS position from a Dyson sphere of 10-mile-deep Matryoska brains.
  - kbenson 7 years ago
    
    I bet that assumption is responsible for a large number of problems. I just also think it's correct enough most the time and relied on enough that if it all of a sudden often wasn't true, we'd see our carefully crafted applications for what they really are, a pile of assumptions that sometimes have little relation to reality.
- chatmasta 7 years ago
  
  IIRC Docker for Mac had a bug like this for a long time where the clocks of containers would become wildly out of date.
  - TheDong 7 years ago
    
    More accurately, the clocks of the linux virtual machine running docker containers would differ from the OSX clock.
    Those aren't really containers skewing from other processes on the same system as the parent describes, but of clocks skewing on two different systems (which is a totally normal thing we deal with regularly).
cyphar 7 years ago

There is a time namespace proposal[1], but currently the answer to this question is no. The reason is that timekeeping is incredibly complicated within the kernel (for instance -- when userspace gets the current time, it's read from a vDSO page that the kernel injects into every process and thus is updated by the kernel asynchronously). Adding different clock speeds is already non-trivial, let alone switching out different time backends.
The current time namespace proposal just allows you to set the current time separately from the host, which is actually quite a difficult thing to do already (it takes 20 patches)...
[1]: https://lore.kernel.org/lkml/alpine.DEB.2.21.1810022310360.1...
zapita 7 years ago

Is zerovm still active? I loved the concept, but the startup behind it is gone, and it’s built on Nacl which is being deprecated by wasm... I would live to see it portes to wasm and expanded behind the original “python compute embedded in openstack storage” use case, which was underwhelming. There are so many more exciting applications to server-side wasm. I hope someone actually builds this.
At one point I got my hopes up that Docker would build this as the logical next step after Linux containers... But they seem to be focused on monetizing the containers/kubernetes movement, which makes sense as a business decision but still is disappointing.
- derefr 7 years ago
  
  Considering that PNaCl was made for running untrusted, user-supplied native code in a sandbox-environment resembling that of native Linux binaries; and was used for this in e.g. Google App Engine to build the various first-generation container runtimes...
  ...and considering that GVisor (https://github.com/google/gvisor) is now used by Google for that same use-case...
  ...then perhaps GVisor (or a thin "make everything deterministic" layer on top of it) could be looked at as something like a "spiritual successor" to ZeroVM?
vlovich123 7 years ago

Couldn't that just be done as part of libfaketime? Now granted it's harder to do an entire OS with that but you could run it within a VM that itself is run by libfaketime, no?

theamk 7 years ago

I personally miss core pattern namespacing. I would love to give some of my containers a custom coredump handler, but this is impossible.

And in general, a sysctls settings namespace would be really useful. Sure, sometimes it makes no sense to namespace a setting, but net.ipv4.tcp_congestion_control for example? I'd love to be able to change it without modifying the code.

vxNsr 7 years ago

meta: This is from 2017,

Super interesting though, the keyring thing especially seems to have broader implications...

tyingq 7 years ago

Syslog seems to be on the proposal list as well.

lalaithion 7 years ago

Why is this the case? No one has bothered to do it? It would break backwards compatibility? Linus thinks it's a bad idea?

jchw 7 years ago

Shouldn't break backwards compatibility. More than anything, my guess is that it's just a result of most of Linux's modern day design having been implemented before the era of containers. Afaik, namespaces+cgroups were never meant to support complete isolation.
simcop2387 7 years ago

The time namespace is being worked on, it's a very difficult problem because of how pervasive time is in the kernel.
Here's a recent in depth LWN article about the topic. https://lwn.net/Articles/766089/
They keychain stuff I haven't heard about any work being done but I don't know any reason it shouldn't be doable.
emmelaich 7 years ago

Probably merely because it's hard to do and no one has sufficient motivation.
I can think of one good use case -- y2k style problems.
Also sometimes apps are tied to external events like legislation. It would be good to set the time forward for testing.
You can sort of do this with LD_PRELOAD but it can get hairy.
Also see @wmf's comment above.
- etaoins 7 years ago
  
  Another use case is dealing with tokens that assume globally synchronised clocks such as JWTs and Kerberos/Active Directory. Ideally all clocks would be perfectly synchronised but things happen.
  For example, you might have one container that’s exchanging JWTs with a micro service that should be using AWS’s NTP servers and another that’s joined an Active Directory domain that should be using the AD NTP server. Right now you either need to run them on separate machines or expose yourself to interesting problems if clock skew happens.
- briffle 7 years ago
  
  If there is one thing Y2K taught us, its to ignore any worry about the 2038 problem until 2036, then make a HUGE deal out of it.
  https://en.wikipedia.org/wiki/Year_2038_problem
  - cyphar 7 years ago
    
    Linux and glibc have been working on 2038 problems for at least the past decade.
    
    pjmlp 7 years ago
    
    There are plenty of other POSIX platforms out there.

Sharlin 7 years ago

I’m not sure that people who think ”containers are just like VMs” should have any business working with containers.

timeattack 7 years ago

You can't change time in container, but it's possible to change timezone files.

With generating fake timezones it is possible to change time in container.

cyphar 7 years ago

This doesn't change what gettimeofday(2) gives you (and actually you can't even use ptrace easily to fake the time of day because gettimeofday(2) isn't a real syscall -- it's actually implemented as a read from the vDSO page the kernel maps into every process).

Settings

Two Objects Not Namespaced by the Linux Kernel (2017)

Keyboard Shortcuts