The quest for a secure and accessible desktop

This article is an overview of accessibility and security efforts in Arcan “the desktop engine”: past, present and those just around the corner. It is not the great and detailed one I had planned, merely a tribute, presented in order to assist ongoing conversations elsewhere and as an overlay to the set of problems described by Casey Reeves here (linux:audio, linux:video).

Because of this the critique of other models is omitted but not forgotten, as is the firey speech about why UI toolkits need to “just fucking die already!” as well as the one about how strapping a walled garden handsfree to your face hardly augments your reality as much as it might “deglove” your brain — but it is all related.

What should be said about the state of things though is that I strongly disagree with the entire premise of some default “works for me”, a set of trapdoor features (repeatedly pressing shift is not a cue to modal dialog a ‘do you want to enable sticky keys?’ in someone’s face) and a soup of accessibility measures hidden in a setting menu somewhere. I consider it, at most, a poor form of ableism.

Due to the topic at hand this post will be text only. Short-links to the sections and their summaries are as follows:

Philosophy argues that accessibility is a broader perspective on how we attach and detach from computing both new and old.

Security extends the philosophy argument to say that security mechanisms and accessibility tooling need to evolve hand in hand, and not as a sacrifice of one quality in order to improve another.

Thereafter it gets technical:

Desktop Engine is a walk-through of the different layers of Arcan specifically from the angle of building more secure and accessible computing interfaces.

Client-Level covers the specific tools external sources and sinks can leverage to increase their level of cooperation towards accessibility.

Frameservers expands on role-designated clients for configurable degrees of protection and segment the desktop privilege level by the kind of data processing being made.

Shell and TUIs covers the middle ground from command-line shell and system management tools to composed desktop.

Examples: Durden – Pipeworld highlights preparations made for accessibility in a traditional desktop environment, and follows up with how it extends in a more unconventional one.

Trajectory briefly covers how this is likely to evolve and where the best opportunities to join in and contribute lie.

Philosophy

‘Secure and accessible’ is one of the grand narratives governed by the principles for a diverging desktop future and has been one since before the very first public presentation of Arcan itself. As such it is not a target of any one single principle, but a goal they all combine in order to deliver.

We view accessibility as a fundamental mission for computing. A rough mission statement would be ‘to increase agency, in spite of our stern and mischievous mother Nature’. It is for all of us or for none of us.

Whatever your preconditions or current state of decay happens to be, computing should still be there to grow the boundaries of your perceivable world. That world can be ever expanded upon, even if you are below 40 years of age and still believe yourself to be more than capable of anything — the mantis shrimp laughs at your crude perception of colours; a computer did not always fit in a pocket.

This reaches into both the future and into the past; gilded memories of youth and retrocomputing alike. Your current desktop and smart phone will eventually become dust in wind and tears in rain. Being suddenly rejected access to that past is a substantial loss and not just a mere hindrance — as the companies that to this day retains a win95 machine for their abandoned CNC control software would attest; some critical infrastructures still run on OS/2.

Computing history does not simply go away, it just becomes less accessible. This is why we chose to use virtualisation (including, but not limited to, virtual machines), not the toolkit, as the default ‘last resort’ model- and compartment- for compatibility that needs to be made accessible.

Security

A special topic with substantial overlap is that of security. Some have scoffed in the past at the idea of security having anything to do with the matter, yet if your threat model grows more complex directly through you or indirectly via friends or family performing sensitive work; managing access to substantial funds; having a tiff with the local ol’ drug peddler; fighting an abusive spouse or spotting the wrong criminal in the act — you will find out first hand how maintaining proper operational security (“opsec”) absolutely forms a substantial disability in a digitally traceable society where targeting information is ridiculously cheap to come by.

A trivial symptom of someone failing to understand this property are sprinkles of ‘we have just given up, here is a deliverable oh great product owner’ kind of dialog boxes: “Do you allow the file manager to access the file system yes/no/potatosalad?”, “The thing you just downloaded to do things with your camera did not pay us to pretend to trust them on your behalf and is trying to access your camera, which we reject by default to inconvenience you further; exit through the app store. To override this action please go to settings/cellar/display department, press flashlight and do beware of the leopard. We reserve the right to remove this option at any point in the future”.

On top of how such measures struggle with accessibility tooling as they, by their very nature and intent, can automate things some lobotomised designer thought should be exclusively interactive, the often ignored ‘Denial of Service‘ attack vector is really important, if not paramount.

This is because you do stupid things when put under wrong kinds of stress and pressure, hence the phrase ‘being under duress’. Stripping someone of agency at choke points is a tried and true tactic to achieve that. With tech advancing, integrating and gate-keeping just about everything it also provides great opportunity to deny someone service.

Revoking features disguised as gavaged-fed updates, reducing interaction or presentation options or dropping compatibility in the name of security is a surefire way of signalling that you do not actually care about it, just pretend to. These are properties that you develop in tandem or not at all.

Desktop Engine

Arcan is presented as a ‘desktop engine’ and not as a ‘display server’ for good reason. In this context it means to consolidate all the computing bits and computer pieces useful when mapping our perception and interaction into an authoritative behaviour and permission model, one entirely controlled by user provided scripts.

Such scripts can be small and short and just enable the bare essentials, as shown in Writing a console replacement using Arcan (2018) as well as reach towards more experimental modes like for VR (Safespaces: An Open Source VR Desktop, 2018). An exclusively aural desktop with spatial sound is fairly simple to achieve in this model, as is one controlled exclusively through eye movement or grunts of arousal, displeasure or both.

This time we are only concerned about how the desktop engine relates to the accessibility topic. To dive deeper, there already is Arcan as Operating System Design (2021) for a broader picture and Arcan versus Xorg – Approaching Feature Parity (2018), Arcan versus Xorg: Feature parity and Beyond (2020) for the display server role.

The core engine has the usual bells and whistles; an OS integration platform; media scene graph traversal; scheduler; scripting runtime; configuration database; supervisory process and interprocess communication library. Most other tools, from network translation and others come through separate processes using this interprocess communication library.

By itself it does nothing. You load a bunch of scripts that has some restrictions on what things are named and how they can access secondary resources like images. This is referred to as an ‘appl’ – more than an app, less than a full blown application. The scripting API is quite powerful, with primitives comparable with that of a web browser but thankfully layered on Lua instead of Javascript.

The end user- or an aggregate acting on their behalf- selects and tunes the active set of scripts. For the desktop perspective, this becomes your window manager. The set is interchangeable at runtime, sources- and sinks- (“clients”) adjust accordingly.

The user can augment the appl they chose to run with hook scripts, intended for agnostically adding specialised modifications of common functions to support exotic input devices and custom workarounds basically aggregating your desktop ‘hacks’.

The appl may open named ‘connection points’ (role like ‘statusbar’) to which external sources and sinks attach with an expressed coarse grained type (e.g. game, sensor or virtual-machine). Sources and sinks can work with both video as well as with audio. They also have an inbound and an outbound event queue. This ties in with the interprocess communication library mentioned above. The relevant parts of the event model is covered further below in the ‘Client Level’ section.

Connections over these named points can detach, reattach and redirect. By doing so the system can recover from crashes and transition between local and networked operation transparently.

This has been demonstrated to be sufficient even for fairly complex assistive devices. For more on that see the article on Interfacing with a ‘Stream Deck’ Device (2019) and on experimenting with LEDs (2017).

The most successful application to date has not been the use as a desktop, but similar to the setup of AWK” for Multimedia (2017) — for industrial machine vision verification and validation setups with hundreds of thousands of accumulated hours with multiple concurrent sources at high frame rates (1000+Hz) and real-time processing, which covered substantial portions of the overall project development costs.

Client Level

As mentioned in the engine section we have the concept of Window-Manager (WM) initiated named connection points (role) to which a client connects and announces a primary type.

These connection points are single-use and need to be explicitly re-opened to encourage the WM to differentiate and rate-limit. This works as a counter measure to Denial of Service through resource exhaustion. It is also one of the main building blocks for crash recovery (2017) and network migration (2020).

The WM can substitute connection point for named launch targets defined in a locked down database. This does not change the actual client code, but provides a chain-of-trust for sensitive specialised clients such as input drivers for input method engines and special assistive devices like eye trackers.

The WM combines connection point and type to pick a set of ‘preroll’ parameters. These parameters contain everything that is supposedly needed to produce a correct first frame.

This includes:

Preferred ‘comfortable visible font size’, font, hinting and possible fallbacks for missing glyphs.
Target display density and related properties for precise font rasterisation.
Preferred GPU device and access tokens.
Fallback connection points in the event of a server crash or connection loss.
Input and Output language preferences as per ISO-3166-1, ISO-639-2.
Basic colour palette (for dark mode, high contrast, colour vision deficiency).

Any and all such parameters can also be changed dynamically per client.

The client now has several options to announce further opt-in capabilities on top of a regular event loop:

labelhints are used to define a set of input tags that the WM can attach to input events, as a way of both communicating default keybindings and to let the WM map them to both analog and digital input sources. T2S can leverage this and say, for instance, ‘COPY_AT_CURSOR’ instead of CTRL+C.
contenthints are used to tell how much of available content is currently visible. This lets the WM decorate or describe boundaries and control seeking instead of clients embedding notoriously problematic controls like scroll bars that take precise eye-hand-coordination, good edge detection and has poor discoverability with the current ‘autohide’ trends.
state/bchunkhints are used to indicate that it can provide alternate data streams from a set of extensions (clipboard and drag and drop are just specialised versions of this), and if the exact current state can be snapshotted and serialised for rollback/network migration/persistence.

The WM now has the option to request alternate representations. This was used to implement a sudo-free debugging path in Leveraging the “Display Server” to Improve Debugging (2020) and the article eludes to other purposes, such as accessibility.

While X11 and others have the idea of a client creating a resource (like a window) and mapping/populating it and so on – Arcan also has a ‘push’ one. The WM can ‘send’ a window to a client. If the window gets mapped/used, it acts as an acknowledgement that the client understands what to do with it. These are typed. In the linked article, the ‘debug’ one was used (though with some extra special sauce).

There is another loosely defined one of ‘accessibility’. It signals that the client can provide a representation of the source window that is reduced to a ‘no-nonsense’ form. Thus you can have a main window with the regular bells and whistles of animated dynamic controls and transitions while also providing a matching secondary output which carries only the attention deficit, text-to-speech friendly and information dense part.

Frameservers

While the scripting itself is in a memory safe environment, it is by definition in a privileged role where caution is warranted regardless — memory is never safe, wear a helmet. For this purpose, traditionally dangerous operations such as parsing untrusted data is delegated to specialised clients referred to as frameservers. This also applies to transfers carrying potentially compromising data, e.g. desktop sharing or a regular image encode where someone might be tempted to embed some tracking metadata.

Frameservers come in build-time defined sets, designated to perform one task and be terminated with predictable consequence and repeatable results. The engine can act as the processing they provide is guaranteed. They get specialised scripting API functions to be easier to use for longer complicated processing chains.

Frameservers are chainloaded by a short stub that sets up logging and debugging; system call filtering and file system namespacing; hypervisors for hardware enforced separation all the way to network translation for per-device compartments — pick your own level of performance to protection trade-offs.

The decode frameserver takes an incoming data stream, be it an image file, input pipe or usb camera feed and outputs a decoded audio or video stream. This is also used for voice synthesis – the data stream would be your instructions for synthesis.

The encode frameserver works in the other direction: a human-native input stream to compute-friendly representation. Video streaming is an obvious one, but of more interest here is OCR — region of image interest goes in, text comes out.

Combine these properties and we have covered most bases for uncooperative forms of “screen reading”.

Shell and TUIs

One of the longest running subproject in Arcan has been that of arcan-tui (dream, dawn, day:shell, day:lash) – a migration path away from terminal emulators to better text-dominant user interfaces which follows the same synch, data and output transfer as everything else without the terminal-emulator legacy.

On top of exposing the CLI shell to the features mentioned in Client Level above, and the prospects of getting clean separation between the outputs and inputs of shell tasks, text is actually rendered server-side.

This means that the WM understands the formatting of each line of text and not merely as a soup of pixels or sequence of glyphs. It knows where the cursor is and where it has been and box drawing characters are cell attributes and not actually part of the text. Any window, including accessibility, can decide to provide this ‘TPACK’ format, deferring rendering to the last possible moment. With that using better composition tactics comes cheap. A big win is using signed distance field forms of the glyphs to let one zoom and pan at minimum cost with maximum sharpness.

Just as the WM model is defined entirely through scripts, so is the local job control and presentation for the command-line shell. Not being restricted to terminal emulation allows for discrete parallel output streams, prompt suggestions and event notifications. These different types can all be forwarded and rendered or played back separately, with different voices at different amplitude and head-relative position.

Examples: Durden and Pipeworld

Returning to the Appl “WM” level, we have several ready to build and expand on. The longest running one, ‘Durden’ has been the daily driver workhorse for many years by now, and is the closest to reaching a stable 1.0 release, although documentation and dissemination lags behind its actual state.

While first a humble tiling window manager, it has grown to cover nearly all possible management schemes, including some quite unique ones. What is especially important here is that it is organised as a file system.

Every possible action (and there are well over 600 by now) has a file system path with a logical name, such as /global/display/displays/current/zoom/cursor, a user intended label, a long form description and rules for validating and setting reasonable values. Keybindings, UI buttons, customised themes and presets are all just references to such paths. This makes it friendly to audio-only discovery, nothing in the model gets hidden behind a strictly visual form. Even file browsing attaches itself onto the same structure.

It is the best starting point for quickly testing out new assistive input and mixed vision methods, and has quite a few of them baked in already. Examples on the internal experimentation board that has yet to be cleaned up for upstream are edge detection and eye tracker guided mouse warping and selection; content update aware OCR and content-adaptive luma stepping to catch ‘flashbangs’.

The upcoming release has text to speech integration added, with the t2s tool supporting dynamic voices, and assigning different roles to different voices.

Even more experimental is the dataflow WM, Pipeworld. Although the introduction post (2021) painted a broad picture with more fancy visual effects and its zooming-UI, it has real potential as an environment for the visually impaired for a few reasons:

All actions yield a row:column like address, and the factory expression that sprung them into place is kept and can and modified/repeated.
It is 2D spatial, there is a cursor relative ‘above’, ‘below’, ‘left’, ‘right’ that can map to positional audio for alerts and notifications and a distance metric can be used for attenuation.
As with durden, there is a file system like view of all actions, but also a command-line interface for any kind of processing or navigation.
Each row and cell has magnification-minification as part of the zooming user interface, with zoom-level guided filtration.

Trajectory

The current boom in machine learning based video/audio analysis and synthesis fits like a glove with the frameserver approach to division of responsibilities. While current OCR is based on Tesseract and the voice synthesis uses eSpeak, each integration is in the 100-200 lines of code in a stand alone binary that can be swapped out and applied without restarting.

Adding access to more refined versions with ‘your voice’, transcription, automated translation or image contents description that then apply across the desktop is about as easy at it gets. Such extensions, along with packaging and demographic friendly presets would be suggested points where the project can be leveraged for your own ends without the time investment it would take to get comfortable with all the other project components.

While the current topic / release branch is focused on the network protocol and its supported tooling, when that is at a sufficiently high standard, we will return to a number of features with impact for accessibility and overlap with virtual reality.

Examples in that direction would be individually adapted HRTFs and better head tracking for improved spatial sound and better handling of dynamic audio device routing when you have many of them.

A more advanced one would be indoor positioning for ‘geographically bound computing’ with information being stored, ordered, secured and searched based on where you are – as a trivial example, my cooking recipes are mainly accessed in the kitchen, my doomsday weapon schematics in my workshop and my tax and banking in my home office.