Hyphens, minus, and dashes in Debian man pages

6 min read Original article ↗
Ready to give LWN a try?

With a subscription to LWN, you can stay current with what is happening in the Linux and free-software community and take advantage of subscriber-only site features. We are pleased to offer you a free trial subscription, no credit card required, so that you can see for yourself. Please, join us!

It is probably fair to say that most Linux users spend little time thinking about the troff typesetting program, despite that application's groundbreaking role in computing history. Troff (along with nroff) is still with us, though, even if they are called groff these days, and every now and then they make their presence known. A recent groff change created a bit of a tempest within the Debian community, and has effectively been reverted there. It all comes down to the question of what, exactly, is the character used to mark command-line options on Unix systems?

Last July, Sven Joachim filed a bug report regarding a change in groff, and in how it renders man pages for terminals in particular. A change to the handling of the character often referred to as "hyphen", "minus", or "dash" ("-") made many man pages rather harder to work with. To understand the problem, it's worth noting that Unicode provides a plethora of similar characters, some of which are:

NameCodepoint
Hyphen-Minus002D-
Hyphen2010
En Dash2013
Em Dash2014
Minus Sign2212

There are many more — Unicode is nothing if not generous in this regard. The term "dashes" will be used to refer to this class of glyphs here.

The specified behavior of groff is that an ASCII "-" (Hyphen-Minus) in the input becomes a Hyphen in the output. If the desire is to use Hyphen-Minus in the output, then the input should use the sequence "\-" instead. If the author of a man page types "--frobnicate" as an option name, the output will read "‐‐frobnicate" (with Hyphen) rather than "--frobnicate" (with Hyphen-Minus). The two look the same, but there is a crucial difference. A user who searches for "--frobnicate" in a man page will not find it if the wrong type of dash is used and, if that user cuts-and-pastes an example with the wrong dash, it will not work.

As an example, one can try pasting these two lines into a shell:

    /usr/bin/echo --help
    /usr/bin/echo ‐‐help

The results from one will be rather more helpful than from the other. Use of the wrong type of dash can also break URLs and corrupt file names.

Developers of free software are, of course, diligent about writing man pages; they do the job promptly, take their time to get every detail right, and can be expected to use the right kind of dash in every situation, even though the output from using the wrong kind looks exactly the same. They will surely not be bothered by the fact that a format designed to document command-line options contains a trap whereby the failure to add backslashes silently introduces problems for users who are distant in time and space.

Shockingly, this turns out not to be the case, and Linux man pages are overflowing with unescaped dashes. Years ago, the Debian project tried to address this problem by adding a check to its Lintian tool that would issue a warning when unescaped dashes were used. That check was dropped in 2015, though, after Niels Thykier concluded that it was simply being ignored: "The tag has existed since 2004 (commit fb2e7de). To date there are still 2000 packages with the issue." Since then, there has been no warning shown to Debian developers when man pages contain unescaped dashes.

Given the prevalence of this problem, it would arguably make sense to apply a fix at the processing level. And, indeed, groff has, for many years, duly remapped the Hyphen-Minus character (and a few others) in the man-page macros, making dash characters simply work as many would expect. That helpful behavior ended with the groff 1.23.0 release in July:

The an (man) and doc (mdoc) macro packages no longer remap the -, ', and ` input characters to Basic Latin code points on UTF-8 devices, but treat them as groff normally does (and AT&T troff before it did) for typesetting devices, where they become the hyphen, apostrophe or right single quotation mark, and left single quotation mark, respectively. This change is expected to expose glyph usage errors in man pages. See the "PROBLEMS" file for a recipe that will conceal these errors. A better long-term approach is for man pages to adopt correct input practices

Problems were indeed exposed, and users began to complain; bugs were filed and the topic showed up on the debian-devel mailing list as well. G. Branden Robinson, the upstream maintainer of groff and author of this change, defended the new behavior:

Mapping all hyphens and minus signs to a single character, as people whose blood pressure spikes over this issue tend to promote as a first resort, is an ineluctably information-discarding operation. In my opinion, man page source documents are not the correct place to discard that information.

Among other things, the information being discarded by this change includes whether line-breaking is allowed; Hyphen-minus does not allow it, while Hyphen does.

Others disagreed with Robinson's position; Russ Allbery said:

My opinion is that the world of documents that are handled by man do not encode meaningful distinctions between - and \-, and man should therefore unify those characters.

Colin Watson, who maintains Debian's groff package, admitted that he had overlooked this problem when he updated Debian to the 1.23.0 release:

I was aware of the change, but it somehow fell off my list of things to make a positive decision about when packaging 1.23.0. I'm rather inclined to revert this by adding the rest of the recipe above to debian/mandoc.local (while I agree with the idealized typographical point being made, I have approximately negative appetite for the Sisyphean task of fixing an entire distribution's manual pages in practice).

A few weeks later, he said that his plan was to leave the change in place during the current Debian 13 ("Trixie") development cycle, but then to revert it prior to the pre-release freeze to avoid inflicting problems on Debian's users. That would, in theory, give developers time to fix as many of the problems as possible. After the discussion went on for a while, though, he changed his mind, stating that he was unwilling to have his inbox filled with this discussion for the next year. So the remapping of "-" has been reinstated into Debian's version of groff.

This little episode may well be repeated in other distributions as they catch up with the groff 1.23.0 release. It also is probably not finished within Debian. This situation brings together the problems of documentation writing, typographic correctness, and Unicode look-alike code points, all of which are fertile ground for disagreement. The hopes that removing the remapping in groff would lead to the fixing of all those man pages may have been dashed, but that does not bar another attempt in the future.