The sad state of PDF-Accessibility of LaTex Documents (2016)
umij.wordpress.comThe thing is: LaTeX might try hard to look like a declarative language for structured documents, but it is not. It is a set of TeX macros. And TeX is a type setting system.
There is no good reason to put the accessibility into the type setting. Instead, use a declarative (e.g., any markup) language, translate that a) to (LaTeX) and b) to accessibility annotations and then combine the two results. Problem solved.
Unfortunately you will either lose a lot of expressiveness along the way or you have to find a very sophisticated markup language.
Why not simply augment LaTex with PDF tags which would be inserted manually, in the process of typesetting?
Something like:
Common packages could then generate these tags, and very few modifications of TeX source would be needed.\pdftag{blah}Yeah, one could do that. But then again, one could do that with any other scripting language. A true declarative document would mean a single source of truth and freedom from those technical matters.
An answer, particularly in the sciences, is to also distribute the source *.tex files, which being plain text with markup, can be handled just fine by things like emacspeak, or accessibility tooling for other sensible editors.
This comes up a bit around the blind accessibility issue for mathematics, which is why I suspect it's bubbling up this week on HN.
Standard maths notation is sight first and sight only.
If you want maths for the blind you need to convert to something like s-expressions, which emacspeak and read perfectly.
There's a bunch of counter-argument here from working practising blind mathematicians who read and write raw TeX every day, so there's at least a non-zero audience.
There were a bunch of mathematicians who used roman numerals for arithmetic between 0 AD and 1200AD. That they existed is not a counter-argument to the fact roman numerals are terrible to do arithmetic in. That is an argument that people can get used to anything, and become proficient enough at it that changing to a new - and better - system will set them back enough for it to be not worth while doing - for them. The same is true for modern maths notation.
I am not blind. Neither am I a mathematician. I'm a sighted biomedical engineer.
I have heard blind mathematician colleagues very loudly espouse using LaTeX for everything they do. I'm inclined to believe them!
Speaking of accessibility for math equations, there is some really good work done in MathJax 3: http://docs.mathjax.org/en/latest/basic/a11y-extensions.html... slides from presentation: https://mathjax.github.io/papers/CSUN19/csun2019_talk.pdf
You can try on any of the demos on this page: https://mathjax.github.io/MathJax-demos-web/ (right click on an equation to enable the accessibility options)
It's more than that.
I read more than one or two IT papers which donated variable types by using a different font between A and A and that difference was essential for reasonable understanding the papers.
Then the notation is overloaded.
Brackets are not even necessary enclosing something or are not necessary well balanced.
"Bar" based brackets and bars used for non bracket purpose are not necessary differentiate by noon visual clues. Etc.
It's already often confusing for people which can read the formula so I'm not surprised that really annoying for people which can't normally read the formula.
A super simple example is that [a;b[ is that "German"/(EU?) still to write an range with inclusive start and exclusive end. In the US [a;b) is used instead. But let's be honest something like rangeIE(a, b) or similar (Latex range_{ie}) is much better for a screen reader I thing and that's a trivial example not one of the really bad ones.
I think all formula should be written in a way which represents semantics not visuals and then be compiler to a classical visual representation (maybe using some additional non-semantic style annotation block).
A friend of mine works with a professor that defines: \be -> \begin{equation}, \ee -> \end{equation}, \ga -> \gamma, \gm -> \gamma, \s -> \section, and so on
Personally I think that latex should produce pdf documents with better mappings so that copy-paste and latex-paragraphs are preserved, even if obviously it will still get messed up in complex layouts
Even though LaTeX is still not very close to producing perfectly accessible PDF documents, there is some recent work towards this goal.
- https://ctan.org/pkg/accessibility
I am using the former for some personal documents and found that it improves text selection and copying on Apple devices. (This could be related to how PDFKit handles text.)
Edit: formatting.
accessibility, which its own author says not to use any more (or at least not until foundational problems are fixed), is mentioned in the article.
Converting LaTeX to HTML may be a route to making it accessible. I'm working on this project: https://github.com/arxiv-vanity/engrafo
It's 80% of the way there, but with 80% more work it could be a pretty complete implementation.
It powers this: https://www.arxiv-vanity.com/
If you want accessibility, it would be better to convert your content to XML and run the LaTeX through MathJax first, using accessibility extensions (https://mathjax.github.io/MathJax-a11y/docs/). Then use a third-party converter such as PrinceXML to generate the PDF from the XML.
One pattern I like to use is to write my documents using markdown which can be compiled into pdf via latex with a template of my choosing. It is also capable of compiling to other formats which may be more accessible such as plain text, html and docx.
[edit to add link to pandoc]
The "accessibility" LaTeX package maintainer is looking for help in this area: https://github.com/AndyClifton/accessibility/issues/42
Ross Moore did a presentation about accessibility and PDF at the 2020 TUG online conference.
The author writes:
"Did I mention that both Word and LibreOffice generate tagged PDFs?"
But then the simple solution is this: Convert your LaTeX to Word or LibreOffice. Then generate the PDF.
Absurdly the easiest way to convert LaTeX to Word/LibreOffice is by creating a PDF first (https://tex.stackexchange.com/questions/111886/how-to-conver...), import that into Word/LibreOffice and then create your PDF/A from that.
Why must the PDF encapsulate all requirements? My understanding of accessibility requirements it that you must have a version that is amenable to automatic speech, not that all versions must be.
It doesn't have to. But ideally you only want to have to produce and distribute a single file.
Also, a lot of people don't consider accessibility when generating PDFs. If LaTeX produced accessible PDFs, the PDFs would be tagged automatically without the author even thinking about it. Of course, it might still not be perfect, but it would be a lot better than the status quo.
1. Needs (2016) in the title.
2. Even by 2016, pdfTeX had been largely superseded by LuaTeX.
3. The author bizzarely links to "the mess" of the literate source of TeX the program as a WEB file rather than as a typeset document.
4. AIUI, the source code of the TeX engine has nothing/very little to do with adding tags to PDFs, which it is the job for LaTeX packages. Admittedly, understanding and writing their source code is a rarer skill than reading the literate source of TeX.
> 2. Even by 2016, pdfTeX had been largely superseded by LuaTeX.
Is that true? My experience was that pdftex is by far the most used one, but while thinking about it, I noticed I have zero data to back that up.
The overwhelming majority of people writing today use pdflatex. But at this point lualatex is pretty fast, and is starting to accumlate packages that leverage it to do good things. It is starting to build momentum.
I can't believe MathML just died and it's like not even part of the conversation about the history of math markup.
MathML isn't dead! Igalia have been doing some pretty great work on getting it upstreamed into Chromium, where there has been no MathML implementation (in contrast to Firefox) for some time.
>Take your average computer science graduate from the last ten years. Do you think anyone would be remotely able to understand what is going on there?
Yes, you literally read the literate program of TeX and understand what's going on: http://brokestream.com/tex.pdf
I had never learned Pascal but I've managed to edit and compile TeX successfully, and it was easier than trying to understand any of my own non-literate programs.
>My point being that if we wouldn’t rely on TeX itself and use ANT (or whatever alternative) which is written in the quite elegant OCaml, than hacking it would be at least possible for mere mortals. Although I have to admit, despite being in love with OCaml since my PhD days, it’s also a quite niche language. But imagine if the whole thing was written in Python, or at least C.
Imagine if software engineers were actual engineers instead of glorified script kiddies.
>I wish someone would design a new space shuttle because while it's a neat project I only understand MKS units and it's too much effort to use a calculator for converting between them and Imperial units.
It does seem that the OP constructs a bit of a strawman by claiming TeX's source code is too complicated for an average computer science graduate. I doubt even the top 5% of computer graduates could more readily understand the source code of most common programs.
Well, Knuth's programming style is idiosyncratic, to say the least. But in the early 80s, in advance of the publication of The TeXBook, I learned enough TeX from reading the literate source for TeX to put together a 100-page software manual.
I don't really agree with Knuth's version of literate programming, but one can hardly fault the TeX source for being unreadable.
> Imagine if software engineers were actual engineers instead of glorified script kiddies.
Maybe you shouldn't need to be an engineer to write a document with some formulas inside?
We're not talking about the difficulty of writing a document using LaTeX, which I think we can agree is more or less proportional to the degree of control one has over the finished product.
What's at stake here is the difficulty of editing the source code of LaTeX, which would be necessary in order to make the resulting pdf files more compatible with screen readers for the visually impaired.
Back in the day we had engineers writing documents from markup. We called them typesetters. It was a qualified job, you don't expect the task to be something Joe Random could to with 6 weeks of training.
> Back in the day we had engineers writing documents from markup.
Back in the day secretaries were creating electronic documents with troff and nroff:
> The first version of Unix was developed on a PDP-7 which was sitting around Bell Labs. In 1971 the developers wanted to get a PDP-11 for further work on the operating system. In order to justify the cost for this system, they proposed that they would implement a document-formatting system for the Bell Labs patents department[1]. This first formatting program was a reimplementation of McIllroy's roff, written by Joe F. Ossanna.
OP, here. It's not that long ago, I've used troff myself. Or maybe it was that long ago, but it just seems like yesterday.
Back in the days operating a computer also needed weeks of training and was a qualified job. That doesn't mean it always has to stay that way.
At risk of gatekeeping, it doesn’t sound like typesetters merit the title of engineer.
That's because it wasn't.
1 + 1 = 2
Aren't formulas easy to insert in plain text?
The problem with maths notation is that it was invented by sighted people for sighted people as a short hand for very complex ideas, which at the time weren't fully understood.
If you want to type equations use s-expressions, the clarity you get from saying what you mean is astonishing.
Hey look, you don't need the dummy variable explicitly any more. It's almost like it's a relic from a time before people understood what function application actually was.(define integral (lambda (function) ;; Implementation of Risch algorithm left ;; as an exercise to the motivated reader. )) (define definite-integral (lambda (lower-bound upper-bound function) (let ((anti-derivative (integral function))) (- (anti-derivative upper-bound) (anti-derivative lower-bound)))))If you want the implicit mess (and incredible power) of higher maths, be prepared to deal with the the mess of typography. Which is why you need TeX, or worse.
> Hey look, you don't need the dummy variable explicitly any more. It's almost like it's a relic from a time before people understood what function application actually was.
If you actually perform definite integration, as a limit of Riemann (or Lebesgue, if you like) sums, or even try to approximate it numerically, you're going to have some dummy variable cropping back up again. As your comment indicates, when Risch's theorem proves that there is no elementary anti-derivative, you're going to be out of luck.
You're skimping on some of the complexity by equating a definition with a particular way of computing it, but that's completely inadequate for mathematics, as there are many things we can't compute (either in theory or in practice).
In particular, you definition of integral assumes that integrable functions always have an antiderivative, which is wrong.
I find it odd that you pretend a meta-mathematical question is a mathematical one.
You are dismissing constructivism as not mathematics, and you ignore that the halting problem is a way to deal with a class of results which include non-existence proofs by running an algorithm forever.
I can easily create calculations that will never return results which classical mathematics say are impossible by the simple fact they never return any results at all.
The Risch algorithm being a complex example, finding the square root of two in the rational number domain being a simple one. I can still deal with them as though they return results in all calculations though, without the need for baroque semi-mystical notation. Unless you want to claim some sort of divine human essence not present in Turing Machines and Lambda Calculus which lets us transcend their computational capabilities?
I literally don't understand what the hell you are talking about.
But this has nothing to do with constructivism. Even if you only allow constructive definitions and proofs, there is still the world of a difference between the definition of an integral and the result you get from evaluating it.
Yeah, sure, in theory you can represent an integral as a function that takes another function and two boundary points and returns a value...
But first, it may not be possible to determine the value of the integral exactly because there is no known method of doing so (the Risch algorithm, apart from it basically being so complex that it's implemented almost nowhere fully, only works for elementary functions!).
And second, if integrals are "just functions", you lose the ability to manipulate them according to known theorems, e.g. additivity, triangle inequality, Cauchy-Schwarz, convergence theorems, ...
So yeah, here's where I get the feeling that some people should do some more maths and spend good parts of their days proving theorems and playing with definitions before they start complaining about how dumb its language is.
>So yeah, here's where I get the feeling that some people should do some more maths and spend good parts of their days proving theorems and playing with definitions before they start complaining about how dumb its language is.
Some people have a PhD in mathematical physics and wrote the higher function code of axiom. I guess those people would be difficult to understand for non-experts.
Ok, I misjudged your experience apparently, but you could still do a better job actually engaging with the arguments.
You sound like a third year maths student who has just been taught the Lebesgue integral and has decided that it is the _real_ definition of definite integration. Quite frankly I don't have the energy or inclination to have adversarial arguments with people who don't understand what I'm saying. Maybe talk to your professors about the generalizations of integration and why none of them are the 'real' way to integrate a function.
Also the integral procedure defined at the top isn't a function, it's an operator. It returns functions as results, not values.
I don't know lisp so maybe there where more nuances in your code, but often your want to analyse an integral in symbolic terms (for whatever integration definition you are using)
The expression of the integral operator as a function in code is contrary to that with how people usually think about functions and code.
The only language I know that properly manages to represent integrals as code is Wolfram Mathematica by using rich rewrite systems.
That is Integrate(f,a,b) is not code but a data structure to be interpreted by an external (and customizable) integration context that defines numerical types, algorithms, lazyness, etc.
From the links I know of Wolfram Mathematica and lisp this could well be what you meant, but it is quite different from giving a single integration algorithm.
Many (if not all?) CAS have some internal tree representation of mathematical structures, Mathematica is not the only one. I worked on such a system myself. To define data types for your expressions and then evaluate them via different algorithms is quite natural. So yes, we also used something like Integral(f, from: a, to: b) and then had like a gazillion techniques to actually evaluate that.
Proof assistants do something similar btw, they also encode mathematical expressions as (often recursive) data types and then prove things about those definitions.
edit: to be fair though, the LISP implementation proposed to use the Risch algorithm, which actually does give you symbolic antiderivatives. So that wouldn't be a valid critique of the implementation. There more salient points are a) that the Risch algorithm only works for a certain class of functions (those that have an elementary antiderivative) and b) that by not separating the definition of an integral from its evaluation, you're not able to manipulate it directly as an expression or to evaluate it via different methods (e.g. symbolic vs. numerical methods).
Just think about inputting $complicatedIntegral - $complicatedIntegral. This is clearly zero, but if your integral is "just a function", you're not able to see that and will spend an unreasonable amount of time computing it (twice, even), or worse, will fail to produce a result.
> b) that by not separating the definition of an integral from its evaluation, you're not able to manipulate it directly as an expression or to evaluate it via different methods (e.g. symbolic vs. numerical methods).
I was mostly referring to this. Mathematica is simply my only experience with this kind of approach.
> In particular, you definition of integral assumes that integrable functions always have an antiderivative, which is wrong.
I was objecting at the same time as you were, but I don't think this is the right objection. It's true that not every integrable function has an elementary anti-derivative, but every integrable function f does have an anti-derivative F, at least in the sense that F is almost everywhere differentiable, and the derivative is almost everywhere equal to f. (And, of course, if f is continuous, then F is everywhere differentiable, and its derivative everywhere equals f.)
If you say "at least in the sense that F is almost everywhere differentiable", you're already redefining "anti-derivative" to some extent, I feel. But this is just arguing over semantics. Even then, I think your claim that every integrable function has a "generalized antiderivative" is also only true for the Riemann integral. The Dirichlet function is Lebesgue integrable, but it doesn't have an antiderivative even in this weaker sense.[^1] And mathematicians generally prefer the Lebesgue integral.
I think the more important insight here is that integration fundamentally isn't defined through the anti-derivative, and that the two notions are actually related is a deep theorem, rather than just a definition.
And the fact that non-elementary antiderivatives exist is interesting in theory, but in practice you can't use them directly for anything. In particular, in practical situations you will often use numerical methods to integrate a function which will not be based on any notion of anti-derivative at all.
[^1] Edit: I think I was wrong here. If you take the function identically zero, then its derivative is identically zero and as such equal to the Dirichlet function almost everywhere. So this is not a counterexample. I still think it's weird to call than an "antiderivative" though.
My favourite anti-derivatives of the constant zero and Dirichlet function are monotonically increasing.
> If you say "at least in the sense that F is almost everywhere differentiable", you're already redefining "anti-derivative" to some extent, I feel. But this is just arguing over semantics.
It is, but let's! Before we re-define the anti-derivative, we'd have to define it. A sensible definition is: a function F is an anti-derivative of a function f if F is everywhere differentiable, and if F' = f everywhere. By this definition, not every integrable function has an anti-derivative.
On the other hand, we could also just choose to define—not re-define!—an anti-derivative of f to be a function F that is almost everywhere differentiable, and such that F' = f almost everywhere. This definition is more complicated, but also more inclusive; and it handles everything the old definition could.
In this respect it is, and it's no accident, exactly like the Lebesgue integral vis a vis the Riemann integral. Lebesgue's integral has a more complicated definition than Riemann's, and we could call it a re-definition; but, since it handles everything that Riemann's does (with the same answer), we could say in retrospect that Lebesgue's was the correct definition, and Riemann's was just the special case we happened to discover first.
> I think the more important insight here is that integration fundamentally isn't defined through the anti-derivative, and that the two notions are actually related is a deep theorem, rather than just a definition.
Certainly I agree with this!
> And the fact that non-elementary antiderivatives exist is interesting in theory, but in practice you can't use them directly for anything. In particular, in practical situations you will often use numerical methods to integrate a function which will not be based on any notion of anti-derivative at all.
Here again I'd argue over semantics, though I'd concede it's much more a matter of personal preference than my argument above, which I think has mathematical weight behind it. Namely, I'd argue that the numerical integration is doing something directly with the non-elementary anti-derivative, namely, evaluating it at a point—just like we call reading off the value of, say, the sine of an angle from our calculator doing something directly with the sine, even though what we're really doing is summing sufficiently many terms in a Taylor-series approximation.
> [^1] Edit: I think I was wrong here. If you take the function identically zero, then its derivative is identically zero and as such equal to the Dirichlet function almost everywhere. So this is not a counterexample. I still think it's weird to call than an "antiderivative" though.
I agree that it's not a counterexample for the reason you say, and there's no arguing with perceptions of something being weird; it certainly is counter to intuition built out of Riemann integrals. And yet, if we didn't steel ourselves to handle this weirdness, we'd have to say that it didn't have an anti-derivative at all; and why artificially restrict our theorems to match our intuition, rather than expanding our intuition to meet our theorems?
As to your first point: In the sense that it's useful to say "every integrable function f has some antiderivative F so that you may compute the integral by computing F at the endpoints", yes, your definition can be useful. On the other hand, it's also an important question to consider "which functions can be derivatives?" and in that sense, the definition is less useful. But definitions are definitions; the most we could objectively argue about is which one is the more standard one.
> Here again I'd argue over semantics, though I'd concede it's much more a matter of personal preference than my argument above, which I think has mathematical weight behind it. Namely, I'd argue that the numerical integration is doing something directly with the non-elementary anti-derivative, namely, evaluating it at a point—just like we call reading off the value of, say, the sine of an angle from our calculator doing something directly with the sine, even though what we're really doing is summing sufficiently many terms in a Taylor-series approximation.
Fundamentally, at a mathematical level, yes. That's what it means for two definitions to be equivalent. But on an algorithmic level, the process of evaluating an integral numerically and the process of finding an antiderivative (especially symbolically) are quite different things.
But in the end, it doesn't seem that we fundamentally disagree.
You don't.
Maybe to edit such a complex program that you're using to convert your equations to some other pretty format you should.
Imagine how much further ahead HTML/CSS would have been if the academic crowd abandoned latex 20 years ago.
Revisit this comment in 5 years.
It wasn’t because mathematicians were reluctant to change that math in HTML didn't take off. Rather, it was because browser developers were loathe to implement and maintain the enormous and complex pile of code that is MathML, and they said "Why should we, if you mathematicians already have LaTeX?" [0]
Firefox has had MathML support for a long time. Complain to Apple and Google (and vote with your browsing activity, by using the browser that is less driven by commercial considerations).
Safari has supported MathML since 2011. Though apparently the implementation is somewhat buggy.
The real issue is the lack of MathML support by Chrome (and until recently, Edge)
Yes, Chrome had some MathML support but removed it (I think as part of their forking of Blink from WebKit).
Believe it or not, HTML, CSS, and MathML are still going pretty strong in the educational publishing industry.
MathType supports TeX input. MathJax accepts TeX input. The technology is already there, there's just very little mindshare because no one cares about accessibility.
I had a realization a while back, that in my opinion LaTeX isnt really needed anymore. Pretty much anything you can do with LaTeX, you can do with HTML. Want a PDF? Most browsers will print to PDF now, or you can use a library like this:
https://github.com/dompdf/dompdf
Need a page break? Here you go:
https://developer.mozilla.org/Web/CSS/break-after
Im not sure what you would do about TikZ and stuff like this, but I have seen some pretty wild stuff in CSS, so surely its possible:
> I had a realization a while back, that in my opinion LaTeX isnt really needed anymore
Am I correct in assuming that you are not working in academia/on research? In that case, I would argue that from your point of view latex was never needed. On the other hand, if you are working on math/cs/physics research latex is indispensable...
Not even considering the math formatting, Html is still lacking good footnotes, bibliographies, glossary generation, index generation, and table of context generation. Browsers also render things atrociously compared to a latex pdf.
There are third-party tools that do all of this.
A LaTeX-generated PDF does not render correctly at all for a blind user.
I'm not saying latex is exemplary in all forms. (I would also like to know if the big browsers render all web pages as accessible PDFs, though).
I simply meant that a plain html document + the browser leaves much to be desired for even non-technical documents.
Tbh, I would like to see a more advanced and open html-based ecosystem for documents. Latex has many watts, but also a lot of features.
> There are third-party tools that do all of this.
The things that today do this will be all gone in ten years, or replaced with other things that will in turn be replaced ... LaTeX has been here for a long time, and has been strikingly stable.
I wrote my bachelor thesis in markdown with inline mathtex(?) and compiler it from there with pandoc.
I also tried from/to HTML and the result is just bad. HTML isn't suited at all to write scientific documents with proper formatting.
Also I still need to do some things directly in latex and include them with inline latex in markdown.
So no HTML+CSS isn't suited for this at all.
At least for now where scientific papers are still DIN A4 real paper based/targeting.
Maybe it's time to change that. People read papers a lot in 24" Monitors, tablets, laptops and e-book readers for all of this the current formatting sucks.
You can definitely generate good scientific PDFs from HTML and CSS, combined with tools to generate SVGs from your LaTeX/MathML. I've converted a fair number of textbooks from print to high quality digital and print hybrid PDFs.
The learning curve is pretty high though. If you're not a web developer already, there are better options out there.
I get that things like proper kerning are probably doable with CSS.
LaTeX still does a better job out of the box, which goes a long way for my understanding of its typical use cases (resumes or long form reports).
No need for (much) CSS, browsers already support vector graphics: SVG!
In Pandoc, what you would do is you would use a general Latex plugin, and then TikZ supports PNG or SVG output by 'standalone' (https://tex.stackexchange.com/questions/51757/how-can-i-use-...), and you can either save that to a file & use as an image or you can inline them.
This uses Latex at compile-time, but considering the extent to which TikZ is a graphics DSL, I wonder how hard it would be to implement a TikZ->SVG compiler as a standalone tool in a different language? (Or make it available in a variant like Mathjax? Like https://github.com/kisonecat/tikzjax except without running an entire TeX engine in the user's browser.)
Well, LaTeX is great for mathematical symbols. I have not seen a typesetting in the web space that can compare itself to LaTeX. This is why scientists from the exact sciences use it.
MathJax is routinely used for inserting equations into HTML documents [0]. A side benefit is that it uses LaTeX-like syntax for defining equations.
Static site generators like Sphinx, Hugo, and Jekyll have support for MathJax which allows for inline equations in Markdown/RsT docs. See a Sphinx example here [1].
[1]: https://www.pflotran.org/documentation/theory_guide/mode_th....