Settings

Theme

Preview in macOS Big Sur is destroying PDFs

annoying.technology

359 points by matrixagent 5 years ago · 325 comments

Reader

mulmen 5 years ago

I have learned to be scared of my MacBook. Seemingly safe behavior can cause permanent damage. It does completely unexpected things, apparently by design.

I do not put my pictures in the ~/Pictures directory for fear of what the newest app will do to “improve” them for me. I fully expect it to apply lossy compression to my files without asking. This is after Photos or whatever it was called at the time mangled the dates on a bunch of my vacation photos to 10 years before the actual trip.

Oh and have fun when your photos are automatically uploaded to iCloud to save space locally then silently deleted from iCloud to... save space? My sister lost her first year of baby pictures to that one.

Same with ~/Music after iTunes wiped out a bunch of carefully curated metadata. Yes, I did want that album art.

I fat-fingered some key combination in Messages recently and got a prompt confirming I wanted to delete the entire conversation history. I consider myself lucky it bothered to ask.

I can add “view a PDF” to the list of things likely to leave me holding the bag.

  • cle 5 years ago

    I have run into that Messages fat-finger-delete multiple times, it is infuriating! I still don't know what the key combination is, but IIRC the confirmation defaults to "Delete" when enter or space are pressed, which are...quite common when sending short messages.

    • tekacs 5 years ago

      They've finally removed the keyboard shortcut for this in Big Sur. :)

      It was Cmd + Delete/Backspace before.

      • OkGoDoIt 5 years ago

        Which on Windows is the key combination to delete the most recently typed word, and is basically muscle memory to replace spelling mistakes for me...

    • mulmen 5 years ago

      After my latest iCloud password change Messages has also been giving me the beachball of doom when images are received. I'm terrified of what that implies. I'm looking forward to the announcement of the exploit where a carefully crafted image owns MacBooks.

  • FireBeyond 5 years ago

    Multiple people are complaining that Big Sur is blowing out their speakers on the laptops.

    I submitted something yesterday where Big Sur completely breaks DSC for all non-Apple monitors (and in some cases, even those).

    Oof.

  • p1necone 5 years ago

    The more I learn about macs the more I think the "it just works" crowd really mean "I will sacrifice my system "working" sometimes in exchange for having zero configuration options".

    • iansinnott 5 years ago

      In the past it really did just work and macs were configurable. The trend of limited configuration is more recent (and yes, it's terrible).

    • netflixandkill 5 years ago

      For a decade or so it pretty much did just work. Alas, nothing gold (or metallic shades of white in this case) can stay.

  • norswap 5 years ago

    Apple, who sold hardware because they had the best software, now sells software because they have the best hardware.

    Funny how things change.

  • dkonofalski 5 years ago

    Those things are literally not possible to happen without your intervention... ಠ_ಠ

    • kalleboo 5 years ago

      Yeah, macOS does not touch files in Pictures or Music, only files that you've explicitly imported into the Photos or Music/iTunes apps. And it definitely doesn't silently delete photos from iCloud, if that was common it would be a major bug that would make the news.

      • dkonofalski 5 years ago

        It's interesting to me how people can upvote comments that are pretty much made up nonsense but comments like yours and mine are downvoted because people just want to hate on Apple.

  • robertoandred 5 years ago

    Do you have proof of iCloud deleting photos?

    • meibo 5 years ago

      I assume these are a result of bad UX, at least in my personal experience.

      iPhones used to/will(haven't had the pleasure in a year now) bother you quite heavily if you're at your iCloud storage cap to either upgrade or clean it out. Not a stretch that some users might not think long enough about the consequences.

      • mulmen 5 years ago

        Yeah I think the deleted from iCloud thing is bad UX. I don't recall the details.

        The ~/Music and ~/Photos directories are basically owned by Apple though. I don't trust their apps with my files, especially when the rules change between OS versions.

  • Yetanfou 5 years ago

    Install 'Linux' [1] on the machine then? That is, assuming you're using a model which is supported by some form of Linux. That way you get to use the hardware without being bitten by the software. Linux distributions are not perfect either but they offer fewer such 'surprises'. Keep MacOS around for those times you need to run software which is only supported there but do your main work in Linux.

    [1] where 'Linux' stands for any supported Linux distribution

xxpor 5 years ago

Why anyone treats PDF as anything but a write-once format is beyond me. It's so finicky that I'm not shocked bugs like this happen. The only programs I'd be reasonable sure wouldn't screw it up are Acrobat itself, and pdflatex and friends.

I think we need a multi-image container format. It could be something that's literally a bunch of jpgs/pngs/pick your poison in a tar container, and given a new extension. OSes would open it and present it as a gallery in order. There's no value in a non-ocr'd PDF existing. For OCR'd text that gets more complicated, but it feels like we should be able to come up with a common denominator that doesn't have the legacy of a binary format derived from postscript in the early 90s.

  • systemvoltage 5 years ago

    PDF is one of the best things IMO - it's like a docker container for documents. The way the original authors intended it to be, including fonts and all the things that go into making a document.

    For example, I personally like to purchase books that are in PDF format, not epub/mobi. I want to rely on professional typesetting from the publishers, not some front-end engineer's vision of what the ebook should look like and how it should be presented to the user. It works only for novels and long form reading where typesetting is not critical. Basically any book that also has a good audiobook version works fine with epub/mobi since visual formatting is a non-issue. For everything else - programming books to research papers, PDF is great. Can PDF format be improved? Sure, but the level of adoption and its widespread use is more important than fixing copy paste and content migration aspects of PDFs.

    What I absolutely DO NOT WANT - is web page like format with auto-flowing text and something that fits to the screen with user styling/typesetting. That IMO defeats the purpose of what the container is supposed to do, i.e. contain and not leak. It should most definitely have fixed physical dimensions (or pixels).

    • wtallis 5 years ago

      > PDF is one of the best things IMO - it's like a docker container for documents. The way the original authors intended it to be, including fonts and all the things that go into making a document.

      Sounds like you want the PDF/A standard for archiving: https://en.wikipedia.org/wiki/PDF/A

      It forbids embedding audio and video and JavaScript, requires embedding all the fonts, forbids encryption and patent-encumbered compression. It's basically the PDF format with the most regrettable features stripped out.

    • inetknght 5 years ago

      > I want to rely on professional typesetting from the publishers, not some front-end engineer's vision of what the ebook should look like and how it should be presented to the user.

      I, on the other hand, do not whatsoever trust publishers. PDF runs software and that means that everything under the sun including malware and DRM can run on PDF -- and indeed that has occurred many times. It should be a non-starter for anyone who actually values the ability to read their content on any device they want.

      • Wowfunhappy 5 years ago

        > PDF runs software and that means that everything under the sun including malware and DRM can run on PDF -- and indeed that has occurred many times.

        You can basically avoid this problem by only opening PDFs in programs like Preview and Evince, and never Acrobat.

      • systemvoltage 5 years ago

        There are many other formats, a myriad of text formats from asciidocs to markdown. For embedded devices and such, they work great I presume.

        Please don't mess with PDFs. 20+ year old technology that is shitty but works, I prefer keeping the status-quo.

        • x86_64Ubuntu 5 years ago

          If PDF had an "accident", and someone was able to do to it what Steve Jobs did to Flash, I would be very happy.

          • systemvoltage 5 years ago

            Are there any other "fixed" or "docker-like" formats for documents besides PDFs that solves a lot of problems with PDF format? From what I understand, PDF is an xml based document standard.

            • pjmlp 5 years ago

              Nope, PDF doesn't have anything to do with XML.

            • kall 5 years ago

              Are web pages/archives really less fixed than PDFs if you don‘t want want them to be? I mean I know they are in practice, but I don‘t see why you couldn‘t use them as a fixed document format.

              Some contracts may need to be upheld between creator and consumer. If the consumer doesn‘t want to, well, that‘s freedom, isn’t it?

              Of course it is the most code executing document format to ever execute code, so if that was the issue...

              Is postscript (that‘s what PDFs are ultimately right?) capable of thing‘s that HTML+CSS+SVG isn‘t?

              Edit: Oh no I completely blanked on the fact that web pages literally don‘t have pages. Seems like a big one.

              • simion314 5 years ago

                >Are web pages/archives really less fixed than PDFs if you don‘t want want them to be? I mean I know they are in practice, but I don‘t see why you couldn‘t use them as a fixed document format.

                Web content is not that fixed, same code can redner different in different browsers this is why sometimes we need to use some css reset rules to attempt to make stuff render as similar as possible.

                What is great with PDF it that it should look and print the same everywhere, I always have issues withfor example .doc files that don't look and print the same on my machine as the sender one and it is just a few lines of normal text, nothing fancy.

                So a PDF replacer should be identical on all machines this means a extremely strict specifications with no place where implementation could deviate a pixel.

              • systemvoltage 5 years ago

                Yea, you can use `position: fixed` or `position: absolute` with CSS + HTML but as you said, it doesn't dictate pages. Although, now that I think about it, why do we need pages :) Just fixed layout is good enough.

                • leephillips 5 years ago

                  The pdf2htmlEX program does this. It creates exact renditions of PDFs in HTML, using embedded fonts and positioning each character. Personally, although it’s an amazing demonstration, I don’t really understand the purpose, when you can just distribute the PDF.

                  https://github.com/coolwanglu/pdf2htmlEX

            • arthur2e5 5 years ago

              You are thinking about XPS, Microsoft's OOXML-based page description language.

          • pjmlp 5 years ago

            You mean setting back the Web 10 years behind, with WebAssembly still not exposing everything that Flash + CrossBridge were capable of, and all the Web tools that try to replicate Flash IDE still don't come close to it in HTML5?

            Nah, better leave PDF as it is, we don't need any accident.

    • markandrewj 5 years ago

      Generally the PDF copy of a book is designed to fit the format of the final printed book. The PDF is even often the deliverable sent to press.

      If you are using something like an e-reader, or a smartphone, the PDF layout often doesn't translate well. Typesetting is also normally also done for ePub/Mobi, but the layout is tailored to the format of the device that the document will be read on. Although there are times when publishers just take the PDF and click 'convert to ePub', which isn't ideal.

      There are also other advantages to different formats, when talking about things like programming books. As an example, working code snippets for web formats. I am thinking of things like Jupyter notebook when I say this.

      As others have mentioned there are also a number of security risks associated with PDF.

      I can't deny there are several books I have read in PDF format however.

      • kall 5 years ago

        I do believe this happens, but I have yet to read a book that looks even reasonably typeset on Apple Books on a phone with anything larger than default.

        I know a PDF can‘t do that either, but that‘s the promise of these formats right? Between the most typography focused tech company and whatever the publisher puts into the file, they don‘t deliver.

        • omnimus 5 years ago

          There are two issues here:

          1. PDFs are fully baked/static and made for one page format (the one original book was printed on). These will never work.

          2. Ebook formats are basically webpages and they have CSS. There are defaults from the ebook reader app and but often there are css settings from the publisher of the book. Their css styles mash together leading quite often to bad results. They could be very responsive but i suspect publishers test only on Kindle/ebook readers.

          I am not sure why they just don't make one "master" style thats tested everywhere. But it's probably because they are not software companies. For them epub/mobi just might mean "black and white ebook release".

          Also mobi is probably most of the sales and it is pretty restricted design wise compared to epub.

          So basically blame the publishers.

        • leephillips 5 years ago

          What do you mean “PDF can‘t do that either”?

      • Wowfunhappy 5 years ago

        > Generally the PDF copy of a book is designed to fit the format of the final printed book. The PDF is even often the deliverable sent to press.

        I believe this is precisely what the GP finds appealing about PDFs. A PDF (usually) constitutes the original copy, reproduced faithfully.

        It's like the difference between buying a widescreen VHS versus one that uses pan-and-scan to fit the movie to your TV's aspect ratio. Sure, the widescreen copy may not be ideal for your viewing environment, but it's the version the director intended for you to see.

        And for both eBooks and movies, you can work around the problem by using more appropriate hardware.

    • odyssey7 5 years ago

      Knuth as a book author has written about this on his website.

      “Warning: Unfortunately, however, non-PDF versions have also appeared, against my recommendations, and those versions are frankly quite awful. A great deal of expertise and care is necessary to do the job right. If you have been misled into purchasing one of these inferior versions (for example, a Kindle edition), the publishers have told me that they will replace your copy with the PDF edition that I have personally approved. Do not purchase eTAOCP in Kindle format if you expect the mathematics to make sense. (The ePUB format may be just as bad; I really don't want to know, and I am really sorry that it was released.) Please do not tell me about errors that you find in a non-PDF eBook; such mistakes should be reported directly to the publisher.”

      https://www-cs-faculty.stanford.edu/~knuth/taocp.html

      • omnimus 5 years ago

        But thats such a specific example. Math books have so much complex notation and wierd inline elements... he had to invent TeX for that. HTML/CSS wouldn't handle it.

        Basically for all normal texbooks you will want epub as digital format. And if you have lot of ilustrations or wierd format book then just get the printed version.

    • ernst_klim 5 years ago

      > I personally like to purchase books that are in PDF format, not epub/mobi

      PDF is abysmal for books though. 1) You can't scale fonts to fit various screen sizes. 2) It's waaay more expensive to interpret and render, so it affects battery life.

      • systemvoltage 5 years ago

        1) That's a feature, not a bug :)

        2) Agreed, but that's a challenge for the engineers, not something we should blame the user for. Users have unequivocally spoken (Look up reviews for epub programming books on Amazon, usually they just want a PDF).

        • cratermoon 5 years ago

          Not being able to scale the fonts for different resolutions is something only a young person with 20/20 vision would thinks is a feature.

          • systemvoltage 5 years ago

            You can zoom the whole thing.

            Scaling fonts while keeping the width of the line the same causes other problems.

      • cjsawyer 5 years ago

        And yet, I prefer it. Textbooks in any other format are miserable to read.

        • Ar-Curunir 5 years ago

          Textbooks are different from other kinds of books in that you have formulae and figures that can't easily be reflowed.

      • sjy 5 years ago

        > It's waaay more expensive to interpret and render, so it affects battery life.

        Interesting – I never noticed a difference in battery life when I was reading MOBIs and PDFs on my Kindle, but the slow refresh rate made scrolling through PDFs a nightmare. I do notice that reading PDFs on my iPad is very light on battery compared to reading web pages, but that’s because those web pages are running a bunch of JavaScript and using my internet connection, whereas with a PDF I download it once and then read it for hours.

    • jcelerier 5 years ago

      > The way the original authors intended it to be, including fonts and all the things that go into making a document.

      cries in random PDFs that end up being printed with letters spaced by 1 cm

    • cratermoon 5 years ago

      What size and resolution is your eReader? I can read PDFs on my computers fine, the screen is large enough. On my eReader I can't both see a whole page and have the text large enough to read, even with my reading glasses. I end up zooming in to bring the text up to readable size and then I'm only seeing part of the page and have to scroll left/right/up/down to read a page.

      • smartbit 5 years ago

        Scrolling left/right zoomed text can be prevented with Goodreader and these 2 simple steps

        - with the crop function in Goodreader remove as much as possible margin left & right. This might cut text in the header/footer eg the page number. Optionally crop odd & even pages differently.

        - turn the iPad landscape and with 2 fingers pinch to zoom the text until the margins snap to the sides of the screen. Now with a single finger scroll down to read the text from top to bottom. Flip to the next page by swiping with a single finger to the left and Goodreader will conveiniently start at the top of the next page.

        In my experience for most portrait pages this enlarges the text big enough. As a bonus you can use the iPad Smart Cover as a stand in either position. I use this trick since the original iPad1 as Goodreader has been there since day 1.

    • adrianmonk 5 years ago

      > absolutely DO NOT WANT - is web page like format with auto-flowing text

      I would love it, though, if PDF included this as something that's always entirely optional.

      Sometimes I want to read something just as formatted. If my display is big enough, and the formatting has any importance, I probably want that.

      Other times, my screen is smaller (phone) or the wrong shape (small laptop), and I'd rather the text confirm to the device rather than vice versa.

      Also, sometimes I use my arrow keys to scroll as I read to keep my place (like a line-oriented instead of page-oriented bookmark). So just because the device is capable of it, I don't necessarily always want page-oriented original formatting because it might have two-column text to deal with or top/bottom margins that serve no purpose for me.

      • Closi 5 years ago

        > I would love it, though, if PDF included this as something that's always entirely optional.

        To be fair that’s what they are working on, with Liquid Mode for PDFs in Adobe Reader. The plan is that it will automatically work for all PDFs using some sort of AI magic.

        • mongol 5 years ago

          Yes but as I understand, that will be a feature of that reader application alone. And it does not exist on Linux

    • higerordermap 5 years ago

      > The way the original authors intended it to be, including fonts and all the things that go into making a document.

      So that it's very very hard to read on mobile, or in a small width window.

      > For example, I personally like to purchase books that are in PDF format, not epub/mobi

      exactly the opposite. I want to be able to properly reflow it in a narrow window side by side with another tab on laptop, often for technological books.

    • ketamine__ 5 years ago

      What device do you use to read PDF's?

      • gpm 5 years ago

        I've really been enjoying a remarkable tablet since I got it a few weeks ago. Great for taking notes, great for reading pdf's (and scribbling on them - which is surprisingly useful!), terrible for literally anything else.

        • rsfinn 5 years ago

          That does sound like a remarkable tablet. What's it called?

          (...oh, now I get it: <https://remarkable.com/>)

        • pklausler 5 years ago

          I love reMarkable 2 for note-taking, but the fact that internal links in PDFs cannot be traversed makes it a non-starter for me as a technical document reader.

        • krn 5 years ago

          > terrible for literally anything else

          I hope it will soon be the best e-reader on the market, thanks to KOReader[1].

          [1] https://old.reddit.com/r/RemarkableTablet/comments/jsb3r7/ko...

          • gpm 5 years ago

            Hopefully, but I don't like counting features before they exist.

            A feature that does exist - and I maybe should have mentioned, is that one of my house mates finds it really useful for remote teaching, they screen share the remarkables screen to their students and use it as something like a blackboard.

        • ketamine__ 5 years ago

          Is it a good ebook reader if you're not a prolific note taker?

          • gpm 5 years ago

            It renders PDFs, let's you scale/move where on the screen the page is rendered, let's you flip through pages, let's you switch to a view of page thumbnails, let's you go to a page number, remembers the last page you were on.

            If you expect more from your ebook reader than that, it is not just because of software.

            As a sibling comment mentions that might be fixed soon. V1 had long since had support for third party apps. V2 is taking a bit of time (because controlling eink screens in weird) but will probably have third party apps support soon. At that point you will be able to use one of the well known open source ebook reading apps.

            Edit: And to be clear, all the third party app support is unofficial, the help menu on the tablet includes the ssh password for root, everything past that is just reverse engineering (it's a pretty typical linux system apart from the display).

      • crazygringo 5 years ago

        I personally use an iPad. It's amazing for it. Especially when you use the Apple Pencil to highlight text, make annotations like circles or question marks, etc.

        • lilactown 5 years ago

          I struggled to find good software for this. Is there a specific app you recommend?

          • e_proxus 5 years ago

            I’ve been a very happy user of LiquidText [1]. You can load multiple PDFs/images/documents into a LiquidText project and annotate and cross reference. There’s a separate work space outside of the documents where you can store clippings and collect summaries.

            It’s great for reading long standard documents and cross reference between them and supporting material like examples etc.

            [1]: https://www.liquidtext.net/

          • crazygringo 5 years ago

            I've gone back and forth between Apple Books and Acrobat Reader a lot.

            Ultimately I currently use Books because I do a lot of multicolored highlighting which is far more consistent in Books, and because Books lets you set bookmarks (to go back and forth between main text and endnotes) which Acrobat inexplicably still does not support.

            Ultimately every PDF app is good at some things and terrible at others, and you're just going to have to find the app that has the best set of features for you. It is pretty sad that all the apps have significant drawbacks for certain common workflows.

          • sjy 5 years ago

            I like GoodReader for this. It’s the only app I’ve found that stores my annotations in the PDF files themselves, and allows me to keep my PDF library in sync over SFTP, Dropbox, etc. so that I can read and annotate documents on non-Apple devices as well.

          • schaefer 5 years ago

            On the iPad, I use goodnotes for that. You can import PDFs and then mark them up.

          • saagarjha 5 years ago

            PDF Viewer has worked well for me, personally.

          • ghufran_syed 5 years ago

            I use notability on the 12.9 inch iPad Pro. You can import the pdf, then annotate

      • mhh__ 5 years ago

        A computer.

        I started off my "learning more than what you get taught in school" reading PDFs on a tiny phone, I now do it on a bigger phone, a big DPI monitor etc. none of these platforms makes a blind bit of difference to me because my eyes aren't the bottleneck.

      • cjsawyer 5 years ago

        I use an ipad pro for reading pdf textbooks. The high refresh rate screen makes it more pleasant to use than the android tablets I’ve tried. Also the iOS books app is great, if a little bloated by the store half

      • systemvoltage 5 years ago

        iPad Pro, 10”.

    • pklausler 5 years ago

      Would you be happy with some kind of paginated image file format, if it were not significantly larger than PDF?

    • fomine3 5 years ago

      Oppositely what I absolutely WANT - is web page like format with auto-flowing text and something that fits to the screen with user styling/typesetting. I wish both option is available.

    • hcurtiss 5 years ago

      I agree completely. Have you found any good resources for acquiring books in pdf?

  • izacus 5 years ago

    > Why anyone treats PDF as anything but a write-once format is beyond me.

    sigh These kind of ignorant statemens really annoy me.

    The massive part of PDF spec is dedicated to editable features (annotations, form filling and digital signatures) which are used by massive industries daily because it brings them a lot of value. It's REALLY not that hard to think for a second and remember why having a single file which can be:

    - Edited, annotated and commented

    - Approved with explicit markers

    - Digitally signed by auditors and reviewers with tamper protection

    - Rendered on anything that has a semblance of interactive UI with graceful degradation

    - Archived on any kind of file storage

    is insanely useful and valuable for massive worldwide industries.

    • TwoBit 5 years ago

      And yet PDF re-writing keeps getting screwed up. Probably because it really is a bad format for this, with the pages of docs notwithstanding.

      • izacus 5 years ago

        Sure, PDF re-writing is one of very clear non-goals when you read the spec. It's designed to be generated, always consistent and then later annotated and marked up.

        And that doesn't make it read-only. It just makes it not suitable for live text editing just like a PNG isn't suitable for collaborative image work.

        • theamk 5 years ago

          Eh, "remove empty page" is not "live text editing". I have seen tools struggle with seemingly simple operations, like "take page 1,4 from A and pages 4,6 from B, scale to fit paper size and combine into one file". And those are not some exotic requirements, those are one of the most basic operations people want to do with multi-page documents.

          Also, did you want to say JPEG isn't suitable for collaborative image work (because it's lossy?). As PNG is a pretty fine exchange format if two people, who prefer different graphical editors, want to work on the same document.

      • virgilp 5 years ago

        I never saw these screw-ups in Acrobat, so maybe the problem is not the format, but Apple?

        • dkonofalski 5 years ago

          Then you probably don't use Acrobat enough. PDF is a finicky mistress.

          • virgilp 5 years ago

            Maybe; but I do have the full suite. Have used it plenty during the pandemic to fill in forms digitally (some that were designed to be filled in digitally, others that were not), both on an iPad and on computers. Sure, I'm not a hardcore professional user but those wouldn't be using Preview anyway; I'd bet my family usage is consistently above average.

            (also used it before the pandemic - I used to work for Adobe, for 10 years, though not in the Document Cloud group).

            [edit] I should clarify that "I never personally experienced Acrobat bugs that would corrupt my document", not that they don't exist; am not claiming that Acrobat is flawless, and maybe if I went to dig into the bug tracker I could've found document corruption bugs too, who knows.

            The format is certainly messy [+], I don't deny that. However, I think in this instance you're letting Apple off the hook too easily - this one is decidedly their screwup, not Adobe's.

            [+] not the basics of it, that's fine; but the "trying to do too much" part.

            • dkonofalski 5 years ago

              I don't think I'm letting Apple off easy because I don't think it's not their fault. I just think that, until we see the document in question, we can't squarely say that it is their fault. Preview and Acrobat both have ways of dealing with junk/bogus data and I feel like that's what's going on here. There must have been some kind of bit switched when the document was "corrected" and it's possible that Acrobat and Preview just resolve that issue differently while both still outputting "valid" PDF files.

    • theamk 5 years ago

      Sure, but the major text is still "write once" -- you can only add specially designated information to a read-only core.

      As this article shows, a trivial operation on the core text, like "remove a blank page", is still very hard and pretty easy to get wrong.

      I think having general Turing-complete power in a document format is a horrible idea. It's a pity we ended up with it. Something like OpenOffice's presentations ("odp") is also a single file which is layed out per page, and can be annotated, commented, approved, edited and so on, while not being Turing-complete.

    • Igelau 5 years ago

      le sigh the surface changes you're talking about are small potatoes compared to removing a page and altering document structure without breaking anything.

      • izacus 5 years ago

        I've worked on PDF editing library long enough to have a very good understanding how PDFs can break. And also good enough understanding why they're designed as they are - there are some rather forward thinking approaches the authors took in 1.0 and 1.2 spec standards.

        Editing actual page contents is hard but the format itself wasn't designed with live editing in mind. Instead, they made a tradeoff to render the document consistently on all devices which has made the format so popular.

  • bigbubba 5 years ago

    In many industries, writable PDFs are commonplace and aren't seen to be a problem because people in those industries don't see any problem with using Acrobat. If you want to unseat Acrobat and PDF, I think you'll have to provide something that has equivalent or greater power.

    As for multi-image container formats, I think we already have that: cbz/cbr. Just a zip or rar archive with images in it, to be displayed in order by name. This 'format' is in common use already for scans of comic books. There are numerous viewers you can use to open and browse these files. Something to consider though; accessibility. A screen reader needs to use OCR to read these files to people with impaired vision. PDF files aren't fantastic for screen readers either, but they're much better than just a JPEG. I'd love to see some sort of subtitle system for images (I think mpv could probably overlay a subtitle sidecar on a still image, but that's not widely supported. Text-based subtitles formats are easy to wire into text to speech though.)

    • gfody 5 years ago

      you can also embed images and fonts in svg, probably not as reliable as pdf for pixel-perfect reproductions everywhere though

      • bigbubba 5 years ago

        Is there any way to embed raster graphics into an svg? If so, maybe svg could be used as a text-capable container for raster images.

        • unilynx 5 years ago

          Yep, they can, at least JPEG and PNG.

          (and if you're not precise enough with your questions, some designers will simply send you a PNG-base64-blob-inside-SVG if you ask them "can we have a SVG version of that icon")

        • gedy 5 years ago

          Sure, data urls in image tag. (Not sure how practical that is for file size)

  • monocasa 5 years ago

    The comic book piracy community in fact does have what you're asking for. The formats .cbz, .cbr, etc. are just zip and rar files respectively, with images inside and a standard format for the internal filenames so they're presented in order by your reader.

    Would love something like that with svg allowed inside to support vector drawings.

    • ASalazarMX 5 years ago

      And this has become the de facto open standard, supported by many document viewers, probably because it's a very straightforward combination of two widely used tools.

  • yarcob 5 years ago

    Basic PDF editing (adding, removing, reordering, cropping and rotating) has been rock solid in Apple's Preview app. It's something that I dearly miss when I'm on Windows.

    Which makes showstopper bugs like this a very unfortunate.

    But then again, it's been common knowledge that you shouldn't upgrade to a .0 release...

    • helmholtz 5 years ago

      I'm not sure how widely you've sampled the field on Windows, but in case you don't know it, I can recommend PDF-Xchange Editor. They forced it upon us at my workplace and it seemed dodgy as hell at first. But I've been using it steadily and now I really like it. I'm a paper-free researcher so I no longer print research papers. Instead, I annotate/highlight/comment using this software, and it's worked so well that I've installed it on my personal machine as well.

      I've tried all of the commands you've mentioned, and so far it's worked without a hitch.

  • crazygringo 5 years ago

    Because you often have no choice.

    You have a PDF of a book and you need to export the pages of a single chapter to a new PDF.

    Or you have 30 different PDF's that you need to combine into a single one.

    Or a full-color large-filesize scanned PDF that you need to convert to a smaller-sized black-and-white one.

    Or you need to copy quotes from a PDF to paste somewhere else.

    I could go on... PDF's are documents, and normal workflows involve all sorts of conversion and rearrangement and processing of documents. That's the whole point.

    • _-david-_ 5 years ago

      None of those require modifying the existing document. A new document could be produced and the existing document(s) would only be read from.

      • crazygringo 5 years ago

        Oh. Well, I mean... sometimes it's just easier to "Save" rather than "Save As"? (Or on macOS now, "Save" rather than "Duplicate" then "Save".)

        It's supposed to be a valid assumption that re-saving a lossless document in its native format won't destroy parts you haven't touched.

  • richard_todd 5 years ago

    There are lots of image-container formats. DjVu is my favorite, and supports OCR text annotations also. But, outside of some niches, it has fallen out of use these days. Adobe closed the quality gap by adding jbig2 and jpeg2000 support to PDF, making it possible to build a "bunch of images" PDF with similar quality to DjVu.

  • Yetanfou 5 years ago

    Strangely enough (?) I have not had many problems with PDF, certainly no more than with other document formats. I often use tools like pdftk [0] (as found on many Linux distributions) to split sections out of PDFs, create new ones out of single-page PDFs created with ImageMagick, create odd-even page versions etc. I generally do not touch the more "advanced features" like embedded JS (which I have disabled in all readers which support it), I just use it as a document format which more or less guarantees the resulting document looks the way it was meant to, plus or minus a few fonts. For that purpose it works works well enough.

    The "bunch of jpgs/pngs/pick your poison in a tar container" format you describe exists in a fashion: Comic Book Archive, a format meant for and mostly used for comics. It consists of a compressed archive which contains sequentially numbered "pages" which can be JPEG, PNG or other image file formats. For pure image documents it can be used as a replacement for PDF but since it does not support text it can not be used for scanned OCR'ed documents. DjVu [2] does support a text layer but that comes at the cost of complexity, it is far from the simple container you propose. Since an OCR'ed text layer needs to save not only the text itself but also the location on the image for each character I don't see any way to avoid complexity here.

    [0] https://www.pdflabs.com/tools/pdftk-server/

    [1] https://en.wikipedia.org/wiki/Comic_book_archive

    [2] https://en.wikipedia.org/wiki/DjVu

  • sildur 5 years ago

    Blind and shortsighted people may not like your idea. There are accessibility options in PDF, but even a regular one can be accessed with a screen reader. Your images won't.

    • bigbubba 5 years ago

      It's also generally a pain in the ass for people with good vision too, since text can't be copied out of a JPEG. If you want to forward a paragraph of text from a pure raster document, you have to read and type it all out yourself.

    • theamk 5 years ago

      It's not like PDF is the only one with text layer, or the only options are PDF, JPG or PNG.

      DjVU can have text layers as well, and I think SVG too?

  • qwerty456127 5 years ago

    > Why anyone treats PDF as anything but a write-once format is beyond me.

    Because PDF is supposed to be like electronic paper and you normally can draw on a paper original.

    In an ideal world OSes and apps would be better equipped to separate data and metadata and work with multi-stream files. Every file would have different kinds of pure data (e.g. scanned picture and OCRed text), metadata (e.g. title, author, ISBN/etc IDs, ToC etc) and user-generated annotations in separate streams. But our actual world is not ideal, we mix everything into one stream for every document and allow the apps modify it every time we view it.

  • eddieh 5 years ago

    We have that container format. It is called: TIFF

  • ruined 5 years ago

    There already exist formats as you describe, but regardless, a change of format wouldn't fix anything. The same problem would just surface there, as Preview and whatever add support for those formats, and include editing features.

    If you don't want to break something, just don't try to edit it. That can't be enforced anywhere but at the individual user level.

    Even proprietary formats in first-party software have this problem. Hell, plain-text editors have this problem.

    • pm215 5 years ago

      It would help if Preview wasn't really keen on editing the document, though. I find it tries to modify PDFs quite often -- I think it happens if I accidentally click on a table as I scroll through the document. I have resorted to 'chmod 0444 *.pdf' on my folder of datasheets and manuals so it can't mess with what I want to be a read-only file. Preview is the only pdf viewer I've ever needed to do that for.

      I've never wanted to edit a PDF, so personally I'd rather the feature was gated behind a menu option or something so you had to deliberately ask for it.

  • jrochkind1 5 years ago

    Making a PDF include OCR'd text seems to by definition require writing twice. It has to be edited to add the OCR'd text that was not in the original image-only PDF, no?

    I mean, you can say that the second PDF is a different PDF that has only been written once, but you can say that about literally any edit, "write once" no longer means anything if you think about it that way.

    • aardvark179 5 years ago

      You can do the OCRing as a part of the original scanning process. You define a blank font and place the letters at the appropriate points to allow text selection. This is also done for documents that use fonts which are not licensed for embedding, it reduces on screen rendering quality slightly as you have lost any hinting information but it’s fine with printing and reserves the intent.

    • tonyedgecombe 5 years ago

      PDF does support appending to the file and it’s not just for adding pages. You can make changes throughout the original document.

      I’m not sure that is what Preview does though.

  • idle_zealot 5 years ago

    > I think we need a multi-image container format. It could be something that's literally a bunch of jpgs/pngs/pick your poison in a tar container

    You've just described the .cbt format!

    https://en.m.wikipedia.org/wiki/Comic_book_archive

    • marcan_42 5 years ago

      And it's a bad format, because .tar does not have an index that allows random access. It also does not have compression, unless you stick it in gzip, which makes it even worse because then you can't even skip entries, you have to decompress the whole thing from the top.

      .zip is better, at least that has an index, but at the end of the file, so it's not linearly streamable.

      For what it's worth, .pdf is structurally ~the same as .zip; the index is at the end, and blocks are compressed individually.

      Choice of archive format matters :-)

      • Freak_NL 5 years ago

        Most comics using this meta-format seem to use .cbr (Rar) or .cbz (Zip). The main benefit is that reader software understands the intended use of these archives, and you can link a comic book reader to these file extensions.

  • nine_k 5 years ago

    PDF is often called "electronic paper", meaning its primary use, typesetting documents on screen and paper alike.

    I'd rather view that PDF is an electronic clay tablet. You can put any text or image on it, but once it's formed, you better not try to alter it, lest you break it.

  • m463 5 years ago

    What annoys me is that there are some .pdf manuals I want to refer to, and when I view them in preview and go to quit it asks if I want to keep my changes or revert them.

    WTF? why does viewing a manual - and I only used navigation like page up / page down - cause modifications?

  • ernst_klim 5 years ago

    > Why anyone treats PDF as anything but a write-once format is beyond me.

    Hahaha. Quite a lot of people around me think that PDF is a collection of JPEGs glued together (because they mostly see PDF as scanned non-OCRed docs).

  • thordenmark 5 years ago

    "Why anyone treats PDF as anything but a write-once format is beyond me."

    Some of use work in print publishing and being able to edit PDF's is critical.

  • SoSoRoCoCo 5 years ago

    Here's is one good reason: My company uses digital certificate signing on PDF. It is a very common usage. We have yet to migrate to DocuSign.

  • cmiller1 5 years ago

    > The only programs I'd be reasonable sure wouldn't screw it up are Acrobat itself

    Oh no, I've seen acrobat screw with it too

  • db48x 5 years ago

    There are already a handful of multi-image container formats. They're all images inside zipfiles, of course.

  • Siira 5 years ago

    We already have CBR/CBZ.

  • christkv 5 years ago

    CBR and CBZ for comics fits this I think?

  • CamperBob2 5 years ago

    .PDF works just fine. Don't blame the shortcomings of crappy software implementations on the format itself.

    We need a format that will still be readable 100 years or more from now, and .PDF serves that purpose.

crazygringo 5 years ago

I work with a ton of PDF's between my Mac and iPad, and it mostly works but there are still just way too many bugs.

It's a lot of little things, like in Catalina where opening up the sidebar for annotations (comments) seemingly randomized their order. (Big Sur, fortunately, fixed it to be page-order again.)

Or how printing a PDF from a website (in Catalina, also seemingly fixed in Big Sur) would look right on the page... but if you copied and pasted the text from the PDF to somewhere else, something like 10% of the glyphs were scrambled ("lik3 thZs"), like some sort of character table corruption.

Or reading a PDF with Books on my iPad, maybe 10% of the time bookmarking a page... doesn't bookmark it. Or removing a bookmark... doesn't remove it. Or a handful of highlights you just made have inexplicably disappeared the next time you open the file.

Or whenever you open the PDF in Books it remembers which page you were on. Except sometimes it doesn't, so you can't really rely on that for saving your place.

Or in Books, if you select some text to copy but accidentally hit the adjacent "select all" in the pop-up menu, and you're dealing with a 400-page PDF, it just locks up and you have to restart it.

Or in Preview if you want to convert a PDF to black-and-white, there's an option for it but your PDF will balloon in filesize to 10x larger or something.

I mean, I could go on and on. It's weird, because Preview is an incredible app, really. But it really is like they build it and then never bother to test if basic workflows reliably work.

  • inetknght 5 years ago

    > 10% of the glyphs were scrambled ("lik3 thZs"), like some sort of character table corruption.

    As I understand it, that's a form copy protection trying to prevent exactly what you're doing.

    • crazygringo 5 years ago

      While that type of copy-protection does exist in rare instances, in this case it wasn't -- it was with any normal webpage printed to PDF. It was a straight-up bug, I tried to file it with Chromium in fact [1], but it turned out to be a macOS issue that Big Sur fixed.

      [1] https://bugs.chromium.org/p/chromium/issues/detail?id=112849...

    • Someone 5 years ago

      More likely it’s the effect of font subsetting.

      If you want your PDF to render the same everywhere, you have to embed font information (even if that font is something like Arial, which is available almost everywhere, as ‘almost everywhere’ isn’t ‘everywhere’, and because there are zillions of variations on Arial)

      You don’t want to embed entire fonts, though, certainly not if a PDF uses only a few characters of a font.

      The memory wise cheapest way of embedding a font leaves out the table that maps glyph numbers back to Unicode code points. If you do that, the PDF reader guesses a 1:1 mapping to ascii/Unicode.

      Subletting means you can’t extract all glyphs in a font from a single file, but AFAIK, that’s just a side effect.

      I guess this bug drops such a table, or messes up that translation table.

    • m463 5 years ago

      Try to screenshot with a movie onscreen.

      You get a grey rectangle.

  • drevil-v2 5 years ago

    That not remembering the last page you were at in Books drives me crazy.

    My iPad would be the perfect device for reading PDF books especially technical and maths text books, if weren't for the Chinese torture of whether it will remember your page the next time you open it or will I have to scroll through until I recall where I left.

    • Jtsummers 5 years ago

      I've been a happy GoodReader user for years. Syncs with a variety of sources, remembers where I was, bookmarks work. Annotations work, but I don't use that much beyond being able to say it works.

      • drevil-v2 5 years ago

        I like the Apple Books app because

        - It syncs across all my devices. No 3rd party storage like Dropbox required. In fact what I don't like is GoodReader asks for too broad an access to Dropbox/OneDrive if you try to set it up.

        - Books is first party app so absolutely no tracking whatsoever. GoodReader had Facebook SDK and Google Analytics and was trying to contact a bunch of other trackers the last time I installed it.

        • crazygringo 5 years ago

          Oh man, that was another nightmare with Books. I tried keeping syncing on for a year, but ultimately switched back to a local library.

          It worked 99% of the time. But then every few weeks it would overwrite my newer-annotated version of a PDF with an older one, and I'd lose chapters' worth (in one case an entire books' worth) of highlights. It was infuriating.

          It had to be some kind of iDrive sync bug. It seems insane that type of stuff passes QA, I can't even imagine.

          • smartbit 5 years ago

            > I'd lose chapters' worth (in one case an entire books' worth) of highlights. It was infuriating. It had to be some kind of iDrive sync bug.

            That hurts, sorry to hear. iCloud drive supports files sync for files 50GB or less in size. Have you experienced sync issues with iCloud drive too? Or with Dropbox?

            • crazygringo 5 years ago

              Nope, no data loss issues with iCloud Drive files directly (or Dropbox).

              My hypothesis is that Books does something wonky with metadata for extra sync logic, like it depends on reading the last "last page read" record but then sometimes fails to write that, and then loses the pointer to the new version altogether. I can't be sure, but I can't think of any other explanation.

        • smartbit 5 years ago

          > GoodReader had Facebook SDK and Google Analytics and was trying to contact a bunch of other trackers the last time I installed it.

          Longtime fan of Goodreader here. Sorry to hear that Goodreader is trying to contact trackers. How did you find out? Any suggestions how to prevent Goodreader to contact those trackers, eg pi-hole?

          • drevil-v2 5 years ago

            I run pi-hole with recursive dns so I was able to see what domains any app contacts and then I am able to lock it down on my local network no problem.

            On my devices (iPhone / iPad) I also use Lockdown app which allows you to blacklist IP addresses and domain so once I can see what an app is up to using pi-hole, I manually add those domains to Lockdown so that even when I am out of the house the trackers are blocked.

avalys 5 years ago

This is a clickbait, sensationalist headline. “Saving a PDF with Preview in Big Sur can corrupt OCR text added by a third-party program” is more accurate.

  • matrixagentOP 5 years ago

    That's fair, and I honestly didn't enjoy posting the headline here, but as far as I know I have to use the original title? And the original title is from a personal blog where we talk about annoying things. We're not a professional tech blog, or a bug tracker, or… something other than our own little thing. I choose that headline not to be clickbait, but "sensationalist" is probably true because I'm personally really, really angry about this issue. I wanted to do my usual "scan all my documents for the month" routine 30 minutes before going to bed, and instead it turned into a two hour debugging session. And I can't even use my normal workflow now, possibly until March. I find it completely unacceptable that Apple would break Preview that way again. It's not even the first time. Just thinking about it now gets me going again. That's why the headline sounds like it sounds. I would have no problem at all if it was modified here, and as I said – your assessment is absolutely fair.

  • jabbany 5 years ago

    Kind of agree the current title is a bit misleading.

    I've never used any editing features in Preview (I mean it's called "preview" so...) and reading the title I thought this meant it was mangling files even by opening them which would have been super scary.

    As for non-Acrobat software mangling PDFs after editing... Well that's much less surprising. I've even had Acrobat mangle stuff in PDFs after editing...

    • wil421 5 years ago

      Preview has similar features to Microsoft’s snipping tool. Highlighting, draw basic shapes, draw free hand with a couple colors, and add some annotations. I like it better than Windows most of the time.

    • matrixagentOP 5 years ago

      Well, technically you at least only have a chance of noticing the error after opening a PDF (again). I suppose that's because after saving the old, correct data is still in memory, but I don't know exactly when the corruption happens – would surprise me if it wasn't upon saving, though.

  • refulgentis 5 years ago

    In my very humble opinion, it's accurate: "Saving a PDF with Preview in Big Sur [Preview in Big Sur] can corrupt OCR text added by a third-party program [is irreversibly destroying PDFs]"

    I understand the (fairly common, in these comments) viewpoint that this is the fault of the "third-party program", but since the PDF is readable up until Preview touches it...I find it hard to come around to the viewpoint the third-party program is relevant. Readable bytes -> Preview -> unreadable bytes is my mental model so far.

    Edit: absolutely unacceptable this is downvoted to -4. I've observed for a couple months that participation in Apple-related threads, outside indignation that Apple was involved in the discussion at all, gets down to -5 before getting back to -1 a day or two later. No matter what tone is used, this happens, and it makes the problem even worse in the long run. Been here 10 years, always been a _slight_ problem, but over the last year, it's virtually impossible to participate without continuing to slowly destroy my 11 year old account. Not sure how much longer I can keep trying.

    • birdyrooster 5 years ago

      Preview isn’t breaking files by reading them, as I understand it, people are saving files with Preview and over-writing their ABBY compatible pdf. Just because the last four bytes of a file name is “.pdf” doesn’t mean anything that opens files with that suffix will work.

      PDF is not a bitmap, it’s a script like HTML or JS.

      People understand browser incompatibility but some how this is unconscionable.

      • refulgentis 5 years ago

        I see, the chain isn't as simple as bytes -> Preview -> unreadable bytes: rather, bytes -> Preview, edit -> unreadable bytes. Is that accurate?

        • fastball 5 years ago

          No, the bytes are still readable by Preview, it's just the OCR meta that is apparently no longer copy-pastable.

          No reason why you can't just run this through ABBYY FineReader again and get the exact same OCR you got the first time, so I think "irreversible" is definitely a stretch.

        • birdyrooster 5 years ago

          Yep they are flushing new bits to disk when preview is instructed to save.

    • dkonofalski 5 years ago

      Maybe this can give you some insight. I downvoted your comment simply because it didn't add anything to the discussion and you made an assumption that has several faults in it.

      For one, your "mental model" is off because you assume that the first part of "readable bytes" is accurate. Without actually seeing the PDF in question, you don't know if the "readable bytes" are actually corrupted and Preview is fixing them to make them readable. That would mean that Preview is actually correct in its behavior and the source document is what's flawed.

      On the tail end of your mental model, then, is another assumption which is that this results in "unreadable bytes" but that's not accurate either. The PDF that results from a save in Preview may be accurate to the PDF specification and is perfectly readable as a PDF in any PDF application/reader. What's no longer readable is the extra content that was originally in the file that may not have been saved correctly, in-spec, or may have been corrupt to begin with.

      A big hint as to what's going on here, now that I've had some time to review this, is that the "corruption" happens consistently - the letter "a" is always replaced by the same "corrupted" character, the letter "b" seems to be consistently replaced with the same character, etc. That points, in my opinion, to a lookup that's no longer correct. What side of that lookup is bad is hard to say without seeing the file in question.

    • vzqx 5 years ago

      The title is technically accurate, but it's misleading to non-mac users like myself. I assumed the author was using functionality called "Preview" only to view the documents, rather than save them.

      There's a big difference between "read-only operation is mangling files" vs. "PDF writer is buggy".

    • JxLS-cpgbe0 5 years ago

      > Not sure how much longer I can keep trying

      Keep trying, just with a new account every few months are so. HN has no privacy controls, we must add our own.

  • birdyrooster 5 years ago

    Did they save the file using Preview? If so that’s on them, they chose to write a pdf using Preview and that comes with all of the pitfalls of pdf compatibility. Does plain old PostScript have this problem?

    • caminocorner 5 years ago

      I'm not the original author. The usecase I have for Preview is to open it up, read it, highlight a few things and save the file (with the new highlight overlays). I wouldn't expect that to destroy my underlying OCR (which I also use a 3rd party app for)

      If the behavior changes, that's not on me, that's on Preview.

      I don't have any issue with this today on my Mac, but I'm glad I didn't upgrade to BigSur. I almost did.

    • lilyball 5 years ago

      Yes, they used Preview to modify the PDF.

      • Aperocky 5 years ago

        So misleading.

        There are a chance that write goes wrong, but Preview is much more geared towards read than write.

        This doesn't excuse Preview, but its name is suggestive enough.

        • tbejn 5 years ago

          Actually, I think it's not Preview bug. I use gramarly plugin in safari, and double copy-paste issue is happening there also. It's something more generally broken. I run 11.1

    • CapriciousCptl 5 years ago

      Probably most people who’ve done a bit of PDF work know there’s no guarantee of the same output from different (or even the same) editor. So I don’t think it’s Preview’s fault per since the problem is endemic to PDF. But I don’t think you can blame the user either. Really, PDFs are just these enormously useful complex things that are always breaking in unexpected ways and some people haven’t been bitten by its problems enough yet to cope properly.

lilyball 5 years ago

I find posts like this completely pointless when they include no details at all. This is just "there's an incompatibility between third-party software and a version of macOS that the third-party software says they don't support yet, so I'm going to publicly criticize Apple".

If you're not going to do the work to figure out what the corruption is, at least include the two PDFs so other people can look at them and see what happened.

  • dewey 5 years ago

    There’s a list of blog posts about the same problem linked in the article including a radar from 2016 (https://openradar.appspot.com/29786282) on Apple’s bug tracker. It’s not exactly an obscure bug that nobody knows how to reproduce.

    • lilyball 5 years ago

      A radar from 2016 is not useful, that describes an old bug. Just because the symptoms look like something we’ve seen before doesn’t mean it’s the same underlying issue.

      • dewey 5 years ago

        There might be newer Radars too, I can only see that 2016 one because some kind soul set up a public interface to show a subset of submitted Radars. With Apple's opaque and in-transparent bug reporting process there's no way of knowing if someone already reported it. I've submitted numerous feedbacks already but I can't do that for everything or do the homework for a trillion dollar company.

  • matrixagentOP 5 years ago

    > If you're not going to do the work to figure out what the corruption is …

    I'm sorry, but last time I checked neither Apple nor ABBYY pay my salary. I really don't understand these takes. If Apple or ABBYY want my PDFs, they should be able to find my email address rather easily. Your tl;dr version of the post is completely unfair. I publicly criticize Apple because they are breaking something that potentially affects a lot of people who are unlikely to even know about it, and they are doing it for at least the second time now. If you don't think that's worthy of criticism, I don't know what is.

    I also love how so many people assume I didn't already talk to support and file radars. I guess you had better luck in the past than me, but I can assure you, these options aren't always as useful as you might think they are.

    • JumpCrisscross 5 years ago

      > neither Apple nor ABBYY pay my salary

      This is a fair bar for conversation, in person or online. One can be more demanding of a public write-up.

      • matrixagentOP 5 years ago

        Fair enough, though I'd still argue that a public write-up that might warn a few people and save them the trouble is well worth writing on its own. As this blog is just a hobby, my time for it is limited and it's not like a write-up like this doesn't already take up some time. I did provide plenty of resources in the links, to which I don't think sample PDFs from me would add much value. But point taken, I'll try to attach sample data where applicable in the future.

ztravis 5 years ago

My guess is that the output PDF is still valid, but that an embedded (subset) font has had its `ToUnicode` map stripped, so that there's no link between the character codes used in the text elements and the "actual" characters they represent (there are also other ways this corruption could happen, but dropping or mangling the `ToUnicode` map seems like a likely cause).

  • duskwuff 5 years ago

    This is almost certainly it. I've seen similar issues with copy/paste from poorly constructed PDFs, often ones generated by "print to PDF" features.

    • arthur2e5 5 years ago

      Very old LaTeX PDFs tend to have this issue too. Chances are pretty slim for profs to edit PDFs witb Preview, I think…

      • duskwuff 5 years ago

        Yep, and in that case it's because those PDFs were often generated through really horrifying pipelines (e.g. TeX to DVI to PS to PDF). Under some workflows, the resulting document wouldn't even contain any characters, as far as PDF was concerned -- it'd just be a bunch of vectors.

        • mkl 5 years ago

          Or not even vectors, but lots of little bitmaps. It's really awful.

  • lrossi 5 years ago

    I agree. If you look closely, you can see certain patterns repeating, they’re just not English letters. But it definitely looks like natural language, and not random binary dump.

    • Marioheld 5 years ago

      Also look at the spaces. The length of the words is the same on both texts. So the content is still present just the characters got shifted.

zepto 5 years ago

They are using software unsupported by the vendor and blaming Apple for the outcome.

“ABBYY says they don’t support Big Sur yet, that’s fine. But Apple didn’t tell me that I can’t upgrade to Big Sur when I use ABBYY. I’d be a lot less angry if there was a changelog or release notes from Apple where it says there is a known problem with OCR’ed PDFs in Preview. Their software is broken, they need to tell me. I don’t care if it only worked because they had workarounds for super shitty PDFs that ABBYY possibly produces, I just need my OS to keep working for me.”

  • jcrawfordor 5 years ago

    So Preview opens a file, which is apparently valid per Preview (Preview handles and displays it correctly). After changes, preview then saves a file that is not valid per Preview (OCR text layer is corrupted).

    It is very difficult for me construe a situation in which this would not be considered errata in Preview. Even if ABBYY is writing unusual PDFs, it's popular enough software that this issue will be encountered in the real world multiple times. Having to deal with unusually formed PDFs is just a general trait of writing PDF software. If you consider it a non-issue that your software writes corrupt PDFs when the same situation is handled correctly by Acrobat, no matter how unjust you feel the cruel world to be, you should not be in the PDF business. There's a reason most PDF viewers only present a highly constrained feature set, and it's because writing a capable PDF editor is very difficult. Apple has decided they are up to the challenge, and in this case has failed.

    As a general rule, if your software package opens a file fine, then writes a broken version of it, seems to all the world like a bug in your software package.

    The idea that this behavior in a PDF reader can be excused because the software that generated the PDF was not approved for the operating system the viewer is running on per the vendor... kind of stretches credulity. I don't usually inquire as the OS used to generate the PDF when I receive one.

    • dkonofalski 5 years ago

      If this is a bug with Preview, then that's really, really bad since Preview is a bedrock of macOS and has been for years.

      However... it sounds like the issue is that FineReader is storing the OCR'd text in the metadata in a way that's not part of the official PDF spec. So, it sounds like Preview is able to open the file by ignoring that metadata and then, upon save, is storing the metadata back, as normal, which then corrupts the OCR data. This reminds me a lot of when people would store metadata like this in MP3 files to include things like album art and booklets. Normal mp3 players would ignore it as just metadata or bogus data but opening it in an audio editor would do this same thing.

      I'm not sure who the "blame" lies with in this case because Abby FineReader probably is writing this stuff in a non-supported way but Preview really should just ignore it rather than trying to correct it. It's very likely that the OCR text, post-save, is actually bits from the document itself rather than from the metadata.

    • zepto 5 years ago

      “The idea that this behavior in a PDF reader can be excused because the software that generated the PDF was not approved for the operating system the viewer is running on per the vendor.”

      Nobody is saying that. The suggestion is that the software that generates the PDF produces corrupt documents.

      The fact that the vendor of that software doesn’t approve it for Big Sur suggests that they might be aware that there are problems.

      • jcrawfordor 5 years ago

        I worded it that way for two reasons, one of which is admittedly speculative:

        1) It seems highly unlikely that ABBYY relies on some changed OS behavior in generating PDFs that leads to it producing PDFs that are malformed in such a way that is only revealed when they are rewritten by Preview. Behavior in Preview is by far a more likely cause of the problem. Generally the thing that changed is what broke...

        2) To the user, this looks 100% like a problem in Preview no matter what's happening, and it's Apple's responsibility to not do this kind of thing to users. The PDF opens properly the first time, so de facto it is "valid" as determined by the product that later corrupts it. As I said, handling questionably valid PDFs is part and parcel of writing PDF software, and failure to handle a PDF that otherwise renders correctly looks like a bug on your part... especially when it otherwise renders correctly in your own software.

        • zepto 5 years ago

          I don’t really see why #1 is highly unlikely. PDF is very complex, and it’s easily possible that a generator could have a bug.

          ‘Generally the thing that changed is what broke’

          This has been never been true in software engineering. Changed code reveals bugs which need to be fixed elsewhere all the time.

          2. Yes, to the end user it looks like a problem with preview.

          No, the fact that it opens at all doesn’t make it de facto “valid”.

          Yes, handling questionable PDFs is part of writing PDFs handling software.

          No, that doesn’t mean that all PDF handling software must or can feasibly handle any and all corruption.

          The very fact that there are many kinds of questionably valid PDFs out there proves the point. Handling the the intersection of all the invalid PDFs is impossible.

          Rendering correctly has nothing to do with this. PDFs have many attributes which are not rendered.

          It really is on the document creator to produce a valid document in the first place.

          It’s certainly on ABBYY to have tested this months ago and either fixed it, or publicized it.

        • dkonofalski 5 years ago

          >The PDF opens properly the first time, so de facto it is "valid" as determined by the product that later corrupts it.

          While I agree with your main point that, to the user, this looks like a problem with Preview, I think it's actually because Preview is doing something beneficial to open the file which is to ignore "bogus" data. Preview, from what I can tell, ignores additional data that it doesn't expect specifically to allow for opening PDF files where the document data is fine but the metadata is corrupted. The fact that it opens it means nothing since the file would not be able to be opened at all otherwise and Preview is actually doing the user a favor by salvaging it. Once Preview "fixes" it, though, it looks like the OCR pointer is still there but the metadata that contains the plain-text content is not so it's pointing to binary document data instead. Another option is that Preview has "fixed" a font and the mapping is no longer correct which, while I can't really see the text in the image, would be obvious as you'd see words that map to the same corruptions. In either case, Preview's behavior is "correct" and the fact that it was less strict before does not mean that it's now broken - the source PDF is still what was "broken".

          • interestica 5 years ago

            I think a big issue is that 'preview' is used as just a pre-view. The metaphor is that you haven't actually opened/viewed the file yet 'for real'. Yet, preview has morphed into a program that makes changes to files even without any kind of save dialogue.

            • dkonofalski 5 years ago

              Are you confusing Quick Look with Preview? Preview is not just for PDFs and allows you to view, annotate, and edit lots of document types...

              • interestica 5 years ago

                No, really. Preview. Yes, it can open everything (I don't know where I suggested that it is only for PDFs?)

                But it doesn't act like other editors. There's no save dialogue when you, for instance, rotate a PDF. The term 'Preview' in other programs, like when using a scanner or other text editors, is a non destructive type of viewer. Just a viewer. "preview before you make changes"

                But preview on OSX changes that.

      • smarx007 5 years ago

        This is ridiculous. When I read that FineReader is not supported on Big Sur, I understand it as I cannot run FineReader to _produce_ PDFs on Big Sur. But I expect Big Sur not to trash PDF files that were produced on Catalina, for example.

        By the way, I was forced to purchase PDF Expert since a few years ago because of all kinds of problems with Preview (blurry text, wierd bugs with annotations, removed ability to print PDF with annotations and the list goes on).

  • arvindamirtaa 5 years ago

    Here's the rest of the statement that you left out that answers your point.

    "...I just need my OS to keep working for me. This bug could hit me without even owning a scanner at all – someone sending me a PDF that I then unknowingly break before archiving it. That’s the part I’m mad about."

    • zepto 5 years ago

      It doesn’t really. Firstly, that’s not what they are mad about, since that hasn’t actually happened to them. They are mad that the pdf’s their scanner produces don’t work properly.

      And secondly, that hypothetical could well be using broken software too. We really don’t know how bad the PDF’s ABBY produces are, but as someone who owned this scanner and the software, it’s fairly obvious that the software is barely maintained.

      It really isn’t Apple’s fault if someone else is producing bad files that they happened to previously tolerate, especially if that somebody isn’t maintaining their software.

      • arvindamirtaa 5 years ago

        > "They are using software unsupported by the vendor and blaming Apple for the outcome."

        But by this logic, no software other than those that Apple sells you directly (their own or through the MAS) is "supported software". If Apple makes a change and all of them break rendering it useless of anyone using software not bought from them, would you still be making this same apology?

        • zepto 5 years ago

          No this is not correct.

          ABBY hasn’t updated their software for Big Sur, i.e. ABBY themselves say it hasn’t been updated yet and is not supported by them on Big Sur.

          That’s what unsupported means. It has nothing to do with whether Apple supports it.

          • arvindamirtaa 5 years ago

            I receive that PDF file that looks like any other PDF file.

            I open it with preview. It opens fine. I close it and open it again. It opens fine.

            I open it a third time and make a small arrow pointing to something and save it (as I would with ANY OTHER PDF).

            It breaks.

            I'm not using ABBYY. I've never heard of it. It's not on my system. It's just a PDF file that I got sent.

            Now what?

            • dayjobpork 5 years ago

              You are opening it wrong /s

            • zepto 5 years ago

              You’re a victim of ABBYY’s poor PDF generation, that’s what.

              Nothing excuses ABBYY if their PDFs are corrupt.

              • mynameisvlad 5 years ago

                Wow, just wow.

                The PDF works fine before saving in Preview. As in, Preview itself will render the file perfectly fine with OCR. By all accounts, the PDF is completely valid and uncorrupted at this point. Making a change to the file and saving it in Preview is what causes the corruption.

                So how the hell could you possibly excuse Preview and call this a ABBYY issue when Preview is the one that causes the issue?

                That's absurd to the highest level.

                • dkonofalski 5 years ago

                  >The PDF works fine before saving in Preview.

                  I think the disagreement here is that there's no evidence of this. Preview's error handling could very well be interpreting bad data and allowing the file to be opened. The question becomes, then, should Preview continue to propagate that bad data on save or should it try to correct it with the possibility that it corrupts just that data. If the PDF was not in-spec prior to Preview touching it but it is in-spec after Preview saves it, is it a good thing that Preview "fixed" the PDF file and made it "proper" or is it bad because it technically lost/corrupted data?

                  In other words, what is the "right" thing for a software to do in this case? Keep bad data and leave the file as "bad" or fix the issue to make a valid PDF and, as a side effect, remove the "bad" data?

                • zepto 5 years ago

                  “By all accounts, the PDF is completely valid and uncorrupted at this point.”

                  Is it? I don’t see any accounts showing this to be a fact.

              • arvindamirtaa 5 years ago

                Hahaha this is just brilliant.

                • zepto 5 years ago

                  It’s not brilliant. It’s just correct if ABBYY is producing corrupt PDFs, which is a possible cause of this problem.

                  • arvindamirtaa 5 years ago

                    If "Corrupt" = works fine until you make an edit and save using Preview, yes. You're 100% correct.

                    • zepto 5 years ago

                      Works fine until you make an edit really doesn’t mean anything at all. PDF is much more complex than just a static image.

            • ehutch79 5 years ago

              Someone buys a bottle coke

              They shake it up a lot

              They hand me said bottle of coke

              I open it, and get sprayed by fizz

              THAT IS CLEARLY THE COCA-COLA COMPANIES FAULT!

              • ehutch79 5 years ago

                To be clear, my point here is that just because something is handed off, does not mean it's existence is reset or something.

      • function_seven 5 years ago

        > It really isn’t Apple’s fault if someone else is producing bad files that they happened to previously tolerate, especially if that somebody isn’t maintaining their software.

        I think it is. Preview used to correctly handle the thing, and now it doesn't. Users cannot be expected to inspect the raw PS code of their PDFs to determine standards compliance. What they can do is assume that something that works yesterday, and appears to work today, is in fact working.

        The PDFs didn't change, the application did. It's the application's fault.

        If you have compatible behavior that you remove, it's on you to alert the user of the reduced functionality.

        • zepto 5 years ago

          Yes, it is in principle possible that Apple was doing something special to support bad PDFs produced by ABBY which they then removed. If this is what happened, I would agree with you.

          But, there is no evidence yet that they removed anything.

          It’s quite possible (I’d say likely) that ABBY’s PDFs were corrupt all along.

          Preview may have tolerated them not because of a special feature which was then removed, but because it happened not to depend on correctness before.

          In this case, Apple wouldn’t have been in a position to even know about the problem.

          You know who would? ABBY, who have had free access to Big Sur for months. You’d think that even cursory testing at their end would have shown this problem.

          They have had months to fix it, months to work with Apple on the issue, any months to publicize it it really is Apple’s fault.

          • tonyedgecombe 5 years ago

            The trouble is the PDF specification is vague and ambiguous in a number of areas. This has resulted in a lot of ill formed documents. If you are going to write a PDF reader then you are going to have to accept those documents and not corrupt them.

            Even if ABBY does have a bug and fixes it then people will still have a bunch of older documents they will want to read.

            • zepto 5 years ago

              This is mostly accurate, however it is an idealization to suggest that you can accept and not corrupt all of the ill-formed documents.

              This is in principle impossible.

              If it turns out that there is a widespread problem with many kinds of document that no longer work properly, I’ll be more inclined to say Apple is at fault.

              But as it stands, it just seems like ABBY has been producing bad PDFs.

          • saagarjha 5 years ago

            I’m sure Apple had much more time with ABBY than ABBY had with Big Sur ;)

      • jfim 5 years ago

        > It really isn’t Apple’s fault if someone else is producing bad files that they happened to previously tolerate, especially if that somebody isn’t maintaining their software.

        It's not clear if the issue is caused by the PDF files produced by ABBYY, or if the issue is also present in other OCR'ed PDF files.

        In either case, Preview shouldn't silently corrupt the file. It could display an error saying that it was unable to save the file, or a warning that some data couldn't be saved, so that the user can check if the file still works for them after saving.

        • zepto 5 years ago

          PDF’s are essentially a stack machine based program.

          It’s not clear what’s causing the problem, but verifying correctness of programs is not a solved computer science problem.

          • duskwuff 5 years ago

            PDF is not a stack machine. This is a common misconception.

            PostScript was a stack machine with included operators which implemented a set of vector graphics commands. PDF has a similar rendering pipeline, but it renders a set of primitives described directly in the document -- there is no virtual machine involved.

            • zepto 5 years ago

              My understating is that PDF still has a stack machine. Just one that is much simpler.

              Having said that, PDF is vastly more complex than it was originally. It can contain forms, JavaScript, audio, video, accessibility metadata, etc.

              It’s an extremely complex format.

              • buckminster 5 years ago

                Yes, but it's easy to generate a valid pdf. You just include the features you need and ignore the rest. It's really hard to transform an arbitrary pdf because you need to handle all the features.

                Without further information this is more likely to be an Apple bug because they are doing the harder job.

                • zepto 5 years ago

                  One point: Transforming an already incorrect file into a correct one is the hardest job of all.

                  Plus some further information:

                  ABBY has had months to support Big Sur but still haven’t.

                  That makes it seem like they either don’t care about supporting MacOS users, or they have a very low engineering budget.

                  Either way, that makes it seem more likely to me that they have a bug, than Apple.

                  But, regardless of the balance of probabilities, the point is that the original piece blames Apple entirely.

                  And yet nobody here really seems to dispute the idea that ABBY, a professional PDF generator could be producing a buggy file.

      • ihuman 5 years ago

        > They are mad that the pdf’s their scanner produces don’t work properly

        The PDF that FineReader made worked properly. The one that Preview made didn't.

        • dkonofalski 5 years ago

          That doesn't mean it's a valid PDF. Preview has error correction to try and recover document data when metadata is corrupt. It's entirely possible that the documents have always had corrupt metadata and that Preview now expects stricter PDF guidelines.

        • zepto 5 years ago

          How do you know it worked properly?

          • ihuman 5 years ago

            The first paragraph and image in the article. When the author copied the text from FineReader's PDF and pasted it in TextEdit, it had the correct text. When the author tried that with preview's pdf, it pasted garbled data.

            • zepto 5 years ago

              That doesn’t mean there wasn’t file corruption.

              • ihuman 5 years ago

                I never said anything about file corruption. You asked how I know FineReader worked properly, and I explained why the article showed it working properly, and that preview doesn't work properly anymore.

                • zepto 5 years ago

                  File corruption would be a way it would not be working properly that wouldn’t be revealed by what you did.

  • matrixagentOP 5 years ago

    Read it again. This bug could hit you without ever using ABBYY yourself. Apple broke Preview.

    • zepto 5 years ago

      I read it. It doesn’t show that Apple broke Preview. It says Preview stopped being compatible with PDF’s produced by ABBYY.

      Those PDFs could have been buggy all along, and only now be showing up due to improvements in preview.

      It’s possible that Apple broke preview, but having seen how poorly maintained ABBYY is, I wouldn’t be surprised if it was producing malformed PDFs that just happened to work on older version of preview.

      • matrixagentOP 5 years ago

        I honestly don't care who is exactly at fault. I definitely blame Apple for destroying the PDF. There might be plenty of blame left for ABBYY for creating a bad PDF in the first place. That does not change how utterly unacceptable Preview.app's behavior here is. If you honestly think there is no valid criticism for Apple here, I don't know what to tell you. In any case, the whole thing is annoying technology – and seeing some of the comments here, people should more often try to step out of their own mind bubble and try to look at these things form their grandparents' or similar perspective. I personally am very aware of plenty of ways to work around this issue. But Apple used to be the company you could use if you either are not or don't want to be concerned about things like that.

        And a quick edit: I really dislike people who discuss like you do – all of what I just wrote is already stated in the post itself. Another commenter even called you out for ignoring that part when quoting me. I should not have wasted five minutes spelling it out for you again, your mind won't be changed anyway.

        • zepto 5 years ago

          With all due respect, if you had led with your “footnote”, you wouldn’t have even been able to write such an angry sounding piece. Also, you say that you don’t care who is at fault, but that isn’t how the piece came off. It’s odd you’d say that now.

          I quoted you because you hadn’t incorporated this key information into the main text.

          What I think is that there are many ill-formed PDFs out there and that supporting the intersection of all of them is essentially impossible.

          I also think that it’s every bit the responsibility of a company like ABBY to generate good PDFs. How can it not be? Relying on Preview to be forgiving when you are a maker of PDF generation software is obviously irresponsible.

          Why do you think ABBY didn’t announce that this problem existed when the Big Sur betas were available for them to test with?

          For what it’s worth, I stopped recommending Fujitsu scanners years ago. For a while I loved mine, but none of the software was well maintained.

    • masswerk 5 years ago

      "Broke" may be a bit harsh. What appears to be happening is that Preview somehow loses or corrupts the toUnicode map, which is apparently located in the metadata, when saving the PDF. Mind that every application will have to reassemble/reflow the metadata when saving a reflowed document (like after cropping and/or discarding pages). To do so, the application has to interpret and to reassemble the metadata before wiriting it back.

      Now, some algorithms and routines may be more robust and allowing than others. Maybe, an innocent refactoring attempt just lost that critical bit of robustness, required to deal with that particular format produced by this particular application.

      For example, consider an XML-based format, where a particular application delivered a malformed document, like a missing closing quote for some attribute. Most XML interpreters will churn happily along with this, but, after a rewrite of some routine, an application just ignores the malformed tag with the runaway string. Did it break XML? Or did it just fail to interpret a malformed document, it had somehow been able to deal with thanks to some extra robustness present in the previous version?

      Considering this hypothetical case: Should that application be improved by an update to regain its previous robustness? Yes, absolutely. Is it a bug and is the vendor to blame? Probably not. Mind that this might be quite well what is happening here, as well.

      • matrixagentOP 5 years ago

        I think you might interpret the word more harshly than I intend it to be. Something that worked before Big Sur is not working in Big Sur. It's broken. That's all there is to it for me as an end user. Of course there is a difference in reasons for things breaking, and I'm not denying that Apple might have "improved" Preview.app when you judge it by how well it adheres to a PDF spec. But it still "broke" a very common and normal workflow by doing so, and I would at least expect them to acknowledge that. If they decide breaking this is worth it, that's absolutely fine with me. But I don't think they are even aware of it, and that is not something I'm willing accept from a company with the reach and resources of Apple.

        • zepto 5 years ago

          Are you happy to accept corrupt PDFs being generated by ABBYY, a company that does nothing but write software to produce PDFs and yet hasn’t even commented, let alone maintained their software?

          Do you think ABBYY is even aware if the issue?

          It seems like you expect Apple to test everyone else’s software, and make workarounds for their bugs.

          Regardless of Apple’s scale and resources, this is obivously unreasonable.

          Also you say: “That's all there is to it for me as an end user” as if you didn’t understand anything about the complexities of software development, but reading your comments on other topics, you are obviously skilled in the art. You are not ‘just an end user’ who doesn’t understand the complexities.

    • r00fus 5 years ago

      So is there an active Preview corruption example that doesn't involve ABBYY? I've used FineReader before for a commercial effort, I do remember it being very finicky.

      • apocalyptic0n3 5 years ago

        It's unrelated to OCR, but there have been other Preview issues. We ran into an issue a few years ago where saving a PDF with forms in Preview would set some style setting so that in most other readers, the form fields would have no background color and use white text, making them unreadable. They were still perfectly readable in Preview, but had the style issues in Chrome, Firefox, Acrobat (and Reader), and Foxit. We have a project where we programatically fill in PDF forms using PDFtk and one day, our editor just starting spitting out empty PDFs. After troubleshooting, we traced it back to the style changes Preview was making after another dev had accidentally done a CMD+S on the template file and committed the template.

        In short: NEVER save a PDF with Preview. You should probably just avoid opening it in Preview period, frankly.

      • matrixagentOP 5 years ago

        I personally don't use anything else, but when the problem first occured a few years ago, it was not limited to PDFs from ABBYY. (Which is not to say that it's purely Preview.app's fault. Maybe all of these PDFs were created in a bad way, would not surprise me at all. Could very well be that Preview.app is actually "improving" and fixing old bugs/cruft, breaking things that worked before but never should have in the first place. As the end user that doesn't really matter for me though, as I said in the post itself.)

        • zepto 5 years ago

          > Could very well be that Preview.app is actually "improving" and fixing old bugs/cruft, breaking things that worked before but never should have in the first place.

          Exactly this.

          So the question is, whose responsibility is it? Apple’s to magically support the intersection of all the broken sofware?

          You essentially have argued that ABBY is popular enough that Apple should have tested it.

          Maybe it is popular, but the implication is that Apple would need to regression test against all this popular PDF generating software for any change to the preview engine, since they wouldn’t be able to know for sure what software’s PDFs would be broken by conforming changes.

          What we know they did, was to make a copy of Big Sur available for ABBY to use to test their own software. That’s pretty standard practice in a case like this and is them behaving responsibly.

          If Preview really was at fault, ABBY could have raised the issue with Apple, and or put a warning in their own software. If it’s that popular, you’d think they would have an incentive to do this.

          What isn’t obvious is that Apple should somehow introduce workarounds every time a third party doesn’t fix a bug.

    • ehutch79 5 years ago

      but the file was stuff generated with abbyy, even if you give it to someone else.

  • AlexandrB 5 years ago

    Would be interesting to see if Preview is stripping OCRd text from PDFs not created by ABBYY FineReader.

  • db48x 5 years ago

    Are you arguing that ABBYY Finereader is going to produce different PDF files once they support the new OS? Possibly they will, if only to work around this obvious bug in Preview.

    • zepto 5 years ago

      I’m arguing that yes, they will produce different PDF files.

      It’s not obvious the bug is in preview. The bug could easily be in in ABBYY’s PDF generation code.

      To be fair, I’m not arguing that this is the case. I’m arguing that it just as easily could be as there being a bug in Preview, which is also possible.

userbinator 5 years ago

I remember many years ago distributing PDFs as part of course material, that Adobe's official reader would open just fine, but Mac's built-in one wouldn't (and simply fail with a useless "an error occurred" message.) Only a small subset of the class was using Macs and the built-in reader, so it took a while to discover. The problem eventually turned out to be some oddity in the way it treats whitespace[1], that Adobe and a few other readers were perfectly fine with, but not Preview.

[1] PDF is one of the strangest file formats I've worked with. It is a bizarre mix of binary and text, and some of the other design decisions are also perplexing.

  • rubyn00bie 5 years ago

    > PDF is one of the strangest file formats I've worked with.

    Do you by chance have a "definitely strangest" file formats? Just curious if something out there is vastly weirder, or more perplexing, than PDFs?

    • agersant 5 years ago

      I haven't worked with it myself but I heard Photoshop's PSD format is a good candidate.

      • Someone 5 years ago

        https://github.com/gco/xee/blob/4fa3a6d609dd72b8493e52a68f31...:

        “At this point, I'd like to take a moment to speak to you about the Adobe PSD format. PSD is not a good format. PSD is not even a bad format. Calling it such would be an insult to other bad formats, such as PCX or JPEG. No, PSD is an abysmal format.”

        (Followed by the real rant)

        I’m not sure it’s worse than PDF, though. PDF started life as a text-based postscript replacement that made it possible to index into postscript files, allowing one, for example, to render page 214 of a file without rendering the previous 213.

        When files grew too large, they added compression. Initially, that was base-85 (https://en.wikipedia.org/wiki/Ascii85), keeping the file text-based, but other compressors were added so that, now, the file is binary. They kept the index and all metadata in uncompressed ascii, though.

        Then, they layered zillions of different features (3D graphics, forms, JavaScript, font embedding, etc) on top of it.

    • maximilianburke 5 years ago

      Yes, Adobe's PSD is definitely more weird and perplexing than PDF.

unfocused 5 years ago

I think the HN crowd has forgotten that the entire legal system uses PDFs, and in addition uses the redaction features of the likes of Adobe Acrobat, as well as others trying to squeeze in like FoxIT.

Redaction is huge in governments that have gone digital. Gone are the days where you print the paper, black it out, and then photocopy it.

I have worked with PDFs for a long time, and if you ever wanted compatibility, you had to use Adobe Pro, since there were so many bad PDFs with weird embedded stuff that only Adobe could read properly...because it was initially created in Adobe sigh

All other products try to catch up, but they can't clean up the mess that Adobe has left behind.

mhh__ 5 years ago

Preview seems like a good example of something that's worth open sourcing. Not only will people end up doing work for you, you get eyes on the code and more direct issue tracking.

Consumers get a product and they still have to go on Mac to use it.

  • bigbubba 5 years ago

    I've been looking for a FOSS desktop agnostic universal file previewer or thumbnail generator for a while now; if anybody has suggestions I'd love to hear them. Ffmpegthumbnailer for video thumbnails or imagemagick for image thumbnails are fine, but what about previews for things like ebooks or PDFs? Something that provides a one-stop-shop for as many common filetypes as possible is what I'm looking for.

    My current solution is controlling a floating mpv window to open image, video or audio files as they are selected. This works well for A/V but not so well with other sorts of documents.

    • hydrox24 5 years ago

      > but what about previews for things like ebooks or PDFs

      MuPDF is a great FOSS application and my go-to PDF reader. It lacks fancy annotation, and doesn't even have great text selection and copy/paste, but it is really fast, and has fast search, manipulation, etc.

      https://mupdf.com/

  • duskwuff 5 years ago

    For what it's worth, Preview is a relatively thin shell around Apple's own PDFKit:

    https://developer.apple.com/documentation/pdfkit

    Whether that could itself be open-sourced is an interesting question. (My concern would be that parts of it might be covered by Adobe NDAs.)

  • arvindamirtaa 5 years ago

    >Consumers get a product and they still have to go on Mac to use it.

    There will be ports to windows and linux in under a month.

    • mhh__ 5 years ago

      How? I assume it's mostly using proprietary MacOS APIs, and, ignoring that it doesn't really matter beyond apple being an abusive partner, lawyers are a powerful tool (This software is under the Apple don't take the piss licence)

      • arvindamirtaa 5 years ago

        Assuming the License is that restrictive, only those who are financially invested in Preview working well (and only on macOS) will have any incentive to contribute code.

        I can't think of any company other than Apple that fit this description.

        • mhh__ 5 years ago

          Apple fanboys, and also having the code be public puts pressure on you for it to be good.

    • djxfade 5 years ago

      I wouldn't be to sure, Apple's applications usually rely heavily on proprietary Cocoa APIs not available anywhere else

    • yakubin 5 years ago

      There already is an equivalent feature in GNOME. Preview's code is of little value outside of Mac. It's also not rocket science, that the thing stopping other people from doing the same thing would be lack of access to original source code. :)

fastball 5 years ago

From what I can tell, there is no reason you can't just run the PDF through ABBYY FineReader again and get the exact same OCR you got the first time, so I think "irreversible" is a bit over-the-top.

Is it as easy as CMD+Z? No. Is it data you can never get back? Also no.

  • matrixagentOP 5 years ago

    In theory that is probably true – in my actual scenario I can't run them through ABBYY again because of the limitations of the bundled version. It only accepts PDFs coming from the scanner software, so running these through ABBYY again would give me an error message. I'd have to buy the full version to be able to try out that workaround.

    • non-nil 5 years ago

      On a totally not entirely unrelated note, I have found ExifTool[0] to be quite useful for many tasks. Especially in combination with a bash alias or simple Automator action, to be used in the services menu, or as a droplet or folder action. [0]https://exiftool.org/TagNames/PDF.html

      • matrixagentOP 5 years ago

        That's exactly what I'm planning to explore as a workaround to remove the blank pages until ABBYY and/or Apple sort this out. :)

cprecioso 5 years ago

This happened to me in Catalina as well. This summer I was preparing the paper proceedings for a conference, which were made with InDesign. I had to remove a couple of pages from the output, did so with Preview, and from then on, the text was garbled on copy-pasting. Had to switch to using Acrobat for that step.

juskrey 5 years ago

Preview for PDF manipulation was a nice try at first, until I realized I suddenly have unexpected problems with produced docs, trouble with drag-and-drop, overwritten files etc..

Now I am using PDFGenius and never looking back.

e40 5 years ago

Let's be real. Every single macOS release, until it reaches x.y.4 or x.y.5 is just in beta and you are the tester.

I upgraded to Catalina when it hit 10.15.6, and I watched for the year since the release all the comments and posts about the horrible things it was doing to their computer, files, apps, etc.

Apple supports the latest 2 versions of macOS. Always be on the "previous" one is my advice. Since my family and friends started following it, they are much happier and more productive.

Let the masses beta test.

  • fastball 5 years ago

    Is that not like, every piece of software ever?

    I don't know very many pieces of widely used / actively developed software that stayed static on X.0.0 for more than a couple weeks after release or so.

    • e40 5 years ago

      No, it's not. I knew I'd get downvotes. Don't mind. I don't say this about macOS lightly. I've been using it since 10.0.

  • krull10 5 years ago

    Seconded for macOS. I usually update when the next major release is about to be announced. By 11 months they usually finally have the bugs worked out. It isn’t always necessary for every yearly release, but once you’ve been burned a few times you learn it’s better to wait for several point releases...

ehutch79 5 years ago

Apple has a lot of shit they need to fix in macOS and the accompanying apps.

That said, the author of this article is clearly an ass, and i have a hard time being sympathetic.

Assuming the pdf is actually in spec, which it's probably not, this shouldn't be happening. That said, if the 3rd party app vendor says the pdfs they generate are broken in big sur, that should tell you, they may be broken other places as well, and it's probably not apple's issue.

  • matrixagentOP 5 years ago

    Could you explain why or how exactly I'm an ass?

    • ehutch79 5 years ago

      To quote:

      """But Apple didn’t tell me that I can’t upgrade to Big Sur when I use ABBYY"""

      • matrixagentOP 5 years ago

        This is at least the second time that Apple breaks Preview in this way. I outlined why this bug is something a lot of users could run into, and I even mention that I'm not necessarily holding it against them that it breaks, but specifically that they don't even care to mention it breaking. After the first time, which was a big deal at the time, they should have put safeguards in place so that they could notice it happening. If the fault is with ABBYY, then I'm okay with Apple breaking it for the sake of internally improving Preview, and blame for the breakage lies with ABBYY. But I expect Apple to tell me – because it happened before. That blame lies with Apple.

        That is my reasoning, and I don't think that's too high a bar for one of the richest companies on the planet, priding themselves in the details and "it just works". You don't have to agree with that, obviously. Calling me an ass for that is extremely rude and uncalled for, though. I was under the impression that this tone was actually not acceptable here.

cosmotic 5 years ago

The text corruption doesn't appear to be random. The same word gets converted to the same corruption. It's more likely an encoding/decoding bug.

dev_tty01 5 years ago

Preview used to be solid, but it has been increasingly fragile in recent years. I found PDF Expert to be a great replacement. I have no affiliation.

nerpderp82 5 years ago

> You have to completely close the file and reopen it, only then will you realize that it has been destroyed.

Someone 5 years ago

At first glance, it’s a replacement cypher. Every ‘a’ becomes a filled square, every ‘b’ a ‘p’, every ‘c’ a ‘(‘, every ‘d’ a ‘)’, etc.

However, there are exceptions, for example the first ‘b’ on line 10. It becomes an ‘ä’ on line 21. I guess that’s because that is bold text, and thus a different font.

rubatuga 5 years ago

Once again, the Hacker News comments prove to be more useful and insightful than the article itself.

kekeblom 5 years ago

I had an issue recently where the form contents filled and saved with Preview.app would not show up in acrobat reader. I've encountered this in two cases so far, with two completely different documents.

qwerty456127 5 years ago

I have encountered too man PDFs (mostly digital originals rather than OCRed scans) corrupted this way during the recent months. Now I see why...

skissane 5 years ago

I hate Preview's PDF editing features, I wish there was a way to turn them off.

I'm the kind of person who tends to randomly click on things as I read them. In other PDF readers, this is quite harmless. In Preview, it starts editing the PDF. 99.9% of the time I have zero interest in editing or annotating the PDF I am reading. And then when I quit it asks me if I want to save a copy. I never wanted to change it to begin with!

(Maybe it is time I found another PDF reader...)

jordache 5 years ago

anyone else not able to see sufficient details the tiny screenshots? What was the difference?

lisper 5 years ago

Using Apple devices in general seems like a total crap shoot to me nowadays because of the impossibility of down-grading the OS. Every "upgrade" comes with a considerable risk that something that had been working will stop working, and if that happens, you are pretty much SOL.

  • fastball 5 years ago

    What? You can definitely downgrade to an earlier MacOS.

    It's not a one-click downgrade like the upgrade is, but I don't know of any OS with that feature.

    • lisper 5 years ago

      > You can definitely downgrade to an earlier MacOS.

      Sometimes you can, sometimes you can't. Going from Mavericks to Yosemite for example is one-way because it includes a non-backwards-compatible firmware update. Going to Catalina is also one-way because it changes the file system from HFS to AFS.

      And iOS is famously non-downgradable.

    • rbanffy 5 years ago

      > What? You can definitely downgrade to an earlier MacOS.

      Unless they got a brand-new M1-based Mac. Macs usually don't install versions of macOS prior to their launches.

0000011111 5 years ago

Use "Adobe Acrobat Reader DC" for pdf work on macOS v11.1

nt2h9uh238h 5 years ago

Is this German?

anonuser123456 5 years ago

Time machine?

  • dewey 5 years ago

    Backups are always great, but if something is broken silently behind your back and you only realize in a few years that your archived documents are not searchable any more that makes it harder to recover.

beamatronic 5 years ago

Preview should not change the file on disk. I would expect it to open the original file as read-only.

  • blacksmith_tb 5 years ago

    Yes, the author says it's "the result after modifying (removed a blank page) and saving that same PDF in Preview." So it's not enough to just view the file in Preview.app I take it, but you need to save it out (which still shouldn't strip anything extra, obviously, but is not what I thought was being claimed).

    • birdyrooster 5 years ago

      So you are saying that they overwrote their file and are upset that the file they overwrote is different from the new file? This is insanity. Clearly a bug in AABBY that it can’t read PDF saved in the standard spec.

      PDF is not a bitmap, it’s a script like HTML or JS. People understand browser incompatibility but some how this is unconscionable.

      • fastball 5 years ago

        Other possibility is that Apple isn't saving to spec.

        Or even more accurately, that neither are using the same spec because PDF just isn't that standardized.

  • throwaway744678 5 years ago

    I understand it does not: the issue occurs when the user removes another (blank) page, then saves the file.

  • MrBuddyCasino 5 years ago

    > In the lower half is the result after modifying (removed a blank page) and saving that same PDF in Preview.

    I don't think this means Preview changes the files just by opening them.

YetAnotherNick 5 years ago

PDFs are not intended to be modified. Preview and other readers use hacks to do the work. In general don't modify the PDF and if you really want to do it buy Acrobat reader.

sn41 5 years ago

There was something in macos Catalina that broke mupdf on my macbook pro. The view would occupy the lower left corner of the window, and something was clipping the view to the lower quadrant.

I tried installing from source, changing the gl library etc. But it was the same.

Am done with Apple for now. M1 is a bit tempting, but I guess I will wait for the technology to mature, buy a Macbook Air, and run Linux on it.

  • ehutch79 5 years ago

    Why would installing from source change things. Without finding/fixing the bug, you're just using the same compiled code as before

    • sn41 5 years ago

      I was trying avoid library incompatibilities. Pulled everything from the repository and recompile with the latest libraries. I also tried a couple of different libraries. I gave up after a week or so. (What I did not do was to compile the libraries from the source as well.)

      I really like mupdf so it was a big nuisance for me to lose that.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection