Why do we convert structured data to PDFs?
Company A has structured data. They input this into a PDF (making it unstructured) and send it to Company B. Company B now has to use PDF parsing software to turn it back into structured data.
Why? Back in the day company A would send a paper document to company B and naturally somebody would have to retype it. PDF is great for that legacy workflow or anything where you need print output or screen output that exactly resembles print output. PDF has facilities for tagging documents such that they can be reflowed like HTML so they can be viewed on different sized screens. It is a boon for accessibility but framing the discussion around accessibility as opposed to a better experience for everyone, particularly automated tools, is hard. (e.g. in politics there is the analogy of how we "can't have good things" because policies that are good for everyone get framed as policies that benefit a racial or other group perceived as a "special interest") I spoke w/ Larry Masinter at Adobe and he told me Adobe would like people who want structured data in their PDF documents to simply attach files to the PDF. A scientific paper could contain a CSV file of the data, for instance, or a business document could contain a JSON or XML document. Note that "structured" is not a panacea because the structure might not be the same in the two organizations. For exchange of structured data to take place the organizations have to agree on some ontology, something that happens in some industries some of the time, but it isn't free, and when it is not in place people still have an excuse to continue using paper processes or processes that emulate paper processes. Thanks for responding. I'm curious why PDF doesn't have any metadata attached to it that can easily be parsed out by machines. Sigh See https://en.wikipedia.org/wiki/Extensible_Metadata_Platform and https://ontology2.com/essays/LookingForMetadataInAllTheWrong... Thanks for sharing! Why do you think XMP isn't widely adopted yet? There has been a lot of politics. It's yet another case study for "why we can't have nice things." When XMP first came out, Adobe tools would look at all the metadata in, say, an image file (such as EXIF) and re-express it in XMP format. I liked that a lot because I could read that XMP packet with my RDF tools and have complete access to all the metadata with very simple software. At some point other people in the industry accused Adobe of undermining other metadata standards and Adobe was pressured to only use XMP for data that could not be expressed with EXIF and other formats. This takes away complete and easy-to-work-with metadata unless I write my own tools that can convert the EXIF metadata to XMP and merge it with the XMP which might be in the document. The semantic web community also has some blame here as it never embraced XMP, if Adobe had had more industry support it might not have nerfed XMP. I very much like how XMP adopted solutions to problems like keeping track of the order of authors that communities like the one behind Dublin Core haven't had the moral fortitude to address... Keeping Dublin Core in the category of "metadata for an elementary school library" as opposed to the world beating solution that XMP and DC could have been. You might like this thesis: http://www.bloechle.ch/jean-luc/pub/Bloechle_Thesis.pdf I made a HN post on this here: https://news.ycombinator.com/item?id=33674525 Unfortunately I contacted the author via youtube and the work is proprietary, owned by the business he either created or sold-to. Thanks for sharing -- will dive deeper. This has been keeping me up at night recently... Many reasons. In finance, PDF reports are passed between companies instead of JSON/XML, etc.. because: 1. PDF is considered tempered proof. Obviously, not true, but legal is ok with that. 2. PDF can be reviewed quickly by non-technical folks, and then parsed and store into databases. 3. PDF is flat file that can be archived easily per legal, other formats such as word documents are used for that as well. In a sense, PDF is what people want. Structured data is what machines want. Shouldn't both exist? I.e. PDF for the human and data for the machine? Each report costs money. People don't want to pay extra for the same data. But they want to pay extra for expensive OCR software to parse it out instead? It's one time fee and it's probably used in other places. printers love pdf There are too many variables and edge cases to parse data. Dozens of text encodings, mixed with dozens of markup languages, mixed with millions of uniquely preserved legacy datasets, results in an exponential number of edge-case requirements that the world's data is currently stored in. And when you consider the high-power companies with financial investment in legacy data, as well as high-power companies protecting the proprietary rights and trademarks of their existing formats, the world has maximum incentive to use the status quo, a postscript-generated PDF which, due to it's legacy, happens to lack the structure you want. On a more philosophical level, the PDF has structure which is probably the most generalized structure across all domains: paragraphs of text on a page. Consider that most people barely know how to search a text file for a given word, and a minuscule percent of those people who know how to query a SQL database. People simply do not have the time or resources to learn a separate domain (data structure design and interaction) apart from their own domain. In other words, there's very few people who understand or even have motivation to use tools that provide exponential return on their time (such as manipulating/filtering/working with structured data). Time passes uniformly, and you typically receive no reward other than more work for learning tools to improve your own workflow. Software engineers have long noticed that we can successfully create "models", "view models", and "views" of data that achieve the separation of concerns that you are seeking. A PDF is nothing more than a "view" of data, which has passed through a professional who has created a "view model" of that data (he/she decided how best to organize the data on to the page), and then you read the document and "parse" the data with your intellect. There is a lot of expertise and professionalism embedded in crafting paragraphs (or other graphical representations) that you can't discredit. There is very little software options to treat generalized, domain-specific data in this three-step manner. Data exchange formats are a detail which can, often is, and quite frankly should be specified in partnership and/or vendor contracts. PDF-based interchange of structured data as part of an ongoing relationship ... seems to reflect a poor business relationships management. (And yes, there are all manner of organisations which fail to follow good practices whether on grounds of competence or malice, but generally, this is how I'd suggest addressing this issue. I'd also strongly suggest checking to see if such a data exchange option is already available.) Hm -- don't think this is the case for legacy industries. It's just what they're used to and have been doing for years I'd suggest otherwise. If only on the basis that PDF itself is a relatively recent data format. "Legacy industry" to me would be IBM mainframe data formats and the like. Which ... are their own flavour of fun. Exactly the sales pitch of https://sento.io/. Their platform allows companies to directly send structured data to other companies (circumventing the error prone and potentially labour intensive ds > pdf > sd transformation). Also interesting as a business case, because 1co to 1co requires one connector on every side. Adding another co will only require 1 (not 2 nor 3) connector on the new co side; since the others are still valid hooking into the sento.io platform. Note: I'm not affiliated, I just came across them a few months ago and this reminded me of them. Many businesses DON'T do that, and have adopted structured data transfers. I imagine you're working in an older industry like real estate? What is an example of this? I don't work in real estate, but just based on the # of OCR companies for B2B PDF data extraction, I would imagine it's still a huge problem.