I read your PDF

In 1944, inspired by Rommel‘s book, George Patton embarked on his own French adventure. Hollywood responsibly supported the national war spirit in various ways. The WW2 defense effort was bolstered by high-quality paper drawings, statistical process control and strict enforcement of industry standards.

In 1984, Ronald Reagan won in a landslide. Tom Clancy published “The Hunt for Red October,” and Orwell’s nightmare was brought to the screen, each demonstrating the value of information in their own fashion. Adobe released PostScript, making printing both author-controlled and device-agnostic.

In 1994, the world was supposedly getting “flat,” while Hollywood was definitely shifting left. Adobe released PDF, a descendant of PostScript. PDF guaranteed consistent, author-controlled, and device-agnostic client-side viewing and printing, with the added capability of secure signing. Around the same time, S1000D emerged to enforce rules for technical documentation layout.

In 2004, the USA was deeply engaged in the Global War on Terror. Hollywood’s role in generating partisan consent began to overshadow its original entertainment purpose. The concepts of the digital thread started to proliferate through the engineering and manufacturing domains, encompassing CAD, simulations, and requirements. Meanwhile, PDF continued to conquer the world.

PDF became ISO 32000, and Adobe opened its API to external developers. Subsequently, PDF evolved into the default method for information presentation and distribution, especially for legally binding documents.
PDF/A was introduced to address PDF dependencies on potentially unstable external features. Combined with the PDF Document Catalog‘s hierarchical embedding of a wide variety of documents (such as STEP/JT, MS Office, etc.), PDF/A is arguably the uncontested tool for long-term data archiving. As of today, the veraPDF project validates any PDF/A file for compliance at any level.
3D PDF was born to facilitate the easy read-only sharing of 3D MBD data. With native support in Adobe Acrobat and enthusiastic adoption by the DoD, its success seemed assured. Unfortunately, for reasons too lengthy to cover here, it did not live up to its promise and is now largely abandoned by all but its most devoted proponents.

It is November 2024, and I suddenly feel quite optimistic about many topics. Hollywood’s very existence is under threat. In the ongoing quest to connect the engineering and manufacturing puzzle to the digital thread, the PDF consumption process might be up for disruption. In one such case, recipients either print PDFs authored in various PLM ecosystems onto paper (sic!) or manually retype and copy-paste their content into MES. They do this because, due to the 40-year-old system architecture, PDF data cannot be readily extracted into JSON or XML. This issue is relevant for PDFs created from authoring tools like MS Word using API, as well as those created from scans.

Patton’s exploits in France in 1944 originated from British conceptual musings circa 1924, German and Russian experiments around 1934, and the subsequent German blitzkrieg successes. He hardly invented anything; instead, he was able to orchestrate the already matured stack of technologies, battlefield techniques, and the overwhelming American industrial and logistical advantages in the most creative and consistent manner. We can learn from Patton a lot as we think about the next phase of the industrial revolution.

We still expect to see PDF being used on a grand scale in the MES/MRO domain in 2034, as it will remain extremely cost-efficient, especially in the context of AI. The Senticore team has experimented extensively with several LLMs and a number of public GitHub projects to extract data from PDFs, and we would like to share our conclusions.

So far, we haven’t seen a fully comprehensive and reliable solution for PDF consumption in the MES/MRO domain. The latest research at Google, Microsoft, and several prominent startups seems to pragmatically concentrate on relatively structured data mixes, such as invoices.
Unless there is a qualitative leap with AI, our own concept is to keep humans in the loop as a standard feature, steadily moving generative AI-led automation from 20:80 to 80:20. This approach allows us to teach the system to process text, tables, diagrams, and drawings. In a sense, a fusion of generative AI, neural networks, and the right IDE functions like Patton’s combined arms warfare against the German lines, breaking through PDF constraints, identifying data types, and allowing other algorithmic tools to come into play and extract them correctly into JSON or XML.

Feeling exhausted from plowing through the avalanche of inbound PDF files? Would you like to integrate the engineering and manufacturing data trapped inside these files into your digital thread ecosystem reliably and at a reasonable cost? Talk to us; like General Patton, we may have a solution for you.