The Baseline: What I Learned from 155 Documents
This is the third in a multi-part series on building a local AI knowledge engine for digital health policy. Part 1 covered the hardware and the why. Part 2 covered the architecture. This part covers what happened when I actually ran it.
There’s a version of this post I could write where everything works smoothly, the system processes all my documents in one clean pass, and I emerge with a set of elegant policy insights formatted for immediate use. That version would be shorter and more satisfying to read. It would also be a lie.
The real version involves four distinct categories of document formatting problems, a model that couldn’t count past three, a legal graph that took 14 seconds to build and will take me considerably longer to fully understand, and one document — a comparative analysis of the AI Act and the MDR — that required three separate model calls before it would co-operate. This is the real version.
154 documents, 2234 images
I started with what I’m calling the Dutch baseline: the corpus of policy documents that represent the current NL position on digital health. Kamerbrieven, implementation plans, legal frameworks, impact assessments, sector analyses, interoperability roadmaps. Things I’ve had in folders for years, in various states of organisation. I fed them all into the conversion pipeline.
The old version of this pipeline — the one I’d been using before — processed 387 PDFs and produced exactly zero extracted images. Not because the documents didn’t have images: they did, extensively. Governance diagrams, implementation timelines, three-plateau NVS architecture diagrams, progress charts. The pipeline just wasn’t configured to extract them. I had it pointed at the wrong flag.
The new pipeline processed 154 documents and extracted 2234 images. That’s an average of nearly 15 per document. Zero failures.
I mention this not to embarrass myself about the old pipeline — though I’ve had worse weeks — but because it illustrates something important about building systems for real documents rather than for demonstrations. The difference between “it works on the happy path” and “it works on your actual archive” is often one configuration flag, three months, and a certain amount of humiliation.
Four patterns, one cleaning function
When a PDF is converted to Markdown, it doesn’t always come out clean. pdfmd — the conversion tool I’m using — faithfully represents everything it finds in the PDF, including the formatting artefacts that the PDF’s original design produced. Over the course of processing 154 documents, I found four distinct failure patterns, each requiring its own fix.
Pattern A is what I call the lone asterisk problem. Some PDFs render their subtitle or caption as italic text, which pdfmdconverts to *text*. When this appears as the very first character in the document, the language model enters what I can only describe as an italic-mode fugue state: it sees an opening asterisk before anything else and decides it’s in a formatting context, which means it produces styled Markdown rather than the plain JSON I’m asking for. The fix was to strip the leading asterisk from the first line only. Everything else stays.
Pattern B is the standalone double-asterisk. A certain category of Dutch policy document uses bold formatting as a structural separator rather than inline emphasis. pdfmd faithfully renders this as a line containing nothing but **, sometimes a hundred times through the document. When the language model tries to determine whether it’s inside or outside a bold span at any given point in the document, it gives up and stops producing structured output. The fix was to remove lines consisting only of **, while leaving every **term** with actual content intact.
Pattern B variant, I discovered during the same diagnostic, is ** ** — two bold markers with a space between them. A different flavour of the same structural separator pattern. One additional regex.
Pattern C is the most interesting, because it’s caused by a feature working correctly. pdfmd extracts every page of a document as an image and includes a reference to that image in the converted Markdown. For most documents this is fine — the references are scattered through the text and don’t dominate the excerpt. But for documents where images are the primary content — highly visual policy reports, strategy documents with large infographic sections — the extraction produces 80 to 135 image references that occupy the majority of the excerpt budget I’m giving to the language model. The model receives a 9000-character window of content in which a third of the characters are image file paths. It can’t usefully process this, and says so, in its own way, by not producing JSON.
This one required genuine thought. My first instinct was to strip all image references. A better instinct was to preserve the semantic signal and strip only the noise. The  references go. The **bold term** markers that indicate which concepts the document considers important stay. The document’s structure is preserved. What’s removed is only what was invisible to a human reader anyway: a list of file paths pointing at images that have already been extracted separately and stored where they belong.
The cleaning function is now about sixty lines long and handles all four patterns, plus long URL compression (European Commission publication URLs alone can be 150 characters, and a document with 90 of them has no budget left for content). The function is called clean_for_extraction() and it runs only on the copy of the content that gets sent to the model. The stored Markdown is never touched.
The asterisk debate
The most useful conversation I had during this whole project was a short one. I was about to merge a version of clean_for_extraction() that removed all asterisks wholesale — all bold, all italic, clean prose, simple solution. The response I got was: by stripping the asterisks, don’t we lose too much context? The bold terms carry semantic signal about what the document considers important.
This is correct. A document that bolds SNOMED and Eenheid van Taal throughout is telling you, in its formatting, which concepts it considers load-bearing. A document that italicises gegevensuitwisseling every time it appears is treating that term as a term of art. Stripping that away makes the document cleaner but shallower. The model would have had an easier time. It would have learned less.
The actual fix — five lines of regex targeting the specific artefacts that caused the failures — was much smaller than what I’d proposed. This is usually how it goes. The temptation to solve the general case is strong, and usually wrong.
I have flagged this in my notes as a thing worth writing about. Building a system that processes policy documents is not just a software engineering problem. It requires domain understanding — knowing the difference between formatting noise and structured emphasis in government documents. I am, it turns out, not a neutral observer here. The knowledge I’ve accumulated over years of reading these things was actually useful for something.
14 seconds
Ingesting the five Dutch healthcare laws — 215 individual articles from five XML source files downloaded directly from wetten.overheid.nl — took 14 seconds. That includes parsing the XML, extracting every article with its lid numbers and definitions, resolving the cross-references between articles, generating an embedding for each article, and writing everything to both the SQLite database and the graph.
I spent considerably longer than 14 seconds trying to decide whether this was the right approach. The alternative was to treat the laws as just more documents — convert them to Markdown, feed them to the curator, let it tag them like everything else. That would have been simpler.
What I have instead is 230 resolved cross-references between the five laws, structured as edges in a graph. WEGIZ Article 1.1 defines cliënt by explicit reference to Wkkgz Article 1. Wabvpz Article 7 incorporates the processing conditions from the UAVG. The Wgbo’s patient rights provisions are the foundation on which Wkkgz’s quality requirements are built. This is not a description of how the laws relate to each other. It’s a model of those relationships that a query can traverse.
When an international document talks about patient rights in the context of electronic data exchange, the system can now ask: which specific articles in the Dutch legal framework bear on this? Which of those articles cross-reference each other? What does the full chain of authority look like? These are not questions the semantic embedding layer could answer reliably on its own.
The laws also gave me an unexpected finding in the reference graph. Of the 434 cross-references in the five laws, 204 point at laws not in our set. The most-referenced absent law — 26 times — is the Algemene wet bestuursrecht, the general administrative law that underpins most of Dutch government. Dutch healthcare law is built on Dutch administrative law which is built on constitutional principles. The graph makes this visible as a structure, not just as a claim.
The model that couldn’t
For the first pass through the 154 baseline documents, I used phi3.5:3.8b for both extraction tasks: structured metadata (what country, what year, what topics) and candidate term detection (what concepts appear repeatedly that aren’t already in my taxonomy). The metadata extraction worked reasonably well. The candidate term detection produced, across 153 documents: two terms. Both with a frequency of one.
The problem is not that phi3.5 is a bad model. It’s a good model for what it’s designed for. But asking a 3.8-billion-parameter model to read a 10,000-character policy document, compare every concept in it against a 43-topic taxonomy, identify novel recurring concepts, and return them as a structured JSON array with context quotes — that’s a genuinely hard comparative reasoning task. It’s not a fill-in-the-template task. The model was capable of producing the right shape of output. It was not capable of producing useful content in it.
The fix was to route the candidate term extraction to qwen2.5:14b. The metadata extraction stays with phi3.5 — it’s faster and the task is well within its capabilities. The harder task goes to the bigger model. The first document processed after this change produced eight candidate terms, including Cross-Domain Collaboration, Terminology Standardization, and Regulatory Landscape. The second produced eight more: AI Act Implications for Medical Devices, Regulatory Overlap between AI Act and MDR/IVDR, Post-Market Monitoring for AI Medical Devices.
None of these terms exist in the OECD Digital Health Policy Framework taxonomy I’m using as my starting ontology. Some of them probably should.
The last document
The hardest document in the corpus to process was a comparative legal analysis of the AI Act and the MDR/IVDR — the EU regulation for medical devices. It was hard for two separate reasons that required separate fixes.
First: the document is 226,000 characters of dense legal analysis, and the last 9,000 of those characters are an appendix of image references — one per page, 79 pages. The smart excerpt function I use to give the model a manageable window of content takes 70% from the start and 30% from the end. The end, in this case, was pure noise. The model received a reasonable introduction followed by 79 lines of file paths. It declined to produce JSON. This was the correct response, in a sense: there wasn’t enough signal in what it was given.
Second, and separately: even after fixing the excerpt problem, phi3.5 couldn’t extract coherent metadata from this document. It’s written in dense technical English about two overlapping EU regulatory frameworks, contains extensive cross-references to specific recitals and articles, and is structured more like a legal opinion than a policy document. The model returned text. It was thoughtful text. It was not JSON.
The fix for the second problem was a third model call, automatically triggered when both phi3.5 attempts fail: escalate to qwen2.5:14b. This is the same model I’m using for candidate terms. It handles the document cleanly in 35 seconds.
The escalation is now automatic. phi3.5 first, minimal-prompt phi3.5 retry second, qwen2.5:14b third. The system handles this without human intervention. The document gets processed. The log shows which path it took.
In retrospect, I should probably have designed it this way from the start. In practice, I designed it this way because the last document in the corpus refused to be processed any other way, and that’s a perfectly good reason.
What the baseline shows
After all of this — the failures, the fixes, the debugging sessions, the asterisk debate — I have a baseline. 155 Dutch digital health policy documents, each tagged with topic relevance scores across ten dimensions, each embedded and comparable to the others, each with candidate terms logged for review. 215 legal articles from five Dutch laws, cross-referenced and graph-indexed.
The candidate term review hasn’t happened yet. The --rerun-candidates job is running as I write this, re-processing all 153 documents that got the unhelpful phi3.5 pass on the first run, now with qwen2.5:14b. By tomorrow morning I’ll have the full candidate term picture, and then I’ll sit down with the ontology review tool and decide which of those concepts deserve a permanent home in the taxonomy.
That review session is where the international corpus work begins. Once I know what the Dutch baseline thinks is important, I can start comparing it to what France thinks is important, and Australia, and Canada, and the WHO, and the G20, and the rest of the 280 documents currently sitting in a folder on my Desktop.
The GDHL exists. It’s empty of international content, still just a well-catalogued Dutch library. But the foundation is right. The architecture is honest about what it knows and what it doesn’t. The legal graph means the system understands the difference between a binding requirement and a policy aspiration. The entropy detection means it will notice when the international landscape starts to move.
That’s enough to start. More to follow.
The code for this project is not yet public. If you’re building something similar and want to compare notes, get in touch.
0