Parsing webpages with a Large Language Model (LLM) revisited – Hans Dembinski’s blog

I previously wrote about parsing websites and extract structured data, but that was in January 2025 and a lot has happened in the LLM sphere since then. The pace at which development moves forward is truly mind-boggling. Since then, I changed my mind about a couple of things, especially regarding the libraries that I would recommend.

llama.cpp > ollama

I won’t praise ollama anymore, instead I’ll recommend running llama.cpp. ollama is great to get a head start into the world of LLMs, because it makes installing and running your first LLM really easy, but the team behind it has shown behavior that raises red flags.

The project is essentially wrapping llama.cpp, but for a long time did not provide proper attribution, see ollama-3697. Even now you need to scroll down the whole README to find that attribution, which seems unfair as all the hard engineering work to make LLMs run fast on consumer hardware is done by the llama.cpp team.
ollama introduced their own format for storing LLMs on device for no particular reason, which is incompatible with the standard GGUF format, meaning that you cannot easily switch to other tools to run the same models that you already downloaded.
llama.cpp compiled from sources preforms better than ollama.
llama.cpp provides more feature and allows for greater control of said features.

PydanticAI > llama-index

In my post about RAG, I advertised llama-index, which was based on a survey of several AI libraries. I have since discovered PydanticAI, which is from the same team that brought us the fantastic Pydantic. Both libraries abstract away annoying details and boilerplate, while giving you layers of control from high-level down to the fundamentals, if you need it (and in the realm of LLMs, you often need to dig in deep). Most libraries only achieve the former, but fail at the latter. They also have excellent documentation. PydanticAI is great for extracting structured output, so we will use it here.

The task

With that out of the way, let’s revisit the task. In this post, I will let the LLM parse a web page to extract data and return it in a structured format. More specifically, I will read a couple of web pages from InspireHEP about a few scientific papers on which I am a co-author and then extract lists of references contained in these pages. Normally, one would write a parser to solve this task, but with LLMs we can skip that and just describe the task in human language. With the advent of strong coding models, there is also an interesting third option, the hybrid approach, where we let LLM write the grammar for a parser based on a bunch of example documents. The hybrid approach is arguably the best one if the structure of the source documents changes only rarely, because it provides deterministic outcomes and is much more energy efficient than using a LLM. LLMs are great for one-shot or few-shot tasks, where writing a parser would not make sense.

Disclaimer: I’ll note again that there are easier ways to solve this particular task: InspireHEP allows one to download information about papers in machine readable format (BibTeX and others). The point of this post is to show how to do it with an LLM, because that approach can also be used for other pages that do not offer access to their data in machine-readable format.

Converting dynamic web pages to Markdown

The code for this part was written by ChatGPT. We use Playwright to render the HTML a user would see in an actual browser. That’s important, because many websites are rendered dynamically with JavaScript, so that the raw HTML code does not contain the information we seek. Since the HTML downloaded by Playwright is still very cluttered and hard to read, we convert it with markdownify into simple Markdown, which is easier to read by humans and LLMs. This step removes lots of the HTML noise that deals with formatting. In signal processing terms, we increase the signal-to-noise ratio of the data. We save the Markdown files in the subdirectory scraped.

On Windows, the Playwright code cannot be run inside a Jupyter notebook, it is a long-standing issue. Playwright refuses to use its sync API when it detects that an event loop is running, and its async API fails on Windows with a NotImplementedError.

As a workaround, I run the code in a separate process, using joblib. If we weren’t running from a Jupyter notebook, we could also use a concurrent.future.ProcessPoolExecutor, but that doesn’t work in a notebook. joblib does some magic behind the scenes to enable this. As a sideeffect, this enables us to scrape multiple websites in parallel. We need to careful doing that too much, though, because websites, including Inspire, tend to block IPs that make too many calls in parallel.

from pathlib import Path
import joblib


def scrape_to_markdown(url: str, output_dir: Path):
    from playwright.sync_api import sync_playwright
    from markdownify import markdownify as md

    output_fn = url[url.index("://") + 3 :].replace("/", "_").replace(".", "_") + ".md"
    ofile = output_dir / output_fn
    if ofile.exists():
        return f"Skipped {ofile}"

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)

        page = browser.new_page()
        page.goto(url)
        # Wait for JavaScript-rendered content to load
        page.wait_for_load_state("networkidle")
        rendered_html = page.content()
        page.close()
        markdown_content = md(rendered_html)

        with open(ofile, "w", encoding="utf-8") as file:
            file.write(markdown_content)

        browser.close()
        return f"Saved {ofile}"


scraped = Path() / "scraped"

urls = """
https://inspirehep.net/literature/1889335
https://inspirehep.net/literature/2512593
https://inspirehep.net/literature/2017107
https://inspirehep.net/literature/2687746
https://inspirehep.net/literature/2727838
""".strip().split("\n")

joblib.Parallel(n_jobs=4)(
    joblib.delayed(scrape_to_markdown)(url, scraped) for url in urls
)

['Skipped scraped\\inspirehep_net_literature_1889335.md',
 'Skipped scraped\\inspirehep_net_literature_2512593.md',
 'Skipped scraped\\inspirehep_net_literature_2017107.md',
 'Skipped scraped\\inspirehep_net_literature_2687746.md',
 'Skipped scraped\\inspirehep_net_literature_2727838.md']

The content of an example files looks like this:

Measurement of prompt charged-particle production in pp collisions at $ \sqrt{\mathrm{s}} $ = 13 TeV - INSPIREYou need to enable JavaScript to run this app.

[INSPIRE Logo](/)

literature

- Help
- Submit
- [Login](/user/login)

[Literature](/literature)

[Authors](/authors)

[Jobs](/jobs)

[Seminars](/seminars)

[Conferences](/conferences)

[Data](/data)BETA

More...

## Measurement of prompt charged-particle production in pp collisions at s \sqrt{\mathrm{s}} s = 13 TeV

- [LHCb](/literature?q=collaboration:LHCb)

Collaboration

•

- [Roel Aaij](/authors/1070843)(

  - [Nikhef, Amsterdam](/institutions/903832)

  )

Show All(972)

Jul 28, 2021

35 pages

Published in:

- _JHEP_ 01 (2022) 166

- Published: Jan 27, 2022

e-Print:

- [2107.10090](//arxiv.org/abs/2107.10090) [hep-ex]

DOI:

- [10.1007/JHEP01(2022)166](<//doi.org/10.1007/JHEP01(2022)166>)

Report number:

- LHCb-PAPER-2021-010,
- CERN-EP-2021-110

Experiments:

- [CERN-LHC-LHCb](/experiments/1110643)

View in:

- [CERN Document Server](http://cds.cern.ch/record/2777220),
- [HAL Science Ouverte](https://hal.science/hal-03315290),
- [ADS Abstract Service](https://ui.adsabs.harvard.edu/abs/arXiv:2107.10090)

pdfciteclaim[datasets](/data/?q=literature.record.$ref:1889335)

[reference search](/literature?q=citedby:recid:1889335)[32 citations](/literature?q=refersto:recid:1889335)

### Citations per year

[...]

The web page also contains all the references cited by the paper. I skipped that part here, which is not of interest for us. In fact, one should cut that part away in order to help the model focus on the relevant text piece and to not waste time on processing irrelevant tokens.

The converted Markdown does not look perfect, the conversion garbled up the structure of the document. Let’s see whether the LLM can make sense of this raw text. We want it to extract the authors, the journal data, the title, and the DOI.