The web in 1000 lines of C (Maurycy's blog)

6 min read Original article ↗
(Programming)

Modern browsers are hugely complex: Chromium (the open source portion of google chrome) currently has 49 millions lines of code, making it bigger then any other program on my machine.

... but how much of that is needed if I just want to visit websites instead of running multi-gigabyte Javascript abominations that just happen to render to the browser?

My goal is to to read all the blogs on my link list: The pages have to be rendered (no printing HTML) and links must work.

Let's start by trying to view the list. Here's the URL:

http://maurycyz.com/real_pages/

The http in the URL tells use that the server is expecting plaintext (unencrypted) HTTP. After connecting to the server, the browser has to send an HTTP request:

GET /real_pages/ HTTP/1.0
User-Agent: MyEpicBrowser/1.0
Host: maurycyz.com
[trailing blank line]
Repeating the hostname in the request might seem redundant, but it's very common to host multiple websites on one IP address. (Cloudflare, etc)

Assuming everything goes well, the server will respond with some metadata, a blank line, and the page:

HTTP/1.0 200 OK
Content-Length: 14
Content-Type: text/plain

Hello, World!

URLs don't usually include an IP address, so the browser has to perform a DNS lookup before connecting. I used the C library for this, which I don't consider cheating because DNS is very widely used outside of the web.

So now the browser has the page, but it's not exactly something that can be displayed. Most pages look something like this:

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8" />
        <meta name="norton-safeweb-site-verification" content="24usqpep0ejc5w6hod3dulxwciwp0djs6c6ufp96av3t4whuxovj72wfkdjxu82yacb7430qjm8adbd5ezlt4592dq4zrvadcn9j9n-0btgdzpiojfzno16-fnsnu7xd" />
        <link rel="preconnect" href="https://substackcdn.com" />
            <title data-rh="true">Shining light on photodiodes - lcamtuf’s thing</title>
            <meta data-rh="true" name="theme-color" content="#f5f5f5"/><meta data-rh="true" property="og:type" content="article"/><meta data-rh="true" property="og:title" content="Shining light on photodiodes"/><meta data-rh="true" name="twitter:title" content="Shining light on photodiodes"/><meta data-rh="true" name="description" content="An explanation of how photodiodes work. From diode physics to a working photodiode amplifier."/><meta data-rh="true" property="og:description" content="A quick intro and an application circuit for a versatile but oft-misused light-sensing component."/><meta data-rh="true" name="twitter:description" content="A quick intro and an application circuit for a versatile but oft-misused light-sensing component."/><meta data-rh="true" property="og:image" content="https://substackcdn.com/image/fetch/$s_!Comu!,w_1200,h_675,c_fill,f_jpg,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fima
ges%2F694d5853-cc65-4875-8172-f6590c071608_2000x1334.jpeg"/><meta data-rh="true" name="twitter:image" content="https://substackcdn.com/image/fetch/$s_!vCYi!,f_auto,q_auto:best,fl_progressive:steep/https%3A%2F%2Flcamtuf.substack.com%2Fapi%2Fv1%2Fpost_preview%2F149971958%2Ftwitter.jpg%3Fversion%3D4"/><meta data-rh="true" name="twitter:card" content="summary_large_image"/>
        <style>
          @layer legacy, tailwind, pencraftReset, pencraft;
        </style>
        <link rel="preload" as="style" href="https://substackcdn.com/bundle/theme/main.7d9ea079caf96466d476.css" />
        <link rel="preload" as="font" href="https://fonts.gstatic.com/s/lora/v37/0QIvMX1D_JOuMwr7I_FMl_E.woff2" crossorigin />
                <link rel="stylesheet" type="text/css" href="https://substackcdn.com/bundle/static/css/382.1ff9d172.css" />
                <link rel="stylesheet" type="text/css" href="https://substackcdn.com/bundle/static/css/56938.770097cf.css" />
                ... a few hundred of these ... 
                <link rel="stylesheet" type="text/css" href="https://substackcdn.com/bundle/static/css/86379.813be60f.css" />
                <link rel="stylesheet" type="text/css" href="https://substackcdn.com/bundle/static/css/53264.6ac03138.css" />
        <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=0, viewport-fit=cover" />
        <meta name="author" content="lcamtuf" />
        <meta property="og:url" content="https://lcamtuf.substack.com/p/shining-light-on-photodiodes" />
        <link rel="canonical" href="https://lcamtuf.substack.com/p/shining-light-on-photodiodes" />
            <meta name="google-site-verification" content="XQ-ieyX1lItk-wZQ6kV-RzuJzSGt-MKxVrcP9To6P0I" />
            ... and a few thousand lines more

Where's the content? It's all mashed into line 206, starting at column 4097.

On paper, parsing HTML is easy, but there is a huge amount of edge cases. For example, this is perfectly valid html:

<p>I made a website.
<p>These are some cool things
<ul>
    <li>Thing 1
    <li>Thing 2
    <li>Thing 3
</ul>

... and so is ...

<sTyLe>body {color: gray;} /* </random tag like thing> */<StYlE>

... but of course website authors have never concerned themselves with standards. I've seen a lot of stuff like this:

<p> Yap Yap Yap <a href=/more/yaps/>Link-y Yap<a> <!-- not a closing tag! -->

Chrome is smart enough to figure out that the link without a destination is a fat-fingered closing tag. Fixing broken markup requires a lot of added complexity, but a browser must make an attempt to render broken HTML — otherwise most websites won't work.

My parser is a straight forward recursive implementation that prints the text to the terminal as it goes. It always closes tags in reverse order of their opening.

Self closing tags (<br>, <img>, etc) can be handled trivially, but I've made no attempt to follow more complex auto-closing rules like those for <p> elements... because it isn't needed:

All the browser does with paragraphs is inserting a blank line between them. Independent paragraph elements is displayed the same as nested ones.

Typesetting uses a "good enough" approach: If the browser sees a space, and it's current column is above 70, it inserts a newline. This usually keeps the text within 80 columns, but long words can cause it to go over that.

I didn't have space for a proper interface, so every link is given a number which the user can type to visit it. It's not exactly a good UI, but it works.

These links are actually the reason I used a recursive HTML parser. A somewhat simpler (and more robust) non-recursive parser would work fine for formatting, but it would have required some extra work to put the link numbers after the link text.

Here's the browser displaying one of Lcamtuf's blog posts (hosted on substack):

A small snag is that most sites will refuse plaintext requests, redirecting to an https:// URL. I'm not exactly sure why, there are better mechanisms to inform browsers that you support it... but redirects have long since become the default.

To come anywhere near my goal, the browser will have to support encryption. This requires supporting 50 different ciphers and managing public-key-infrastructure...

Instead, I just used the OpenSSL library: all the other browsers do it, even Chrome. This still required duplicating all the socket code — adding around a hundred lines:

LinesFunction
107User interface
127URL parsing
99 + ...HTTP ...
    ... 109    ... over SSL
    ... 90    ... over plaintext sockets
364HTML parsing and typesetting
999Total + block comments + #include-s

Compilation

Source code: tiny.c

gcc tiny.c -o tiny -Wall -lssl -lcrypto -DUSE_SSL

If you don't have OpenSSL, you can compile it without the USE_SSL flag:

gcc tiny.c -o tiny -Wall

... at the cost of not being able to access 90% of websites.

It should work on UNIX like systems: Linux, BSD, MacOS, etc. It may work on Windows, but I haven't tried it.

Features:

  • (Kinda) word wrapped text
  • Bolded and emphasized text
  • Code blocks
  • Automatically blocks cookies and Javascript.
  • Forces dark mode

The browser takes a URL as a command line argument. If non is specified, it defaults to my link list. Once a page is loaded, all the links are assigned numbers. Type the number of a link to visit it. Exit using control-C.

You may have to scroll up to read the page.

To be clear: I don't recommend this for serious use: If you want a small browser, there are far more complete and well-tested options like lynx.