Show HN: Fast and Extensible Parser for Markdown in PHP
parsedown.orgI don't need another Markdown Parser. I need another Markdown.
Markdown has outgrown its original spec, yet Gruber both clings on to it and is unwilling to update it. Meanwhile, different websites and different parsers proliferate, each adding new extensions with varying degrees of usefulness and compatibility, all under the name "Markdown" or some variation.
I wish GitHub would drop the name "GitHub flavored Markdown", give it a clever new name, a cleverly branded website and use their bully pulpit to cast off Gruber's shackles and effect change.
I wish GitHub would drop the name "GitHub flavored Markdown", give it a clever new name
GitDown, GitDown
everybody GitDown now
Agreed. Most new Markdown parsers for higher speed have different edge cases, including the famous marked.
If you don't mind the frustrating syntax tree, you can try Strictdown (not Markdown), and get some insights to make a better one. (I'm now lazy to update it.) https://github.com/jakwings/strictdown
Strictdown looks thorough: well-done! Added to the inventory of Markdown parsers and resources.¹
SkrivML² is another thoughtful take on lightweight markup, next generation.
¹ https://github.com/rhythmus/markdown-resources/ ² http://markup.skriv.org/
Thank you! Wonderful collections!
I would like a complete template language based on Markdown.
AsciiDoc is similar to markdown and offers a much more extensive syntax, but it's no where near as common in the wild.
It's important to remember that most markdown implementations (including his one) cannot be used to provide a safe mechanism for authoring user generated content without opening a site up to XSS vulnerabilities, since markdown allows arbritrary HTML markup.
Easily solved by proper use of HTMLPurifier on the output.
Thus negating any speed improvements in the markdown parser....
Considering you have to run any markdown parser through a sanitizer, the speed improvements still matter.
The markdown parser should be able to do it in an ideal world. Htmlpurifier is very slow.
edit:
To whoever downvoted me, I'm sorry, was I wrong? The markdown parser has to look at every input byte, obviously it's better to do the HTML sanitation at this level because the HTML parser must also look at every input byte, so, combine them into one pass...
Running HTMLPurifier on the output of the markdown parser is inefficient - it's sanitizing known good elements not just the potentially bad ones, so you're giving it more work to do.
Different markdown editors seem to be in disagreement how to parse the following: https://gist.github.com/anonymous/810ae1f7d52bcfffa1ef
If the second empty line marks the end of the list block, the indented html (code block) should preserve tags
Yeah it looks like my site fails at this - http://markdownshare.com/view/96996ce5-63bc-45ca-af49-ba18cb...
I remember seeing this on /r/PHP, and one of the top comments there was about it using Regex instead of parsing it like a language.
However, I also recall that it's thanks to using regex that it works so quickly. So I figured I'd get this argument out of the way before someone else brought it up.
Well, the original markdown.pl heavily uses regexps.
From having tilted at this windmill a little myself, I think:
1. It's tricky enough to handle correctly all the under-specified corner cases of basic markdown -- not to mention the popular extensions to it. The cognitive load of doing it with complex regexps gets heavy, quickly.
2. I'm incredibly impressed with all the work that John MacFarlane has put into the problem, for example in [Pandoc] and [Cheapskate].
[Pandoc]: https://github.com/jgm/pandoc
[Cheapskate]: https://github.com/jgm/cheapskate
I think semantics parsing with lexer/tokens is better for a lot of things but it sometimes overkill when the patterns are predictable and simple.
That said, has there ever really been an issue with speed as it pertains to markdown translation? I can't imagine it's an everyday, practical concern.
> That said, has there ever really been an issue with speed as it pertains to markdown translation?
Yes, speed of translation is a big deal. I tried at least 4 Markdown parsers for Python precisely because I needed the right combination of speed and extensibility. When you are constructing a very large static site, a full rebuild can take a long time.
For those wondering, I went with Mistune (http://mistune.readthedocs.org/en/latest/). It is accelerated by Cython.
As with most tech, if there such a leap in speed (about 10 times) then a lot of other applications become possible. You could remove a layer of caching because its not needed anymore, thus reducing your app complexity. But apart from that, imagine how many places use markdown? If people all move to a 10 times faster implementation, that an incredible reduction in wasted cpu cycles.
My point was not that we shouldn't work to produce even small efficiencies (which, yes, cascade into larger aggregate ones).
It was more wondering whether speed in markdown parsing is such a concern that this would merit a marquee 'selling' point.
If you're building a static site from markdown files, and your site consists of thousands of pages, speed will definitely be a concern.
Not all that often, unless either a) you're in the habit of frequently making broad changes, or b) your build tool doesn't take account of modification times.
Can you parse HTML with regex?
We started with a server side Markdown parser, but switch to a JavaScript parser (https://github.com/chjj/marked). Really there is no reason to do this work on the server.
I've worked with several markdown implementations and parsedown is my current choice due to my main constraint - speed. Great work and thanks for sharing.
If speed is your top priority you may also want to look at Sundown, which can be installed as a PHP extension and is likely faster since it's just C.
Looks great, happy to see the Markdown Extra extension. With regard to performance, I've always gotten around the slowness of the original Markdown parser by making liberal use of caching, but warming the cache is still painful for a CMS. Will look to migrate to this.
Parsedown is certainly very fast, but I wouldn't call it "extensible". CeBe's markdown parser is nearly as fast, but focusses on being very easy to extend, so it's trivial to add custom syntax elements, see https://github.com/cebe/markdown
(CeBe's library is inspired by parsedown)
> Parsedown is certainly very fast, but I wouldn't call it "extensible".
Parsedown is extensible and it already has been extended. There's a well working extension of Parsedown that adds support for Markdown Extra. It's called Parsedown Extra. It can be found at https://github.com/erusev/parsedown-extra
It is possible to extend, but extensible requires more. In this case, ParsedownExtra looks to directly extend the Parsedown class. This is fine for a single extension, but it discourages utilization of multiple independent extensions.
Perhaps I am looking at this wrong, but I don't see why you would use a Markdown parser written in PHP if you're looking for speed. Case in point the parsedown system is fast because it has heavy use of regular expressions, which parse faster and run faster than the host language-- it already relies on a language other than PHP to essentially emulate parts of a well-written lexer.
As debaserab2 says[1], if you are looking for speed, consider PHP extensions.
In my opinion, writing a system like this is a misappropriation of PHP, which evolved from and works best as a hybrid templating/scripting language. It becomes a powerful development platform when its extensive library of C functions is used to do most heavy lifting.
If someone has done it well without compiling C library, why don't you try it (on a shared server, maybe)?
I don't understand what you mean. Could you also explain the negativity around this comment? I didn't think it was a badly voiced opinion, and karma is not meant to be used to show how much you agree or disagree with someone.
I had no right to downvote your reply. (even now) ;) Please take it easy. I just want to say that using C library is not always preferred.
I agree, but I did not mean to make any sweeping statement in that regard, but in the case of a standardised markup like markdown, there are already suites of field-tested C libraries that provide much better speed than this library would; for markdown content, this provides a better experience for your users.
Agree, too. Well, plain text sucks. So I like to use smiley symbols now. ;-)
:¬)