Show HN: Fast and Extensible Parser for Markdown in PHP

88 points by erusev 12 years ago · 38 comments

Reader

I don't need another Markdown Parser. I need another Markdown.

Markdown has outgrown its original spec, yet Gruber both clings on to it and is unwilling to update it. Meanwhile, different websites and different parsers proliferate, each adding new extensions with varying degrees of usefulness and compatibility, all under the name "Markdown" or some variation.

I wish GitHub would drop the name "GitHub flavored Markdown", give it a clever new name, a cleverly branded website and use their bully pulpit to cast off Gruber's shackles and effect change.

untog 12 years ago

I wish GitHub would drop the name "GitHub flavored Markdown", give it a clever new name
GitDown, GitDown
everybody GitDown now
NaNaN 12 years ago

Agreed. Most new Markdown parsers for higher speed have different edge cases, including the famous marked.
If you don't mind the frustrating syntax tree, you can try Strictdown (not Markdown), and get some insights to make a better one. (I'm now lazy to update it.) https://github.com/jakwings/strictdown
- rhythmvs 12 years ago
  
  Strictdown looks thorough: well-done! Added to the inventory of Markdown parsers and resources.¹
  SkrivML² is another thoughtful take on lightweight markup, next generation.
  ¹ https://github.com/rhythmus/markdown-resources/ ² http://markup.skriv.org/
  - NaNaN 12 years ago
    
    Thank you! Wonderful collections!
k__ 12 years ago

I would like a complete template language based on Markdown.
- phpnode 12 years ago
  
  AsciiDoc is similar to markdown and offers a much more extensive syntax, but it's no where near as common in the wild.
  http://www.methods.co.nz/asciidoc/

simonw 12 years ago

It's important to remember that most markdown implementations (including his one) cannot be used to provide a safe mechanism for authoring user generated content without opening a site up to XSS vulnerabilities, since markdown allows arbritrary HTML markup.

McGlockenshire 12 years ago

Easily solved by proper use of HTMLPurifier on the output.
- phpnode 12 years ago
  
  Thus negating any speed improvements in the markdown parser....
  - Navarr 12 years ago
    
    Considering you have to run any markdown parser through a sanitizer, the speed improvements still matter.
    
    phpnode 12 years ago
    
    The markdown parser should be able to do it in an ideal world. Htmlpurifier is very slow.
    edit:
    To whoever downvoted me, I'm sorry, was I wrong? The markdown parser has to look at every input byte, obviously it's better to do the HTML sanitation at this level because the HTML parser must also look at every input byte, so, combine them into one pass...
    Running HTMLPurifier on the output of the markdown parser is inefficient - it's sanitizing known good elements not just the potentially bad ones, so you're giving it more work to do.

brute 12 years ago

Different markdown editors seem to be in disagreement how to parse the following: https://gist.github.com/anonymous/810ae1f7d52bcfffa1ef

If the second empty line marks the end of the list block, the indented html (code block) should preserve tags

stevekemp 12 years ago

Yeah it looks like my site fails at this - http://markdownshare.com/view/96996ce5-63bc-45ca-af49-ba18cb...

Navarr 12 years ago

I remember seeing this on /r/PHP, and one of the top comments there was about it using Regex instead of parsing it like a language.

However, I also recall that it's thanks to using regex that it works so quickly. So I figured I'd get this argument out of the way before someone else brought it up.

greghendershott 12 years ago

Well, the original markdown.pl heavily uses regexps.
From having tilted at this windmill a little myself, I think:
1. It's tricky enough to handle correctly all the under-specified corner cases of basic markdown -- not to mention the popular extensions to it. The cognitive load of doing it with complex regexps gets heavy, quickly.
2. I'm incredibly impressed with all the work that John MacFarlane has put into the problem, for example in [Pandoc] and [Cheapskate].
[Pandoc]: https://github.com/jgm/pandoc
[Cheapskate]: https://github.com/jgm/cheapskate
nkozyra 12 years ago

I think semantics parsing with lexer/tokens is better for a lot of things but it sometimes overkill when the patterns are predictable and simple.
That said, has there ever really been an issue with speed as it pertains to markdown translation? I can't imagine it's an everyday, practical concern.
- chrismonsanto 12 years ago
  
  > That said, has there ever really been an issue with speed as it pertains to markdown translation?
  Yes, speed of translation is a big deal. I tried at least 4 Markdown parsers for Python precisely because I needed the right combination of speed and extensibility. When you are constructing a very large static site, a full rebuild can take a long time.
  For those wondering, I went with Mistune (http://mistune.readthedocs.org/en/latest/). It is accelerated by Cython.
- seer 12 years ago
  
  As with most tech, if there such a leap in speed (about 10 times) then a lot of other applications become possible. You could remove a layer of caching because its not needed anymore, thus reducing your app complexity. But apart from that, imagine how many places use markdown? If people all move to a 10 times faster implementation, that an incredible reduction in wasted cpu cycles.
  - nkozyra 12 years ago
    
    My point was not that we shouldn't work to produce even small efficiencies (which, yes, cascade into larger aggregate ones).
    It was more wondering whether speed in markdown parsing is such a concern that this would merit a marquee 'selling' point.
- oneeyedpigeon 12 years ago
  
  If you're building a static site from markdown files, and your site consists of thousands of pages, speed will definitely be a concern.
  - aaronem 12 years ago
    
    Not all that often, unless either a) you're in the habit of frequently making broad changes, or b) your build tool doesn't take account of modification times.
coolj 12 years ago

Can you parse HTML with regex?
http://stackoverflow.com/a/1732454

nodesocket 12 years ago

We started with a server side Markdown parser, but switch to a JavaScript parser (https://github.com/chjj/marked). Really there is no reason to do this work on the server.

zaf 12 years ago

I've worked with several markdown implementations and parsedown is my current choice due to my main constraint - speed. Great work and thanks for sharing.

debaserab2 12 years ago

If speed is your top priority you may also want to look at Sundown, which can be installed as a PHP extension and is likely faster since it's just C.
https://github.com/chobie/php-sundown

alphadevx 12 years ago

Looks great, happy to see the Markdown Extra extension. With regard to performance, I've always gotten around the slowness of the original Markdown parser by making liberal use of caching, but warming the cache is still painful for a CMS. Will look to migrate to this.

phpnode 12 years ago

Parsedown is certainly very fast, but I wouldn't call it "extensible". CeBe's markdown parser is nearly as fast, but focusses on being very easy to extend, so it's trivial to add custom syntax elements, see https://github.com/cebe/markdown

(CeBe's library is inspired by parsedown)

seer 12 years ago

> Parsedown is certainly very fast, but I wouldn't call it "extensible".
Parsedown is extensible and it already has been extended. There's a well working extension of Parsedown that adds support for Markdown Extra. It's called Parsedown Extra. It can be found at https://github.com/erusev/parsedown-extra
- lacksconfidence 12 years ago
  
  It is possible to extend, but extensible requires more. In this case, ParsedownExtra looks to directly extend the Parsedown class. This is fine for a single extension, but it discourages utilization of multiple independent extensions.

tshadwell 12 years ago

Perhaps I am looking at this wrong, but I don't see why you would use a Markdown parser written in PHP if you're looking for speed. Case in point the parsedown system is fast because it has heavy use of regular expressions, which parse faster and run faster than the host language-- it already relies on a language other than PHP to essentially emulate parts of a well-written lexer.

As debaserab2 says[1], if you are looking for speed, consider PHP extensions.

In my opinion, writing a system like this is a misappropriation of PHP, which evolved from and works best as a hybrid templating/scripting language. It becomes a powerful development platform when its extensive library of C functions is used to do most heavy lifting.

[1] https://news.ycombinator.com/item?id=7784219

NaNaN 12 years ago

If someone has done it well without compiling C library, why don't you try it (on a shared server, maybe)?
- tshadwell 12 years ago
  
  I don't understand what you mean. Could you also explain the negativity around this comment? I didn't think it was a badly voiced opinion, and karma is not meant to be used to show how much you agree or disagree with someone.
  - NaNaN 12 years ago
    
    I had no right to downvote your reply. (even now) ;) Please take it easy. I just want to say that using C library is not always preferred.
    
    tshadwell 12 years ago
    
    I agree, but I did not mean to make any sweeping statement in that regard, but in the case of a standardised markup like markdown, there are already suites of field-tested C libraries that provide much better speed than this library would; for markdown content, this provides a better experience for your users.
    
    NaNaN 12 years ago
    
    Agree, too. Well, plain text sucks. So I like to use smiley symbols now. ;-)
    
    tshadwell 12 years ago
    
    :¬)

Settings

Show HN: Fast and Extensible Parser for Markdown in PHP

Keyboard Shortcuts