YouTube-dl has an interpreter for a subset of JavaScript in 870 lines of Python

473 points by yuuta 3 years ago · 165 comments (163 loaded)

Reader

To be clear, this is an extremely tiny subset of JS. It looks like they only implemented the features needed to run a very specific function. For example, the only symbol allowed after "new" is "Date", everything else throws an exception.

It's still fun that it's there, but it's not as big a deal as it sounds from the tweet.

krab 3 years ago

It will only grow - as new scripts will need to be interpreted, new features will be added.
- lolinder 3 years ago
  
  I would be horrified if this grew much further. It's perfectly fine for its current scope, but the architecture would not scale at all to a full interpreter without essentially starting from scratch.
  - kelnos 3 years ago
    
    Yeah, at some point you have to question if it's worth spending time maintaining a quirky, error-prone, ever-growing mini-JS interpreter, or just adding a dependency on v8 or node or something. And then you don't have to worry about supporting new scripts, as they'll just always work.
    
    jchw 3 years ago
    
    If you were going to use a C library, the most logical is QuickJS since it has Python bindings, is small, perfectly fast enough for the kind of needs yt-dl has, and it has excellent coverage of the standard and passes conformance tests.
    That said I think a decent Python-native JS interpreter isn't that bad of an idea, it definitely needs a separate project and a more sophisticated architecture but it's an attainable goal.
    
    d0mine 3 years ago
    
    Being pure Python has an advantage where it can be run e.g., in Pythonista 3 for iOS (which allows to implement various functions such as: download just audio, send link to be viewed on a separate device).
  - whateveracct 3 years ago
    
    How much does it need to "scale"? It just has to be fast and correct enough for the a CLI to work.
    
    lolinder 3 years ago
    
    "Scale" has multiple meanings, and here I'm not talking about speed.
    This interpreter is built around matching specific Regex patterns and then immediately executing hard-coded behaviors with a few slots for parameters. It's missing a whole lot of the skeletal structure that would be necessary to "scale" it up to support a generally useful subset of JS, much less the entire language. Without the necessary structure, it would be a buggy mess that's impossible to maintain, and you can't just take the existing code and structure it better: it's built on the wrong foundation. That's what I mean when I say it won't scale.
    That's not a judgement against the team that wrote it! It meets their needs fine, and choosing a minimal solution is great engineering that takes a lot of guts in the current software culture. I like what they've done here. Just don't take it and try to scale it up into a full JS runtime.
mid-kid 3 years ago

Yeah, it's essentially used as a javascript expression solver. You can see the full extent of its capabilities in the testsuite: https://github.com/ytdl-org/youtube-dl/blob/master/test/test...
The specific site modules in youtube-dl will take care to extract the bare minimum necessary to solve whatever challenge.
em-bee 3 years ago

if it's going to need much more than that then it probably would make more sense to port the whole application to javascript instead.
but then this could be turned into a commandline browser that is able to interpret a whole web-page and save the resulting html structure instead of the source as curl/wget would do.
- pvillano 3 years ago
  
  Eventually, YouTube-dl might have to simulate an entire browser and human user to fool Google. Until then, the usefulness of YouTube-dl is that it's less heavy than a full browser.
  I bet someone's already started a YouTube downloader that uses a headless browser

Uptrenda 3 years ago

Anyone who has ever pulled a website from a script knows the pain that is Javascript. Normally you want to just get some text and work out the API actions but a lot of sites use horribly obfuscated Javascript -- either because that's what modern web development is (lolz) -- or because its part of their 'security.' That means if you want to write browser-based bots properly -- you ought to use a browser. There are special browsers that run 'headlessly' or are designed mostly for bot use. Like https://www.selenium.dev/ which plugs into a few different 'browser engines.'

But now you have another problem. Your simple script goes from being small, simple, self-contained, and elegant gem, to requiring a full browser, specialized drivers, and/or daemons running just to work. If you're using something like Python you just frankly don't have very good packaging. So it's hard to string together all that into a solution and have it magically work for everyone. What YouTube-dl have done is good engineering. Even though it's not a full JS interpreter: they've kept their software lean, self-contained, and easier to use.

Scaevolus 3 years ago

Embedding V8 can work quite well: https://github.com/sqreen/PyMiniRacer
You probably have to emulate some of the DOM, but you can interact directly with whatever obfuscated/packed scripts in a more lightweight and secure way than driving an entire browser.
hansvm 3 years ago

I use pyminiracer to great effect for that sort of scraping.
eurasiantiger 3 years ago

Just npm install puppeteer.
- lolinder 3 years ago
  
  Puppeteer is cool, but it's exactly what OP is warning against: it's a full browser that is downloaded and run through npm. It's remarkably well packaged, but still far more error prone than a simple HTTP request, and far more likely to break on its own just with the passage of time.
  - eurasiantiger 3 years ago
    
    Yes, but:
    ”Your simple script goes from being small, simple, self-contained, and elegant gem, to requiring a full browser, specialized drivers, and/or daemons running just to work”
    Complex problems cannot be solved by simple scripts, but they can be abstracted away to vendor libraries when/if they are well maintained, such as in this case. While it can break with time, at least someone else fixes it for you.
  - ciupicri 3 years ago
    
    There's also puppeteer-core which lets you use your own (Google Chrome) browser and if your own browser is broken then you're having bigger problems than youtube-dl not working.
- ciupicri 3 years ago
  
  By the way there is also Playwright [1] and it has Python bindings too [2].
  [1]: https://playwright.dev/
  [2]: https://playwright.dev/python/docs/intro

delusional 3 years ago

Can we stop the trend of linking to tweets that just contain another link to the content? what's the point? Wouldn't this be 10x better if it was a link directly to the github?

derangedHorse 3 years ago

I like the Twitter linking since it's almost like the OP is giving credit to where they found the information from.
- plaguepilled 3 years ago
  
  Agreed. If you only know this from someone else's observation, you should link the observation.
  - mkl 3 years ago
    
    That is against HN guidelines: "Please submit the original source. If a post reports on something found on another site, submit the latter." - https://news.ycombinator.com/newsguidelines.html
    
    plaguepilled 3 years ago
    
    Hot take, but that guideline is just bad. It removes necessary context and may even obscure relevant developments in a public dialogue. Original posts make sense only if the repost is not adding to the dialogue or otherwise making interesting contributions.
    (If you disagree, this may the one time I will actively ask you to flag this post, so a mod can respond to this point)
    
    rolandog 3 years ago
    
    Perhaps some sort of convention, like /r/science could be followed (like having OP post a link to the article or reference as a top comment).
    
    nixcraft 3 years ago
    
    This. Whenever I post a comic and something funny comic on Twitter/FB (i put credit: URL their home page or IG), but on Reddit, there is no such thing. Most Reddit admins also ban direct linking to sources due to spam or remote server overloading. So instead, you need to use Imgur or Reddit's hosting. So I write a comment saying credit which often gets enough upvotes to push at the top of the thread. I think those content creators deserve traffic and recognition too. It is good karma for everyone.
    
    Tao3300 3 years ago
    
    Citing the guidelines is against the guidelines, if not by the letter, in spirit. It's boring and it lacks curiosity. It assumes too much about the sharer. "Can we stop this trend" is a dog whistle for the "HN is getting worse" complaint.
    Instead we could be considering if we're meant to read the Twitter conversation as well, or sharing a laugh about the link in the tweet author's bio. Or maybe the sharer didn't feel comfortable enough seeming like they made the claim but still wanted to share it because it's kind of cool.
    AFAIK there's no junior HN mod of the year award.
    
    netheril96 3 years ago
    
    If guidelines were not to be cited, who would it guide?
    
    ciupicri 3 years ago
    
    How can I read the Twitter conversation when Twitter asks me to log in? :-)
    
    boogies 3 years ago
    
    With Nitter: https://github.com/zedeus/nitter/wiki/Instances https://github.com/zedeus/nitter/wiki/Extensions
    
    Tao3300 3 years ago
    
    cf. paywalls
    
    Tao3300 3 years ago
    
    I'm in stitches that my other comment was flagged at 0 min. Don't ever change, you guys ;-)
kelnos 3 years ago

I was thinking the same thing; link to the file on Github, with the same title text as is there now, and it saves me an extra click. And any time I don't have to visit Twitter, I consider that a win.
caned 3 years ago

I often share links to HN instead of the referred link. Many times the comments are as interesting as the content. This applies to sharing Twitter or Reddit links, too, albeit with a lower S/N ratio.
- Firmwarrior 3 years ago
  
  Is there some trick to actually being able to see information on Twitter? When I click a tweet, I get the tweet, then a random smattering of 2-3 semi-related tweets, and then a login popup that breaks the page
  Do you guys use an extension to process it or something?
  (Same issue with Reddit of course)
  - paulmd 3 years ago
    
    replace "twitter.com" with "nitter.net", or for video embedding (discord, etc) use vxtwitter.com or fxtwitter.com. Tweetdeck is what a lot of twitter people use for "serious twittering" (lol).
    For reddit use old.reddit.com instead of www.reddit.com. Reddit is Fun is a great native app for android and on iOS there's Apollo.
    Both sites are laser-focused on driving conversions and engagement which means forcing you into an account and native apps (specifically their shitty native apps), and undoubtedly they'll start breaking the workarounds and third-party clients for realsies at some point.
    But I mean, if users don't even have an account and native app install, how can they possibly get you doomscrolling all day? It's 2022, it's all about the engagement metrics, fuck user experience.

sylware 3 years ago

Nowadays "javascript" refers to the scriptable, grotesquely and absurdely complex and massive web engines, aka google financed blink and geeko, then apple financed webkit, that with their SDK.

The currently obfuscated javascript media players will try to break yt-dlp by leveraging the complexity and size of those scripted web engines. They will make them out of reach to small teamns or individuals and it is even "better", it will force ppl to use apple or google web engine, killing any attempt to provide a real alternative.

A standalone javascript interpreter is actually some work, but seems to stay in the "reasonable" realm: look at quickjs from M. Bellard and friends (the guy who created qemu, ffmpeg, tinycc, etc): plain and simple C (no need of a c++ compiler), doing the job more that well enough.

That's why noscript/basic (x)html is so much important.

dtx1 3 years ago

> but seems to stay in the "reasonable" realm
> M. Bellard and friends
Chose one, that dude is a wizard wielding c like a brain surgeon wields a scalpel.
olliej 3 years ago

Yeah I agree with almost all of this - the massive size and complexity of commercial engines makes it seem like JS the language must also be complex.
I also agree with the idea that these sites will probably be able to/want to create JS that breaks these small/lightweight engines requiring constant work :-/
This final point I disagree with entirely. You can't point to Bellard doing something as evidence that it's reasonable. This is a guy that wrote a program that generated a TV signal via a VGA card. :D
axiolite 3 years ago

> quickjs from M. Bellard and friends
Is the M key next to the F key on your particular keyboard by chance? Because I've always called him "Fabrice."
https://en.wikipedia.org/wiki/Fabrice_Bellard
- a_e_k 3 years ago
  
  Could just be the usual abbreviation for Monsieur.
- ganjatech 3 years ago
  
  Monsieur Bellard - M. Bellard
  - sylware 3 years ago
    
    Yeah... M = Monsieur (in french, namely Mister in english), I forgot the 'r'... I should have written Mr. Bellard, I kneel upon the weight of my apology.
randyrand 3 years ago

Chrome and Safari both have open source JS engines…
- userbinator 3 years ago
  
  That's beside the point. Open-source is not useful to the smaller players if it is too complex to comprehend and constantly churned.
  - kelnos 3 years ago
    
    That's not the case, though. There are even python modules that let you evaluate JS code in v8 (Chrome's JS interpreter). It'd be pretty trivial for youtube-dl to make use of that if the author thought it was worth doing.
    
    sylware 3 years ago
    
    It is the case, not too mention v8 pulls in an extremely expensive c++ compiler in the SDK while quickjs can compile with tinycc.
    Open source is not enough anymore, "lean" open source is the way now, SDK included.
oblak 3 years ago

ah, but quickjs is an actual js engine. I have tried a couple of versions with real progress between them. This thing here is not
languageserver 3 years ago

> That's why noscript/basic (x)html is so much important.
xhtml has been dead for a decade

esprehn 3 years ago

This isn't really JS, it's a purpose built evaluator that's only for evaluating a particular script on YouTube, assuming a huge list of things are true about how YouTube JS is written.

Ex. Its got a hard coded list of methods for String, and it doesn't respect prototypes. It only supports creating Date instances, and won't work if you override the global Date. It parses with regexes and implements all operators with python's operator module (which is the wrong type semantics) etc. Nearly none of the semantics of JS are implemented.

It's sort of the sandwich categorization problem:

If I write a C# "interpreter" in perl thats only 200 lines and just handles string.Join, string.Concat and Console.WriteLine, and it doesn't actually try to implement C# syntax or semantics at all and just uses perl semantics for those operations is it actually C#? :P

I say "not a sandwich".

jraph 3 years ago

And as a user of youtube-dl, I'm quite happy about this. This probably allows a very safe, restricted "subset" of JS. Way better than using a full JS engine. 900 lines is still small and manageable.
- mjevans 3 years ago
  
  yt-dlp sometimes doesn't know how to evaluate the javascript / emcascript and will call out to an optional dependency, a real javascript interpreter, if installed.
- sebzim4500 3 years ago
  
  I'm trying to get the thread model here. Is the concern that Youtube will inject JS into the payload which tries to break out of the youtuble-dl js sandbox using some zero day in whatever js engine they would use instead?
  - pabs3 3 years ago
    
    One of the reasons people use yt-dlp/youtube-dl (and nitter.net/etc) is to transform the modern proprietary JavaScript web into something more suitable for enthusiasts of the old document web and of FOSS. If the web switched to plain <video> then yt-dlp/youtube-dl would become completely unnecessary. Your browser should not have to run JS to watch an embedded video.
    
    nyanpasu64 3 years ago
    
    On my Ivy Bridge laptop running Linux, enabling hardware video decode in mpv took installing one package and adding one line to mpv.conf. Enabling hardware decoding in Firefox took multiple attempts of Googling frantically, toggling flags in about:config, passing logging environment variables to Firefox, recording a Pernosco trace of multi-process communication, and even asking for help in the gfx-firefox Matrix chat where they pointed out I had disabled media.rdd-process.enabled causing Firefox to print a misleading error message in about:support saying HARDWARE_VIDEO_DECODING was available, but failing at runtime saying WebRender was disabled even though it was enabled. And to my knowledge, hardware decoding in Chromium is simply not possible on Linux right now (maybe possible on Chromebooks, I haven't checked).
    Even after I fixed hardware acceleration, playing a 1080p YouTube video in Firefox using hardware H.264 decoding took more CPU energy (40% of a core) than playing the same video in mpv using software H.264 decoding (20% of a core). Web browsers are just horrifically complex, intractable to understand, and inefficient.
  - rwmj 3 years ago
    
    Google attempting zero days on client computers would be something. It's not totally without precedent (Sony CD rootkits - https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootk...) but would still be major news.
    
    btown 3 years ago
    
    While they likely wouldn't do a zero-day, their JS files, particularly for automated captchas, do push the boundaries of whatever JS engine they're executed inside. See https://github.com/neuroradiology/InsideReCaptcha#the-analys... and note that this analysis is 8 years old. While there's minimal risk if you're either using a full-fledged modern JS engine or a limited-subset interpreter like the OP, an older or non-optimized spec-compliant JS engine might hit pathological performance cases and result in you DOSing yourself.
    
    origin_path 3 years ago
    
    It's interesting to speculate about why they don't use this much more powerful technology to stop ytdl but instead use this much weaker yt specific thing.
    Most likely the reason is that they keep the botguard system for the stuff that matters to them a lot more like account signups and click fraud, and don't want to incentivize the ytdl guys to break it on behalf of spammers/clickfraudsters.
    
    saagarjha 3 years ago
    
    I mean the DOS would be that your youtube-dl invocation hangs, and then you kill it.
  - jraph 3 years ago
    
    Let's say they end up using Node. Node has a quite complete standard library that lets you access files and everything.
    Now if they do it right and only embed some bare JS interpreter, it's still way harder to audit than these < 900 lines, for which it is quite easy to convince oneself that the interpreted script cannot do much.
    
    geysersam 3 years ago
    
    Nowadays they could probably use Deno. Without permissions it doesn't allow network or file access etc.
  - kevingadd 3 years ago
    
    Embedding a whole js engine and then interopping with it from python would be non trivial. Good luck fixing any bugs or corner cases you hit that way. The V8 and spidermonkey embedding apis are both c++ (iirc) and non trivial to use correctly.
    Having full control like this +simple code is probably lower risk and more maintainable, even if there's the challenge of expanding feature set if scripts change.
    The alternative would be a console js shell, but those are very different from browsers so that poses it's own challenges.
    
    esprehn 3 years ago
    
    Fwiw there are python bindings for QuickJS and Duktape:
    https://github.com/PetterS/quickjs
    https://github.com/stefano/pyduktape
    https://github.com/amol-/dukpy
    I can't speak to the quality of those bindings, but they do seem maintained.
    
    lloeki 3 years ago
    
    > Embedding a whole js engine and then interopping with it from python would be non trivial.
    Cue libv8-node+mini_racer from which PyMiniRacer was born. It is non-trivial but not as hard as one might think.
    The most painful part is the libv8 build system and Google-centric tooling (depot tools!), which makes it an absolute PITA for libv8 consumers that are not Google/Chrome.
    This is why the libv8 gem was atrocious to keep up to date and to build for several platforms, and why libv8-node was born, because the node build system and source distribution are actually sane (props to their relentless work on which we piggyback on)
    Disclaimer: worked at Sqreen, now maintainer of libv8-node and collaborator of mini_racer
    https://github.com/sqreen/PyMiniRacer
    https://github.com/rubyjs/mini_racer
    https://github.com/rubyjs/libv8-node
    
    kevingadd 3 years ago
    
    Very cool, I'll have to remember that this exists! Looks useful.
    
    em-bee 3 years ago
    
    apparently yt-dlp is somehow calling out to a js engine if available
    
    kevingadd 3 years ago
    
    Yeah, it's possible to install v8 or spidermonkey shells and use them to run code - we use them to run parts of the .NET wasm test suite - but they have a bunch of arbitrary limitations, so if you're trying to emulate a browser I'm not sure I'd bet on them. It's certainly going to be easier than a C++ embedding, so it makes sense that they took that route.
    Another option is to use node, but it also has weird limitations/behaviors when running code.
  - loeg 3 years ago
    
    youtube-dl targets a lot of websites other than Google properties, many of which are a lot sketchier (think, uh, NSFW streaming sites).
- jiggawatts 3 years ago
  
  That’s the exact same logic I hear from developers who say things like:
  Why do I need a full XML parser when I can just extract what I need with regex?
  And:
  All that RPC IDL stuff is overcomplicated, REST is so much easier because I can just write the client by hand.
dang 3 years ago

Ok, we've changed this title to shrink the scope of the interpreter.
Submitted title was "YouTube-dl has a JavaScript interpreter written in 870 lines of Python".
- ec109685 3 years ago
  
  Hence why HN better than Twitter.
  The amount of high engagement just plain wrong tweets there are is just sad.
tra3 3 years ago

It’s quacks like a duck at midnight, but it’s actually a frog?
blast 3 years ago

I suppose this means it would be easy for YouTube to fuck with youtube-dl simply by throwing in more features of JS?
- _pn3l 3 years ago
  
  Cat, meet mouse.
  - nyanpasu64 3 years ago
    
    It's unfortunate, https://github.com/mpv-player/mpv/issues/8655#issuecomment-1...:
    > Youtube now throttles requests of more than 10MB at a time, yt-dlp works around it by making many requests of 10MB using Range HTTP headers (yt-dlp calls it the http-chunk-size), but ffmpeg which does the downloading for mpv doesn't support that yet.
    I want to change mpv or yt-dlp to support range-based video URLs (eg. appending &range=333999644-335298975&rn=5&rbuf=0 to URLs) which speed up stream seeking and probably eliminate throttling altogether, but I haven't taken the time to look into how to achieve it. For anyone interested, I have an open bug report at https://github.com/mpv-player/mpv/issues/10601, and have found https://satadalsengupta.github.io/docs/papers/2017_nossdav_y... describing these parameters.
Test0129 3 years ago

This really isn't fair. Just because it doesn't faithfully implement whatever standard Javascript is on doesn't mean it isn't an interpreter. All an interpreter is is something that executes a script directly rather than requiring compilation. It is a defacto interpreter for a subset of javascript. Nothing more, nothing less. The title could be more clear, however.
- blast 3 years ago
  
  esprehn didn't say it isn't an interpreter. They're saying it is an interpreter and what it's interpreting isn't (all of) JS. That's also what you're saying, so you're agreeing with esprehn.
  Edit: You misunderstood baobabKoodaa in the same way. Nobody is arguing about what constitutes an interpreter, except you. The question is what language is being interpreted.
  Before accusing someone of pedantry, it would first be good not to completely misread them.
- baobabKoodaa 3 years ago
  
  There's a huge difference between an interpreter for "JavaScript" and an interpreter for a "subset of JavaScript".
  - Test0129 3 years ago
    
    Making a pedantic argument on what constitutes an interpreter is silly. The title is bad. It is an interpreter. I'll continue to eat downvotes on this because of the pedantry of HN.
    
    jraph 3 years ago
    
    I didn't downvote, but I don't think esprehn is being unfair. Their comment is very informative. They didn't argue that what was implemented is not an interpreter, they did explain why it's not a JavaScript interpreter and not even an interpreter for a subset of JavaScript. It's just a special purpose interpreter suitable for YouTube's code that cannot be re-used for any code that uses the subset that it seems to implement.
    It's not pedantry (or I'm pedantic). It's a reaction to the title that can lead people to believe that a complete JavaScript interpreter has been written in less than a thousand lines of Python. This reaction is perfectly understandable.
    
    khazhoux 3 years ago
    
    Technically, it’s only the pedantry of a subset of HN.
    
    lupire 3 years ago
    
    It's an interpretation of a subset of the pedantry on HN.
    
    baobabKoodaa 3 years ago
    
    > Making a pedantic argument on what constitutes an interpreter is silly. The title is bad. It is an interpreter.
    It's not a pedantic argument. Based on the title I thought that somebody wrote something akin to V8 in 800 lines of Python. After reading the comments I realized those 800 lines just interpret a particular JavaScript function written by Youtube. Those things are different. Pointing out the fact that they are different is not pedantry. The title is misleading and the comments pointing that out are helpful.
    
    chess_buster 3 years ago
    
    I evaluated it with my Pedantic Interpreter which only results in the `pedantic` token.
    
    blondin 3 years ago
    
    my vote is meaningless and i am sorry about that. but just wanted to let you know that what you said made sense. do not let people get to you.
    most of us know that a thousand or so lines of code is not a full JavaScript interpreter and cannot be the real thing.
    there is no argument or conversation to have about it.

haunter 3 years ago

The same in yt-dlp https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/jsinterp...

Interesting to see the diffcheck between the two https://www.diffchecker.com/8EJGN27K

cheschire 3 years ago

Is yt-dlp's implementation being better the reason why I have fewer throttling issues than with youtube-dl?
- LeoPanthera 3 years ago
  
  Maybe this isn't true anymore, but for a while they would hit different APIs. yt-dlp was using the Android YouTube API because it had no throttling.

kristopolous 3 years ago

To understand why, I have a far simpler tool that focuses on a subset of sites (adult content video aggregators)

https://github.com/kristopolous/tube-get

It too deals with this problem but does so in a way that'd be easy to maliciously sabotage

Look right about here https://github.com/kristopolous/tube-get/blob/master/tube-ge...

As to why this program exists, this was originally written between about 2010-2015 or so technically predates the yt-* ecosystem.

The tool still works fine and it's not a strict subset of yt-dlp or YouTube-dl because being a different approach, although it's overall site coverage is smaller, I've had it be a "second try" system when yt-* fails and it comes up with success maybe about half the time

pabs3 3 years ago

Would you mind switching to subprocess with shell=False? os.popen is obsolete and insecure because it passes the command through the shell.
PS: I found it quite easy to contribute to yt-dlp and the reviewers are ultra-helpful and kind, you might want to migrate all of your extractors there.
- kristopolous 3 years ago
  
  1. It's ancient code but sure
  2. They're fundamentally not compatible approaches. This is worthless to them

aeyes 3 years ago

They just don't want to use any external dependencies... There is also an AES implementation: https://github.com/ytdl-org/youtube-dl/blob/master/youtube_d...

M30 3 years ago

How should a programming noob interpret this? Be impressed at what was achieved here? Be concerned about security implications using the tool? Something else entirely?

rkangel 3 years ago

This is the compiler writer equivalent of parsing HTML with regex:
It is technically wrong - it isn't a sufficiently rich and powerful approach to handle all JS (HTML) that you might throw at it. It'll work for a while until it eventually barfs when you least expect it.
EXCEPT that if the inputs you are giving it come from some understood source(s) that aren't likely to change, then a simpler approach to the "all singing all dancing" correct may be appropriate and justified. E.g. because it might be easier to write, easier to maintain and/or less attack surface etc.
- pwdisswordfish9 3 years ago
  
  > some understood source(s) that aren't likely to change
  Does that apply to YouTube? Or any of the other hundreds of supported sites?
  - rkangel 3 years ago
    
    Presumably because it gets tested with those sites and the JS doesn't change that much it can be fixed or adjusted as required.
lolinder 3 years ago

It's an extremely tiny subset of JS—as an example, the only object that can be instantiated is Date. Anything other than "Date" after "new" throws an exception.
It's definitely neat, but not especially useful outside of the confines of its current application, and the security concerns of such a tiny subset will be minimal.
- petters 3 years ago
  
  > Anything other than "Date" after "new" throws an exception
  It's even very sensitive to white space.
smcl 3 years ago

All of the above, really.
chlorion 3 years ago

The "interpreter" in the youtube-dl source is probably safe from a security standpoint.
yt-dlp seems to support running javascript in a full javascript interpreter/headless browser called phantomjs though. Running javascript in a full interpreter like this is a lot more scary from a security standpoint. I am not sure whether phantomjs sandboxes the javascript evaluation from the rest of the system, and if it does, whether the sandbox actually works properly at all. It looks like the project is not being maintained which is another bad sign.
Big projects with lots of manpower behind them such as chromium have trouble keeping javascript evaluation safe, so I would really suggest not trusting phantomjs on untrusted input.
bjt2n3904 3 years ago

The goal of youtube-dl is to download a video off of YouTube for offline storage.
This isn't something YouTube particularly enjoys. They would rather you keep coming back -- every visit is more ad revenue for them. If you have an offline copy, you don't need to visit YouTube anymore.
YouTube has an incentive, therefore, to make it more difficult to download (or "scrape") their content.
I'm not particularly sure of the specific details, but apparently YouTube has added JavaScript (a programming language that executes in the browser) as a hurdle to jump over. A simple python script doesn't have enough brains to execute JavaScript, only enough to realize that it exists. (Clearly, youtube-dl is sophistication enough to have jumped over it.)
These are the conclusions I come to, having written software for about a decade.
1) Once you give information to someone, be it text, pictures, sound, or video -- they will do whatever they want with it, and you have no control. Oh, yes -- it may be illegal. Maybe unethical. But the fact of the matter is you do not have control over information once it leaves your hands.
2) Adding hurdles to make it harder to access the information does little to stop someone who is dedicated to accessing it.
3) Implementing a subset of JavaScript in such an elegant and tiny manner is quite impressive.
How you interpret these facts depends on your worldviews. If you are a media and content creator, you will view these facts differently than a politician, and a teenager.
As an engineer and amateur philosopher, I certainly support the rights of content creators to be paid for their work. And yet, I fear that more and more, content creators want to lease me a right to listen their music, instead of own a copy of it.
I used to own CDs, DVDs, movies, and books. What happens if Amazon or YouTube decides to not serve me anymore? Anything I've "purchased" from them, I lose access to.
Further more, if I create a song, I used to be able to burn copies of CDs and distribute it on the street corners. Now, you have to sign up to stream on Spotify. This is a double edged sword -- I get a wide audience, but Spotify will do whatever they want with me.
This troubles me.
Test0129 3 years ago

> How should a programming noob interpret this?
Usually in a virtual machine.
tenebrisalietum 3 years ago

> How should a programming noob interpret this?
The browser is client-facing and everything there is possible to reverse engineer and figure out. So if you design a web-based application, and are depending on client-side Javascript for any security or distribution enforcement, it can be helpful, but can ultimately be unwound and cracked even if obfuscated, etc.
> Be impressed at what was achieved here?
Yes. Try to download a YouTube video with out it or an online service which is probably using it internally.
- Supermancho 3 years ago
  
  Youtube-dl is impressive. This particular hack is not.
  - pwdisswordfish9 3 years ago
    
    youtube-dl as a whole is not particularly impressive either. It’s a big pile of unresolved technical debt, of hacks-upon-hacks and quick-and-dirty temporary solutions just like this one staying there for years.
Tao3300 3 years ago

In the face of weird shit like this, I give you the permission to go with your gut.

lewisl9029 3 years ago

Another really cool JS dialect I recently learned about is njs from the nginx team: https://github.com/nginx/njs

This video goes into some of the design and tradeoffs: https://www.youtube.com/watch?v=Jc_L6UffFOs

TL;DW: they optimized for fast creation/destruction of low-footprint VMs with no JIT or garbage collection.

homarp 3 years ago

the tests for it: https://github.com/ytdl-org/youtube-dl/blob/master/test/test...

olliej 3 years ago

This is super cool.

Some of the stuff is kind of questionable to me in the sense that I could believe you could probably make some kind of sufficiently wonky JS that this would do the "wrong" thing.

But it's super cool that they are able to do this as I think it shows that claims of JS complexity based on the size of JS engines is overlooking just how much of that size/complexity comes from the "make it fast" drive vs. what the language requires. Here you have a <1000LoC implementation of the core of the JS language, removed from things like regex engines, GCs, etc.

Mad props to them for even attempting it as well - it simply would not have ever occurred to me to say "let's just write a small JS engine" and I would have spent stupid amounts of time attempting to use JSC* from python instead.

[* JSC appears to be the only JS engine with a pure C API, and the API and ABI are stable so on iOS/macOS at least you can just use the system one which reduces binary size+build annoyance. The downside is that C is terrible, and C++ (differently terrible? :D) APIs make for much more pleasant interfaces to the VM - constructors+destructors mean that you get automatic lifetime management so handles to objects aren't miserable, you can have templates that allow your API to provide handles that have real type information. JSC only has JSValueRef and JSObjectRef, and as a JSObjectRef is a JSValueRef it's actually just a typedef to const JSValueRef :D OTOH other hand I do thing JSC's partially conservative GC is better for stack/temporary variables is superior to Handles for the most part, but it's also absolutely necessary to have an API that isn't absolutely wretched. The real problem with JSC's API is that it has not got any love for many many many .... many years so it doesn't have any way to handle or interact with many modern features without some kludgy wrappers where you push your API objects into JS and have the JS code wrap them up. The API objects are also super slow, as they basically get treated as "oh ffs" objects that obey no rules. I really do wish it would get updated to something more pleasant and really usable.]

esprehn 3 years ago

This doesn't actually implement any of the JS language though, it just reuses all of python's semantics and hard coded a tiny list of ex. String methods
I also assume you mean mainstream JS engine, but Duktape, JerryScript and QuickJS are all C APIs.
They probably could have used ex. https://github.com/PetterS/quickjs instead of the hacks in the OP linked file.
- olliej 3 years ago
  
  Ah, I only briefly scanned the implementation, and it looked like it was doing actual work - is it mostly string replacing to get approximate python equivalent syntax? Regardless that's disappointing.
  You are correct though that I was only thinking of the big engines - bias on my part alas.
  For your suggested alternate engines, JerryScript and QuickJS seem more complete than Duktape but I can't quite work out the GC strategy of JerryScript. Bellard says QuickJS has a cycle detector but I'm generally dubious of them based on prior experience.
  If I was shipping software that had to actually include a JS engine, if perf was not an issue I would probably use JerryScript or QuickJS as binary size I think would be a more critical component.

jraph 3 years ago

I do wonder why YouTube does not try harder to make it difficult to do this computation meant to prove you are a legit YouTube web client. Providing an easy-to-find, simple JS function interpretable with 900 lines of Python is like they don't try at all. They might as well do nothing.

Or is their goal just to make youtube-dl not 100% reliable? Or to be able to say "look, you are running our code in a way we did not intend, you can't do this because you are breaking the EULA"?

zuminator 3 years ago

I'd guess that their efforts to make it harder are limited by the fact that they want YouTube to be able to play on thousands of different low powered set top boxes and cheap phones. So whatever obfuscated code they use has to be simple enough to be run and periodically updated by all these different devices, and that same simplicity makes it emulable.
Arnavion 3 years ago

They do make it harder from time to time. In fact yt-dlp's interpreter has been broken for a month or so now and the devs finally gave up and told users to just install PhantomJS (which itself hasn't been updated since 2016 and probably has bugs / vulns of its own, but whatever).
https://github.com/yt-dlp/yt-dlp/issues/4635#issuecomment-12...
- whywhywhywhy 3 years ago
  
  I mean if this is the direction it’s heading it makes more sense to port yt-dlp to node. It’s already dependent on a scripting language, it may as well be the one YouTube speaks.
Cthulhu_ 3 years ago

I'm guessing the amount of people using it is low enough to not bother with mitigation. Then again, there's a LOT of YT videos that take clips from other videos (which in most cases falls under fair use), which I can imagine would use this tool.

mdaniel 3 years ago

I was expecting this to be about Duktape <https://github.com/svaarala/duktape>, but heh, for sure no. I'd bet $1 there's no way youtube-dl would switch, but I wonder if yt-dlp would?

rcarmo 3 years ago

Awesome. Even if it's likely incomplete, it might come in really handy for some scraping I need to do...

Too 3 years ago

They must have been inspired by this PyCon presentation, where David Beazley live codes a fully working webassembly interpreter, in under one hour. https://youtu.be/VUT386_GKI8

atan2 3 years ago

This seems to be a pretty small subset of JavaScript, but I personally love small projects like this for educational purposes. Removing the noise and keeping things minimal helps my brain reason about things.

Earlier this year I enrolled in an online class called "Building a Programming Language" taught by Roberto Ierusalimschy (creator of Lua) and Gustavo Pezzi (creator of pikuma.com). We created a toy language interpreter/VM and the final code was around of 1,800 lines of Lua code. Keeping things as simple (and sometimes naive) as possible was definitely the right choice for me to really wrap my head around the basic theory and connect the dots.

Thanks for the link.

Tao3300 3 years ago

Greenspun's Tenth Rule:

> Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp. [1]

And here we have a complicated Python program with a partial JS implementation in it.

[1] https://en.wikipedia.org/wiki/Greenspun's_tenth_rule

anony23 3 years ago

What purpose does it serve?

rany_ 3 years ago

They need to run a JavaScript function to download YouTube videos at normal speeds.
Edit: it's also required to download music, otherwise it will just fail
Source:
- https://github.com/ytdl-org/youtube-dl/issues/29326#issuecom...
- https://github.com/ytdl-org/youtube-dl/blob/d619dd712f63aab1...
- https://github.com/ytdl-org/youtube-dl/commit/cf001636600430...
- ajkjk 3 years ago
  Wow:
  Overview of the control flow (already known): The Youtube API provides you with n - your video access token If their new changes apply to your client (they do for "web") then it is expected your client will modify n based on internal logic. This logic is inside player...base.js n is modified by a cryptic function Modified n is sent back to server as proof that we're an official client. If you send n unmodified, the server will eventually throttle you.
  So they can always change the function to keep you on your toes, hence you need to be able to run semi-arbitrary JS in order to keep using the API.
  Waste of human brainpower but I guess that energy is better spent imagining a world where Google isn't in charge instead of kvetching about what they're doing with their influence.
  - isatty 3 years ago
    
    There is a reason Google is able to serve the amount of video bandwidth, and also a reason why there are no worthwhile youtube clones. Some amount of scrape protection is absolutely essential.
    
    Aperocky 3 years ago
    
    Seems like they ultimately failed, youtube-dl is available freely as a pip package, anyone with scraping intent would have been able to use it.
elaus 3 years ago

I'd have to read up on the specifics as well, but I think basically Youtube uses a lot of obfuscated, rapidly and automatically changing Javascript code to fetch the video data. A project like youtube-dl has to run this code to be able to download videos, because that's what's happening in the browser as well.
- temp_account_32 3 years ago
  
  For those interested further, in some of the past few weeks youtube-dl had stopped working intermittently for multiple hours at a time, and it was precisely related to this code.
  We have a custom-made Discord music bot on our server which uses ytdl to stream songs so we can listen together, and at one point we were listening and suddenly got some obscure JavaScript error.
  We began joking that there's some bug in the code which breaks it after 6PM, but later found out that Google had changed some of the obfuscated JS and this basically broke this part of code, which prevented us from fetching the song information.
- londons_explore 3 years ago
  
  If you start a youtube video and then pause it and resume a few days later, you'll notice that the youtube page plays for ~30 seconds (ie. whats buffered) and then the page refreshes. I'd guess this refresh is to pick up the new javascript and any updates to the HTML code.
  It's kinda annoying if you have a lot of youtube tabs open for a long time and come back to them.
- bitexploder 3 years ago
  
  What is interesting is it seems to be constant cat and mouse. I download a YT vid. It crawls. Update yt-dlp, it flies again. I love yt-dlp and use it a lot.
- lupire 3 years ago
  
  But why not just use a normal JS engine called from Python?
hadrien01 3 years ago

It's used in the YouTube extractor: https://github.com/ytdl-org/youtube-dl/blob/d619dd712f63aab1...
I believe YouTube limits your bitrate if you don't pass a specific calculated value; it's possible youtube-dl has to parse and eval JS to get it.
- RicoElectrico 3 years ago
  
  > I believe YouTube limits your bitrate if you don't pass a specific calculated value
  It's starting to become Widevine bullshit all over again.
  - kevin_thibedeau 3 years ago
    
    It's their platform. They can do with it what they want.
    
    RicoElectrico 3 years ago
    
    Many channels would be more than happy to enable download options, if possible.
    Hell, how is Creative Commons licence they totally give you option to select, work in case of videos that can't be downloaded in any way?
    
    londons_explore 3 years ago
    
    But would the channel owner be happy to enable download options if $0.09 per GB downloaded was subtracted from their ad revenue?
    
    Dylan16807 3 years ago
    
    If you cite a price that high for bulk data then if you get an answer of "no" it won't prove anything. Try asking about a competitive price.
    For ballpark numbers, youtube dedicates 1200kbps to 1080p videos in VP9. Let's say we have a 10 minute video with an RPM of $3.
    We can arrange a CDN to deliver files at $0.005 per GB without even putting effort into it. And that's at a super low scale. The price drops a lot from there as things get bigger. So I'll use that number, and note that it's being generous to google.
    So that's 0.3 cents of revenue per watch, which is 90MB of data that would cost .045 cents to deliver.
    One view would pay for about 7 downloads. And how many downloads are we likely to see? Probably under 10% of viewers.
    I'd turn that option on.
    
    jraph 3 years ago
    
    They've also chosen to be a monopoly.
    
    vukgr 3 years ago
    
    Just because they have the right to do it doesn't make it right.
    
    forchune3 3 years ago
    
    it’s sort of an extension of the state / surveillance
    
    Dylan16807 3 years ago
    
    It's their platform but it's also a web site and that comes with certain expectations of interoperability.
oynqr 3 years ago

You need to run some obscured JS to get decent download speeds from Youtube. Something along the lines of PoW.
- db48x 3 years ago
  
  It’s not like proof of work at all. It’s just a challenge and response; youtube includes a random number in the webpage for each video, and expects to see a request parameter with a particular value calculated from that random number when you request the video. If you don’t do the arithmetic it throttles you to 50kb/s.
  Since the calculation of the response is done in JS, and they occasionally change the formula, some download programs are moving towards running the JS rather than trying to keep up with the changes.
  It’s really just bullshit to make people’s lives harder.
  - xg15 3 years ago
    
    Next step will probably be moving the calculation to webassembly or requiring the script to fetch the result via websocket or webrtc...
  - mistrial9 3 years ago
    
    .. pirate determination is a thing to behold, as is crazed-repetitive digital grabs.. Its not a fair or accurate characterization to dismiss it as "making people's lives harder" .. it is remarkable that the Debian distros now include ytdl; lets do what is reasonable to make it continue
    
    db48x 3 years ago
    
    You can’t exactly pirate a youtube video, since they’re all publicly available.
    
    MiguelX413 3 years ago
    
    That's not really how piracy works. I say this as an advocate of it.
  - dannyw 3 years ago
    
    YouTube PM: We need to stop youtube-dl.
    Engineers: make half arsed attempt.
throwaway0984 3 years ago

IIRC it's used to extract/generate the signatures needed for YouTube media URLs

tonetheman 3 years ago

If this got much bigger I would switch it to quickjs

Settings

YouTube-dl has an interpreter for a subset of JavaScript in 870 lines of Python

Keyboard Shortcuts