GitHub's language detection is broken
github.comWhy don't they just let project maintainers say, "This project contains x, y, and z" or something? That'd at least let them get a leg up on doing the categorization right and I don't think many people would mind having that capability.
+1000. Github routinely detects the wrong language for my projects, and there is no way to manually override it. My take is this: If you want to auto-detect the language, fine... but let the owner of the repo override your detection when it's wrong.
It's probably also a bug to even have the notion of "a language" for a repo given the burgeoning polyglot programming trend. So many repos these days contain multiple languages, especially when you consider javascript, that I question if it even makes sense to say 'This project is in language X' at all.
Like you say, the best option really would be to let the repo owners / maintainers just specify this stuff. They are, after all, the ones who know.
I wish I had more up buttons. Sometimes you can be too smart of your own good, and the good old fashioned way is superior...
Note: I'm not saying they shouldn't have the auto-detection, because it definitely helps if the maintainer doesn't do it, but for those that want to help classify things - let them!
Actually, a way to turn off that feature would be nice. It adds very little value at the cost of it taking days to update. It also marks my dotfiles repo as "VimL" which means any auto-resume tool will assume I know VimL, when I don't. Funny thing is it marks my .vimrc as Perl, not VimL.
I disagree; I think the process should be as streamlined as possible. However, I could see auto-detection balanced with a confidence threshold; which, when not met, would ask user:
"Sorry, I couldn't determine if you had C code in your repo or is that Limbo code?"
You mean like the other source code repository hosts do?
I think the idea of automated language detection is pretty cool, but why doesn't github just give you the option of correcting it, or labelling it with the language you prefer?
For example, I've got a javascript modules in repositories. For each module, I make a demo version to show what the module does, and that demo includes a bunch of css. Apparently, there is more css than their is Javascript, so GitHub labels the module as css, but the important part isn't css, the important part is the javascript. In order to resolve this, I've had to move the css into a different repository, and ignore it in the javascript repository. Seems like a long way around, when all I want to do is correct them and say that the module is actually a javascript module.
I actually much prefer BitBucket's way of doing things, for exactly this reason. It doesn't even try to detect - it just asks me. Sometimes the simplest solutions are the best.
Language detection as discussed in the link is per-file. I don't think overriding individual files makes sense since it's likely to be more trouble than it's worth. But I can understand the desire to change the detected language of the project.
How 'bout a project-specific property list that looks something like this:
.rb=RealBasic .m=Mercury .pl=Prolog .js=SomeCrapOrOther …
Seems like more effort than it's worth still to deal with project-specific settings. AFAIK the only two things this practically affects is syntax highlighting and repository stats. That approach would be a good tradeoff though if things are important enough.
This 'lewellyn' person seems to be complaining about the lack of support for the language Limbo, a language for the Inferno OS. Both seem quite outdated and out of use. He also complains about how Github is focusing on 'cool' kid languages. Which I am guessing refer to modern, popular languages (If this is the definition of cool, then yes, they are.) Which, if I was Github, I would do the same. It's called priorities. I kind of get the vibe that lewellyn is some kind of 'hiptser'. His obscure language is better than the 'cool' kids simply because he's using it. I also would phrase it as "GitHub's language detection is broken", it's merely missing a feature/language.
I suspect his - rather labored - point is that there are multiple use cases to show that the design of Linquist's configuration is flawed as a rule and not an exception, and the lack of attention paid to this particular issue is perhaps indicative of a more general Github attitude towards the less trendy languages and technologies out there.
Which I think is an uncharitable way of saying "Github prioritizes working on things that will impact the most people."
From my reading of the issue it seems that he's complaining about a limitation of the tool Linguist. There's a suggested fix that doesn't, as far as I can tell, involve changing how Inferno code is written. My understanding is that primary_extension is simply used to short circuit analysis when it's unique to a language. In this case if the primary extension was .inferno and .m was in other extensions it seems that the sample code would be used by the classifier to distinguish between inferno, MATLAB, and obj-c.
To me this comes off as assuming the worst intentions on behalf of the github developers.
> My understanding is that primary_extension is simply used to short circuit analysis when it's unique to a language.
No, the primary_extension is only used in a gists_helper.rb file outside the Linguist repos. Note that the feature is deprecated anyway.
https://github.com/github/linguist/blob/master/lib/linguist/...
Nobody's arguing that GitHub shouldn't prioritize popular languages. I don't think that the responses to this pull request show 'prioritization' however, they show incompetence and close-mindedness instead.
More likely a lack of time. If you'd like to see it fixed, start contributing to the fix.
Do you mean, a fix like https://github.com/github/linguist/pull/985?
A lack of time doesn't cause problems with multiple languages that only use the trivial ".m" extension. Bad design does.
Luckily I use languages popular enough to be classified correctly.
Yeah, kind of a self-important hipster too.
> Basically, Github needs to be accepting of programmers of all stripes, or they are destined to be irrelevant (or at least doing lots of scrambling) once the trendy kids move on from the trendy things they're doing and the currently-popular languages start falling out of style with a reversion to a previous status quo. Github needs to accept that there is a vast wealth of code out there which predates it and which will easily postdate it.
Okay there, buddy. I don't think lack of Lingo support is going to be GitHub's eventual downfall.
Their language detection is indeed terrible. I have a repository (https://github.com/jperkin/pilights) which is entirely composed of shell scripts and a single markdown README. GitHub's analysis?
Perl 83.5% Shell 16.5%
There is not a single .pl or .pm file, nor a single mention of 'perl' anywhere in the repository, and all scripts begin with #!/bin/sh.A number of my other repositories have similar problems, but this one is by far the worst.
Sorry, I see a huge blue bar saying 100% shell.
Earlier this morning when I looked it looked like OP said...it looks like something has changed in the three hours or so intervening.
Huh, yes, they appear to have coincidentally fixed it since I wrote that comment. Maybe I need to start reporting all GitHub bugs as Hacker News comments...
Inciminated comment: https://github.com/github/linguist/pull/748#issuecomment-374...
> if you'd like Mercury language detection on GitHub then with the current implementation of Linguist you need to pick a different (unique as Objective-C already defines this) primary_extension and add .m to the extensions array which will force Linguist into using the other detection methods mentioned above.
what, then, is the point of the primary_extension field?
EDIT: or as I like to yell at Github for Windows when it can't revert out of a merge conflict "WHAT IS EVEN THE POINT OF YOU?!"
It's just a design error in the original implementation where linguist assumes that the "primary_extension" for any particular language will be unique among all primary_extensions. Obviously that was a mistake, but that's where we are. The comment that set people off was perhaps poorly worded, but it was an honest suggestion to work around the design bug.
A better suggestion: fix the defect. Just delete primary_extension. At best, it does nothing that the extensions array can't do, as it doesn't appear at first glance that any check to primary_extension does not also include a check to extensions.
At worst, it is confusing to implementers and requires chicanery to work around... which is exactly the case we're in. We're in the worst case scenario for this bit of code, and there is no upside to its best-case scenario. Just delete the code.
I bet it's not as easy as "just delete the code". They will probably have to do quite a bit of refactoring to remove this, followed by a probably even larger amount of testing.
Long-term this is probably the right solution, but why go through all this trouble right now if there is a simple workaround? It seems like the only problem right now is a few people's pride.
Of course it's not that easy. They don't have a compiler with a static checker to show them all the places the field was used :P
How much you wanna bet? (https://github.com/github/linguist/issues/985)
If I'm reading this right all this does is define the first extension as the primary extension, which still has to be unique. So they would still have to use a different extension as the first one in the list, it just wouldn't be called "primary" anymore. How would this help?
There is no unicity check anymore; I just kept the first one special because some private code at GitHub seems to rely on the primary_extension property for the Gist editor. As I can't hack this (it is private), I can't remove it entirely.
There's no more requirement for it to be unique as far as I can see.
Probably if the classifier can't determine the language, it will fall back to the primary_extension field - as it should, if the classifier can't determine if a .m file is objc or mercury, it should and will default to objc.
Classification is never 100% accurate.
EDIT: Exact method that it is used is reported here: https://github.com/github/linguist/pull/748#issuecomment-374...
It's the other way around: the extension check is ran before the classifier. And given the limitations of file extensions, an extension collision is so likely of an occurrence (not most likely, but likely enough) as to not bother with primary_extension and just use the collection of extensions as a culling system to minimize the work for classification.
In other words, it seems like the overall design wouldn't be hurt too much by just extricating primary_extension completely. Best-case scenario, primary_extension is equivalent to always having at least one item in extensions. It does nothing else.
Also, this looks like a bug: "if possible_languages.length > 1 ... else possible_languages.first" What if length is 0? That's not greater than 1. I'm not familiar with Ruby, does first return null on an empty array, or does it error? LINQ in .NET has separate First<T> and FirstOrDefault<T> methods: one errors, the other returns default(T) (which is null in the case of reference types). Or is there a default match in the index that occurs when no other language is found? https://github.com/github/linguist/blob/master/lib/linguist/...
Not instilling a lot of confidence that someone really thought through this bit of code. I'm not saying I very strictly think through everything I write, but I also don't write software for thousands of users, and I acknowledge that I've grown rather complacent in terms of time spent per unit code.
I'm looking at more of this, and jesus christ, this is completely wrong: https://github.com/github/linguist/blob/master/lib/linguist/...
Code that commonly resides in .asp files is completely different from code that commonly resides in .aspx files. They are not synonyms for each other. Also, I would wager that C# aspx files are a tad more common than VB.NET aspx files.
It's even worse than lumping together .c and .cpp. You at least have some chance of getting .c files to compile in a C++ compiler. There is no chance of running ASP code through the ASP.NET engine.
This is why "ASPX" as a term exists, to differentiate from ASP.
I have a few Perl projects on Github that uses Bootstrap. Main language (according to Github): Javascript.
I expect that Javascript's github popularity ranking is (a little bit) inflated due to such issues.
I expect it's a lot inflated. I also have repos that are primarily Groovy, but show up as "Javascript" due to the presence of JQuery, Bootstrap, etc.
Unfortunately, until this core issue is fixed, users can't really submit further pull requests to fix the other issues which would correct the "inflation" we all know and hate.
They actually have a set of files and libraries that they ignore (.DS_Store/jquery/boostrap/etc)
https://github.com/github/linguist/blob/master/lib/linguist/...
I’ve been in the programming trenches since early 90’s fluent in 5 languages at the production level and have to say I have never heard of the language 'Limbo’. I don’t fault GitHub one bit.
I suppose I could Google it and act like I know… naw
I don't think the point was to complain about Limbo being missing. I think the point was to show that saying "Objective-C is the only language which can ever use .m as its primary extension" affects more than just the two languages listed earlier in the PR. The PR itself is about Mercury, after all.
Are people even reading the context of the rest of the PR?
I'm pretty sure there's no one with an encyclopaedic knowledge of programming languages. The industry is enormous, and just because you haven't heard of it doesn't mean that it's irrelevant. "Niche" is not the same as "irrelevant".
And even if it WAS irrelevant and only important to a very small number of people, that doesn't mean it can be ignored.
>And even if it WAS irrelevant and only important to a very small number of people, that doesn't mean it can be ignored.
I don't follow. That sounds like the exact criteria for something to be ignored.
This really explains SV's homeless problem.
Given infinite time and developer bandwidth, sure. But we don't live in that world, so "do the work that gives the most benefit to the most users" remains the preferable real-world strategy.
Except that primary_extension does not serve any purpose.
From the thread it looks like there are over 6 languages that use .m as the filename extension (including both MATLAB and Mathematica which you may have heard of), meaning the whole concept of a unique "primary_extension" is kind of ludicrous.
I thought that Mathematica uses .nb extension?
Mathematica notebooks use .nb ; but Mathematica scripts generally use .m [1]
[1] https://reference.wolfram.com/mathematica/tutorial/Mathemati...
True. But I'll bet you've heard of Matlab, which also uses .m, and is just as old as Obj-C. Matlab is everywhere in scientific computing.
Five languages across two decades?
Stand back, gents! This one is a champion!
I don't really care about Limbo, but GitHub seems to think my .m files are all M(UMPS) files, and not Matlab files, the most obvious choice. Highly annoying.
I don't get it - why is a bunch of people trolling the github project with fairly irrelevant arguments interesting? Could someone who upvoted this explain the logic?
How is having a differing opinion "trolling"? Seriously, this word has lost all meaning at this point.
Ignoring the suggested workarounds (setting a unique primary extension and then having the correct extension in the array, for instance) and continuing to rampage in the comments in an attempt to stir up the masses seems like the canonical example of trolling an online community.
Except he actually has a point. GitHub's default behaviour is broken.
In my years of experience online, trolling was specifically riling someone up by saying things the troll doesn't really believe.
Trolling isn't disagreeing that a workaround is sufficient to ignore an actual issue. But that's just my opinion.
"Troll" is like "terrorist" these days. It has absolutely no semantic content beyond "person I disagree with about something".
Seriously? I have an open-source Matlab project from my time in academia that's been misclassified as an Obj-C project in the past. Less popular languages are used all over, especially for more niche industries.
While it is unfortunate that a pull request on this project has been around for 5 months without much progress, I think the commenter is being a bit dramatic. He is acting as if GitHub is blocking all commits with Limbo code. The language can still be under version control, it just might not have syntax highlighting and its own color in the repo stats bar.
GitHub isn't discriminating against certain programmers. Stay calm and keep coding!
> it just might not have syntax highlighting and its own color in the repo stats bar.
It is discriminating, and harmful to all programmers. We need to be able to easily search for these lesser known languages – they are important cultural works. The commenter points out: "Limbo ... seems to have heavily inspired Go (which is currently extremely fashionable)". We are worse off for not having our history readily accessible.
All they are asking is to arbitrarily specify some other extension as the "primary" extension and have ".m" as another extension. Users will still see the same end result.
Unless they use gist.
I miss Mac OS Classic Filetype/Creator codes... Filename extensions are such an ugly hack.
Great example of worse is better in action.
Could they use Bayesian classifiers? trained on a corpus of different languages, primarily concentrating on the symbols used in the language.
I think if I was writing a language detector, it would have these features:
- learning heuristics based on user suggestion.
- extension filtering to differentiate similar languages.
- the algo would use prominence and placement of white space and non-word characters to create the DNA of a language. If the language scores below a threshold against the DNA, it doesn't presume, it asks the user. If a language scores high against this DNA, it still allows used override. Whenever a user would submit their indicator, its file source would be used to train the heuristic.
This is because you likely think before you code.
> My esoteric programming language isn't properly supported by the popular kids' web tool that I'm likely not even paying to use in the first place. I'm OUTRAGED!
Yep, seems about right.
Also, this is untrue. Omgrofl is supported on GitHub, even if nobody uses it.
And posting it to HN of all places is hilarious.
What's interesting about this PR is that this case was actually one of the reasons that I created http://www.gitignore.io. GitHub's original repo for .gitignore templates had nearly 1000 open PRs until around Oct 2013 so I built my own repo that would actually accept PRs. Since then, a few employees have worked on accepting PRs, but I had a similar feeling of frustration. Unfortunately, the OP can't just fork this repo because its features are integral to how GitHub works, where as I was able to hack around the system and create a separate product.
The rant linked to appears to misunderstand the problem and the workaround. @arfon admits there's what amounts to a design bug in Linguist, and so to identify ".m" files, you have to identify a different extension as the "primary" and put the real extension into the "alternate" list. That's a hacky workaround, but it would make the pull request work.
The alternative is to fix the design issue. But that's going to be a lot harder and require more than a few days.
@arfon doesn't admit that there's a bug, rather he says that "requiring a unique primary_extension isn't really a 'bug', rather it's a consequence of how language detection works in Linguist."
The work to fix the design issue was already done by @nox, who submitted a pull request which is still open: https://github.com/github/linguist/pull/985
I guess the charitable reading is "this isn't a bug, it's more of a bad design choice, and we can't just fix it overnight".
But I honestly can't tell if that's what he meant, or if it was more of a "not my problem" type of response.
Except that it was totally fixed overnight. No, wait. Not overnight. Over two hours.
Of course that PR isn't being accepted either.
I can agree that the detection is broken. C++ gets often recognized as C. PHP with some CSS file gets recognized as mostly CSS, etc.
Personally I'd like to have a fixed language that I can set and that the search will use. Next to that, it would be fine for me to statically show what the repository contains, but please use a better language detection, just going by extensions is quite naive.
> C++ gets often recognized as C.
The disambiguation test for C++ headers is ridiculous:
matches << Language["C++"] if data.include?("#include <cstdint>")Well, I expect that's why so much C++ is misrecognized. Not enough people write valid C++, in Github's narrow world view. :)
I wish I could pick the language so I could upload shell scripts without extensions, but it doesn't even read the shebang line.
Sorry, but was there an actually something useful in that comment? I couldn't tell over the 6 paragraphs of childish moaning.
Okay, but the automatically updating comments view is pretty cool. I didn't know Github did that. That is pretty awesome.
And that would be part of the problem with Github. Emphasis on "pretty cool" visual flair while letting fundamental architecture fly out that is flatly, and very obviously, just plain broken.
Considering that the comment you're replying to said "that feature is pretty cool", and didn't even need to address the actual linked rant, it seems that not everyone agrees with your "this is just plain broken" viewpoint.
I use Github for the visual flair and cool features. If I wanted to run my own fundamental architecture, I'd be doing that.
"I use Github for the visual flair and cool features."
The software crisis spelled out in a single sentence.
I've since looked at some of the Linguist code, and it's kind of shit.