The shortest pull request ever

133 points by julienpalard 3 years ago · 54 comments

Reader

> If we drop those markers (1110 and 10 in front of bytes) and keep the remaining bits we're left with 1111111011111111, which evaluates to 65279, which is in hexadecimal 0xfeff. Yes, you recognize it, it's a BOM. Because yes a BOM is just a ZERO WIDTH NO-BREAK SPACE, isn't it beautiful?

Byte Order Marks have stolen hours and days of my life. Anyone suffering the pain of developing on a windows box can relate. Windows puts BOMs by default in the front of every file. Thus windows programs silently ignore it, but then linux machines run the program and choke on the BOM. You have to specifically ask the editor if the BOM is even there, it doesn't show up in the editor by default. I have specific lines in my .vimrc[1] that prevent BOMs from ruining my day/week, but they still pop up often. I often joke there will be a byte order mark on my tombstone, along with avahi daemon.

1: https://git.sr.ht/~djha-skin/dotfiles/tree/main/item/dot-con...

ddalcino 3 years ago

> Byte Order Marks have stolen hours and days of my life.
Me too, to some degree. I have discovered them in a Ruby code base at work, in the middle of a line of code (copy pasted), where the Ruby interpreter thinks they are undeclared identifiers. When the code runs, it throws an exception every time that complains of “Undeclared identifier `‘”.
The dad-joke of it is that “You gotta sweep for BOMs before they blow up your code.”
- jftuga 3 years ago
  
  See my other comment in this thread about my project:
  https://github.com/jftuga/chars
jftuga 3 years ago

I have written a cross platform, stand alone CLI program to inspect a file for BOM8 and BOM16. It also detects if a file uses CRLF or LF. Tab and nul characters are also evaluated. Please see the Examples in my repo:
https://github.com/jftuga/chars
- djha-skin 3 years ago
  
  Love it!
TillE 3 years ago

I've dealt with two elusive bugs which were ultimately caused by Windows stupidly using UTF-8 with BOM by default. Python requires you to take extra steps to decode that garbage, and some C++ libraries can't handle it at all.
I'm sure there were good reasons that BOM sounded like the right idea at Microsoft, but everyone else just used straight UTF-8 and it was fine.
- AceJohnny2 3 years ago
  
  Windows supported Unicode in 1993 (NT 3.1) and 1995 (Win95) via UCS-2, a fixed-width 16-bit encoding.
  In 1996, it was realized 16-bit wasn't enough, and was expanded in Unicode 2.0, which also included UTF-16, a variable-width encoding, which required the BOM.
  Windows 2000 supported UTF-16 on release.
  Why didn't Windows 2000 support UTF-8, which was invented in 1992 and implemented in Plan9 in that same year? Who can say...
AceJohnny2 3 years ago

> along with avahi daemon
Tell us more!
- djha-skin 3 years ago
  
  Back in 2014 I was trying to set up a Linux machine and bind it to the active directory domain at work. The active directory domain was a .local domain, but avahi Daemon thinks any packet that's bound for a DOT local address is addressed to it. So it would swallow up all the packets that were headed to the domain controller, look at them, think they were weird and not understand them and then drop them on the floor. From my perspective it looked like the firewall just hated me.
  It was like a week or two later until I finally went to my friend and said I must be stupid but I can't do this it's not working and he just disabled the avahi daemon and everything started working again.
  Blarg.
  - AceJohnny2 3 years ago
    
    Oof. On the other hand...
    > The active directory domain was a .local domain
    .local is a reserved domain for mDNS (aka ZeroConf or Bonjour, the stuff Avahi handles), standardized in early 2013.
    Then again, 2014 is soon enough after for that for knowledge not to have percolated everywhere, and/or for it to stomp on older networks that had used .local beforehand.
    
    fanf2 3 years ago
    
    Microsoft recommended using .local for active directory domains since the 1990s, I think because back then it was not reasonable to demand that their customers register a domain name at a time when that was a massive hassle. But it was still wrong to squat on a TLD: there were already moves to expand the number of TLDs at the time, but MS were very slow to correct their mistake.
    Then Apple made the same mistake with Bonjour / mDNS, and the IETF standardized Apple’s use of .local and it all became an even worse mess.
    
    yencabulator 3 years ago
    
    I've seen consultants set up ActiveDirectory as "companyname.ad", as if Andorra didn't exist.
Am4TIfIsER0ppos 3 years ago

> nobomb
Sounds like that is a good choice for the option name

donatj 3 years ago

Over the years I've submitted a decent number of pull requests just removing execute bits from files that shouldn't be executable.

They always end up +0-0 - see:

https://github.com/ICanBoogie/Inflector/pull/38

christiangenco 3 years ago

My high school english classes would upload any papers students wrote to a site that would check for plagiarism. I figured out that if I inserted random zero-width no-break spaces in the middle of words my plagiarism score would drop to zero.

Presumably the plagiarism system was just looking for exact matches of long substrings.

benj111 3 years ago

Somewhere someone is adding long lines of BOMs just so if someone else adds long lines of BOMs it gets flagged as plagiarism.
I hope it returns the copied string.
String "" is plagiarised
donatj 3 years ago

Interesting. You could presumably also swap out characters for homoglyphs at random.
frabjoused 3 years ago

The original creator of the first zero width space had to be evil.
- throwaway290 3 years ago
  
  It has uses in typesetting, e.g. for allowing a word to be broken.
  - Doxin 3 years ago
    
    So what's the point of a zero width nonbreaking space then? seeing as that also exists.
    
    throwaway290 3 years ago
    
    Used e.g. to avoid awkward linebreaks
    
    Doxin 3 years ago
    
    You can surely also use no character at all to replace a zero width nonbreaking space for that use case though, or am I missing some subtlety here?
mr337 3 years ago

Wow, this is pretty nifty and I would, nor would the program authors thing of that use case :D

verandaguy 3 years ago

I mean, it's the shortest _possible_ pull request (since I don't think you can make a git diff of zero bytes, barring some weird quirk), but also probably has the highest PR description : PR diff length ratio of any PR I've seen.

AceJohnny2 3 years ago

Working in embedded, I've seen commits that changed a single bit with pages and pages of background explanation :D
dixie_land 3 years ago

I think file mode changes (eg 0644 -> 0755 to fix a script not running) could be smaller
silverwind 3 years ago

`git commit --allow-empty -m 'empty commit'` will do.
- capableweb 3 years ago
  
  That won't show up in diffs though, while this one does show up albeit it's VERY hard to spot the actual difference, as the different is a space with zero width.
- rlayton2 3 years ago
  
  Rejected, please don't commit code in a broken state.
  - JasonFruit 3 years ago
    
    You can't commit code from New Jersey?
layer8 3 years ago

Given that a BOM is three bytes, I don’t really agree that it’s the shortest. How about replacing a CRLF by LF? That one is invisible in many contexts as well.
- layer8 3 years ago
  
  …or removing a trailing LF at the end of a file where the last line is non-empty.
mjochim 3 years ago

Depends on whether you consider `git commit --allow-empty` a weird quirk ... I guess it would be reasonable to do so ;).
- mywittyname 3 years ago
  
  What's the purpose of this? I can think of ways to use/abuse it, but there has to be a specific reason that it was added as a feature to git.
  - kadoban 3 years ago
    
    The most common reason I see it is to create the initial commit of a repo. This is useful both so you have something to push to a remote, PR against, and because several git commands (most notably rebase variants) need weird switches (eg --root) if you _don't_ have an empty initial commit to refer to instead.
  - dixie_land 3 years ago
    
    git commit --allow-empty -m "trigger CI job"
    
    aflag 3 years ago
    
    Nice. I used to just add a BOM at the beginning of a random line to accomplish that.
    
    sodality2 3 years ago
    
    I think this post is a good reason not to do that ;)
  - mdaniel 3 years ago
    
    I use it extensively for entire PR based development; that way all the code that ended up in the repo has had multiple eyes (and an opportunity to comment) upon it
    git commit --allow-empty -m "initial" git push -u origin HEAD git checkout -b first-pr # type-y type-y
  - bryanlarsen 3 years ago
    
    I've found it useful when interactively rebasing a series of commits to make a nice PR. The code moved to a different commit, but I didn't want to lose the message.
  - maxfurman 3 years ago
    
    I see `--allow-empty` in git tutorials all the time, to demonstrate the concept of a commit. I'm not sure when you'd reach for it in a real repo though
    
    eyelidlessness 3 years ago
    
    The most obvious reason not involving automated systems is to create a repo’s initial ref. This can be useful on teams who are super strict about review process and git history.
    Once you account for automated systems, many other reasons can arise. Not just the “trigger CI” case mentioned in another comment, triggering builds or processes based on remote content the code accesses, or generating something either random or seeded by the commit hash/timestamp/message/etc.
    It can even be a Homer Simpson-style drinking bird button press, so finished software doesn’t get mistaken for abandoned.
    Probably whole worlds of things I haven’t imagined because I don’t use git hooks or submodules.
  - ok_dad 3 years ago
    
    For those of us who start projects often, but never finish them:
    git commit --allow-empty -m "Initial commit."
  - ezekg 3 years ago
    
    I use it to trigger CI/CD. And for example, when upgrading Heroku's dyno stack, when is applied on next deployment.
    heroku stack:set heroku-22 git commit -m 'upgrade to heroku-22 stack' --allow-empty git push heroku master
  - gh02t 3 years ago
    
    Maybe so you can make an empty initial commit and push it to a remote like Github as a placeholder? Or a lazy way to trigger a CI job when you're gonna squash and cleanup later.
    I guess conceptually you could use it to represent "I started from nothing."

k470 3 years ago

I love the title. I like to ask for pull requests with this exact description to influence my coleagues to look at it faster when it's something small like a single character. For example, when it's a two character PR I say "hey, the second smallest pr in the world". Guess I was wrong!

remram 3 years ago

You can easily check in your CI that your files are ASCII (code should probably be) with file(1). There is probably an off-the-shelf-tool that can validate that all characters are printable, ASCII or unicode.

pabs3 3 years ago

Is there a tool to check for byte order marks, zero width spaces and other "weird" Unicode characters?

jftuga 3 years ago

I wrote a cross platform, stand alone cli program to do this.
It determines the end-of-line format, tabs, bom, and nul characters:
https://github.com/jftuga/chars
- pabs3 3 years ago
  
  Nice. Would it be possible to have an option to only output the names of files that failed the -f check? i.e. hide the names of files that look "normal" and show the "weird" ones.
  Also, does it detect files that only contain CR as EOL characters? Or files that have different EOL characters on different lines?
  - jftuga 3 years ago
    
    I like the idea of only showing filenames that fail using -f so I created an issue for that. According to:
    https://en.wikipedia.org/wiki/Newline#Representation
    CR does not appear to really be used as EOL. Also, I don't think having different EOL chars within the same file is really a thing.
    
    pabs3 3 years ago
    
    Thanks for the issue, I've subscribed to it.
    https://github.com/jftuga/chars/issues/2
    According to the page, several machines and operating systems used CR as EOL. While the systems are all obsolete, files from that era that use CR as EOL could persist and be transferred to modern systems. Clearly those are weird on modern systems, so they should be warned about in a linting situation, which I would like to use your project 'chars' in.
    Having different EOL chars within the same file is definitely a thing, usually by mistake. I had to fix a bug about this recently:
    https://github.com/EionRobb/purple-discord/pull/416
  - jftuga 3 years ago
    
    I added a new command line switch, -F
    * when used with -f, only display a list of failed files, one per line
    https://github.com/jftuga/chars/releases/tag/v2.3.0

tmaly 3 years ago

Jesus wept

Settings

The shortest pull request ever

Keyboard Shortcuts