EOF is not a character

78 points by UkiahSmith 6 years ago · 135 comments

Reader

rectang 6 years ago

Like NULL, confusion over EOF is a problem which can be eliminated via algebraic types.

What if instead of a char, getchar() returned an Option<char>? Then you can pattern match, something like this Rust/C mashup:

   match getchar() {
     Some(c) => putchar(c),
     None => break,
   }

Magical sentinels crammed into return values — like EOF returned by getchar() or -1 returned by ftell() or NULL returned by malloc() — are one of C's drawbacks.

nothrabannosir 6 years ago
What always annoyed me about C is that it has all the tools to simulate something approaching this, save for some purely syntactical last-mile shortcomings. We can already return structs; if only there were a way to neatly define a function returning an anonymous struct, and immediately destructure on the receiving end. Something like:
```
  #include <stdio.h>

  struct { int err; char c; } myfunc() {
    return { 0, 'a' };
  }

  int main(int argc, const char *argv[]) {
    { int err; char c; } = myfunc();
    if (err) {
      // handle
      return err;
    }
    printf("Hello %c\n", c);

    return 0;
  }
```
This is (semantically) perfectly possible today, you just have to jump through some syntactic hoops explicitly naming that return struct type (because among others anonymous structs, even when structurally equivalent, aren't equivalent types unless they're named...). Compilers could easily do that for us! It would be such a simple extension to the standard with, imo, huge benefits.
Every time I have to check for in-band errors in C, or pass a pointer to a function as a "return value", I think of this and cringe.
- seppel 6 years ago
  You can write that in C++17 with only slightly different syntax (and it is actually really nice for being C++):
  #include <stdio.h> #include <tuple> std::tuple<int, char> myfunc() { return { 0, 'a' }; } int main(int argc, const char *argv[]) { auto [ err, c ] = myfunc(); if (err) { // handle return err; } printf("Hello %c\n", c); return 0; }
- zeckalpha 6 years ago
  
  You may be interested in tagged unions. A struct with an enum and a Union. You can switch on the enum.
  More stuff like this in https://pdfs.semanticscholar.org/31ac/b7abaf3a1962b27be9faa2...
- juped 6 years ago
  
  Sounds like you'd like Go, which works this way.
  - apta 6 years ago
    
    Which is a strictly inferior and botched way to go about it, especially since golang was designed from scratch.
- teddyh 6 years ago
  
  > We can already return structs
  AFAIK, no? You can return a pointer to a struct, and you can pass whole structs as arguments, but not, IIRC, return them from functions.
  EDIT: Apparently you can, sort of, but not portably; how exactly it is defined to work depends on the compiler, and each compiler might define it differently. This means that if you’re using a library which returns a struct and your program use a different C compiler than the library used when it was compiled, your program will not work. I.e. there is no one defined stable ABI for functions returning structs.
  Therefore I think it’s reasonable to regard it as impossible in practice.
  - juped 6 years ago
    
    Structs are values and you can return them like any value (or use them as parameters).
    I'm not sure what you mean about compilers.
    
    JdeBP 6 years ago
    
    Xe is conflating compilers and calling conventions a bit. The way that structure types are returned varies by calling convention, as indeed do a lot of other things. Mismatched calling conventions leads to problems.
    But structure type return values are well specified for most calling conventions, and quite a number of compilers support explicitly specifying the calling convention for mixed-language or mixed-compiler situations.
    * http://jdebp.uk./FGA/function-calling-conventions.html
    
    teddyh 6 years ago
    
    Many calling conventions apparently use a method for returning structs which is inherently non-thread-safe.
    Also from that link:
    > 32-bit cdecl calling convention
    > For return values of structure or class type, there is wide incompatibility amongst compilers. Some make the return thread-safe, by breaking compatibility with the 16-bit cdecl calling convention. Some retain compatibility, at the expense of their 32-bit cdecl calling convention not being thread-safe. The ones that break compatibility don't all agree with one another on how to do so.
    
    teddyh 6 years ago
    
    See, for instance, here:
    https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Incompatibiliti...
    https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Code-Gen-Option...
    https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Warning-Options...
    
    juped 6 years ago
    
    Oh, I get it.
    This is mostly not a practically relevant issue. (Nor are pre-K&R compilers relevant, although something like this could arise among modern compilers.) As far as oddball situations go, it's far from the thorniest to deal with - it doesn't even involve C++.
Someone 6 years ago

”What if instead of a char, getchar() returned an Option<char>?”
Getchar doesn’t return a char; it returns an int (https://en.cppreference.com/w/c/io/getchar).
⇒ if C didn’t do automatic conversions from int to char, we would have that (in a minimalistic sense)
That wouldn’t work for ftell and malloc (and, in general, most of the calls that set errno), though.
- rectang 6 years ago
  
  > Getchar doesn’t return a char; it returns an int
  Dammit, I knew that. Thank you for flagging my blunder; being precise is really important in this case. The Linux manpage better explains the return value of getchar:
  https://linux.die.net/man/3/getchar
  "fgetc(), getc() and getchar() return the character read as an unsigned char cast to an int or EOF on end of file or error."
  getchar() needs to return an object the width of an unsigned char, but all the values in that range are taken by possible character values. The return type had to be expanded to int in order to accommodate the sentinel.
  The alternative of using an algebraic type is superior because the end-of-stream condition has a different type (so to speak), and furthermore, the programmer has no choice but to deal with it because the character value comes wrapped inside an Option which must be stripped away before the character value can be used.
  Really, you also want the type system to express all possible error conditions as well, since getchar() returning EOF can mean either that end-of-file was reached or that some other error occurred!
  As someone who has written lots of C code and worked hard to account for all possibilities manually, I really appreciate it when the type system and APIs can express all possibilities and back me up.
matheusmoreira 6 years ago

> Magical sentinels crammed into return values — like EOF returned by getchar() or -1 returned by ftell() or NULL returned by malloc() — are one of C's drawbacks.
They're part of the C standard library. The POSIX I/O APIs don't have these problems. The Linux I/O system calls are even better because they don't have errno.
Honestly, the C standard library just isn't that good. Freestanding C is a better language precisely because it omits the library and allows the programmer to come up with something better.
- TheCoelacanth 6 years ago
  
  I think that's being too kind. The C standard library is terrible.
  - matheusmoreira 6 years ago
    
    To be fair, the libraries found in other languages aren't much better. Ruby's standard library was the most comfortable in my experience but it still has glaring flaws.
nixpulvis 6 years ago

So `read`'s `Ok(0)` result, is akin to `getchar`'s `None` result here. A different API causes a little more to consider, but generally makes sense.
pwdisswordfish2 6 years ago

Option<u8>, given that in C ‘characters’ means bytes, not code points.
- kevin_thibedeau 6 years ago
  
  A byte could be 6,7,8, or 9 bits depending on platform.
  - pwdisswordfish2 6 years ago
    
    Yes, but Rust doesn’t support those. So on platforms where both C and Rust run, bytes will be 8 bits.
    Either way, no platform defines bytes to be Unicode code points.
sfifs 6 years ago

Or allow multiple return values like Go. EOF gets returned as an explicit error value and the io.Reader interface is standardized and widely used.
- ajnin 6 years ago
  
  It seems weird that Go considers EOF to be an error condition. Reaching the end of a file is a normal, expected outcome of reading files.
  - barrkel 6 years ago
    
    By making error values explicit and handling them necessary, Go makes all error conditions expected outcomes.
    Whether this is an advantage is heavily domain dependent.
- apta 6 years ago
  
  golang's approach is inferior and error prone. We already have better designed languages.
enriquto 6 years ago

> What if instead of a int, getchar() returned an Option<char>?
That would be the textbook case of stupid over-engineering.
- rectang 6 years ago
  
  I strongly disagree. The existing getchar() API is not simple at all! All the possible error conditions are still there, they're just obscured by an overstreamlined API which fuses them inappropriately into a single return type. That makes it harder to handle all cases well, because you have to do all the work manually.
  - enriquto 6 years ago
    
    The man page for getchar is a single, easy to read paragraph. To understand algebraic types you need a couple of textbooks.
    
    dmit 6 years ago
    
    No, encoding additional information in unused bits of an int that you return is stupid over-engineering that needs multiple textbooks to grok. Option<char>, on the other hand, is the simplest possible solution for this problem.
    
    bloomer 6 years ago
    
    I would also like to not that encoding the the Option<char> using the unused bits of the return value is a perfectly valid implementation. But that is exactly what it is, an implementation detail. It could work exactly the same way as today but the programmer wouldn’t have to care about how it was implemented, just whether they got a char or None.
    
    dmit 6 years ago
    
    In fact, that's exactly what Rust already does today! Option<char> uses the exact same amount of bits to store as plain char, because the compiler has enough information to encode the Option-ness of char in what it knows is a garbage bit of the underlying type.
    https://play.rust-lang.org/?version=stable&mode=debug&editio...
    
    steveklabnik 6 years ago
    
    Rust guarantees this, actually.
    
    enriquto 6 years ago
    
    > No, encoding additional information in unused bits of an int that you return is stupid over-engineering that needs multiple textbooks to grok. Option<char>, on the other hand, is the simplest possible solution for this problem.
    What kind of wicked education you had for this to be the case?
    My dad taught me about bits and bytes and words when I was a kid, and by 16 I had a quite solid grasp of it (without any textbook). Then I studied several years and got a phd in applied math (mostly numerical pde, and that involved a lot of programming). Then I have spent 15 more years doing math and programming in several languages (mostly C and Python) and getting paid for teaching data science and signal processing to people who got on to have fruitful jobs in industry. Today, I read the wikipedia page about "option type" [1] and the one about about type theory [2], which seems a prerequisite, and couldn't understand a word.
    [1] https://en.wikipedia.org/wiki/Option_type
    [2] https://en.wikipedia.org/wiki/Type_theory
    
    wk_end 6 years ago
    
    Surely if you have a PhD in applied math you've seen that Wikipedia will often foreground dense theoretical issues, even for topics with straightforward practical applications.
    You do not need to understand theoretical type theory to understand options. It's just like a pointer that can be NULL except the compiler makes sure you can't accidentally dereference it if it is. Algebraic data types in general are basically just structs and tagged unions, except the compiler makes sure you can't screw the tags up.
    Like, dude, by your own account, you're pretty smart; that's the point of your last paragraph, right? There are, at this point, hoards of Rust and Scala and Swift and Kotlin programmers who can figure out how option types work, and don't seem to have too much of a problem with it and pretty much universally think they're great. Are they actually just smarter than you?
    
    enriquto 6 years ago
    
    > Are they actually just smarter than you?
    Sure they are. Or at least they do not hold an irrational, primary hatred of over-abstraction like I do. In math there's also people like this, who work in stuff like category theory, logic and whatnot. Fortunately, they are a mostly controlled minority.
    
    empath75 6 years ago
    
    I’m a community college drop out and have never taken a CS class and I use options in my code all the time.
    It’s just a wrapper around some value that is either Some(value) or None and you need to unwrap it and handle both possibilities for your code to compile.
    You don’t need to know anything about monads or ADT’s to understand it.
    
    enriquto 6 years ago
    
    But do you understand that some values of a 32 bit int can be used to encode meta-data about a 8 bit char?
    If you understand both, which one is conceptually easier to you?
    
    dmit 6 years ago
    
    Option<T> is conceptually easier because:
    1) It applies exactly the same way to any type, not just char.
    2) You don't need to read the man page for every single function that returns an int on the off chance that said int actually contains a bool, a char, or a short plus additional flags.
    
    epolanski 6 years ago
    
    option obviously.
    I just do not understand what the problem with Option is.
    It's either Some(1) or None. If it's some you have the value, otherwise you handle the fact you don't have it.
    It's so simple and basically every modern language uses it to handle nullable types.
    
    empath75 6 years ago
    
    Options for sure, at least in rust.
    Rust has tons of help dealing with options built into the language.
    
    wyldfire 6 years ago
    
    You can amortize that cost over all of the problems in the language's domain, not just getchar.

charlysl 6 years ago

This is very well explained in the classic book The UNIX Programming Environment, by Kernighan and Pike, in page 44:

Programs retrieve the data in a file by a system call ... called read. Each time read is called, it returns the next part of a file ... read also says how many bytes of the file were returned, so end of file is assumed when a read says "zero bytes are being returned" ... Actually, it makes sense not to represent end of file by a special byte value, because, as we said earlier, the meaning of the bytes depends on the interpretation of the file. But all files must end, and since all files must be accessed through read, returning zero is an interpretation-independent way to represent the end of a file without introducing a new special character.

Read what follows in the book if you want to understand Ctrl-D down cold.

Animats 6 years ago

In the beginning, there was the int. In K&R C, before function prototypes, all functions returned "int". ("float" and "double" were kludged in, without checking, at some point.) So the character I/O functions returned a 16-bit signed int. There was no way to return a byte, or a "char". That allowed room for out of band signals such as EOF.

It's an artifact of that era. Along with "BREAK", which isn't a character either.

bhaak 6 years ago

You can still today declare a function without a return value like this: "a() { return 1; }".
GCC only outputs a warning by default: "warning: return type defaults to ‘int’ [-Wimplicit-int]"

reidacdc 6 years ago

Seems like the confusion arises because getchar() (or its equivalent in langauges other than c) can produce an out-of-band result, EOF, which is not a character.

Procedural programmers don't generally have a problem with this -- getchar() returns an int, after all, so of course it can return non-characters, and did you know that IEEE-754 floating point can represent a "negative zero" that you can use for an error code in functions that return float or double?

Functional programmers worry about this much more, and I got a bit of an education a couple of years ago when I dabbled in Haskell, where I engaged with the issue of what to do when a nominally-pure function gets an error.

I'm not sure I really got it, but I started thinking a lot more clearly about some programming concepts.

int_19h 6 years ago

The amusing thing about it is that C does not guarantee that EOF is out-of-band!
ISO C says that char must be at least 8 bits, and that int must be at least 16. It is entirely legal to have an implementation that has 16-bit signed char and sizeof(int)==1. In which case -1 is a valid char, and there's no way to distinguish between reading it and getting EOF from getchar().
- kstenerud 6 years ago
  
  ... which is why no system ever implements things this way. There are many portions of the C spec that can be ignored.
  Large swaths of the C standard were built during the heyday of computer design, when you had all sorts of wacky sizes, behaviors and abstractions. Lots of "undefined behavior" is effectively deterministic, because all modern computers have converged to do so many things the same way.
  - plorkyeran 6 years ago
    
    TI DSPs with 16-bit char are still being made. It's a niche thing that most people will never need to care about, but it's not just a historical quirk and definitely not "no system ever".
    
    int_19h 6 years ago
    
    Then there's SHARC with its 32-bit char.
    Do architectures like that have non-freestanding C implementations, though? It's kinda moot if there's no getchar()...
snek 6 years ago

> and did you know that IEEE-754 floating point can represent a "negative zero" that you can use for an error code in functions that return float or double?
I am begging, please never ever do this. NaN literally exists for this reason. NaN even allows you to encode additional error context and details into the value.
- saagarjha 6 years ago
  
  +DBL_MAX. Negative zero is an entirely valid, if rare, result of certain computations.
  - pwdisswordfish2 6 years ago
    
    IEEE 754 has infinities as well, no need to constrain yourself to DBL_MAX :)
fennecfoxen 6 years ago

I taught myself C on MS-DOS in middle school, decades ago. Could have sworn that ASCII 26 was named “EOF” even if modern text files don’t include it.
This is a supplementary source of confusion.
- wongarsu 6 years ago
  
  Wikipedia supports this:
  > Character 26 was used to mark "End of file" even if the ASCII calls it Substitute, and has other characters for this. Number 28 which is called "File Separator" has also been used for similar purposes. [1]
  I think today we would think of character 4 (End of Transmission, Ctrl-D) as the end of file/input marker, but historically Character 26/Ctrl-Z was used, even on disk.
  1: https://en.wikipedia.org/wiki/Substitute_character
  - pwdisswordfish2 6 years ago
    
    See, this is why you should not believe Wikipedia.
    The DOS syscall interface has no concept of an EOF character. ^Z being considered EOF was a feature of the COPY command, later replicated by the runtimes of various languages targetting DOS.
    http://jdebp.info/FGA/dos-character-26-is-not-special.html
    
    Doctor_Fegg 6 years ago
    
    Not just DOS. CP/M also used CTRL-Z, principally because file lengths weren’t stored on disk - just the list of 128-byte blocks. So to get granularity beyond multiples of 128, you need an explicit EOF character.
    
    giovannibajo1 6 years ago
    
    I think TYPE would also treat ^Z as a terminator of the file. I think it was common in DOS to have binary files with a textual header followed by ^Z, that would hide the binary part.
- nixpulvis 6 years ago
  
  Yea, it's confusing. https://news.ycombinator.com/item?id=22572703, read EDIT 3, I found this pretty illuminating.
nixpulvis 6 years ago

What does "Procedural" vs "Functional" have to do with this? It's a choice in data type.
If by procedural you mean, nonsense, then sure... I agree that a function named `getchar` returning an `int` is procedural. :P
- mikekchar 6 years ago
  
  I suspect it was a product of the OP's musing about errors. Side effects are common in programming languages outside of pure functional languages. When you have a pure functional language, what do you do if the type you are returning can't represent an error? You also can't have side effects (for example throw an exception), so it's doubly important that you make sure your return type can encode errors. I suspect that's all they meant. The choice of wording was just unfortunate (especially the use of "procedural" -- what do I do if I can't return values??? ;-) ).
- pwdisswordfish2 6 years ago
  
  Nothing, apart from the fact that languages with type systems designed more carefully than C happen to be functional languages, to one extent or another.
  (Though by the way: having functions that evaluate to a value when executed is itself a feature that belongs to the functional paradigm, although one so trivial and common that it’s not usually thought as such. But a purely imperative/procedural way of returning values would be via out parameters or global variables.)
- jfdhvdybc 6 years ago
  
  The simple answer to this is that these days "functional programming" doesn't just mean the absence of side effects. It means strong type systems, algebraic data types, list comprehensions, etc. It is a distinct cultural stream in the development of programming languages. Of course "functional" has an original narrow meaning, but so do "Republican" and "Democrat".
  When Rust introduced ADTs they were recognizably a concept from functional programming. It's a place or community of practice, not a purely descriptive adjective.
- eyegor 6 years ago
  
  What they mean to say is: when I was working with a language that enforced pure functions, I had to actually consider purity. It's rare to see a way to enforce purity in procedural languages, whereas most fp langs support it.
  - nixpulvis 6 years ago
    
    Are we talking about even roughly the same concept of functional purity [1]? Nothing is stopping a pure function from representing EOF as -1.
    Implementing IO in a "pure" way, is however another discussion.
    [1]: https://en.wikipedia.org/wiki/Pure_function
    
    eyegor 6 years ago
    
    Mostly, do you know of a single procedural language with a concept of IO monads in its stdlibs?
- chrisseaton 6 years ago
  
  > If by procedural you mean, nonsense, then sure
  Why are you being snarky?
  They clearly mean the issue of modelling partial functions which would normally be done by a side-effect in a procedural language but can’t in a functional language.
  - nixpulvis 6 years ago
    
    No, they imply that the handling is done by returning a negative number.
    I'm being snarky, as is my nature, to highlight the madness of a function called `getchar` returning anything but a `char`.
    
    samatman 6 years ago
    
    It's not a great snark given that the C standard considers the signedness of char to be implementation defined, making -1 a valid option, sometimes.
    
    nixpulvis 6 years ago
    
    I'm sorry you don't find it great (I still do). Integers are not characters.
    Integers are numbers like -1337, 0, and 42.
    Characters are things that compose strings of text.
    These are not the same kind of thing at all. Just because APIs may be leaky, and some of these APIs are held in very high regard doesn't change that fact.
    
    astrobe_ 6 years ago
    
    In the end, integers, floating point numbers, "text\n", emojis etc. are just sequences of bytes. You choose to acknowledge it and take advantage of it, or you don't.
    
    nixpulvis 6 years ago
    
    By that argument, why even have types... and what makes bytes so special? Perhaps you'd like to work with bitstreams, (or qbitstreams)?
    
    astrobe_ 6 years ago
    
    A bit late, but bytes are very often the smallest addressable unit. That's why they are "special".
    Which answers your second question: "bitstreams" would be terrible because they are not well connected with a hardware reality. Unless you have bitstream-oriented CPU, it is a bad idea for a basic type to go against the hardware.
    Why even have types... Well, yes, there are languages without type checking where the notion still exists. For instance Forth has no type checking but two types are implied: the "byte" type and the "machine word size" type, maybe three if you count strings.
    
    samatman 6 years ago
    
    a char isn't a character, though. you can't add two characters together and get another character. it's a number.
    getchar() gets a char. not a character.
    
    nixpulvis 6 years ago
    
    All you've convinced me of is that the + operator isn't defined for characters. Which makes sense. Trying to tell me that something called char is really just a byte in disguise (while true in some popular languages) is just irritating and misleading to me.
    
    chrisseaton 6 years ago
    
    > the madness of a function called `getchar` returning anything but a `char`
    It’s effectively returning a Maybe(char).
    
    nixpulvis 6 years ago
    
    But it's not.
    A `Maybe<char>` has exactly one `None` variant. While an `int` has many, many negative values.
    Also, just calling it `None` (or similar) makes clear what is meant, while `-1` is some magic value.
    
    chrisseaton 6 years ago
    
    > while `-1` is some magic value
    It's a documented return value. Nothing magic about it.

anonymousiam 6 years ago

CP/M and DOS use ^Z (0x1A) as an EOF indicator. More modern operating systems use the file length (if available). Unix/Linux will treat ^D (0x04) as EOF within a stream, but only if the source is "cooked" and not "raw". (^D is ASCII "End Of Transmission or EOT" so that seems appropriate, except in the world of unicode.)

schoen 6 years ago
Strictly speaking, as discussed elsewhere in this thread, ^D can cause a terminal device to signal an EOF condition; other kinds of Unix byte streams don't make this association.
For example,
```
  $ python3 -c 'print("".join(chr(c) for c in range(10)))' | python3 -c 'print(list(ord(c) for c in input()))'
```
will confirm that it doesn't happen in a pipe (the ASCII 4 character there is totally unrelated to EOF).
pwdisswordfish2 6 years ago

That is a common misconception.
http://jdebp.info/FGA/dos-character-26-is-not-special.html
- unilynx 6 years ago
  
  I'm pretty sure the DOS TYPE command (its version of cat) would stop at the first ^Z it encountered, even if the file was longer.
  It was sometimes used to have TYPE print something human readable and stop before the remaining (binary) file data would scroll everything away
  - cesarb 6 years ago
    
    > It was sometimes used to have TYPE print something human readable and stop before the remaining (binary) file data would scroll everything away
    Notably, in the PNG file format (created back when MS-DOS was still very relevant):
    "The first eight bytes of a PNG file always contain the following values: [...] The control-Z character stops file display under MS-DOS. [...]" (http://www.libpng.org/pub/png/spec/1.2/PNG-Rationale.html#R....)
- nineteen999 6 years ago
  
  Maybe not for DOS, but for CP/M it most certainly is true, since the length of a file in bytes is not stored anywhere. Only the number of (typically 128 byte) sectors.
  For binary files, you just assume there is padding at the end of the file to the end of the sector. For text files, the SUB code was used to indicate where the file ended.
- tigershark 6 years ago
  
  It’s not true, plenty of DOS programs stopped I/O operations with ctrl + z, and exited with ctrl + c. What you are saying is that obviously there was no physical 1A byte to demarcate the end of the file, but 1A was used pretty much everywhere. And it’s actually a non printable character: https://en.m.wikipedia.org/wiki/Substitute_character So I’m missing the point of this article, CTRL Z and CTRL D are obviously non printable characters and of course they are not used anymore to demarcate the actual end of a file.
nixpulvis 6 years ago

Using the "file length" as opposed to the "EOF indicator" is like how strings can either be represented as pointer to a contiguous sequence of `char` ending with a NULL byte, or as a tuple of (length, pointer), without the needed NULL byte.
One gives a priori information the other a posteriori.

combatentropy 6 years ago

The kernel returns EOF "if k is the current file position and m is the size of a file, performing a read() when k >= m..."

So, is the length of each file stored as an integer, along with the other metadata? This reminds me of how in JavaScript the length of an array is a property, instead of a function that counts it right then, like say in PHP.

Apparently it works. I've never heard of a situation where the file size number did not match the actual file size, nor of a time when the JavaScript array length got messed up. But it seems fragile. File operations would need to be ACID-compliant, like database operations (and likewise do JavaScript array operations). It seems like you would have to guard against race conditions.

Does anyone have a favorite resource that explains how such things are implemented safely?

JdeBP 6 years ago

You are not thinking about it clearly. Ask yourself this: Filesystem formats use blocking and deblocking. How would a filesystem know the file size without having metadata for it?

chrisseaton 6 years ago

So what is CP/M-style character 26? Isn’t that documented as end-of-file?

jcrawfordor 6 years ago

Perhaps a marginally better title would be "EOF is not a character [on Unix]". There are some OS that have an explicit EOF character, but it seems to have been the less common approach historically. CP/M featured an explicit end of file marker because the file system didn't bother to handle the problem of files which were not block-aligned, so the application layer needed to detect where the actual end of the file was located (lest it read the contents of the rest of the block). This is a pretty unusual thing to do, and was definitely a hassle for developers, so CP/M descendants like MS-DOS fixed it.
- mark-r 6 years ago
  
  I think CP/M copied that convention from an even older OS but I can't remember which one.
  - jcrawfordor 6 years ago
    
    CP/M was developed on TOPS-10 and copied a lot of concepts from it. I can't immediately tell whether or not this is an example, but for any given eccentricity of CP/M it's a good bet that it came from TOPS-10.
    It's amusing that almost the same can be said about NT: for any given eccentricity of Windows NT it's a good bet that it came from VMS, since the two had the same principal designer.
- kevin_thibedeau 6 years ago
  
  This also afflicts the xmodem protocol.
mark-r 6 years ago

It's just a convention, it isn't enforced by the OS. The C runtime for example will check for character 26 if you're reading a file opened in text mode but not in binary mode. The underlying OS call to read a file makes no distinction between text and binary.
nixpulvis 6 years ago

I'm just reading up on this now. But according to Wikipedia "CP/M used the 7-bit ASCII set", so then character 26 would be the "SUB (substitute)" character. No?
EDIT: Seems like 26 = EOF is a DOS thing.
EDIT 2: Some confusing comments: https://www.perlmonks.org/bare/?node_id=228760
EDIT 3: A pretty good thread (read NigelQ's replay): http://forums.codeguru.com/showthread.php?181171-End-of-File...

IndexPointer 6 years ago

Of course it isn't, you couldn't have arbitrary binary files if one of the 256 possible bytes was reserved. That's why getchar returns int and not char; one char wouldn't be enough for 257 possible values (256 possible char values + eof).

schoen 6 years ago

Recently (though mine was the only comment): https://news.ycombinator.com/item?id=22461647

nixpulvis 6 years ago

Well then try explaining ctrl+c vs ctrl+d to someone who's never touched a terminal at all. Starts off so easily... "see one tells the program to stop" the other, well, if you're in a shell... or some programs... oh god. IDK anymore, just assume it works. What was the question?"
- ChristianBundy 6 years ago
  
  Maybe you can correct me if I'm wrong, but I've always considered Ctrl+C and Ctrl+D to be signals that you can send a process rather than explicit characters. You might also get some stdout for those key combinations because ???, but they should be thought of as signals rather than as characters you're sending via stdin.
  Hoping Cunningham's Law comes into play with this comment. :)
  - rgoulter 6 years ago
    
    I liked this explanation. https://www.linusakesson.net/programming/tty/
    When the TTY device takes (by default) Ctrl+C or Ctrl+D, it sends the signals to the program. The TTY's 'line discipline' (the policy for when the program's STDIN can read from a line of input) can be changed from a default 'cooked' to a 'raw mode'. In with raw mode line discipline the Ctrl+C doesn't send the signal. Presumably that's why e.g. vi or emacs don't just close on Ctrl+C.
    
    nixpulvis 6 years ago
    
    > Now you press ^Z. Since the line discipline has been configured to intercept this character (^Z is a single byte, with ASCII code 26), you don't have to wait for the editor to complete its task and start reading from the TTY device. Instead, the line discipline subsystem instantly sends SIGTSTP to the foreground process group.
    This helps me, thanks for pointing me back at this great write-up.
  - anonymousiam 6 years ago
    
    Control-C is part of POSIX job control. If a stream (or "cooked" tty) sends a control-C (ASCII End-Of-Text or ETX), the foreground process will be sent a SIGINT signal. If that signal is not handled, the default action is to terminate the process (SIGTERM). Control-D is just another control character and not part of POSIX job control, but in the "cooked" case above, it will be interpreted as EOF and the process doing the "read" will receive that.
    
    nixpulvis 6 years ago
    
    I was thinking the same thing, until I read this:
    > 'stty -icanon' still interprets control characters such as Ctrl-C whereas 'stty raw' disables even this and is the real raw mode.
    From the very detailed link posted by rgoulter above.
    Still, in raw mode, Ctrl+D will send EOT, and thus end your shell. While Ctrl+C wont.
- taeric 6 years ago
  
  This actually doesn't seem that hard. In both, you are telling the computer, not the target program, something. One is to signal the running program you want to interrupt it. The other is to close the input to the program, as you are done giving it data.
  - nixpulvis 6 years ago
    
    You're making a lot of assumptions about the setup. Granted they are all mostly standard.
    
    taeric 6 years ago
    
    Assumptions help the world go around. :)
    
    nixpulvis 6 years ago
    
    And... the Devil's in the details.
- 1996 6 years ago
  
  it all depends on your stty settings.
  since I am more used to Windows where ctrl-c is copy, I followed other people's suggestion and mapped ctrl-x to do what ctrl-c usually does, with:
  stty intr ^X -ixon
  This is because X and C are very close, and I couldn't sacrifice ctrl-v (paste) or ctrl-z (background) while I seldom use ctrl-c
  I'm sure you could do the same with ctrl-d if you really wanted to.
  - pwdisswordfish2 6 years ago
    
    You could just use Ctrl-Insert/Shift-Insert for copy/paste everywhere.

nixpulvis 6 years ago

I find it interesting that Rust's `Read` API for `read_to_end` [1] states that it "Read[s] all bytes until EOF in this source, placing them into buf", and stops on conditions of either `Ok(0)` or various kinds of `ErrorKind`s, including `UnexpectedEof`, which should probably never be the case.

[1]: https://doc.rust-lang.org/std/io/trait.Read.html#method.read...

comex 6 years ago

The reason for that is that, for simplicity's sake, all of the I/O functions share the same error type. `UnexpectedEof` should never be returned from `read_to_end`, but it can be returned from `read_exact`.
cesarb 6 years ago

That's because `UnexpectedEof` is never returned from `read()`, it's only ever returned from `read_exact()`. In fact, `UnexpectedEof` didn't exist originally, it was added together with `read_exact()` to represent its unique error case (which is: `read()` returned end-of-file, but we still needed more bytes to completely fill the buffer). It's an error to return `UnexpectedEof` from any of the other methods of the `Read` trait, and since it's an error, it makes sense for `read_to_end()` to stop and propagate that error.
(In fact, thinking better about it, there are some cases where `read()` could legitimately return `UnexpectedEof`, like when it's a wrapper for a compressed stream which has fixed-size fields, and that stream was truncated in the middle of one of these fields. It's clear that, in that case, `UnexpectedEof` is not an end-of-file for the wrapper; it should be treated as an I/O error.)

badrabbit 6 years ago

Banged my head against the wall once after trying to figure out why Ctrl+D generates some character in bash but I can't send that character in a pipe to simulate termination.

kylek 6 years ago

Fun fact, ctrl-v in bash sets "verbatim insert" mode for the next character, so you can type a ^D "character" by doing "ctrl-v ctrl-d".
- pwdisswordfish2 6 years ago
  
  It’s not bash, it’s the tty device driver. Applications can switch between the ‘cooked’ mode (which recognises it as EOF) and ‘raw’ mode (which passes it through) by performing some ioctl I don’t really want to look up right now.
nixpulvis 6 years ago

I think I may still be banging my head on this one. It's just an ioctl difference between my pipe and my terminal's session, right?
enriquto 6 years ago

> Banged my head against the wall once after trying to figure out why Ctrl+D generates some character in bash but I can't send that character in a pipe to simulate termination.
Yes, you can. You just end your stream by closing the pipe.

jwilk 6 years ago

Um, no, you can't use Python to infer that "EOF (as seen in C programs) is not a character".

The exception even tells you that "chr() arg not in range(0x110000)" which has nothing to do with range of C's character types.

unnouinceput 6 years ago

For me EOF is a boolean state. Either I am at the end of file (stream / memory mapped etc) or not. That's how I was taught when I started programming. Never occurred to me to think of it like a character.

Thorrez 6 years ago

Another weird thing is that sometimes you can read an EOF, then keep reading more real bytes. So EOF doesn't necessarily mean the permanent end.

jwilk 6 years ago

The EOF condition for stdio functions is supposed to be sticky, although glibc didn't implement it correctly until 2.28:
https://sourceware.org/bugzilla/show_bug.cgi?id=1190
https://sourceware.org/legacy-ml/libc-alpha/2018-08/msg00003...
> All stdio functions now treat end-of-file as a sticky condition. If you read from a file until EOF, and then the file is enlarged by another process, you must call clearerr or another function with the same effect (e.g. fseek, rewind) before you can read the additional data. This corrects a longstanding C99 conformance bug. It is most likely to affect programs that use stdio to read interactive input from a terminal.
- Thorrez 6 years ago
  Wow, very interesting! That sounds like a somewhat significant change, and I wonder how much stuff will be broken by it.
  Although interestingly somehow I'm still seeing the old behavior in Debian Buster with glibc 2.28 with python3.
  import sys while True: b = sys.stdin.read(1) print(repr(b))
  With old glibc with both python2 and python3 the EOF isn't sticky (as expected). With 2.28 with python2 the EOF is sticky (like you said). With 2.28 with python3 it's not sticky for some reason.
  - jwilk 6 years ago
    
    In Python 3, file I/O is is implemented using POSIX read(), write() etc., rather than C stdio.
    
    Thorrez 6 years ago
    
    Interesting, and EOF on POSIX read() isn't supposed to be sticky?
    That seems like a weird situation, that EOF is sticky in some cases but not others.

agumonkey 6 years ago

And this is why I failed C IO classes. Lack of information and improper abstraction.

cjohansson 6 years ago

Interesting read, I suspected it was like this but I didn’t know for sure

jes5199 6 years ago

yeah, this author doesn’t know the history. Unix I/O was defined in opposition to practices in other OSes, that no longer exist

guerrilla 6 years ago

Clearly, since they barely know the system they are talking about but could you elaborate instead of leaving it vague? Which systems?
- jes5199 6 years ago
  
  there’s plenty of other comments that explain it, but, CP/M, VAX, teletypewriters, punch cards - all used in-band control characters rather than an external signal

ineedasername 6 years ago

This strikes me as the sort of pedantic and "I'm witty" click bait that occasionally percolates upwards on HN, especially considering the specifics of "EOF" are very much contingent on operating context.

1996 6 years ago

\r \n (0x0a 0x0d, or just one of them, or the combination of them, depending on your OS) is EOL

^D (0x04) is EOT and 0x03 is EOText: https://www.systutorials.com/ascii-table-and-ascii-code/

So, kinda, but somehow I'm happy it never got turned into a weird combinations depending on the OS.

mark-r 6 years ago

Those are just conventions, and they aren't consistent from one OS to another at all. ASCII tried to standardize it but failed.
- 1996 6 years ago
  
  That should have been my point - I'm happy there aren't 3 standards to specify what EOFile is.

Settings

EOF is not a character

Keyboard Shortcuts