Numbers, Decimals, Formatting, and Printing: Why I Want To Jump Off A Cliff

As mentioned in I had a brain hiccup with __builtin_parity:

Two things:

I work with bits often on a daily basis.

I’m pretty stupid and my brain occasionally has a hiccup… like it did today.

Only the second point is relevant in today’s hella face-palmy ramble.

I was working on some sockets code and I needed to log some stuff where nanoseconds mattered. Although I wanted to go via the RDTSC route and multiply that with TSC increment frequency, I eventually decided against it for reasons that don’t matter here. It was deemed fine to use the clock_gettime() syscall. This syscall takes two arguments: clockid and a reference to a timespec structure variable.

I set the clockid to CLOCK_MONOTONIC_RAW and passed a blank timespec structure variable, whose definition looks like:

struct timespec {
    time_t  tv_sec;     /* seconds */
    long    tv_nsec;    /* nanoseconds */
};

I needed to get the time as a string, so I made this quick function that was repeatedly called:

/* Get the current time, sourced from monotonic raw, as a string */
int gettime_str(char *time_str)
{
    int ret = 0;
    struct timespec ts;

    ret = clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
    if (ret != 0) {
        perror("clock_gettime failed\n");
        return ret;
    }

    sprintf(time_str, "%ld.%ld", (long) ts.tv_sec, ts.tv_nsec);

    return 0;
}

Apart from the blatant lack of checking whether time_str is NULL, some of you may have already identified the problem. For those of you that haven’t, take a look again.

While doing some python3 post processing, I was calculating the difference in timestamps between log entries. However, I noticed that occasionally the program would spit out negative time differences, i.e., a later entry had an earlier timestamp than a previous entry. This was bizarre because I made sure to flush the buffer to the file and ensured that no potential re-ordering was occurring.

I revisited the clock_gettime() man page to see if I missed something because the programmer can NeVeR be wrong (sarcasm, just in case). After some analysis, I realized my mistake — I didn’t format the nanoseconds properly. Here’s an example:

log 0: 216.819
log 1: 216.2102

>>> ...
>>> # after some processing
>>> ...
>>>
>>> timediff = log[1] - log[0]
>>>
>>> # this is essentially:
>>> 216.2102 - 216.819
-0.6088000000000022

Do you see the mistake now? This is when I wanted to jump off a cliff. Log 0 was recorded at (216 seconds, 819 nanoseconds) whereas Log 1 was recorded at (216 seconds, 2102 nanoseconds). OF COURSE WHEN PYTHON3 TRIES TO READ THE NUMBERS FROM NON-ZERO-PADDED STRINGS, IT’S GOING TO THINK THAT 216.819 IS GREATER THAN 216.2102.

Instead, it should have been 216.0819 and 216.2102.

I’m SUCH a fool. To cut me some slack, the logs weren’t as minimal as they are written here. Nanoseconds involve 9 digits, and the logs were staggered, making it hard to notice that a few entries had 8 digits while most had 9 digits. That being said, I’m still a fool.

To fix it, you pad the ts.tv_nsec part of the format specifier with zeros:

sprintf(time_str, "%ld.%09ld", (long) ts.tv_sec, ts.tv_nsec);

</et cetera>