Settings

Theme

Show HN: A portable hash map in C

github.com

176 points by e-dant a year ago · 91 comments

Reader

gritzko a year ago

I recently coded a linear-probing hash map in C and I highly recommend you to use fuzz tests. These evolve naturally from unit tests: next step table unit tests, next step property test, next step fuzzing, all steps are incremental, hence easy.

Having a fuzz test was in-va-lu-ab-le. I caught several bugs right there. The delete function was particularly tricky.

I ended up with two fuzz tests as my hash table has a key feature: convergence. Having same contents, it would have exactly same bits in the buffer. In other words, it is independent of the insertion/deletion order. For this, I added another fuzz test. I would add a third one if I realize there is an important invariant I did not fuzz test. That is not much work, but so much useful!

https://github.com/gritzko/librdx/blob/master/abc/fuzz/HASH....

https://github.com/gritzko/librdx/blob/master/abc/fuzz/HASHd...

P.S. In your case, I recommend clang's libfuzzer with ASAN.

https://llvm.org/docs/LibFuzzer.html#fuzzer-usage

  • eqvinox a year ago

    Here's another example of unit tests for hash tables (and RB-trees and skiplists and…)

    https://github.com/FRRouting/frr/blob/master/tests/lib/test_...

    (this file is #include'd from another, with some #define set up:)

    https://github.com/FRRouting/frr/blob/master/tests/lib/test_...

    The "tribe" of C developers with distaste for the preprocessor will absolutely hate how this is written. But it works and caught real issues. It doesn't use an external fuzzing tool but rather a deterministic PRNG (see prng_* calls) to get a good mix of call combinations. Coverage analysis was done to make sure it reaches everywhere.

    (disclaimer: I wrote this. And it did catch bugs. Also it really should have a lot of comments.)

    • cb321 a year ago

      If you are in a tribe not hating on pp-macros, you might find this approach for testing/debugging data structures (or even the surrounding pp-based "CTL"-like C templating idea or abstract ways of dealing with the BST zoo) interesting:

          https://github.com/c-blake/bst/blob/main/test/bst-interact.c
      
      Essentially, you just write a very simple (some might say trivial) interactive command-shell that can include an auto-checking upon edit of (in that case tree) invariants keyed-off an environment variable. The only missing piece is some little tool to generate random inserts/deletes/etc.

      If the micro-shell has a pretty printer then it is easy to "replay" any failed series of edits with some "p" commands before & after invariant failures.

  • dwattttt a year ago

    fuzz testing vs HN-popular-post testing, go!

drfuchs a year ago

A bug, I believe: If you "put" three colliding strings A and then B and then C, and then "delete" B, you won't be able to find C anymore.

eqvinox a year ago

Here's some food-for-thought questions:

First, a correctness question:

- if a struct contains padding, the value of these padding bytes is undefined. What happens if you use such a struct as a key?

And then some integration questions:

- how do you make this typesafe against mixing up 2 distinct uses of hashmaps in the application?

- how do you make this typesafe for its keys and values, i.e. remove accident-prone casts on read access?

- how do you not call the compare and hash functions through function pointers? (Calling through function pointers is both performance relevant as well as a target for exploitation by overwriting the pointer)

(None of these have "nice" answers. But thinking about them is IMHO extremely valuable since these are the most challenging in organizing and engineering an actual codebase itself.)

  • hoten a year ago

    > - if a struct contains padding, the value of these padding bytes is undefined. What happens if you use such a struct as a key?

    It's undefined for uninitialized objects. If I understand correctly, when an initializer is provided the padding bits are set to 0. Additionally, setting the whole chunk of memory to 0 w/ memset would work.

    https://stackoverflow.com/a/37642061/24042444

    > - how do you make this typesafe against mixing up 2 distinct uses of hashmaps in the application?

    Lacking templates, the user of the library could create slim wrapper methods for each type (and coerce to void* to call the underlying hashmap lib); You could still accidentally call the wrong method but it would help a bit. I suppose you could wrap the real hashmap in a new type to help further (so each set of wrapper methods for a given type would be forced to use the same wrapper struct).

    Alternatively, the library could provide helper macros to accomplish the same.

    • eqvinox a year ago

      > If I understand correctly, when an initializer is provided the padding bits are set to 0.

      https://interrupt.memfault.com/blog/c-struct-padding-initial... looks at this (albeit a bit outdated; clang 13 and GCC 11) and claims to see undefined values even when initializers are given, at higher optimization levels

      > Additionally, setting the whole chunk of memory to 0 w/ memset would work.

      This seems to be the only thing that can be relied on to work (… sadly)

      An alternate answer is: don't treat structs as chunks of bytes when the chunk crosses padding (i.e. create functions to hash specific structs)

      Another alternate answer: remove the padding and add "_pad12[34]" members instead. But then you need to figure out where you actually have padding, which is architecture dependent…

jll29 a year ago

Thanks for sharing. Simplicity is a good thing.

In order to speed it up by reducing the number of malloc() calls, it may be worth adding a simple arena memory allocation measure: by allocating one larger block (e.g. 1 MB) initially and then doubling the memory size each time you run out, all malloc()/calloc() calls can become local salmagundi_alloc() calls that are just macro invocations that return an arena buffer pointer and also increment said pointer as a side effect.

I also recommend you have a look at Chris Hanson's book "C: Interfaces and Implementations", which has a few neat C API tricks that your code could benefit from (e.g. for reducing name space pollution, for avoiding null pointer argument errors, API method naming etc.).

  • attractivechaos a year ago

    If you double a memory block with realloc, the memory may be relocated to a different address (at least on MacOS). Then all pointers pointing to the old block will be invalid. This can be addressed by adding new blocks to a list. It will be more complex than a minimal arena allocator.

    Furthermore, a typical arena allocator only grows but doesn't shrink. A deleted item from the memory block will still hold the memory and will not be released. You will need a more sophisticated allocator for deletion. For fixed sized items, memory pool may be an option.

teo_zero a year ago

Row 100 of hm_put() clears all contents of the item struct, thus losing the pointer to where the previous value was stored. But then row 107 needs this pointer to pass it to realloc().

Additionally, in case of overwrite, the memory previously allocated for the key is leaked.

All in all, I don't think row 100 is really needed, and removing it wouldn't do any harm.

Finally, deletes are hard in hash maps. Either you decide that deletes are not allowed, or you implement them with tombstones or re-insertions. A half-backed solution is not an option.

Hope these suggestions help you.

  • acmj a year ago

    I agree. How to deal with deletions is a 101 of hash table implementation. Many CS majors could get it right or at least know about the caveat. It is amusing that many high-quality posts on hash tables are buried in the "new" section of HN while this half-baked toy floats to the front page, with 83 github stars at the moment. People don't know what they are upvoting for.

  • eqvinox a year ago

    > Finally, deletes are hard in hash maps.

    Deletes are hard in open hash maps like here; they're trivial in chained hash maps.

    (I'm only posting this because this seems to be an exercise/learning-ish HN post and I think it's a relevant point to not forget.)

    • teo_zero a year ago

      Right.

      Thinking about it, I don't grasp why the author has opted for linear probing here. Given the amount of malloc(), they might have used linked lists as well.

      • eqvinox a year ago

        Well, chained hash maps are significantly slower in the face of collisions due to the memory fetch latency hit to grab the next item... even ignoring the academic O(n) lookup, memory is often the primary perf constraint these days...

        And for an exercise, it might be worth implementing linear probing just to get more learning experience out of it(?)

        • teo_zero a year ago

          But in this specific implementation the keys are stored separately, each one at its own malloc()ed address. So its performance is limited by memory latency, cache (non-)locality, etc. just like the chained-lists variety.

codr7 a year ago

Considering there seems to be at least a one memory bug (hm_put/key from another comment), I would strongly recommend running the tests in valgrind or similar if you haven't already. Doing C without it usually ends in some kind of disaster for me.

  • ivanjermakov a year ago

    What kind of disaster? Wouldn't OS security prevent userspace program do destructive things?

    • chlorion a year ago

      It depends.

      You can corrupt memory in such a way that allows executing arbitrary code, in which you could do anything that the process has privilege to do, which is probably reading and writing files as the current user and a lot more.

    • codr7 a year ago

      A userspace process can be capable of pretty destructive things, and they're all on the menu in C, especially without Valgrind.

vintagedave a year ago

This is neat, and remarkably small. I personally need more comments in order to follow, say, the growth logic. I see it rehashes but I don't see how it doesn't potentially overwrite entries on the way.

Algorithm for hash collisions is just to find the next empty slot (linear probing)? What happens when the original entry is removed, are the ones in subsequent slots cycled backwards? I don't see that...

Also the name is new to me! TIL 'salmagundi' is an English word (not even a loan-word) for a type of salad made of many ingredients: an excellent name for a hash map that can contain many effectively random items.

  • e-dantOP a year ago

    I like the exercise of trying to find the simplest reasonable solution to some problem.

    Many of my toy and hobby projects are exactly that. Some make allowances for the sake of performance and generality.

    The hash map, though, is up there with sorting algorithms in the breadth of wild implementations. Salmagundi is unlikely to be the fastest, or the smartest. But it’s cute and portable.

    • vintagedave a year ago

      It's neat. I like it!

      I edited my comment with some more questions, btw, likely as you were answering - apologies if that ended up confusing.

      • e-dantOP a year ago

        There’s no slot cycling logic. When a k,v pair at some index is deleted, and a different k,v pair has a matching hash, there’s a (small?) penalty in the hm_get logic that will eventually the right k,v pair.

        I need a bit more clarification on your first question. There should be no mutation of the underlying map when it grows — we’re just allocating more room for entries and adjusting the old map’s item “ownership” — moving pointers->items around.

        • e-dantOP a year ago

          Nevermind — there are ways for deleting colliding items to cause other colliding items to be “forgotten”, and one solution would be to implement slot cycling on deletion

          • skitter a year ago

            Another solution could be tombstones: When a slot gets cleared and the next slot has an item, insert a placeholder value. During linear probing, treat the placeholder as a normal item (that doesn't match); during insertion & deletion treat the placeholder as an empty slot.

  • andrelaszlo a year ago

    The etymology of 'salmagundi' seems pretty obscure. It might be a loan word?

    https://www.nytimes.com/1978/01/16/archives/the-salmagundi-d...

    https://en.m.wiktionary.org/wiki/salmagundi

cozzyd a year ago

I'm curious about the #ifdef guard, which is of a form I've never encountered.

  #ifndef BD9DF82A4540BB19368E48E4747C0706
  #define BD9DF82A4540BB19368E48E4747C0706
Does some IDE do that for you? Did you just pick a random sequence? Why not just a variant of #ifndef _salmagundi_h that is more common?

Other than that, I strongly echo other's advice about avoiding so many mallocs.

  • quotemstr a year ago

    Amazing how hard some people try to avoid the obviously correct and portable "#pragma once".

    • uecker a year ago

      It is neither portable nor obviously correct.

      • quotemstr a year ago

        Yes, it's both. It works fine in real world million line codebases. I have never, not once, seen a bug or portability issue caused by it. I have seen multiple header guard macro names collide.

        Eschewing pragma once these days is pure ignorance and superstition --- then again, so is writing new code in C at all.

        • uecker a year ago

          This good for you, other people reported problems with pragma once.

          And I think C is one of the better languages you could use to write code today. Also from the programs I use daily, the more stable and reliable ones tend to be written in C.

        • iainmerrick a year ago

          Yep, header guard name collisions, and also typos in header guards so they don’t work as intended.

    • cozzyd a year ago

      for me it's mostly muscle memory that takes a few seconds at the beginning of a file.

      i#ifndef __foo_h <escape>yyp:s/ifndef/define/<enter>o<enter>#endif<escape>

  • tobyhinloopen a year ago

    What's wrong with `#pragma once`? (honest question, i only do C and C++ as a hobby)

    • emsy a year ago

      You can't target ancient compilers.

    • uecker a year ago

      It is not supported everywhere and it uses some heuristic to determine that the file is actually the same one as used before, and this heuristic sometimes fails in complicated scenarios, e.g. with multiple shared directories, link, or because a file was copied.

    • Gibbon1 a year ago

      Nothing is wrong with that.

      • eqvinox a year ago

        One thing is wrong with it: a large part of C developers either doesn't know it exists, or doesn't know what it does.

        … which I'd fix by using it more :D

        • kevin_thibedeau a year ago

          It doesn't exist in C. It's solely a compiler extension. Include guards get the same performance benefit as #pragma once on compilers that support it. There's no point in embracing laziness with no upside.

          • eqvinox a year ago

            There is an upside: complex header files often contain a bunch of #ifdef … #endif blocks, especially when dealing with cross-platform support. Since indentation is generally not used there, they can be rather hard to read, especially when nested.

            Not having the include guard use the same mechanism makes these a bit easier to read and removes a little bit of cognitive load. (And every bit of cognitive load counts since that's a very hard limit of people's brains.)

            Personally (and for the projects I work on), I couldn't care less whether it exists in ISO C or not. I have a list of compilers and platforms I want/need to support. But of course your project may be different.

            • kevin_thibedeau a year ago

              > Since indentation is generally not used there,

              Then indent your macros for readability. clang-format can do it for you if you don't want to maintain them.

              • eqvinox a year ago

                Thanks but I'm not asking for a better way to handle #ifdef's, indenting them is extra work¹ and there's still an unnecessary outermost block for the include guard. You've also not made any further argument against the use of "#pragma once".

                ¹ especially across an open source project with more than a hundred contributors from a multitude of companies.

    • iainmerrick a year ago

      Pedants don’t like it.

  • heavensteeth a year ago

    It's the length of an MD5 hash, which I'd wager isn't a coincidence. It's not "salmagundi", "Salmagundi", or "salmagundi.h" though.

tralarpa a year ago

My C is quite rusty, so apologies for stupid questions.

In the hm_put function, when you overwrite, why do you malloc and copy the key again, and what happens with the old key pointer and the old value pointer? (no free for the key and the memset zeros everything?)

  • codr7 a year ago

    Nice catch :)

    Looks like the key logic is missing an overwrite case, the value is reallocated on overwrite.

alanmoraes a year ago

In the code example, I guess the variable `k_sz` is undefined in the call `hm_get(map, k, k_sz)`. Hope that helps.

ms9598 a year ago

In reviewing hm_put, a few things stand out in the area of handling cases where memory allocation fails. The first problems are at lines 113-116 where memory is allocated and written to, then the allocations are validated in line 117 after they are written to with memcpy. An allocation failure would cause a crash. Secondly, more about semantics, is at line 105 where item->v is free’d after it is found to be NULL, to me it is counter intuitive and unnecessary. Reading further (and back to the first area of concern) at lines 118-119, after an allocation failure, both are free’d which, to me, is acceptable and convenient. In light of that, it seems that line 105 is there for the sake of the reviewer who may not like the apparent asymmetry of the two sites.

e-dantOP a year ago

I want to thank everyone here who pointed out deficiencies in this little library.

Good catches :)

EricRiese a year ago

I followed the link in the docs to the page on the hash function. There's a DJB-2 algorithm on that page, but no DJB-1

zabzonk a year ago

Purely out of interest, and probably my bad, but what is:

    (void)k_sz;
doing? I've seen fussy people doing:

     (void) printf( "hello" );
but not what you have written. As I say, probably ignorance on my part.
  • Sesse__ a year ago

    It's an idiom for telling the compiler “don't give a warning on this argument not being used, I know about it”. Not all compilers will heed it. Other choices include e.g. __attribute__((unused)) (GNU), [[maybe_unused]] (C++17/C23) or just not giving it a name (nonstandard, but very commonly supported).

    • tredre3 a year ago

      > or just not giving it a name (nonstandard, but very commonly supported)

      It's only supported by gcc 11+ and clang 11+ (both from ~2021) because it was a suggested C2x extension. It's been accepted in C23 so other compilers will eventually support it, but I wouldn't hold my breath regarding when that happens.

      • Sesse__ a year ago

        Ah, of course, in C mode it's different. (I nearly always use C++ mode, where at least GCC 10 happily takes it.)

    • zabzonk a year ago

      OK, thank you. But why do the functions have those unused parameters?

      • Sesse__ a year ago

        I would assume that is because the hash table takes in a “hash this key” function pointer and a “compare these keys” function pointer, and those must contain the size for variable-length keys. So even if your user-supplied functions know the length, or only care about the first byte, they have to conform to that API, and then you get an unused parameter.

        • zabzonk a year ago

          Have functions with different names and different numbers of parameters?

          • Sesse__ a year ago

            And then have a union between the two different kinds of pointers, and a tag to tell which one is in use, then test that tag everywhere? Why? What good does it bring for that extra complexity?

            • zabzonk a year ago

              Doesn't the name of the function select for the types and use of the parameters? But I haven't written in C for a long time, and this I think could be done more clearly in C++.

              • Sesse__ a year ago

                You're saying you would write the entire hash table four times, with everything duplicated and no reuse? Again: Why? What do you gain compared to one line to suppress a warning?

          • detaro a year ago

            The point is that you can pass any of the hash functions (or your own compatible) to the datastructure, so it'll obviously call of them with the same set of parameters. And hash functions that need all of them will ignore them.

          • saagarjha a year ago

            That would be a bad API design.

  • lang4d a year ago

    That would be one way to silence an unused variable warning

  • ryanpetrich a year ago

    This is to suppress a compiler warning that the k_sz argument is unused.

  • tester756 a year ago
  • secondcoming a year ago

    It silences 'unused variable' warnings.

david2ndaccount a year ago

If you’re interested in this, I wrote a blogpost on a simple c hash table without deletion.

https://www.davidpriver.com/c-hash-table.html

TheDudeMan a year ago

I suggest improving the probing a little. Something like

  idx = (idx + 1) % map->cap;
becomes

  iterCount++;
  idx = (idx + iterCount) % map->cap;
  • meisel a year ago

    It could be improved even more, performance wise. The potentially expensive modulo could be avoided entirely with an if statement. Or, only use powers of 2 for the capacity, and then you can also use bit wise ops

quotemstr a year ago

This web site is amazing. We have, on one hand, people explaining machine sentience and quantum breakthroughs, and on the other hand, we have people refusing to use "#pragma once" in C because it's still too novel and doesn't work in every compiler yet.

Never change, HN.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection