Show HN: A portable hash map in C

176 points by e-dant a year ago · 91 comments

Reader

gritzko a year ago

I recently coded a linear-probing hash map in C and I highly recommend you to use fuzz tests. These evolve naturally from unit tests: next step table unit tests, next step property test, next step fuzzing, all steps are incremental, hence easy.

Having a fuzz test was in-va-lu-ab-le. I caught several bugs right there. The delete function was particularly tricky.

I ended up with two fuzz tests as my hash table has a key feature: convergence. Having same contents, it would have exactly same bits in the buffer. In other words, it is independent of the insertion/deletion order. For this, I added another fuzz test. I would add a third one if I realize there is an important invariant I did not fuzz test. That is not much work, but so much useful!

https://github.com/gritzko/librdx/blob/master/abc/fuzz/HASH....

https://github.com/gritzko/librdx/blob/master/abc/fuzz/HASHd...

P.S. In your case, I recommend clang's libfuzzer with ASAN.

https://llvm.org/docs/LibFuzzer.html#fuzzer-usage

eqvinox a year ago

Here's another example of unit tests for hash tables (and RB-trees and skiplists and…)
https://github.com/FRRouting/frr/blob/master/tests/lib/test_...
(this file is #include'd from another, with some #define set up:)
https://github.com/FRRouting/frr/blob/master/tests/lib/test_...
The "tribe" of C developers with distaste for the preprocessor will absolutely hate how this is written. But it works and caught real issues. It doesn't use an external fuzzing tool but rather a deterministic PRNG (see prng_* calls) to get a good mix of call combinations. Coverage analysis was done to make sure it reaches everywhere.
(disclaimer: I wrote this. And it did catch bugs. Also it really should have a lot of comments.)
- cb321 a year ago
  If you are in a tribe not hating on pp-macros, you might find this approach for testing/debugging data structures (or even the surrounding pp-based "CTL"-like C templating idea or abstract ways of dealing with the BST zoo) interesting:
  https://github.com/c-blake/bst/blob/main/test/bst-interact.c
  Essentially, you just write a very simple (some might say trivial) interactive command-shell that can include an auto-checking upon edit of (in that case tree) invariants keyed-off an environment variable. The only missing piece is some little tool to generate random inserts/deletes/etc.
  If the micro-shell has a pretty printer then it is easy to "replay" any failed series of edits with some "p" commands before & after invariant failures.
  - eqvinox a year ago
    
    This is incredibly funny to me, because that's (almost) exactly how some of our other tests work :D
    (Almost: we have a full-featured command shell and just use that for testing)
    Shell bindings: https://github.com/FRRouting/frr/blob/master/tests/ospf6d/te...
    Input: https://github.com/FRRouting/frr/blob/master/tests/ospf6d/te...
    Expected output: https://github.com/FRRouting/frr/blob/master/tests/ospf6d/te...
    Absolutely agree this is a great option to have for some kinds of tests.
    
    cb321 a year ago
    
    One reason to have a razor thin custom command shell (perhaps obeying similar conventions across data structures..) is that the parsing for such can be so fast/consistent that you can also use it for benchmarking/perf regression testing with a workload generator. You might also find & record "anomalous" workloads that way or write a little "translator" (some might say "compiler") from a log to such a workload or etc.
    I have done this for data structures even as fast as integer-keyed hash tables (though in that hyper-fast case you might need to try to measure & subtract off parser-loop/IO dispatch overhead and/or need statistical testing for "actually even different at all", perhaps along the lines of https://github.com/c-blake/bu/blob/main/doc/tim.md).
dwattttt a year ago

fuzz testing vs HN-popular-post testing, go!

drfuchs a year ago

A bug, I believe: If you "put" three colliding strings A and then B and then C, and then "delete" B, you won't be able to find C anymore.

e-dantOP a year ago

Good catch! Yes, that looks like a bug :)
- attractivechaos a year ago
  
  You are implementing a closed hash table with linear probing. You need tombstones to mark deleted items, or better, move other items to replace deleted items [1]. Currently your library doesn't have either mechanism.
  [1] https://en.wikipedia.org/wiki/Linear_probing#Deletion
- munificent a year ago
  
  If it helps, my book Crafting Interpreters walks through a linear probing hash table implementation in C including handling deletion:
  https://craftinginterpreters.com/hash-tables.html#deleting-e...

eqvinox a year ago

Here's some food-for-thought questions:

First, a correctness question:

- if a struct contains padding, the value of these padding bytes is undefined. What happens if you use such a struct as a key?

And then some integration questions:

- how do you make this typesafe against mixing up 2 distinct uses of hashmaps in the application?

- how do you make this typesafe for its keys and values, i.e. remove accident-prone casts on read access?

- how do you not call the compare and hash functions through function pointers? (Calling through function pointers is both performance relevant as well as a target for exploitation by overwriting the pointer)

(None of these have "nice" answers. But thinking about them is IMHO extremely valuable since these are the most challenging in organizing and engineering an actual codebase itself.)

hoten a year ago

> - if a struct contains padding, the value of these padding bytes is undefined. What happens if you use such a struct as a key?
It's undefined for uninitialized objects. If I understand correctly, when an initializer is provided the padding bits are set to 0. Additionally, setting the whole chunk of memory to 0 w/ memset would work.
https://stackoverflow.com/a/37642061/24042444
> - how do you make this typesafe against mixing up 2 distinct uses of hashmaps in the application?
Lacking templates, the user of the library could create slim wrapper methods for each type (and coerce to void* to call the underlying hashmap lib); You could still accidentally call the wrong method but it would help a bit. I suppose you could wrap the real hashmap in a new type to help further (so each set of wrapper methods for a given type would be forced to use the same wrapper struct).
Alternatively, the library could provide helper macros to accomplish the same.
- eqvinox a year ago
  
  > If I understand correctly, when an initializer is provided the padding bits are set to 0.
  https://interrupt.memfault.com/blog/c-struct-padding-initial... looks at this (albeit a bit outdated; clang 13 and GCC 11) and claims to see undefined values even when initializers are given, at higher optimization levels
  > Additionally, setting the whole chunk of memory to 0 w/ memset would work.
  This seems to be the only thing that can be relied on to work (… sadly)
  An alternate answer is: don't treat structs as chunks of bytes when the chunk crosses padding (i.e. create functions to hash specific structs)
  Another alternate answer: remove the padding and add "_pad12[34]" members instead. But then you need to figure out where you actually have padding, which is architecture dependent…

jll29 a year ago

Thanks for sharing. Simplicity is a good thing.

In order to speed it up by reducing the number of malloc() calls, it may be worth adding a simple arena memory allocation measure: by allocating one larger block (e.g. 1 MB) initially and then doubling the memory size each time you run out, all malloc()/calloc() calls can become local salmagundi_alloc() calls that are just macro invocations that return an arena buffer pointer and also increment said pointer as a side effect.

I also recommend you have a look at Chris Hanson's book "C: Interfaces and Implementations", which has a few neat C API tricks that your code could benefit from (e.g. for reducing name space pollution, for avoiding null pointer argument errors, API method naming etc.).

attractivechaos a year ago

If you double a memory block with realloc, the memory may be relocated to a different address (at least on MacOS). Then all pointers pointing to the old block will be invalid. This can be addressed by adding new blocks to a list. It will be more complex than a minimal arena allocator.
Furthermore, a typical arena allocator only grows but doesn't shrink. A deleted item from the memory block will still hold the memory and will not be released. You will need a more sophisticated allocator for deletion. For fixed sized items, memory pool may be an option.

teo_zero a year ago

Row 100 of hm_put() clears all contents of the item struct, thus losing the pointer to where the previous value was stored. But then row 107 needs this pointer to pass it to realloc().

Additionally, in case of overwrite, the memory previously allocated for the key is leaked.

All in all, I don't think row 100 is really needed, and removing it wouldn't do any harm.

Finally, deletes are hard in hash maps. Either you decide that deletes are not allowed, or you implement them with tombstones or re-insertions. A half-backed solution is not an option.

Hope these suggestions help you.

acmj a year ago

I agree. How to deal with deletions is a 101 of hash table implementation. Many CS majors could get it right or at least know about the caveat. It is amusing that many high-quality posts on hash tables are buried in the "new" section of HN while this half-baked toy floats to the front page, with 83 github stars at the moment. People don't know what they are upvoting for.
eqvinox a year ago

> Finally, deletes are hard in hash maps.
Deletes are hard in open hash maps like here; they're trivial in chained hash maps.
(I'm only posting this because this seems to be an exercise/learning-ish HN post and I think it's a relevant point to not forget.)
- teo_zero a year ago
  
  Right.
  Thinking about it, I don't grasp why the author has opted for linear probing here. Given the amount of malloc(), they might have used linked lists as well.
  - eqvinox a year ago
    
    Well, chained hash maps are significantly slower in the face of collisions due to the memory fetch latency hit to grab the next item... even ignoring the academic O(n) lookup, memory is often the primary perf constraint these days...
    And for an exercise, it might be worth implementing linear probing just to get more learning experience out of it(?)
    
    teo_zero a year ago
    
    But in this specific implementation the keys are stored separately, each one at its own malloc()ed address. So its performance is limited by memory latency, cache (non-)locality, etc. just like the chained-lists variety.

codr7 a year ago

Considering there seems to be at least a one memory bug (hm_put/key from another comment), I would strongly recommend running the tests in valgrind or similar if you haven't already. Doing C without it usually ends in some kind of disaster for me.

ivanjermakov a year ago

What kind of disaster? Wouldn't OS security prevent userspace program do destructive things?
- chlorion a year ago
  
  It depends.
  You can corrupt memory in such a way that allows executing arbitrary code, in which you could do anything that the process has privilege to do, which is probably reading and writing files as the current user and a lot more.
- codr7 a year ago
  
  A userspace process can be capable of pretty destructive things, and they're all on the menu in C, especially without Valgrind.

vintagedave a year ago

This is neat, and remarkably small. I personally need more comments in order to follow, say, the growth logic. I see it rehashes but I don't see how it doesn't potentially overwrite entries on the way.

Algorithm for hash collisions is just to find the next empty slot (linear probing)? What happens when the original entry is removed, are the ones in subsequent slots cycled backwards? I don't see that...

Also the name is new to me! TIL 'salmagundi' is an English word (not even a loan-word) for a type of salad made of many ingredients: an excellent name for a hash map that can contain many effectively random items.

e-dantOP a year ago

I like the exercise of trying to find the simplest reasonable solution to some problem.
Many of my toy and hobby projects are exactly that. Some make allowances for the sake of performance and generality.
The hash map, though, is up there with sorting algorithms in the breadth of wild implementations. Salmagundi is unlikely to be the fastest, or the smartest. But it’s cute and portable.
- vintagedave a year ago
  
  It's neat. I like it!
  I edited my comment with some more questions, btw, likely as you were answering - apologies if that ended up confusing.
  - e-dantOP a year ago
    
    There’s no slot cycling logic. When a k,v pair at some index is deleted, and a different k,v pair has a matching hash, there’s a (small?) penalty in the hm_get logic that will eventually the right k,v pair.
    I need a bit more clarification on your first question. There should be no mutation of the underlying map when it grows — we’re just allocating more room for entries and adjusting the old map’s item “ownership” — moving pointers->items around.
    
    e-dantOP a year ago
    
    Nevermind — there are ways for deleting colliding items to cause other colliding items to be “forgotten”, and one solution would be to implement slot cycling on deletion
    
    skitter a year ago
    
    Another solution could be tombstones: When a slot gets cleared and the next slot has an item, insert a placeholder value. During linear probing, treat the placeholder as a normal item (that doesn't match); during insertion & deletion treat the placeholder as an empty slot.
andrelaszlo a year ago

The etymology of 'salmagundi' seems pretty obscure. It might be a loan word?
https://www.nytimes.com/1978/01/16/archives/the-salmagundi-d...
https://en.m.wiktionary.org/wiki/salmagundi

cozzyd a year ago

I'm curious about the #ifdef guard, which is of a form I've never encountered.

  #ifndef BD9DF82A4540BB19368E48E4747C0706
  #define BD9DF82A4540BB19368E48E4747C0706

Does some IDE do that for you? Did you just pick a random sequence? Why not just a variant of #ifndef _salmagundi_h that is more common?

Other than that, I strongly echo other's advice about avoiding so many mallocs.

quotemstr a year ago

Amazing how hard some people try to avoid the obviously correct and portable "#pragma once".
- uecker a year ago
  
  It is neither portable nor obviously correct.
  - quotemstr a year ago
    
    Yes, it's both. It works fine in real world million line codebases. I have never, not once, seen a bug or portability issue caused by it. I have seen multiple header guard macro names collide.
    Eschewing pragma once these days is pure ignorance and superstition --- then again, so is writing new code in C at all.
    
    uecker a year ago
    
    This good for you, other people reported problems with pragma once.
    And I think C is one of the better languages you could use to write code today. Also from the programs I use daily, the more stable and reliable ones tend to be written in C.
    
    iainmerrick a year ago
    
    Yep, header guard name collisions, and also typos in header guards so they don’t work as intended.
- cozzyd a year ago
  
  for me it's mostly muscle memory that takes a few seconds at the beginning of a file.
  i#ifndef __foo_h <escape>yyp:s/ifndef/define/<enter>o<enter>#endif<escape>
tobyhinloopen a year ago

What's wrong with `#pragma once`? (honest question, i only do C and C++ as a hobby)
- emsy a year ago
  
  You can't target ancient compilers.
- uecker a year ago
  
  It is not supported everywhere and it uses some heuristic to determine that the file is actually the same one as used before, and this heuristic sometimes fails in complicated scenarios, e.g. with multiple shared directories, link, or because a file was copied.
- Gibbon1 a year ago
  
  Nothing is wrong with that.
  - eqvinox a year ago
    
    One thing is wrong with it: a large part of C developers either doesn't know it exists, or doesn't know what it does.
    … which I'd fix by using it more :D
    
    kevin_thibedeau a year ago
    
    It doesn't exist in C. It's solely a compiler extension. Include guards get the same performance benefit as #pragma once on compilers that support it. There's no point in embracing laziness with no upside.
    
    eqvinox a year ago
    
    There is an upside: complex header files often contain a bunch of #ifdef … #endif blocks, especially when dealing with cross-platform support. Since indentation is generally not used there, they can be rather hard to read, especially when nested.
    Not having the include guard use the same mechanism makes these a bit easier to read and removes a little bit of cognitive load. (And every bit of cognitive load counts since that's a very hard limit of people's brains.)
    Personally (and for the projects I work on), I couldn't care less whether it exists in ISO C or not. I have a list of compilers and platforms I want/need to support. But of course your project may be different.
    
    kevin_thibedeau a year ago
    
    > Since indentation is generally not used there,
    Then indent your macros for readability. clang-format can do it for you if you don't want to maintain them.
    
    eqvinox a year ago
    
    Thanks but I'm not asking for a better way to handle #ifdef's, indenting them is extra work¹ and there's still an unnecessary outermost block for the include guard. You've also not made any further argument against the use of "#pragma once".
    ¹ especially across an open source project with more than a hundred contributors from a multitude of companies.
- iainmerrick a year ago
  
  Pedants don’t like it.
heavensteeth a year ago

It's the length of an MD5 hash, which I'd wager isn't a coincidence. It's not "salmagundi", "Salmagundi", or "salmagundi.h" though.

tralarpa a year ago

My C is quite rusty, so apologies for stupid questions.

In the hm_put function, when you overwrite, why do you malloc and copy the key again, and what happens with the old key pointer and the old value pointer? (no free for the key and the memset zeros everything?)

codr7 a year ago

Nice catch :)
Looks like the key logic is missing an overwrite case, the value is reallocated on overwrite.

alanmoraes a year ago

In the code example, I guess the variable `k_sz` is undefined in the call `hm_get(map, k, k_sz)`. Hope that helps.

ms9598 a year ago

In reviewing hm_put, a few things stand out in the area of handling cases where memory allocation fails. The first problems are at lines 113-116 where memory is allocated and written to, then the allocations are validated in line 117 after they are written to with memcpy. An allocation failure would cause a crash. Secondly, more about semantics, is at line 105 where item->v is free’d after it is found to be NULL, to me it is counter intuitive and unnecessary. Reading further (and back to the first area of concern) at lines 118-119, after an allocation failure, both are free’d which, to me, is acceptable and convenient. In light of that, it seems that line 105 is there for the sake of the reviewer who may not like the apparent asymmetry of the two sites.

e-dantOP a year ago

I want to thank everyone here who pointed out deficiencies in this little library.

Good catches :)

rhelz a year ago

Not getting all defensive is a superpower!

EricRiese a year ago

I followed the link in the docs to the page on the hash function. There's a DJB-2 algorithm on that page, but no DJB-1

e-dantOP a year ago

Huh — I must have written djb1 in error

zabzonk a year ago

Purely out of interest, and probably my bad, but what is:

    (void)k_sz;

doing? I've seen fussy people doing:

     (void) printf( "hello" );

but not what you have written. As I say, probably ignorance on my part.

Sesse__ a year ago

It's an idiom for telling the compiler “don't give a warning on this argument not being used, I know about it”. Not all compilers will heed it. Other choices include e.g. __attribute__((unused)) (GNU), [[maybe_unused]] (C++17/C23) or just not giving it a name (nonstandard, but very commonly supported).
- tredre3 a year ago
  
  > or just not giving it a name (nonstandard, but very commonly supported)
  It's only supported by gcc 11+ and clang 11+ (both from ~2021) because it was a suggested C2x extension. It's been accepted in C23 so other compilers will eventually support it, but I wouldn't hold my breath regarding when that happens.
  - Sesse__ a year ago
    
    Ah, of course, in C mode it's different. (I nearly always use C++ mode, where at least GCC 10 happily takes it.)
- zabzonk a year ago
  
  OK, thank you. But why do the functions have those unused parameters?
  - Sesse__ a year ago
    
    I would assume that is because the hash table takes in a “hash this key” function pointer and a “compare these keys” function pointer, and those must contain the size for variable-length keys. So even if your user-supplied functions know the length, or only care about the first byte, they have to conform to that API, and then you get an unused parameter.
    
    zabzonk a year ago
    
    Have functions with different names and different numbers of parameters?
    
    Sesse__ a year ago
    
    And then have a union between the two different kinds of pointers, and a tag to tell which one is in use, then test that tag everywhere? Why? What good does it bring for that extra complexity?
    
    zabzonk a year ago
    
    Doesn't the name of the function select for the types and use of the parameters? But I haven't written in C for a long time, and this I think could be done more clearly in C++.
    
    Sesse__ a year ago
    
    You're saying you would write the entire hash table four times, with everything duplicated and no reuse? Again: Why? What do you gain compared to one line to suppress a warning?
    
    zabzonk a year ago
    
    Clarity? And in C++ at least, I would re-write the functions.
    
    detaro a year ago
    
    The point is that you can pass any of the hash functions (or your own compatible) to the datastructure, so it'll obviously call of them with the same set of parameters. And hash functions that need all of them will ignore them.
    
    saagarjha a year ago
    
    That would be a bad API design.
lang4d a year ago

That would be one way to silence an unused variable warning
ryanpetrich a year ago

This is to suppress a compiler warning that the k_sz argument is unused.
tester756 a year ago

https://stackoverflow.com/questions/7354786/what-does-void-v...
secondcoming a year ago

It silences 'unused variable' warnings.

david2ndaccount a year ago

If you’re interested in this, I wrote a blogpost on a simple c hash table without deletion.

https://www.davidpriver.com/c-hash-table.html

TheDudeMan a year ago

I suggest improving the probing a little. Something like

  idx = (idx + 1) % map->cap;

becomes

  iterCount++;
  idx = (idx + iterCount) % map->cap;

meisel a year ago

It could be improved even more, performance wise. The potentially expensive modulo could be avoided entirely with an if statement. Or, only use powers of 2 for the capacity, and then you can also use bit wise ops

quotemstr a year ago

This web site is amazing. We have, on one hand, people explaining machine sentience and quantum breakthroughs, and on the other hand, we have people refusing to use "#pragma once" in C because it's still too novel and doesn't work in every compiler yet.

Never change, HN.

Settings

Show HN: A portable hash map in C

Keyboard Shortcuts