Settings

Theme

Write a hash table in C

github.com

189 points by jmlr 8 years ago · 57 comments

Reader

fizixer 8 years ago

I've heard of two possibilities (as a recommendation) for using data structures in C without having to develop them:

- BSD queue.h offers linked lists, queues, etc [1].

- UT-hash offers hash tables [2].

I've also read some people being frustrated with them, probably due to the quirky syntax/API. On the plus side (BSD at least) it's just a header file that you can download, include and start using.

[1] http://bxr.su/OpenBSD/sys/sys/queue.h

[2] https://troydhanson.github.io/uthash/

msarnoff 8 years ago

"The language doesn't come with [a hash table implementation] included"

The standard library does come with a half-baked hash table implementation[1] (hcreate/hdestroy/hsearch) that only allows creation of _one_ global hash table. GNU libc[2] adds hcreate_r/hdestroy_r/hsearch_r that allows multiple tables to be created.

The APIs are strange and antiquated. On the other hand, the hash table implementation described in the article presents a much nicer interface.

[1] http://pubs.opengroup.org/onlinepubs/009695399/functions/hcr...

[2] http://man7.org/linux/man-pages/man3/hsearch.3.html

  • to3m 8 years ago

    The hcreate nonsense is POSIX, not C99, so a conforming C implementation might not include it.

    • twic 8 years ago

      How does junk like this ever get into a standard? Didn't a whole committee of very experienced systems programmers have to sit down and say, "yes, these are routines that i think people will find generally useful"?

      It looks like the same header, search.h, also includes functions for working with binary trees, where the user passes in the root pointer, and so which support multiple trees:

      http://pubs.opengroup.org/onlinepubs/9699919799/functions/ts...

      So this committee must have signed off on the hashtable stuff even though they had an example of how to do it properly right next to it!

      • jws 8 years ago

        A standards process is frequently about documenting what already exists in common implementations so that there is a specification for future implementations. POSIX was very much one of these processes.

        • to3m 8 years ago

          OK... so how did junk like this ever get into enough implementations that people were forced to include it in the standard? ;)

          • emmelaich 8 years ago

            tsearch/hsearch/lsearch have been around for at least 30 years. My guess is that they've been around for a lot longer.

            Why not have these anyway? Otherwise people will make their own (probably buggy) equivalents.

            I'm actually surprised that people aren't aware of them.

            • to3m 8 years ago

              This is specifically about hcreate/hsearch, which are utterly useless, since you only get one hash table at a time. The reason you don't want crap like this is just that: it's crap.

              The _r versions look perfectly sensible.

    • mcguire 8 years ago

      I was just thinking, "Where the hell did that come from?"

      Signed,

      POSIX_ME_HARDER

  • blt 8 years ago

    This is absurd, who could have ever thought it is a good idea?

sillysaurus3 8 years ago

The link at the bottom of the intro is broken: https://github.com/jamesroutley/write-a-hash-table/blob/v0.1... preventing the reader from advancing.

I think the whole thing would be best as a single page, to be honest. Even if it's long, people have scroll wheels. But it's a stylistic choice.

pieguy 8 years ago

"(long)pow(a, len_s - (i+1))" is both inefficient and inaccurate. You should store the current power in a variable which you multiply by 'a' and reduce by 'm' on each iteration.

Also, your double hashing scheme has a bug. You identified that it's problematic if hash_b returns 0, but adding 1 to the result doesn't actually solve anything, because now you have the same problem if hash_b returns num_buckets-1.

__s 8 years ago

Weird there isn't a src directory collecting everything together

Hashtables are one of those data structures one can easily get away with never having to write on their own, yet still use every day. In scripting languages it's impractical to roll your own, & in the more abstracted low level languages there'll usually be a pretty good generic version. Yet there's much room for variation

I had the joy to be out of reach of off-the-shelf hashtables for luwa,

init: https://github.com/serprex/luwa/blob/436c7271120fcd116af30e9...

get/set: https://github.com/serprex/luwa/blob/436c7271120fcd116af30e9...

hash: https://github.com/serprex/luwa/blob/436c7271120fcd116af30e9...

Still need to implement Lua's sequential-portion-as-array logic. I do like the API not having a 'remove' method, opting intead for nil by default. nil value entries get vacuumed out when rehashing. This ends up acting as tombstone. In an earlier iteration I was using 0 as a marker distinct from nil, but an implementation detail is that nil is stored at address 0.. oops

nathan_long 8 years ago

If you just want to see hash tables work as a data structure, here's a talk-as-blog-post I made on the subject.

http://nathanmlong.com/2015/10/reimplementing-rubys-hash/

  • igravious 8 years ago

    Nathan, that was absolutely fan-fuckin-tastic. Pardon me for the potty mouthed tmesis but that was a highly highly amusing run through. How did you do the graphs? (I'm always wondering how people do graphs.) And is that really you on the sofa waking up from the dream?

dullgiulio 8 years ago

If your set is known at compile time, make a Perfect Hash Table with gperf instead. It's commonly used in compilers and other software with a parsing or interpreting stage.

_5ysi 8 years ago

In the intro section, on extending to unicode being non-trivial...is the argument that if your input has some non-ascii data you probably have all non-ascii data and the above 127 bucket will be too full? Honest question.

baron816 8 years ago

While the reasons for this that the author gave are valid, I think it would also be valuable to do this in a high level language. I'm sure lot of people who can read C code went through a CS program, and hopefully learned how hash tables work then. People without CS degrees and who mostly write JS/Ruby/Python/etc. and use hash tables on almost every line of code they write are much less likely to know how they work. Just saying, someone should do it for other languages.

  • eighthnate 8 years ago

    The problem with writing hash tables or any fundamental data structures in JS/Ruby/Python/etc is that most of the hard work is done for them.

    The hard part is allocation memory, keeping track of the structure(array) holding the hashtable, etc.

    What else is there? Implementing the hash function/code?

    If you don't have a CS degree, you probably don't even know how lists, queues, stacks, etc really work. But the kind of programming that is done at a "high" level, you might not even need know it.

    Instead of saying people should implement it JS/Ruby/Python/etc ( all of whom are OOPs ), they should learn C and implement it in C.

    • pfranz 8 years ago

      There's room for both. There's a reason why a lot of CS programs have shuffled from primarily C to Python (some had Java in between).

      Not having to deal with boilerplate and housekeeping lets you focus on learning the topic at hand. Especially in a classroom setting where you're probably focusing on the specific subject instead of real-world dirtiness.

      But after learning the basic concept, you do need an actual implementation if you want to do anything.

      • eighthnate 8 years ago

        > There's room for both.

        Sure. I'm all for learning ML, Scheme, C/C++, Python, Ruby, Java/C#, etc.

        > There's a reason why a lot of CS programs have shuffled from primarily C to Python (some had Java in between).

        The reason is called dumbing down the curriculum. My fall freshman year, my college tried to switch from C to Java in order to make it easier for the general population to get into CS and there was a major pushback from the CS student body, alumni and a bunch of the CS professors. We went back to C/C++ my sophomore year.

        > Not having to deal with boilerplate and housekeeping lets you focus on learning the topic at hand.

        I disagree. Having to deal with the boilerplate and housekeeping helps you learn the topic at hand.

        > But after learning the basic concept, you do need an actual implementation if you want to do anything.

        I disagree. Giving them "training wheels" will keep them dependent on training wheels. In an idealistic world, people go from Java/Python to C/Assembly, but that's rarely the case.

        CS should begin with Math ( boolean algebra/logic/etc ), Assembly and C. Then go on to functional paradigms. Then to OOP. Etc ( in the language ). Of course they should also learn about OS( kernel/filesystems/etc ), networks, computation theory/etc, compilers, etc.

        The dumbing down of the curriculum so that music/english majors can pass CS classes doesn't do anyone any favors.

        • dang 8 years ago

          > so that music/english majors can pass CS classes

          This breaks the HN guideline against flamebait. Please don't post like this here.

          https://news.ycombinator.com/newsguidelines.html

          • eighthnate 8 years ago

            It isn't flame bait. It was the reason our Dean gave for switching from C to Java. The thinking was that the learning curve for C was too high and the Math/Computer Science wanted to attract more humanities majors to take CS classes.

            Introduction to Programming became a requirement and the drop-out/failure rate for non-CS major students was excessively high so the Dean thought teaching the classes in java rather than c would help out those who were new to programming and help them pass these "newly required" classes.

        • pfranz 8 years ago

          > Sure. I'm all for learning ML, Scheme, C/C++, Python, Ruby, Java/C#, etc.

          Personally, I prefer to dig into 1 or 2 programming languages deeply instead of shallowly hop around half a dozen. I seem to retain the deep stuff a lot better. Even when coming back to languages I used heavily years ago (Bash or Perl, for example) I end up having to look up basic things like how loops or functions are defined when I pick them up again.

          > The reason is called dumbing down the curriculum.

          Do you feel the same way about SICP's choice of using Scheme? One of the innovations of SICP was that CS courses spent too much time explaining the details of a programming language [1]. That was 30 years ago. There are orthogonal discussions on the goals of CS changing over time.

          > In an idealistic world, people go from Java/Python to C/Assembly, but that's rarely the case.

          Maybe there's no business need? My wife works in video games. That's one industry where programmers often have deep knowledge of hardware and use assembly--but, there's still only a couple of them at each studio. Most people programming are doing game scripting or content creation.

          > CS should begin with Math

          Your argument seems to be against abstraction. Math is all about abstraction. So is programming. It's been interesting to see programmers tinker with hardware as things like Arduino and Raspberry Pi have become more popular. When writing software they take things for granted, but when interacting with hardware they have to account for things like debounce.

          [1] https://people.eecs.berkeley.edu/~bh/sicp.html

        • B1narySunset 8 years ago

          The dumbing down of the curriculum so that music/english majors can pass CS classes doesn't do anyone any favors.

          You sound butt-hurt

          • dang 8 years ago

            We ban accounts that post like this, so would you please not?

            https://news.ycombinator.com/newsguidelines.html

          • joeberon 8 years ago

            Level of discourse on Hacker News 2017

          • eighthnate 8 years ago

            Huh? I'm not butthurt. Seems like you are though.

            I'm sorry having standards and expectations is something that offends so many people.

            C isn't a "difficult" language. It's something any CS major should have an understanding of.

            I can't believe on hackernews, I'd get attacked for saying people should learn C.

            Nevermind that most OSes and most languages themselves are written in C. Even python interpreters are written in C.

            Since when is saying people should learn C for a CS degree considered being "butthurt"? What's next? Saying people should learn latin/ancient greek for a liberal arts degree is considered being butthurt?

            • B1narySunset 8 years ago

              You're being attacked because of your condescending tone implying that music and english majors aren't smart enough to pass the CS curriculum

              • grzm 8 years ago

                You're being attacked

                Regardless of how much you may disagree with what someone said or how they said it, attacking them for it (even verbally) is not an appropriate response. It only contributes to worsening the discourse. If you can't find a way to disagree or express an opinion constructively, it's better to refrain from commenting at all.

  • igravious 8 years ago

    See elsewhere here https://news.ycombinator.com/item?id=15082885 for an neat-o Ruby implementation.

  • flukus 8 years ago

    You could always port it.

clarry 8 years ago

Don't allocate these items individually; don't store pointers to such items. They're tiny, and the pointer chasing + allocator bookkeeping overhead + potential randomisation make it inefficient for no gain. Just allocate an array of items. This way you'll also deal with fewer potential errors. Speaking of errors, please handle them and fix the API so that the caller can know if there's a problem.

  • pacaro 8 years ago

    Yeah, reading through that jumped out at me.

    Also the use of strdup, while a reasonable choice, should also come with caveats about ownership and copying

    I stopped at the use of pow for exponentiation in the hash function. Just iteratively multiply your factor and take the modulus on each round.

    This would handle utf-8 just fine if characters are cast to unsigned in the hash function and a larger prime factor is used

    None of these are real criticisms, I'd just love to see a pointer to more in depth information on how to really write hash functions

    I hope that i missed it, but this definitely needs a clear statement that this is a pedagogical exercise and not a production ready implementation

smegel 8 years ago

This might seem academic, but I actually had a real reason to hand-roll a hash table implementation that would work in process-shared memory (as an Apache module). It was based on the hash table implemented in mod_ldap, and was part of an LRU cache.

jay-anderson 8 years ago

This is a good mostly simple overview. The appendix should also mention robin hood hashing as a method for addressing collisions. From my understanding it is cpu cache friendly and works well in practice.

peterbotond 8 years ago

berkeley db 185. it has served me very good over the years and it was fast enough for real time market data and indexing and many more.

7Anonymous 8 years ago

Just go back to the table of contents. The link works from there.

agumonkey 8 years ago

hah, just a few days ago I named a few files hshtbl.c trie.c and so on. All after a Raymond Hettinger talk about how they rethought Python 3 dicts.

cletusw 8 years ago

Link should be changed to https://github.com/jamesroutley/write-a-hash-table

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection