Python Dictionaries

medium.com

7 points by ankit_it09 3 years ago · 13 comments

Reader

marzipanWhale 3 years ago

I really like Raymond Hettinger's video on Python dictionaries: https://www.youtube.com/watch?v=npw4s1QTmPg

ankit_it09OP 3 years ago

Yes, I liked it too.

eesmith 3 years ago

This is deeply wrong.

> Python then calculates the hash value for each key in the dictionary using the MurmurHash3 hash function.

Umm... this isn't right. At all. Depending on configuration, Python uses an external hash, a Modified Fowler-Noll-Vo (FNV) hash, or SipHash.

Here's a quote from Include/pyhash.h :

* The values for Py_HASH_* are hard-coded in the * configure script. * * - FNV and SIPHASH* are available on all platforms and architectures. * - With EXTERNAL embedders can provide an alternative implementation with::

and the implementations are in Python/pyhash.c .

The only "murmur" in the repo is in Tools/peg_generator/data/top-pypi-packages-365-days.json as a mention of a PyPI modules.

And the linked-to Wikipedia page at https://en.wikipedia.org/wiki/MurmurHash says: "The authors of the attack recommend to use their own SipHash instead" to avoid collision attacks.

(Also, the example hash is 5478795832145536229 which is a 64-bit hash, while the Wikipedia page says MurmurHash3 generates a 32-bit or 128-bit hash value.)

> When a hash collision occurs, Python stores multiple key-value pairs in the same bucket and uses a linked list to store them.

Umm ... that isn't right either.

A linked list is an "open hashing"/"separate chaining" hash table, but Python uses "closed hashing"/"open addressing." As the linked-to Python source explains, "Open addressing is preferred over chaining since the link overhead for chaining would be substantial (100% with typical malloc overhead)."

skilled's comment about this being ChatGPT generated is flagged and dead, but ChatGPT seems good at being both wrong and confident. Just like this essay.

ealhad 3 years ago

Replying to the last sentence: the essay is also full of repetitions. I didn't make anything of it the first time I read it, but now the AI-generated hypothesis is on the table it seems quite likely.
ankit_it09OP 3 years ago

Thanks for your inputs I have fixed the article.
- eesmith 3 years ago
  
  No, you have not fixed the article.
  You've made it a different sort of wrong.
  "Another probing sequence used in Python dictionaries is quadratic probing" is contradicted by the very source code you link to.
  Compare your "but it can also cause some slots to be skipped or repeated" with dictobject.c's "repeating that 2^i times generates each int in range(2^i) exactly once (see any text on random-number generation for proof)."
  How are you coming up with all of these wrong things? Start with why you thought Python used MurmurHash3.
  - ankit_it09OP 3 years ago
    
    I wrote it as a general article on python dictionaries.
    >>Regarding Compare your "but it can also cause some slots to be skipped or repeated" - I think its correct - Let me try to explain this means that some slots may never be probed or may be probed more than once, which wastes time and space. For example, if the array size is 8 and the original index is 3, then quadratic probing will try the following indexes: 3, 4, 7, 4 (repeated), 3 (repeated), 4 (repeated), and so on. As you can see, indexes 0, 1, 2, 5, and 6 are never probed.
    And yes you are correct - Python does not use the MurmurHash3 algorithm for hashing objects in dictionaries. My bad I mixed up two different things and though it uses this mmh3 for hashing. I have updated it now and made it more like my understanding of python dictionaries.
    Thanks, for your review and pointers to fix it. Will keep it in my mind to get it reviewed before publishing any article.
    
    eesmith 3 years ago
    
    > I wrote it as a general article on python dictionaries.
    Yet as written it shows no understanding of how Python dictionaries work, and instead presents falsehoods.
    "The hash function takes in the key as input and returns a unique integer value"
    This is false. They are not unique.
    >>> hash(9) == hash(2**64+1) True
    "A third probing sequence used in Python dictionaries is double hashing"
    This ie another falsehood. Python does not use double hashing. Python objects only have a single hash.
    "For small integers, Python dictionaries use linear probing with a small array size (8 or 16). For larger integers or strings, Python dictionaries use quadratic probing with a larger array size (32 or more). For other types of keys, Python dictionaries use double hashing with a larger array size (32 or more)."
    This is another falsehood. This specialization does not exist.
    If it does exist, show me in the code where it does this.
    > more like my understanding of python dictionaries
    Your article shows no understanding of Python dictionaries. It is a patchwork of different approaches to making a hash table, almost of which are not applicable to Python. Nor does it mention the places where Python is unusual compared to most hash table, even when clearly pointed out in the code comments.
    It is easier for me to believe you are using ChatGPT (or something similar) than to believe you understand the topic.
    
    ankit_it09OP 3 years ago
    
    Yes I am newbie, don't have any understanding. Thanks, I will not write anything now. I have deleted that story. Peace.

trishmapow2 3 years ago

Note that dictionaries are officially ordered since Python 3.7.

ankit_it09OP 3 years ago

I agree

Settings

Python Dictionaries

Keyboard Shortcuts