Data Structures and Algorithms
with Object-Oriented Design Patterns in Python |

Strings of characters are represented in Python as instances
of the `str` class.
A character string is simply a sequence of characters.
Since such a sequence may be arbitrarily long,
to devise a suitable hash function
we must find a mapping from an unbounded domain
into the finite range of a 32-bit integer.

We can view a character string, *s*,
as a sequence of *n* characters,

where *n* is the length of the string.
(The length of a string can be determined using
the Python built-in method `len`).
One very simple way to hash such a string would be to simply
sum the numeric values associated with each character:

As it turns out,
this is not a particularly good way to hash character strings.
Given that the integer value of a Python character is an 8-bit quantity,
, for all .
As a result, .
For example,
given a string of length *n*=5,
the value of *f*(*s*) falls between zero and .
In fact, the situation is even worse,
in North America we typically use only
the *ASCII* character set.
The ASCII character set uses only the least-significant seven bits
of a `char`.
If the string is comprised of only ASCII characters,
the result falls in the range between zero and 640.

Essentially the problem with a function *f* which produces a result
in a relatively small interval
is the situation which arises when that function is composed with
the function .
If the size of the range of the function *f* is less than *M*,
then does not spread its values uniformly
on the interval [0,*M*-1].
For example, if *M*=1031 only the first 640 values
(62% of the range) are used!

Alternatively,
suppose we have *a priori* knowledge
that character strings are limited to length *n*=4.
Then, we can construct an integer by concatenating
the binary representations of each of the characters.
For example, given ,
we can construct an integer with the function

where .
Since *B* is a power of two,
this function is easy to write in Python:

def f(s): return ord(s[0]) << 21 | ord(s[1]) \ << 14 | ord(s[2]) << 7 | ord(s[3])While this function certainly has a larger range, it still has a problems--it cannot deal strings of arbitrary length.

Equation can be generalized to deal with strings of arbitrary length as follows:

This function produces a unique integer for every possible string.
Unfortunately, the range of *f*(*s*) is unbounded.
A simple modification of this algorithm suffices to bound the range:

where such that *w* is word size of the machine.
Unfortunately, since *W* and *B* are both powers of two,
the value computed by this hash function depends only on
the last *W*/*B* characters in the character string.
For example, for and ,
this result depends only on the last five characters in the string--all character strings having exactly the same last five characters collide!

Writing the code to compute Equation is actually
quite straightforward if we realize that *f*(*s*)
can be viewed as a polynomial in *B*,
the coefficients of which are , , ..., .
Therefore, we can use *Horner's rule*
(see Section ) to compute *f*(*s*) as follows:

def f(s): result = 0 for c in s: result = result * B + ord(c) return resultThis implementation can be simplified even further if we make use of the fact that , where

def f(s): result = 0 for c in s: result = result << b ^ ord(c) return result

Of the 128 characters in the 7-bit ASCII character set, only 97 characters are printing characters including the space, tab, and newline characters (see Appendix ). The remaining characters are control characters which, depending on the application, rarely occur in strings. Furthermore, if we assume that letters and digits are the most common characters in strings, then only 62 of the 128 ASCII codes are used frequently. Notice, the letters (both upper and lower case) all fall between and . All the information is in the least significant six bits. Similarly, the digits fall between and --these differ in the least significant four bits. These observations suggest that using should work well. That is, for , the hash value depends on the last five characters plus two bits of the sixth-last character.

We have developed a hashing scheme which works quite well
given strings which differ in the trailing letters.
For example, the strings `"temp1"`, `"temp2"`, and `"temp3"`,
all produce different hash values.
However, in certain applications the strings differ in the leading letters.
For example, the two *Internet domain names*
`"ece.uwaterloo.ca"` and `"cs.uwaterloo.ca"` collide
when using Equation .
Essentially, the effect of the characters that differ is lost
because the corresponding bits have been shifted out of the hash value.

**Program:** `String` class `__hash__` method.

This suggests a final modification
which shown in Program .
Instead of losing the *b*=6 most significant bits
when the variable `result` is shifted left,
we retain those bits and *exclusive or* them back into
the shifted `result` variable.
Using this approach,
the two strings `"ece.uwaterloo.ca"` and `"cs.uwaterloo.ca"`
produce different hash values.

Table lists a number of different
character strings together with the hash values obtained
using Program .
For example, to hash the string `"fyra"`,
the following computation is performed
(all numbers in octal):

1 | 4 | 6 | `f` | |||||||

1 | 7 | 1 | `y` | |||||||

1 | 6 | 2 | `r` | |||||||

1 | 4 | 1 | `a` | |||||||

| 1 | 4 | 7 | 7 | 0 | 6 | 3 | 4 | 1 |

Copyright © 2003, 2004 by Bruno R. Preiss, P.Eng. All rights reserved.