dictionary

source:
http://www.cs.berkeley.edu/~jrs/61b/lec/22

                              CS 61B: Lecture 22
                               Friday, March 15

DICTIONARIES
============
Suppose you have a set of two-letter words and their definitions.  You want to
be able to look up the definition of any word, very quickly.  The two-letter
word is the _key_ that addresses the definition.

Since there are 26 English letters, there are 26 * 26 = 676 possible two-letter
words.  To implement a dictionary, we declare an array of 676 references, all
initially set to null.  To insert a Definition into the dictionary, we define
a function hashCode() that maps each two-letter word (key) to a unique integer
between 0 and 675.  We use this integer as an index into the table, and make
the corresponding bucket (table position) a reference to the Definition object.

public class Word {
  public String word;

  public int hashCode() {
    return 26 * (word.charAt(0) - 'a') + (word.charAt(1) - 'a');
  }
}

public class WordDictionary {
  private Definition[] defTable = new Definition[676]; 

  public void insertItem(Word w, Definition d) {
    defTable[w.hashCode()] = d;
  }

  Definition findElement(Word w) {
    return defTable[w.hashCode()];
  }
}

Returning to our dictionary:  what if we want to store every English word, not
just the two-letter words?  The table "defTable" must be long enough to
accommodate pneumonoultramicroscopicsilicovolcanoconiosis, 45 letters long.
Unfortunately, declaring an array of length 26^45 is out of the question.
English has fewer than one million words, so we should be able to do better.

Hash Tables (the most common implementation of dictionaries)
-----------
Suppose n is the number of keys (words) whose definitions we want to store, and
suppose we use a table of N buckets, where N is perhaps a little larger than n,
but much smaller than the number of _possible_ keys.  A hash table maps a very
large set of possible keys into N buckets by applying a _hash_function_ to each
hash code.  The obvious hash function is

  h(hashCode) = hashCode mod N.

Hash codes are often negative, so remember that mod is not the same as Java's
"%" operator.  If you compute hashCode % N, check if the result is negative,
and add N if it is.

With this hash function, no matter how long and variegated the keys are, we can
map them into a table whose size is not much greater than the actual number of
items we want to store.  However, we've created a new problem:  several keys
are hashed to the same bucket in the table if h(hashCode1) = h(hashCode2).
This circumstance is called a _collision_.

How do we handle collisions without losing items?  We use a simple idea called
_chaining_.  Instead of having each bucket in the table reference one item, we
have it reference a linked list of items, called a _chain_.  If several keys
are mapped to the same bucket, their definitions all reside in that bucket's
linked list.

Hashing creates a second problem:  how do we know which definition corresponds
to which word?  The answer is that we must store each key in the table with its
definition.  The easiest way to do this is to have each list node store two
references:  a key (the word) and an associated element (its definition).

             |---------------------------------------------------------
defTable >   |   .   |   .   |   X   |   .   |   X   |   .   |   .   | ...
             |---|-------|---------------|---------------|-------|-----
                 v       v               v               v       v
                ---     ---             ---             ---     ---     
                |.+>pus |.+>swirl       |.+>tough       |.+>yut |.+>mud
                |.+>goo |.+>a vortex    |.+>too bad     |.+>Yu  |.+>wet dirt
                |.|     |X|             |X|             |.|     |X|
                -+-     ---             ---             -+-     ---
                 |                                       |
                 v                                       v
                ---                      ^              ---
                |.+>sin              < chains >         |.+>gigantic
                |.+>having fun                          |.+>very large
                |X|                                     |X|
                ---                                     ---

Hash tables usually support at least three operations.  An "item" constitutes
a key and its associated element.

insertItem(key, element)
  Compute the key's hash code and hash it to determine the item's bucket.
  Insert the item (key and element together) into that bucket's list.
findElement(key)
  Hash the key to determine its bucket.  Search the list for an item with the
  given key.  If found, return the corresponding element; otherwise, return
  a sentinel element named NO_SUCH_KEY.
remove(key)
  Hash the key to determine its bucket.  Search the list for an item with the
  given key.  Remove it from the list if found.  Return the associated element
  or NO_SUCH_KEY.

What if two items with the same key are inserted?  There are two approaches.
Following Goodrich and Tamassia, we can insert both, and arbitrarily return
one.  Goodrich and Tamassia also propose methods named findAllElements and
removeAll that return/remove all the elements with a given key.  The other
approach is to replace the old element with the new one, so only one item with
a given key exists in the table.  Which approach is best?  It depends on the
application.

The _load_factor_ of a hash table is n/N, where n is the number of keys in the
table and N is the number of buckets.  If the load factor stays below one, the
hash code and hash function are "good," and there are no duplicate keys, then
the linked lists are all short, and each operation takes O(1) time.  However,
if the table is too small (N << n), performance is dominated by linked list
operations and degenerates to O(n) time (albeit with a much smaller constant
factor than if you used a single linked list).  A precise analysis requires
more probability theory than you want to learn.

Hash Codes and Hash Functions
-----------------------------
Hash codes and functions are a bit of a black art.  The ideal hash code and
hash function would map each key to a uniformly distributed random integer from
zero to N-1.  (By "random", I don't mean that the function is different each
time; a given key always hashes to the same integer.  I mean that two different
keys, however similar, will hash to independently chosen values, so the
probability they'll collide is 1/N.)  This ideal is tricky to obtain.

In practice, it's easy to mess up and create far more collisions than
necessary.  Let's consider bad hash functions first.  Suppose the keys are
integers, and each integer's hash code is itself, so hashCode(i) = i.

Suppose we use the hash function h(hashCode) = hashCode mod N, and the table
size N is 10,000.  Unfortunately, for some reason our application only ever
generates keys that are divisible by 4.  A number divisible by 4 mod 10,000 is
still a number divisible by 4, so three-quarters of the buckets are never used!
We have at least four times as many collisions as necessary.

The same hash function is much better if N is prime.  With N prime, even if the
hash codes are always divisible by 4, numbers larger than N often hash to
buckets not divisible by 4, so all the buckets can be used.

For reasons I won't explain (see Goodrich and Tamassia Section 8.3.4 if you're
interested),

  h(hashCode) = ((a * hashCode + b) mod p) mod N

is a yet better hash function.  Here, a, b, and p are positive integers, p is a
large prime, and p >> N.

I recommend always using a known good hash function like the two above.
Unfortunately, it's still possible to mess up by inventing a hash code that
creates lots of conflicts even before the hash function is used.  Since hash
codes often need to be designed specially for each new object, you're left to
your own wits.  Here is an example of a good hash code for Strings.

  private static int hashCode(String key) {
    int hashVal = 0;
    for (int i = 0; i < key.length(); i++) {
      hashVal = (127 * hashVal + key.charAt(i)) % 8388593;
    }
    return hashVal;
  }

By multiplying the hash code by 127 before adding in each new character, we
make sure that each character has a different effect on the final result.  The
"%" operator with a prime number tends to "mix up the bits" of the hash code.
The prime is chosen to be large, but not so large that 127 * hashVal will ever
exceed the maximum possible value of an int.

The best way to understand good hash codes is to understand why bad hash codes
are bad.  Here are some examples of bad hash codes on Words.

  [1]  Sum up the ASCII values of the characters.  Unfortunately, the sum will
       rarely exceed 500 or so, and most of the items will be bunched up in the
       first few hundred buckets.  Moreover, anagrams like "pat," "tap," and
       "apt" will collide.
  [2]  Use the first three letters of a word, in a table with 26^3 entries.
       Unfortunately, words beginning with "pre" are much more common than
       words beginning with "xzq", and the former will be bunched up in one
       long list.  This does not approach our uniformly distributed ideal.
  [3]  Consider the "good" hashCode() function written out above.  Suppose the
       prime modulus is 127 instead of 8388593.  Then the return value is just
       the last character of the word, because (127 * hashVal) % 127 = 0.
       That's why 127 and 8388593 were chosen to have no common factors.

Why is the hashCode() function presented above good?  Because we can find no
obvious flaws, and it seems to work well in practice.  (A black art indeed.)