Hash Tables

View Code

A Hash Table is a table that uses a hash method to get the elements. A hash function will take an element and given a nearly unique value for it. Therefore it can be found in the array at or near a secific location rather than having linear searches for it.

A basic hash function is to take a numeric value and divide it modulary (return the remainder not the quotient). Now if you had a table that can hold ten items, then modular division by ten will set elements into their positions quickly. However if by chance two elements share a hash result then a colision results. The main study in hash tables is how to deal with these colisions.

Hash Table: A table that uses a hash function to position its elements
Colision: When two elements added to a hash table have the same position
Rehashing: recalculating the keys for resizing a hash table
Load Factor: number of elements in a table divided by the size of the table's array
Clustering: the tendancy for new elements to be hashed in the same location
Relatively Prime: the property that two number share no common factors except the number one
Twin Primes: two numbers that are both relatively prime and differ by two
Linear Hashing: A hashing algorithm that uses linear searches for colisions
Open Addressing: hashing that uses direct location in the table's array
Double Hashing: A hashing algorithm that uses two modular division to deal with colisions
Chain Hashing: A hash table that utilizes a linked list to deal with colisions

Linear Hashing

This is a simple algorithm for hashing and is best used with an array based table. The hash code is modular division of the size of the table. When adding an element just get the hash code. If the spot if empty put it in, otherwise just place is in the next available spot. If the end is reached then loop to the begining like with the circular queue. Removing an element is a linear search, but since it starts from where the element should be it is much quicker.

The problem with linear hashing is adding when removals a have been done. This can create situations in which the element may already be in the table. In that case the table must be seached from the hash position to the first empty space. If it already is in there then that it. If a space is reached then it is not in the table and it can then be added. Here is the general adding algorithm, followed by some example code.

Linear Hashing Adding Algorithm

Get the hash code.
Check the index of the hash value.
1. If it is empty, search the table until a empty space is reached.
  1. If the element is found return.
  2. Else add the element at the hash value
2. Else search the table for an empty space and place the value there.
  1. If the element was found before empty space return.
  2. Else add the element at the empty.

Linear Hashing Example
Value	Call	Comments
	add Dick==5	Check 5, its empty. Check until Dick or empty. The very next is empty so add Dick to 5.




Dick

	add Jane==1	Check 1, its empty. Check until jane or empty. The very next is empty so add Jane to 1.
Jane



Dick

	add Spot==5	Check 5, colision. Now check until the next open spot. If Spot was found retun, otherwise and place Spot in the empty space.
Jane



Dick
Spot
Nan	add Nan==6	Check 6, colision. Now check until the next open spot and place Nan there if she wasn't foudn first. Since searching reaches the end of the table, loop back to zero.
Jane



Dick
Spot
Nan	add Tim==4	Check 4, empty. Check until another empty space or Dick. Add appropriatly.
Jane


Tim
Dick
Spot
~~Nan~~	remove Nan==6	Check 6, not Nan. Search after 6 till Nan then mark empty.
Jane


Tim
Dick
Spot
~~Nan~~	remove Spot==5	Check 5, not Spot. Search after 5 till Spot then mark empty.
Jane


Tim
Dick
~~Spot~~
~~Nan~~	add Rex==6	Check 5, used. Search for empty space. If Rex was found before return, otherwise add to table.
Jane


Tim
Dick
~~Spot~~Rex~~~~
~~Nan~~	add Rex==5	Check 5, used. Search for empty space. Rex was found before an empty space so do nothing.
Jane


Tim
Dick
~~Spot~~Rex~~~~

Rehashing

With various elements returing the same hash codes they can be mixed up in the table, this lessens the usefulness of the table and creates a new issue, resizing. When the table resizing all the keys have to recalculated and added to the table. This is relativly expensive given the tables quick performance, O(n). Since colisions occur more often as the table gets full a rehash is done when it is filled at a certain point. The value of how much is used by a table is called the load factor. Rehashing usually occurs with the load factor is 75%.

Double Hashing

Linear hashing is useful but it can often creates clustering. Clustering is when the hash fuction tends to send the elements in the same area of the table. To prevent this double hashing gets two hash codes. The first is the index of where the value show be. The second is a skip value. Instead of just searching till the next available spot or the element itself, search every hash2-th spot after it.

This skipping however can create problesm. What if there are not available spaces in the table on the skips? This is avoided by making sure that the two numbers for modular divsion are relatively prime. Two relativly prime numbers only have one for a common factor. This does not limit the choices to prime numbers alone. Take fourteen and fifteen. Fourteen's factors are one, two, seven, and fourteen. Fifteen's are one, three, five, and fifteen. The only common factor is one, so they are relativly prime.

An optoinal performance booster caculated by computer sciencetis Donald Knuth is to have the relavtively prime numbers be twin primes. Twin primes are relatively prime numbers whose difference are two. For example fivteen and seventeen. The factor's of fifteen are one, three, five, and fifteen. Seventeen's factors are one and seventeen. The greatest common factor is one, so they are relativly prime. Seventeen minus fifteen is two, so they are also twin primes.

Here's the tracing of the same hash calls with double hashing. Notice that the size of the table, seven, allows for the use of the twin prime five and seven.


int hash1(const string& s)         //notice constance pass by reference
{                                  //& returns the memory location
   return int(&s) % theTableSize;  //mod by table size gets position
}

int hash2(const string& s)         //only used in double hashing
{                                  //one is added so there is no skip of zero
   return 1 + (int(&2) % (theTableSize-2)) 
}

Double Hashing Example
Value	Call	Comments
	add Dick==5/3	Check 5, its empty so add.




Dick

	add Jane==1/2	Check 1, its empty so add.
Jane



Dick

	add Spot==5/1	Check 5. It's used so skip by ones until an empty spot and add it there.
Jane



Dick
Spot
	add Nan==6/2	Check 6, used, so skip by two's. (6+2)-7=1, taken by Jane so skip again. 2+2=4, empty so use it.
Jane

Nan

Dick
Spot
	add Tim==4/1	4 is taken by Nan, so skip by one's until empty.
Jane

Nan
Tim
Dick
Spot
	remove Nan==6/2	Look at 6. Nan is not there so search until blank or Nan by two.
Jane

~~Nan~~
Tim
Dick
Spot
	remove Spot==5/1	Spot's not at six so keep searching until blank or Spot.
Jane

~~Nan~~
Tim
Dick
~~Spot~~
Rex	add Rex==6/2	Six is taken so search by two's until an empty space or Rex. Taken by 1, 3, 5. Finaly a spot open at 0.
Jane

~~Nan~~
Tim
Dick
~~Spot~~
Rex	add Rex==6/2	Six is taken so search by two's until an empty space or Rex. Rex is found at 0.
Jane

~~Nan~~
Tim
Dick
~~Spot~~

Chain Hashing

This type of hashing takes advantage of the objects it uses rather than an algorithm. The hash table now holds linked lists instead of single values. In order to add or remove an element just get the hash value like a linear table. Then add or remove a node with its value appropriatly at the list at that index.

Chain Hashing Example
Value	Call	Comments
NULL	add Dick==5	Add Dick at 5.
NULL
NULL
NULL
NULL
->Dick
NULL
NULL	add Jane==1	Add Jane at 1.
Jane
NULL
NULL
NULL
->Dick
NULL
NULL	add Spot==1	Add Spot at 5 after Dick.
Jane
NULL
NULL
NULL
->Dick-->Spot
NULL
NULL	add Nan==6	Add Nan at 6.
Jane
NULL
NULL
NULL
->Dick-->Spot
-->Nan
NULL	add Tim==4	Add Tim at 4.
Jane
NULL
NULL
-->Tim
->Dick-->Spot
-->Nan
NULL	remove Nan==6	Remove Nan at 6.
Jane
NULL
NULL
-->Tim
->Dick-->Spot
NULL
NULL	remove Spot==5	Remove Spot at 5. Have to search for him though.
Jane
NULL
NULL
-->Tim
->Dick
NULL
NULL	add Rex==2	Add Rex at 2.
Jane
-->Rex
NULL
-->Tim
->Dick-->Spot
NULL
NULL	add Rex==2	Rex is already at 2.
Jane
-->Rex
NULL
-->Tim
->Dick-->Spot
NULL

Performance

Each implementation of the hash table creates different performances. Donald Knuth is the computer scientist who figured all this stuff out. The average performance of a hash table is based on its implementation and is load factor, α These value are how fast a serach is for an element in a tabel that is not full and has had no removals.

α = number of elements in a table / size of the table's array

Average Hash Table Search Performance Rates
Open Addressing with Linear Probing	.5(1+(1-α)^-1)
Open Addressiong with Double Hashing	α^-1(-ln(1-α))
Chained Hashing	1+.5α

Prev -- Back to Portal -- Next