ORACLE MASTERMINDS: Hash Joins

Wednesday, 16 January 2019

Hash Joins

HASH JOINS

Hash joins are used when two larger data sets are joined. Optimizer will look for the smaller one of two datasets and build the hash table based on the join key using a deterministic hash function to specify the location in the hash table in which to store each row. The database then scans the larger data set, probing the hash table to find the rows that meet the join condition.

When the Optimizer Considers Hash Joins

In general, the optimizer considers a hash join when a relatively large amount of data must be joined (or a large percentage of a small table must be joined), and the join is an equijoin.

A hash join is most cost effective when the smaller data set fits in memory. In this case, the cost is limited to a single read pass over the two data sets.

Because the hash table is in the PGA, Oracle Database can access rows without latching them. This technique reduces logical I/O by avoiding the necessity of repeatedly latching and reading blocks in the database buffer cache.

If the data sets do not fit in memory, then the database partitions the row sources, and the join proceeds partition by partition. This can use a lot of sort area memory, and I/O to the temporary tablespace. This method can still be the most cost effective, especially when the database uses parallel query servers.

How Hash Joins Work

A hashing algorithm takes a set of inputs and applies a deterministic hash function to generate a hash value between 1 and n, where n is the size of the hash table.

In a hash join, the input values are the join keys. The output values are indexes (slots) in an array, which is the hash table.

Hash Tables

To illustrate a hash table, assume that the database hashes hr.departments in a join of departments and employees. The join key column is department_id.

The first 5 rows of departments are as follows:

SQL> select * from departments where rownum < 6;

DEPARTMENT_ID DEPARTMENT_NAME MANAGER_ID LOCATION_ID

------------- ------------------------------ ---------- -----------

10 Administration 200 1700

20 Marketing 201 1800

30 Purchasing 114 1700

40 Human Resources 203 2400

50 Shipping 121 1500

The database applies the hash function to each department_id in the table, generating a hash value for each. For this illustration, the hash table has 5 slots (it could have more or less). Because n is 5, the possible hash values range from 1 to 5. The hash functions might generate the following values for the department IDs:

Hash Join: Basic Steps

The optimizer uses the smaller data source to build a hash table on the join key in memory, and then scans the larger table to find the joined rows.

The basic steps are as follows:

1. The database performs a full scan of the smaller data set, called the build table, and then applies a hash function to the join key in each row to build a hash table in the PGA.

In pseudocode, the algorithm might look as follows:

FOR small_table_row IN (SELECT * FROM small_table)

LOOP

slot_number := HASH(small_table_row.join_key);

INSERT_HASH_TABLE(slot_number,small_table_row);

END LOOP;

2. The database probes the second data set, called the probe table, using whichever access mechanism has the lowest cost.

Typically, the database performs a full scan of both the smaller and larger data set. The algorithm in pseudocode might look as follows:

FOR large_table_row IN (SELECT * FROM large_table)

LOOP

slot_number := HASH(large_table_row.join_key);

small_table_row = LOOKUP_HASH_TABLE(slot_number,large_table_row.join_key);

IF small_table_row FOUND

THEN

output small_table_row + large_table_row;

END IF;

END LOOP;

For each row retrieved from the larger data set, the database does the following:

a. Applies the same hash function to the join column or columns to calculate the number of the relevant slot in the hash table.

For example, to probe the hash table for department ID 30, the database applies the hash function to 30, which generates the hash value 4.

b. Probes the hash table to determine whether rows exists in the slot.

If no rows exist, then the database processes the next row in the larger data set. If rows exist, then the database proceeds to the next step.

c. Checks the join column or columns for a match. If a match occurs, then the database either reports the rows or passes them to the next step in the plan, and then processes the next row in the larger data set.

If multiple rows exist in the hash table slot, the database walks through the linked list of rows, checking each one. For example, if department 30 hashes to slot 4, then the database checks each row until it finds 30.