CSCI 103 – Homework 6

hw6

Due: See assignments page
Directory name in your github repository for this homework (case sensitive): hw6

Skeleton Code

Some skeleton code has been provided for you in the hw6 folder and has been pushed to the Github repository resources. If you already have this repository locally cloned, just perform a git pull. Otherwise you’ll need to clone it.

Written Portion

Problem 0 - Complete the Course Evalutions (0%)

Take a moment and complete the course evaluations. They should be available via Blackboard. Go to the home page (not for our course but for all of Blackboard) and find the tab in the upper left labelled “Course Evaluations”. Your feedback is appreciated and helps make the class better. If there is constructive or negative feedback, consider describing an alternative approach (solution) to the problem. Thanks!

Problem 1 - Number Thoery (30%)

To receive credit for each problem, show your work in such a way that allows the grader to recognize that you understand the relevant lecture material.

Place your answers in a file nt1.pdf.

Compute $14^{37} \mod{5}$ .
What is $\sum_{i=1}^{100} i! \mod{7}$ ?
Find the multiplicative inverse of 8 in $\mathbb{Z}_{13}$ .
Find the multiplicative inverse of 83 in $\mathbb{Z}_{191}$
Solve $58*x + 26*y \equiv gcd(58,26)$ . Show your work in such a way that allows the grader to recognize that you understand the relevant lecture material.
Prove that if p is prime and $p \equiv 1 \mod{5}$ then $p \equiv 1 \mod{10}$ .

Programming Portion

Problem 2 - Hash Table with Quadratic Probing and Performance (40%)

You will create a HashTable data structure that uses open-addressing (i.e. uses probing for collision resolution) and NOT chaining. Your table should support linear and quadratic probing via functors. Initially, we will use the std::hash<> functor provided with the C++ std library. In a future question, you will implement your own hash function for strings made of letters and digits only (no punctuation). You should complete the implementation in ht.h.

template<
    typename K, 
    typename V, 
    typename Prober = LinearProber,
    typename Hash = std::hash<K>, 
    typename KEqual = std::equal_to<K> >
class HashTable
{
    /* Implementation */
};

The template types are as follows:

K: the key type (just like a normal map)
V: the value type (just like a normal map)
Prober: a functor that supports an init() and next() function that the hash table can use to generate the probing sequence (e.g. linear, quadratic, double-hashing). A base Prober class has been written and a derived LinearProber is nearly complete. You will then write your own QuadraticProber class to implement quadratic probing.
Hash: The hash function that converts a value of type K to a std::size_t which we have typedef’d as HASH_INDEX_T.
KEqual: A functor that takes two K type objects and should return true if the two objects are equal and false, otherwise.

Internally, we will use a hash table of pointers to HashItems, which have the following members:

typedef std::pair<KeyType, ValueType> ItemType;
struct HashItem {
        ItemType item;  // key, value pair
        bool deleted;
};

You should allocate a HashItem when inserting and free them at an appropriate time based on the description below. If no location is available to insert the item, you should throw std::logic_error.

We have also provided a constant array of prime sizes that can be used, in turn, when resizing and rehashing is necessary. This sequence is: 11, 23, 47, 97, 197, 397, 797, 1597, 3203, 6421, 12853, 25717, 51437, 102877, 205759, 411527, 823117, 1646237, 3292489, 6584983, 13169977, 26339969, 52679969, 105359969, 210719881, 421439783, 842879579, 1685759167. Thus, when a HashTable is constructed it should start with a size of 11. The client will supply a loading factor, alpha, at construction, and before inserting any new element, you should resize (to the next larger size in the list above) if the current loading factor is at or above the given alpha.

For removal of keys, we will use a deferred strategy of simply marking an item as “deleted” using a bool value. These values should not be considered when calls are made to find or operator[] but should continue to count toward the loading factor until the next resize/rehash. At that point, when you resize, you will only rehash the non-deleted items and (permanently) remove the deleted items.

We will not ask you to implement an iterator for the hash table. So find() will simply return a pointer to the key,value pair if it exists and nullptr if it does not.

To facilitate tracking relevant statistics we will use for performance measurements, we have provided the core probing routine: probe(key). This routine, applies the hash function to get a starting index, then initializes the prober and repeatedly calls next() until it finds the desired key, reaches a hashtable location that contains nullptr (where a new item may be inserted if desired), or reaches its upper limit of calls (i.e. cannot find a null location). probe() utilizes the Prober. Prober::init is called to give the prober all relevant info that it may need, regardless of the probing strategy (i.e. starting index, table size, etc.). Then a sequence of calls to Prober::next() will ensue. If, for example we are using linear probing, the first call to next() would yield h(k), the subsequent call to next() would yield h(k)+1, the third call would yield h(k)+2, etc. Notice: the first call to next() just returns the h(k) value passed to Prober::init(). As probing progresses we will update statistics such as the total number of probes. Some accessor and mutator functions are provided to access those statistics.

Note: When probing to insert a new key, we could stop on a location marked as deleted and simply overwrite it with the key to be inserted. However, because our design uses the same probe(key) function for all three operations: insert, find, and remove, and certainly for find and remove we would NOT want to stop on a deleted item, but instead keep probing to find the desired key, we will simply take the more inefficient probing approach of not reusing locations marked for deletion when probing for insertion.

A debug function, reportAll is also provided to print out each key value pair. Use this when necessary to help you debug your code.

Probers

We have abstracted the probing sequence so various strategies may be implemented without the hash table implementation being modified. Complete the LinearProber and then write a similar QuadraticProber that uses quadratic probing. Recall that if a prime table size is used, insertion using quadratic probing is only guaranteed to find a free location if the loading factor is less than or equal to m/2. Thus after m/2 probes, return npos.

Testing Your Hash Table

A simple test driver program ht-test.cpp has been provided which you may modify to test various aspects of your hash table implementation. We will not grade this file so use it as you please.

Problem 3 - Hash Table Analysis (10%)

Hash Table Analysis

You will complete a program that will measure the average time it takes to fill a map (your hashtable with various probing strategies and std::map) with varying number of strings and then attempt to find each string from the file in the table/map.

Complete the program ht-perf in ht-perf.cpp that will be run as follows:

./ht-perf [input] p alpha reps

Where

p is the probing strategy to use (1 = LINEAR, 2 = QUADRATIC)
alpha is the loading factor to pass to the hash table constructor
reps is the number of repetitions of the test to perform (and average over) for the performance/timing anlaysis

The program current defines two templated test functions, one for your hash table (templated so you can pass in hash tables with various probing strategies or hash functions) and one for std::map. These are COMPLETE and should not be altered. Instead, your primary task is twofold:

Modify the code to repeat the calls to the doTest functions reps number of times and passing in a newly allocated map/hash table each call.
Add code to time how long the reps calls to the function took to execute and find the average time per call. See below for an example of how to time a sequence of code.
For your hash table tests, also save the statistics of how many probes were performed on the LAST call to the test function.

We have written two, complete helper functions. The first is to parse the command line arguments and the second is to parse the words from the input file. These need not be altered and already called from main().

We will guarantee the following for all input test cases we use on your program:

You will never need more than 1685759167 indices in your hash table.
All input words will be no longer than 28 legal characters.

Timing Your Code

To time your code, you can use the technique shown in the following program:

#include <iostream>
#include <ctime>

int main() {
    clock_t start;
    double duration;

    /* Preprocessing here that you don't want to time */

    start = clock();

    /* Code to time goes here */

    duration = ( clock() - start ) / (double) CLOCKS_PER_SEC;

    std::cout << duration/r << endl;
}

As a short explanation, the clock() returns the number of clock ticks from a fixed point in time. By getting a timestamp before and after your code runs and looking at the difference, the duration can be inferred. To relate that value to wall-clock time we can convert to seconds using the constant CLOCKS_PER_SEC.

Test Cases

We have provide a Python script, genWords.py to generate files of random strings. You can run it on Docker via:

python3 genWords.py <num-words> <output-filename>

So python3 genWords.py 100000 words-1e5.txt would output 100,000 random strings to the file words-1e5.txt. For testing, we have provided 4 files of size 1E4, 2E4, 4E4, and 8E4. You can generate your own if you like. PLEASE DO NOT ADD/COMMIT THESE FILES TO YOUR GIT REPO as they are large.

What to turn in

When you are sure your code is working (try to test it on small samples), run the experiments described below and submit the results in a file hash-analysis.txt. That file should contain:

A table showing your performance for average runtime for a loading factor of 0.5 and 10 repetitions (for averaging purposes)

            Hash Table (Linear) | std::map
1E4 words
2E4 words
4E4 words
8E4 words

A table comparing different probing schemes for various loading factors using the 8E4 words input case. While we should generally not put the loading factor above 0.5 for quadratic probing, you can try it at higher values for this exercise. If it crashes you can try to generate a different input file. Make a table for average time and also for number of probes.

Resize Loading Factor   | Hash Table Avg. Time (Linear) | Hash Table Avg. Time (Quadratic) | 
2
4
5
6
8

Resize Loading Factor   | Hash Table Num. of Probes (Linear) | Hash Table Num. of Probes (Quadratic) | 
2
4
5
6
8

Explain why you think the results turned out the way they did, and whether you were surprised by them. What aspects were suprising or unexpected.

NOTE 1: You may find that linear and quadratic probing perform roughly the same in terms of time but quadratic probing performs many less probes. In CS 356 you will see that due to caching and other hardware techniques time is not necessarily linearly related with the number of operations that must be performed. Suppose we asked you to buy 50 items at the grocery store but I ordered them randomly vs. listed them in the order you would find them as you go up and down each successive aisle. You got the same number of items but having them in successive order allows for faster access. Linear access in computer memory is extremely fast while taking larger jumps to new memory locations is slower, even for the same number of accesses.

NOTE 2: This portion of the assignment will be graded with human eyes, so you just need to make sure you record and report the data accurately for your implementation and system on which you are running.

Problem 4 - Write Your Own String Hash Function (10%)

In this problem you will write your own hash function for std::string in the provided hash.h file. While std::strings can be of arbitrary length and contain any legal ASCII character, we will assume the largest string you will ever receive is 28 letters long (e.g. antidisestablishmentarianism) and only contains letters and digits (‘0’-‘9’). For the letters, you should treat upper-case letters the same as you would lower-case letters.

The Approach

The basic approach will be to treat groups of letters/digits as base-36 and convert to an integer value (i.e. the decimal equivalent).

First translate each letter into a value between 0 and 35, where a=0, b=1, c=2, …., z=25 and digit '0'=26, … '9'=35. Be sure to convert an upper case letter to lower case before performing this mapping. Next you will translate a (sub)string of 6 letters a1 a2 a3 a4 a5 a6 into an (unsigned long long) 64-bit integer w[i] (essentially converting from base-36 to decimal), via the following mathematical formula:

$w[i] = 36^5*a1 + 36^4*a2 + 36^3*a3 + 36^2*a4 + 36*a5 + a6$

Place zeros in the leading positions if your (sub)string is shorter than 6 letters. So 104 would set a1, a2, and a3 all to zero and yield $36^2*27 + 36*26 + 30$ . You should use the base conversion approach taught in class to avoid repeated calls to pow(). Note that indexing is a bit tricky here so think carefully. The digit at the “end” of the string is assigned the power 36^0, and the 2nd to last letter is worth 36^1, etc.

If an input word is longer than 6 letters, then you should first do the above process for the last 6 letters in the word, then repeat the process for each previous group of 6 letters. Recall, you will never receive a word longer than 28 characters. The last group may not have 6 letters in which case you would treat it as a substring of less than 6 characters as described above. Since we have at most 28 characters this process should result in a sequence of no more than 5 integers: w[0] w[1] w[2] w[3] w[4], where w[4] was produced by the last 6 letters of the word.

Store these values in an array (of unsigned long long). Place zeros in the leading positions of w[i] if the string does not contain enough characters to make use of those values. So for a string of 12 letters, w[0], w[1], and w[2] would all be 0 and only w[3] and w[4] would be (potentially) non-zero.

We will now hash the string. Use the following formula to produce and return the final hash result using the formula below (where m is the size of the hashtable and the r values are explained below).

$h(k) = (r[0]*w[0] + r[1]*w[1] + r[2]*w[2] + r[3]*w[3] + r[4]*w[4]) % m$

where the $r[i]$ values are either:

5 random numbers created when you instantiate the hash function using the current time as your seed (i.e. srand(time(0))). Be sure to #include <ctime> and #include <cstdlib>.
5 preset values for debugging (so we all get the same hash results): r[0]=983132572, r[1]=62337998, r[2]=552714139, r[3]=984953261, r[4]=261934300 (these numbers were generated by random.org)

The constructor of your hash object will take a debug parameter, which, if true should set the r values to the above debugging values, and if false should select the ranomized values.

Note: Make sure you compute using unsigned long long variables or cast (constants or other variables) to that type at the appropriate places so that you don’t have an overflow error.

Testing

We have provided a simple test program, str-hash-test.cpp where you can provide a string on the command line and it will hash the string and output the hash result. We have provided a compiled solution version of this program that is compiled with our solution hash function using the debug r values. You may run it on an x86 based laptop running Docker: ./str-hash-test-sol <string>. It includes some additional debug output that shows the w[i] values computed.

Below is the output for a few strings:

./str-hash-test abc

yields

w[0] = 0
w[1] = 0
w[2] = 0
w[3] = 0
w[4] = 38
h(abc)=9953503400

./str-hash-test abc123

yields

w[0] = 0
w[1] = 0
w[2] = 0
w[3] = 0
w[4] = 1808957
h(abc123)=473827885525100

./str-hash-test antidisestablishmentarianism

yields

w[0] = 17540
w[1] = 195681115
w[2] = 2203855
w[3] = 732943745
w[4] = 484346964
h(antidisestablishmentarianism)=1137429692708383810

You may want to go back and run some of the performance tests but now using your hash function rather than std::hash and ensure you still get reasonable performance without errors or faults. You will likely notice some slowdown using this hash function which is not optimized compared to std::hash.

Problem 5 - Lost Pairs (10%)

Suppose you are presented with a target integer and a file containing a list of integers. The majority of the numbers in the file are pairs of numbers that add up to the target value. However, the integers in the file are not sorted or organized in any way, and some numbers do NOT have a pair that adds up to the target (i.e. they are “pairless”) . Your task is to output how many numbers lack a corresponding number (i.e. pair) that sums to the target AND you must do is time Theta(n) where n is the number of integers in the file.

Requirement: The only data structure (beyond IO and file streams) you may use are: std::vector, std::set, std::map, or your hash table implementation from earlier, but remember you must meet the Theta(n) runtime.

Note: If you use your HashTable, do so using quadratic probing, a loading factor of 0.4, and std::hash<int> for your hash algorithm.

Your Makefile should have a target pairless that produces an executable also named pairless and is built with the all target.

Assume an input file with the following contents and a target of 20:

We will expect to run your program at the command line as:

./pairless pairless1.txt 20

The should simply output the count of how many numbers did not have pair that summed to the target with no other text.

Since 6, 7, and 16 do not have a pairing number that sums to 20.

WAIT You aren’t done yet. Complete the last section below to ensure you’ve committed all your code.

Commit then Re-clone your Repository

Be sure to add, commit, and push your code in your hw6 directory to your hw-usc-username repository. Now double-check what you’ve committed, by following the directions below (failure to do so may result in point deductions):

Go to your home directory: $ cd ~
Create a verify directory: $ mkdir verify
Go into that directory: $ cd verify
Clone your hw_username repo: $ git clone git@github.com:usc-csci103/hw-usc-username.git
Go into your hw4 folder $ cd hw-username/hw6
Recompile and rerun your programs and tests to ensure that what you submitted works.