Termipankki
  1. A
    1. API
    2. ASAN
    3. Access specifiers
  2. C
    1. Constructor
    2. const
  3. G
    1. GDB
    2. g++
  4. M
    1. Make
    2. Memory Leak
  5. P
    1. Pointer
  6. R
    1. Recursion
  7. S
    1. std
  8. V
    1. Valgrind
Ratkaistu: / tehtävää

LAB_HASH: Clash of the Hashes!

Due date: 2024-04-29 23:59.

Assignment Description

In this lab you will be implementing functions on hash tables with two different collision resolution strategies - separate chaining and linear probing. These hash tables serve an implementation of the dictionary abstract data type. In the second part of the lab you will use the dictionary to solve some problems.
Note that there are a LOT of files in this lab - don't get too intimidated as you won't be modifying the vast majority of them.

Notes About list Iterators

When you are working with the Separate Chaining Hash Table, you will need to iterate over the linked list of a given bucket. Since the hash tables are templatized, however, this causes us a slight headache syntactically in C++. To define a list iterator on a given bucket, you will need to declare it as follows:
typename list< pair<K,V> >::iterator it = table[i].begin();
Alternatively you can just use auto and have the compiler worry about the type:
auto it = table[i].begin()
Even then, it's always good to give some thought to what type it is exactly.
Also note that if you do not insert your data at the head of the bucket, you may end up with your results being in a different order than ours. This is fine, but if you want to diff with the solution output (in the soln) folder, you will want to ensure that you always insert your data at the head of the list in each respective bucket.
Warning: If using the list::erase() function, be advised that if you erase the element pointed to by an iterator that the parameter iterator is no longer valid. For instance:
auto  it = table[i].begin();
table[i].erase(it);
it++;
is invalid because it is invalidated after the call to erase(). So, if you are looping with an iterator, remember a break statement after you call erase()!

Checking Out The Code

You can get the files from here: lab_hash.zip
If you look at the files in that folder, you may think it's a lot of files, but don't worry! You will only modify a few files, so just follow the instructions and it'll be fine.
In your directory there should be two folders as well: data and soln. The data folder contains text files used in the applications later, and the soln folder contains sample output to diff your code's output against.
Note: If you want to speed up compile time on a make all, try using make -jN all, where N is an integer (say 4 for instance).

remove and resizeTable on a Separate Chaining Hash Table

Open your schashtable.cpp. In this file, the above two functions have not been implemented - your job is to implement them.

remove

resizeTable

insert on a Linear Probing Hash Table

Open your lphashtable.cpp. In this file, you will be implementing the insert function.
You MUST handle collisions in your insert function, or your hash table will be broken!

Test Your Code Using charcount

CharFreq has already been implemented for you.
Type the following into the terminal:
make charcount
This will make the charcount executable. This file takes two arguments: the first is the file to analyze, and the second is the frequency for which to return characters. For example, a run on data/sherlock_holmes.txt for characters with frequency greater or equal to 10000 might look like:
$ ./charcount data/sherlock_holmes.txt 10000 schash
Finding chars in data/sherlock_holmes.txt with frequency >= 10000...
a 35944
c 10847
d 18981
e 54459
h 29449
i 30921
l 17548
m 12060
n 29436
o 34570
r 25298
s 27805
t 40038
u 13495
w 11432
Note: You should verify that your solution outputs the right values for both kinds of hash tables!! You can test each one by changing the third command line argument to "schash" or "lphash", respectively.

Implement the WordFreq Class

Now we will write our own application for a hash table! Open word_counter.cpp. You will be writing the getWords function. This function, given an integer threshold, should return a vector of string, int pairs. Each of these pairs is a string and its frequency in the file. Your vector should only contain those pairs whose frequencies are greater than or equal to threshold. Look at char_counter.cpp if you are stuck. You may find the TextFile class defined in textfile.h helpful.

Test Using the wordcount Executable

When you're done with the above, type the following into your terminal:
make wordcount
This will build the wordcount executable, which uses the WordFreq class to determine word frequencies in a file. It, like charcount, takes two arguments: the first is the path to the file to be analyzed, and the second is the threshold for which to grab words. If a word occurs >= threshold times, it will be printed to the console.
For example, a run of wordcount on data/metamorphoses.txt with a frequency 1000 might look like:
$ ./wordcount data/metamorphoses.txt 1000 lchash
$ Finding words in data/metamorphoses.txt with frequency >= 1000 using LPHashTable...
a 2246
by 1185
he 1241
in 1989
is 1306
of 5753
to 3078
that 1548
the 10473
was 1706
and 4739
with 1456
her 1500
his 1572
You should verify your solution works with both kinds of hash tables.

Implement the AnagramFinder Class

Finding anagrams is another clever use of hash tables. Open anagram_finder.cpp. You will be writing the checkWord function, which simply returns a boolean value. It should return true if the string test is an anagram of word, and false otherwise.
An anagram of a given word is a word which has the same letters. For example, "retinas" and "stainer" are anagrams of each other. Think carefully about what it means for a word to be an anagram of another. Is there something we can use a hash table to keep track of? Does counting letter frequency help you?

Test Using the anagramtest Executable

When you're done with the above, type the following into your terminal:
make anagramtest
This will make the anagramtest executable, which you can use to test your AnagramFinder class. It can be run with or without arguments. Without arguments will test a very simplistic example - with arguments is more interesting.
anagramtest can also take two arguments: the first is the path to the file to be analyzed, and the second is the word to find anagrams for. For example, a run of anagramtest on data/words.txt looking for anagrams of retinas might look like:
$ ./anagramtest data/words.txt retinas schash
Checking file data/words.txt for anagrams of retinas...
anestri is an anagram of retinas
asterin is an anagram of retinas
eranist is an anagram of retinas
nastier is an anagram of retinas
ratines is an anagram of retinas
resiant is an anagram of retinas
restain is an anagram of retinas
retains is an anagram of retinas
retinas is an anagram of retinas
retsina is an anagram of retinas
sainter is an anagram of retinas
stainer is an anagram of retinas
starnie is an anagram of retinas
stearin is an anagram of retinas
You should verify your solution works with both kinds of hash tables to verify their correctness as well.

Implement the LogfileParser Class

This task is a bit more abstract, but is highly applicable for companies. Given a logfile with the following content:
[customer_name]:    url    date
We want to answer the following questions (quickly!):
In order to do so, you will have to parse the logfile and do something with hash tables. The parsing occurs in the constructor forthe LogfileParser and the querying happens in the other two functions. You should design your constructor so that you can run the two helper functions as quickly as possible (that is, I should be able to do n queries on your LogfileParser in O(1) time per query after the constructor has finished).
Hint: You should not need to define a new hash function for this problem! Think about how to construct a key based on what data you have.
You are safe to assume that customers have distinct names, and that URLs always follow the pattern /product/XX/ where XX is an integer. Products (and thus URLs) have distinct integer IDs.

Test Your LogfileParser with lfparse

When you are done with the above, type the following into your terminal:
make lfparse
This will build the lfparse executable, which will test your LogfileParser class. It takes a single argument: the data file to be processed. For example, a run of lfparse on data/log2.txt might look like:
$ ./lfparse data/log2.txt
Parsing logfile: data/log2.txt...
Number of unique URLs: 10
Printing unique URLs:
        /product/3/
        /product/2/
        /product/9/
        /product/6/
        /product/7/
        /product/4/
        /product/5/
        /product/1/
        /product/8/
        /product/0/
Running sample visited queries...
        chase visited /product/0/ on Mon Apr 11 15:05:56 2011
        chase visited /product/1/ on Sat Apr  9 15:54:36 2011
        chase visited /product/2/ on Tue Apr 12 15:52:58 2011
        chase visited /product/3/ on Tue Apr 12 15:48:33 2011
        chase visited /product/4/ on Tue Apr 12 14:21:47 2011
        chase visited /product/5/ on Sat Apr  9 14:30:12 2011
        chase visited /product/6/ on Mon Apr 11 16:07:27 2011
        chase visited /product/7/ on Tue Apr 12 15:20:48 2011
        chase visited /product/8/ on Mon Apr 11 14:31:36 2011
        chase visited /product/9/ on Tue Apr 12 15:04:12 2011
You should verify your solution works for both kinds of hash tables to verify their correctness as well.

genlog More Logfiles!

If you would like, you can also generate more random logfiles to throw at your lfparse executable. To do so, type the following into your terminal:
make genlog
genlog takes three arguments: filename, products, and lines (in that order).
Note: Don't accidentally delete your data/log.txt or data/log2.txt!

Good luck!

Hash

Test that the following programs work as expected:
  • charcount
  • wordcount
  • anagramtest
  • lfparser
Randomized input will be used! This includes creating a fake anagram. Both SCHash and LPHash will be tested. Remember to also ensure that ASAN does not report memory leaks!

Sallitut tiedostojen nimet lphashtable.cpp, schashtable.cpp, word_counter.cpp, anagram_finder.cpp, logfile_parser.cpp

Varoitus: Et ole kirjautunut sisään. Et voi vastata.

Acknowledgments

We would like to express our gratitude to prof. Cinda Heeren and the student staff of the UIUC CS Data Structures course for creating and sharing the programming exercise materials that have served as the basis for the labs used in this course.
Revised by: Elmeri Uotila and Anna LaValle
?
API stands for Application Programming Interface. In the context of this course, you can consider the header files to be such interfaces, as they determine what class functions and properties are public and thus accessible from anywhere.
AddressSanitizer (ASAN) is a memory error detector for C/C++. In this course, the makefiles will typically compile an executable that uses ASAN, with "-asan" at the end of its name.
The two notable access specifiers are:
  • public: class members defined after the public keyword are accessible from outside the class.
  • private: class members are generally private by default and thus not accessible from the outside
Constructor is a special non-static member function of a class that is used to initialize objects of its class type. A constructor is called upon initialization of an object. A constructor without arguments is a default constructor, but constructors that take arguments can be defined.
GDB, the GNU Project debugger, allows you to see what is going on `inside' another program while it executes -- or what another program was doing at the moment it crashed.
GDB can do four main kinds of things (plus other things in support of these) to help you catch bugs in the act:
  • Start your program, specifying anything that might affect its behavior.
  • Make your program stop on specified conditions.
  • Examine what has happened, when your program has stopped.
  • Change things in your program, so you can experiment with correcting the effects of one bug and go on to learn about another.
GNU Make is a tool which controls the generation of executables and other non-source files of a program from the program's source files. Make gets its knowledge of how to build your program from a file called the makefile, which lists each of the non-source files and how to compute it from other files. When you write a program, you should write a makefile for it, so that it is possible to use Make to build and install the program.
Memory leak means that the program is not freeing up memory it reserves. The memory will be freed when the program terminates, but if a program keeps leaking more and more memory without terminating, it can become a huge issue!
A typical way for memory leaks to occur is reserving memory with new and not calling delete before the pointer goes out of scope.
Pointer variables store a memory address as their value. In other words, they point to some data. The data can be accessed by dereferencing the pointer. (Either like *p or p->...)
A recursive function calls itself from within it. The recursion call must be conditional or it would lead to an infinite loop.
Valgrind is another tool besides ASAN that you can use in this course. It can detect many memory-related errors that are common in C and C++ programs and that can lead to crashes and unpredictable behaviour.
const is a keyword meant to mark something as immutable
  • A const object cannot be modified: attempt to do so directly is a compile-time error, and attempt to do so indirectly (e.g., by modifying the const object through a reference or pointer to non-const type) results in undefined behavior.
  • const keyword on an object's member function prevents the object from being modified in the function
  • Pointer declarations can have 2 const keywords, one to make the data that's pointed at unable to be modified and one to make the pointer itself unable to be modified
Using const improves code readability and prevents accidental modification of objects.
g++ is a C++ compiler that we primarily use for this course. It is the C++ compiler for the GNU compiler collection. You may sometimes see gcc being used instead of g++, which was originally the GNU C compiler, but calling gcc generally also compiles .cpp files as C++. However, calling g++ is preferred.
In C++, std stands for Standard Library, which is a collection of commonly useful classes and functions. Typically, these are defined in the std namespace.