Overview
In this project you will implement a small part of a text search engine: using an existing inverted file index for a multi-word query to find the top-matching documents in a large document collection.
The project consists of 2 coding tasks and 1 written task. Each task is described in detail in the sections below.
Provided files
Before you attempt this project you will need to download the following code and data files from the LMS:
image.png
Makefile To assist with compilation. Edit if you add any extra files. data/ Directory containing many term-based inverted file index files. documents.txt List of all documents indexed.
At this point, you should be able to compile the supplied code (above) by running make. Once your project is completed, you will be able to execute it by running a command of the form
./a1 -t task -r num results -d num documents list of query terms
where
• task is either 1 or 2 representing which coding task to execute,
• num results is the number of search results to find and display,
• num documents is the maximum number of documents to consider (optional; the default is ‘all documents’, or 131,563), and
• list of query terms is a space-separated list of words making up a query for your program to ‘search’ for.
For ple, the command ./a1 -t 1 -r 12 algorithms are fun will run your solution to task 1 on the entire document collection, printing the top 12 results matching the combination of words ‘algorithms’, ‘are’, and ‘fun’.
Data structures
Your solution will need to work with the data structures defined in index.h and list.h. Here’s a brief overview (consult the header files for the finer details and a full list of available functions):
• An Index is used to represent a multi-term inverted file index. An Index has an array of string ‘terms’. For each term, it also stores a List of Documents.
• Each List in the Index is a linked list of Nodes, with each Node containing a pointer to a single
Document. These lists can easily be traversed using an appropriate while or for loop.
• A Document contains an integer id and a floating-point score. The id is an ordinal integer identifying a particular document in the overall collection (see documents.txt, which lists all documents by their id, starting with document 0). The score is a non-negative real number calculated based on the prevalence of the corresponding term within this document.
• Documents occur in each of the index’s lists in order of increasing id. Only documents with
score greater than zero for a term are present in the list for that term.
Additionally, you will require an array-based binary min-heap module to correctly implement each of the coding tasks. It is up to you exactly how to design and implement this module. Your solution to the week 4 lab tasks will be helpful here. Note that sample lab solutions will be released in the weeks after each lab and that these may be used in your project (with proper attribution).
Furthermore, you may create any additional modules necessary to support your solution. If you add additional modules, it is your responsibility to correctly extend your makefile—you must ensure that your solution can be compiled after submission simply by running make.
Coding tasks
The first two tasks of the project require you to implement functions inside query.c.
Task 1: Array-based accumulation approach (3 marks)
Implement the function print array results() defined in query.c. This function has three input parameters: index (an Index pointer) containing data for the query to perform; n results (an integer) describing the number of results to output; and n documents (another integer) containing the total number of documents in the collection to be queried.
This function should find the top n results matching document ids and their associated total scores. The total score for a document is the sum of its score from each term in the query. For this function, use the following method to find these top-scoring documents:
1. Initialise an array of n documents floating-point numbers, all set to 0.0