Overview
In this project you will implement a small part of a text search engine: using an existing inverted le index for a multi-word query to nd the top-matching documents in a large document collection.
The project consists of 2 coding tasks and 1 written task. Each task is described in detail in the sections below.
Provided les
Before you attempt this project you will need to download the following code and data les from the LMS:
main.c
Entry point to program. Do not change.
list.h, list.c
Singly-linked list module. Do not change.
index.h, index.c
Inverted le index module. Do not change.
query.h
Prototypes for functions you must implement. Do not change.
query.c
De ne your functions here, according to speci cation below.
heap.h, heap.c
Binary min-heap module. Empty: see week 4 lab task.
Makefile
To assist with compilation. Edit if you add any extra les.
data/
Directory containing many term-based inverted le index les.
documents.txt
List of all documents indexed.
At this point, you should be able to compile the supplied code (above) by running make. Once your project is completed, you will be able to execute it by running a command of the form
./a1 -t task -r num results -d num documents list of query terms
where
task is either 1 or 2 representing which coding task to execute, num results is the number of search results to nd and display,
num documents is the maximum number of documents to consider (optional; the default is `all documents', or 131,563), and
list of query terms is a space-separated list of words making up a query for your program to `search' for.
For ple, the command ./a1 -t 1 -r 12 algorithms are fun will run your solution to task 1 on the entire document collection, printing the top 12 results matching the combination of words `algorithms', `are', and `fun'.
Data structures
Your solution will need to work with the data structures de ned in index.h and list.h. Here's a brief overview (consult the header les for the ner details and a full list of available functions):
An Index is used to represent a multi-term inverted le index. An Index has an array of string `terms'. For each term, it also stores a List of Documents.
Each List in the Index is a linked list of Nodes, with each Node containing a pointer to a single Document. These lists can easily be traversed using an appropriate while or for loop.
A Document contains an integer id and a oating-point score. The id is an ordinal integer identifying a particular document in the overall collection (see documents.txt, which lists all documents by their id, starting with document 0). The score is a non-negative real number calculated based on the prevalence of the corresponding term within this document.
Documents occur in each of the index's lists in order of increasing id. Only documents with score greater than zero for a term are present in the list for that term.
Additionally, you will require an array-based binary min-heap module to correctly implement each of the coding tasks. It is up to you exactly how to design and implement this module. Your solution to the week 4 lab tasks will be helpful here. Note that sample lab solutions will be released in the weeks after each lab and that these may be used in your project (with proper attribution).
Furthermore, you may create any additional modules necessary to support your solution. If you add additional modules, it is your responsibility to correctly extend your make le|you must ensure that your solution can be compiled after submission simply by running make.
Coding tasks
The rst two tasks of the project require you to implement functions inside query.c.
Task 1: Array-based accumulation approach (3 marks)
Implement the function print array results() de ned in query.c. This function has three input parameters: index (an Index pointer) containing data for the query to perform; n results (an integer) describing the number of results to output; and n documents (another integer) containing the total number of documents in the collection to be queried.
This function should nd the top n results matching document ids and their associated total scores. The total score for a document is the sum of its score from each term in the query. For this function, use the following method to nd these top-scoring documents:
1. Initialise an array of n documents oating-point numbers, all set to 0.0.
2. Iterate through each document list in the index. For each document in the list, add the document score to the corresponding position in the array (using the document id as an array index).
3. Use the priority queue-based top-k selection algorithm to nd the maximum n results total scores in the resulting array, and their associated document ids.
The function should print these results as per the instructions in the `Output format' section below.
Task 2: Priority queue-based multi-way merge approach (3 marks)
Implement the function print merge results() de ned in query.c. This function has two input parameters: index (an Index pointer) containing data for the query to perform; and n results (an integer) describing the number of results to output.
This function should also nd the top n results matching document ids and their associated total scores. For this function, use the following method to nd these top-scoring documents:
1. Use a priority queue-based multi-way merge algorithm to iterate through the n terms document lists concurrently:
Initialise a priority queue to order the document lists based on the id of their rst documents. Repeatedly retrieve a document from the document list at the front of the priority queue, and rearrange the priority queue so that this list is positioned according to the id of its next document (stepping through the list). Stop after processing all documents in all lists.
2. While iterating through the lists in this way, accumulate total scores for each document id. Use the priority queue-based top-k selection algorithm to nd the maximum n results total scores and associated document ids.
The function should print the results as per the instructions in the `Output format' section below.
Output format
The output of both functions should follow the same format: the top-scoring n results results with non-zero scores, printed one result per line to stdout (e.g. via the printf() function).
The results should be printed in descending order of total score (that is, with the highest total scoring document on the rst line, and then the second, and so on). In the case of multiple documents with the same score, such documents may be printed in any order relative to each other. Each document should be on its own line. The line should be formatted with the document id printed rst as a 6-digit decimal integer (padded on the left with spaces if necessary), followed by a single space, followed by the oating-point total score of the document to a precision of 6 digits after the decimal point. The following format string will be useful: "%6d %.6f\n"
There should never be more than n results lines of output. If many documents have the same score as the document with the n resultsth-highest score, any subset of these documents may be printed such that there are exactly n results lines of output. Moreover, there should be no additional information printed to stdout (if you must print additional information, you can use stderr). Finally, only documents with non-zero total scores should ever be shown. Therefore, for some queries, there may actually be fewer than n results lines of output.
ple output
To help you validate the output format of your functions, we provide some samples of correct output for three basic queries. Download the sample les from the LMS. The les contain one ple of correct output from a selection of commands, described in the following table.
These ples are intended to help you con rm that your output follows the formatting instructions. Note that for some of these inputs there may be more than one correct output due to di erent possible orderings of documents with the same score. These samples represent only one ordering. Note also that these samples represent only a small subset of the inputs your solution will be tested against after submission: matching output for these inputs doesn't guarantee a correct solution. You are expected to test your functions comprehensively to ensure that they behave correctly for all inputs.
Filename
Command
helloworld-12.txt
./a1 -t 1
-r 12
hello world
algorithms-99.txt
./a1
-t
1
-r
99
algorithms are fun
catoutofbag-8.txt
./a1
-t
1
-r
8 the cat is out of the bag
Since the expected output format for task 2 is identical to that of task 1, you can test task 2 by replacing -t 1 with -t 2 in each command.
Warning: These les contain Unix-style newlines. They will not display properly in some text editors, including the default Windows text editor Notepad.
Written task
The nal task of the project requires you to write a report addressing the topics described below.
Task 3: Analysis of algorithms (4 marks)
First, analyse the di erent approaches from task 1 and task 2 in terms of their asymptotic time complexity. Then, explore the e ect that the various input parameters (such as the number and length of document lists in the query and the number of results requested) have on the time taken by your program to produce its results. Support your investigation with experimental evidence gathered by timing your program with various input con gurations. Include a plot to visualise your experimental results in your report. To conclude, give a precise characterisation of the circumstances where we should prefer one approach over the other.
Your report must be no more than two pages in length. You may use any document editor to create your report, but you must export the document as a PDF for submission. You should name the le report.pdf.
Submission
Via the LMS, submit a single archive le (e.g. .zip or .tar.gz) containing all les required to compile your solution (including Makefile) plus your report (as a PDF). When extracted from this archive on the School of Engineering student machines (a.k.a. dimefox), your submission should compile without any errors simply by running the make command.
Please note that when compiling your program we will use the original versions of the les marked `Do not change' in the Provided les list. Any changes you have made to these les will be lost. This may lead to compile errors. Do not modify these les.
Submissions will close automatically at the deadline. As per the Subject Guide, the late penalty is 20% of the available marks for this project for each day (or part thereof) overdue. Note that network and machine problems right before the deadline are not su cient excuses for a late or missing submission. Please see the Subject Guide for more information on late submissions and applying for an extension.
Marking
The two coding tasks will be marked as follows:
2 of the available marks will be for the correctness of your program's output on a suite of test inputs. You will lose partial marks for minor discrepancies in output formatting, or minor mistakes in your results. You may score no marks if your program crashes on certain inputs, produces completely incorrect answers, or causes compile errors. We will compile and test your program on the School of Engineering student machines (a.k.a. dimefox).
1 of the available marks will be for the quality and readability of your code. You will lose part or all of this mark if your program is poorly designed (e.g. with lots of repetition or poor functional decomposition), if your program is di cult to follow (e.g. due to missing or unhelpful comments or unclear variable names), or if your program has memory leaks.
The written report will be marked as follows:
1 of the available marks will be for the clarity and accuracy of your analysis of the asymptotic time complexity of the two approaches. You will lose partial marks for minor inaccuracies, or misuse of terminology or notation.
2 of the available marks will be for the quality of your investigation into the observed performance of both approaches. You will lose marks if your investigation is not supported by experimental evidence, if your report does not include a visualisation of your results, or if your investigation does not fully address the topic.
1 of the available marks will be for the clarity and accuracy of your conclusion.
Additional marks will be deducted if your report is too long (past the two-page limit) or is not a PDF document.
Academic honesty
All work is to be done on an individual basis. Any code sourced from third parties must be attributed. All submissions will be subject to automated similarity detection. Where academic misconduct is detected, all parties involved will be referred to the School of Engineering for handling under the University Discipline procedures. Please see the Subject Guide and the `Academic Integrity' section of the LMS for more information.