CS案例之Python实现R-tree研究提供一个2D points dataset数据集处理
当前位置:以往案例 > >CS案例之Python实现R-tree研究提供一个2D points dataset数据集处理
2019-03-14

目标是实现R树。每份提交的内容将根据正确性和效率进行分级。文档的其余部分解释了细节。

您的提交将如何进行测试:您将获得一个包含2D点的数据集。数据集将以文本文件格式提供,格式如下:


n

id_1 x_1 y_1

id_2 x_2 y_2



id_n x_n y_n



具体来说,第一行给出数据集中的点数。然后,每一个后续的行都给出一个点的id,x和y坐标。

你的程序应该从数据集中在内存中构建一个R-tree。然后,我们将如下测量其查询效率。


•首先,您的程序应显示一次读取整个数据集的时间。这次是连续扫描基准测试,与使用R-tree的查询算法的成本进行比较。


•[范围查询测试]您将得到一组文本文件中的100个范围查询,格式为:

x _ 1 x'_ 1 y _ 1 y'_ 1

x _ 2 x'_ 2 y _ 2 y'_ 2



x _ 100 x'_ 100 y _ 100 y'_ 100

也就是说,每行指定一个矩形为[x,x0]×[y,y0]的查询。

你应该输出:


– 每个查询返回的点数到磁盘文件 – 注意:我们只需要检索的点数,而不是这些点的详细信息。

– 回答所有100个查询的总运行时间,以及每个查询的平均时间(即将总运行时间除以100)。


标记:您为此作业获得的总分数基于:

•查询:80分,包括

– 正确性:40分。如果你的程序正确地回答m(100个)查询,那么你得到这个部分的40·(m / 100)标记。

– 效率:40分。如果平均查询时间至少比顺序扫描速度快10倍,则该部分可获得40分。如果速度提高至少5倍(但少于10倍),则获得20分。如果速度少于5倍,则无标记。

•报告:20分,包括

– 功能描述:15分。如果您的报告包含源代码中功能的明确说明,则可获得15分。如果只介绍了部分功能,则会根据正确答案的比例给出分数。

– 要求说明:5分。如果您的报告包含明确的描述

执行你的代码的要求,如操作系统环境,输入文件的位置,输入参数等,你将得到5分。


您需要从头开始实施R-tree。这个

意味着您只能使用您选择的编程语言提供的标准库(例如,对于C ++,STL被视为标准库)。



The objective is to implement the R-tree. Each submission will be graded based on correctness and efficiency. The rest of the document explains the details.


How Your Submission Will Be Tested: You will be given a dataset which contains 2D points.The dataset will be provided in a text file as the following format:


n

id_1 x_1 y_1

id_2 x_2 y_2



id_n x_n y_n


Specifically, the first line gives the number of points in the dataset. Then, every subsequent line gives a point’s id, x-, and y-coordinates.

Your program should build an R-tree in memory from the dataset. Then, we will measure its query efficiency as follows.


• First, your program should display the time of reading the entire dataset once. This time serves as the sequential-scan benchmark to be compared with the cost of your query algorithms that leverage the R-tree.


• [Range Query Testing] You will be given a set of 100 range queries in a text file whose format is:

x_1 x’_1 y_1 y’_1

x_2 x’_2 y_2 y’_2



x_100 x’_100 y_100 y’_100

That is, each line specifies a query whose rectangle is [x, x0] × [y, y0].

You should output:


– to a disk file the number of points returned by each query-note: we need only the number of points retrieved, instead of the details of those points.

– the total running time of answering all the 100 queries, and the average time of each query (i.e., divide the total running time by 100).


Marking: Your total mark earned for this project is based on:

• Queries: 80 marks, including

– Correctness: 40 marks. If your program correctly answers m (out of 100) queries, you get 40 · (m/100) marks for this part.

– Efficiency: 40 marks. If the average query time is at least 10 times faster than sequential scan, you get 40 marks for this part. If at least 5 times faster (but less than 10 times),you get 20 marks. If less than 5 times faster, no marks.

• The Report: 20 marks, including

– Function Description: 15 marks. If your report includes a clear description of the functions in your source code, you get 15 marks. If only part of your functions is clearly introduced, you will be given the marks based on the proportion of the correct answers.

– Requirement Description: 5 marks. If your report includes a clear description of the

requirements for executing your code such as, OS environment, placement of input files, any input parameters, etc, you will get 5 marks.


You are required to implement the R-tree from scratch. This

means that you can use only the standard libraries provided in the programming language of your choice (e.g., for C++, STL is considered as a standard library).

在线提交订单