CN107562872B - SQL-based query method and device for measuring spatial data similarity - Google Patents

SQL-based query method and device for measuring spatial data similarity Download PDF

Info

Publication number
CN107562872B
CN107562872B CN201710771206.8A CN201710771206A CN107562872B CN 107562872 B CN107562872 B CN 107562872B CN 201710771206 A CN201710771206 A CN 201710771206A CN 107562872 B CN107562872 B CN 107562872B
Authority
CN
China
Prior art keywords
query
distance
partition
determining
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710771206.8A
Other languages
Chinese (zh)
Other versions
CN107562872A (en
Inventor
卢卫
杜小勇
侯佳佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Original Assignee
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China filed Critical Renmin University of China
Priority to CN201710771206.8A priority Critical patent/CN107562872B/en
Publication of CN107562872A publication Critical patent/CN107562872A/en
Application granted granted Critical
Publication of CN107562872B publication Critical patent/CN107562872B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a query method and a query device for measuring spatial data similarity based on SQL, which are used for processing a data set in a partitioning manner, wherein each partition comprises a data object and a reference point; determining a first distance between each data object in the partition and the reference point according to the reference point; an index structure for each data object is determined based on the first distance. Determining a second distance between the query object and a reference point in each partition according to the query object in the query request, and determining a query range of the query object in each partition according to the second distance and a preset distance threshold; target data objects corresponding to the query range within each of the partitions are determined within the index structure of the data objects. The method can realize measurement of the spatial data similarity query based on the database of the SQL technology so as to improve the applicability and performance of the similarity query of the RDBMS database.

Description

SQL-based query method and device for measuring spatial data similarity
Technical Field
The invention relates to the technical field of data processing, in particular to a query method and a query device for measuring spatial data similarity based on SQL.
Background
The similarity query is to find all data objects R with the distance from the query object q less than or equal to a user-specified threshold theta in the data set R by giving the data set R, the query object q, a similarity function and the user-specified threshold theta, namely, the data objects R are considered to be similar to the query object q. The similarity query can be applied in various fields, including face recognition, fingerprint recognition, spatial location query, text error correction, pattern recognition (such as DNA or protein sequences), and the like. With the rapid increase of data amount in the data set and the trend of diversity development presented by data types, the object targeted by the similarity query at present extends from dimension data of an early euclidean space and character string data of a text space to measurement space data which is relatively general at present and is not limited to a certain data type, and the distance between any elements in the data set under the measurement space is definable.
Because a Relational Database Management System (RDBMS) can provide a uniform Structured Query Language (SQL) technology, and can also filter data in combination with other Query conditions, and the Database itself supports data update, Query for measuring spatial data similarity using the RDBMS can greatly improve Query performance. However, the existing query method for measuring spatial data similarity generally establishes a novel index structure for a specific application field, and performs similarity query based on the established index structure. However, the types of the novel index structures are not matched with the types of the index structures supported in the RDBMS, that is, the index structures in the RDBMS cannot identify the formats of the novel indexes, so that the similarity query of the measurement space data cannot be realized by means of the powerful data management function of the RDBMS, and the query efficiency is low.
Therefore, a method for implementing query for measuring spatial data similarity on a database based on SQL technology, such as RDBMS, is needed to improve the applicability and performance of query for the RDBMS database similarity.
Disclosure of Invention
The invention provides a query method and a query device for measuring spatial data similarity based on SQL (structured query language), which are used for solving the problem that an index structure constructed by the query method for measuring spatial data similarity in the prior art is not matched with the type of an index structure supported in an RDBMS (radio data management system BMS), so that the query method for measuring spatial data similarity based on the database of the SQL technology is realized, and the applicability and the performance of the query for the similarity of the RDBMS database are improved.
The invention provides a query method for measuring spatial data similarity based on SQL, which comprises the following steps:
carrying out partition processing on the data set to obtain a plurality of partitions; wherein each partition comprises: data object, reference point;
determining a first distance between each data object in the partition and the reference point according to the reference point;
determining an index structure of each data object according to the first distance;
receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object;
determining a second distance between the query object and a reference point in each partition, and determining a query range of the query object in each partition according to the second distance and the preset distance threshold;
and in the query range of each partition, determining a target data object corresponding to the query object according to the index structure of the data object in the query range.
In an embodiment of the present invention, the partitioning the data set to obtain a plurality of partitions includes:
determining, in the data set, a distance between each data object and each reference point;
and dividing the data object corresponding to each reference point and having the minimum distance with the reference point into a partition to obtain a plurality of partitions matched with the number of the test points.
In an embodiment of the present invention, the determining an index structure of each data object according to the first distance includes:
determining a sorting rule of the data objects according to the magnitude relation of the first distance;
and determining the index structure of each data object according to the first distance and the sorting rule.
In an embodiment of the present invention, the determining, according to the second distance and the preset distance threshold, a query range of the query object in each partition includes:
determining an upper limit value of the query range according to the sum of the second distance and the preset distance threshold;
and determining the lower limit value of the query range according to the difference between the second distance and the preset distance threshold.
In an embodiment of the present invention, after determining the second distance between the query object and the reference point in each partition, the method further includes: acquiring the minimum value of a second distance according to the second distance between the query object and the reference point in each partition;
correspondingly, the determining the upper limit value of the query range according to the sum of the second distance and the preset distance threshold includes:
and determining the upper limit value of the query range according to the sum of the minimum value of the second distance and the preset distance threshold.
Another aspect of the present invention provides an SQL-based query apparatus for measuring spatial data similarity, including: the system comprises a partitioning module, a first determining module, a constructing module, a receiving module, a second determining module and a query module;
the partitioning module is used for partitioning the data set to obtain a plurality of partitions; wherein each partition comprises: data object, reference point;
the first determining module is used for determining a first distance between each data object in the partition and the reference point according to the reference point;
the building module is used for determining an index structure of each data object according to the first distance;
the receiving module is used for receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object;
the second determining module is configured to determine a second distance between the query object and a reference point in each partition, and determine a query range of the query object in each partition according to the second distance and the preset distance threshold;
and the query module is used for determining a target data object corresponding to the query object in the query range of each partition according to the index structure of the data object in the query range.
In an embodiment of the present invention, the partition module includes: a first determination unit and a dividing unit;
wherein the first determining unit is configured to determine, in the data set, a distance between each data object and each reference point;
the dividing unit is used for dividing the data object corresponding to each reference point and having the minimum distance with the reference point into a partition, and obtaining a plurality of partitions matched with the number of the test points.
In an embodiment of the present invention, the building module includes: a second determination unit and a third determination unit;
the second determining unit is configured to determine an ordering rule of the data object according to a magnitude relationship of the first distance;
and the third determining unit is configured to determine an index structure of each data object according to the first distance and the sorting rule.
In an embodiment of the present invention, the second determining module includes: a fourth determining unit and a fifth determining unit;
the fourth determining unit is configured to determine an upper limit value of the query range according to a sum of the second distance and the preset distance threshold;
the fifth determining unit is configured to determine a lower limit of the query range according to a difference between the second distance and the preset distance threshold.
In an embodiment of the present invention, the second determining module further includes: an acquisition unit;
the acquiring unit is used for acquiring the minimum value of a second distance between the query object and the reference point in each partition;
correspondingly, the fourth determining unit is further configured to determine the upper limit value of the query range according to a sum of the minimum value of the second distance and the preset distance threshold.
According to the technical scheme, the SQL-based query method and device for measuring the spatial data similarity, provided by the invention, have the advantages that a plurality of partitions are obtained by partitioning a data set, and each partition comprises: data object, reference point; determining a first distance between each data object in the partition and the reference point according to the reference point; determining an index structure of each data object according to the first distance; receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object; determining a second distance between the query object and a reference point in each partition, and determining a query range of the query object in each partition according to the second distance and the preset distance threshold; and in the query range of each partition, determining a target data object corresponding to the query object according to the index structure of the data object in the query range. The method can realize measurement of the spatial data similarity query based on the database of the SQL technology so as to improve the applicability and performance of the similarity query of the RDBMS database.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a query method for measuring spatial data similarity based on SQL according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of data set partitioning according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of a query method for measuring spatial data similarity based on SQL according to another embodiment of the present invention;
fig. 4 is a schematic flowchart of a query method for measuring spatial data similarity based on SQL according to another embodiment of the present invention;
fig. 5 is a schematic structural diagram of an SQL-based query device for measuring spatial data similarity according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an SQL-based query device for measuring spatial data similarity according to another embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The embodiment provides a query method for measuring spatial data similarity based on SQL, as shown in fig. 1, the method may include:
step 101, carrying out partition processing on a data set to obtain a plurality of partitions; wherein each partition comprises: data object, reference point;
due to the fact that data types under the measurement space are diversified and the data size is large, in order to improve query efficiency, a preprocessing method for dividing all data in a database can be adopted, and the data are divided into a plurality of partitions. All data under the measurement space form a data set, each partition after the data set is divided comprises a partition sequence number used for identifying the partition, and each partition also comprises at least one reference point and at least one data object.
Specifically, as shown in fig. 2, a plurality of objects may be arbitrarily selected from the data set as reference points, and the data space is divided into a plurality of disjoint partitions with the same number as the reference points, where the reference points may be piWhere i is a partition number for identifying each partition, and i is 1, 2, 3, 4, 5. The determination of the data objects in each partition may be performed in various ways, for example, the data objects may be divided into a plurality of reference points according to the number of the data objects remaining in the data set except for the reference points; it is also possible to divide data objects whose distances to the reference points are within a preset range into each of the reference points according to their distances to the reference points.
Step 102, determining a first distance between each data object in the partition and a reference point according to the reference point;
since the types of the data objects in the data set have diversity, each data object has its original attribute, a new attribute value can be added to the data object for identifying the data object in the similarity query. Specifically, the distance between each data object to the reference point, i.e., the first distance, may be determined in each partition according to the reference point. The first distance may be used to identify each data object, and specifically, the first distance includes partition sequence number information of a partition where the data object is located, and distance information between the data object and a reference point of the partition where the data object is located.
For example, a data object r is obtained in the data set, no matter the original attribute of the data object r is character string data, image data or any other type, a new attribute value is added to the data object r, and the attribute value is the reference point p from the data object r to the partition i where the data object r is locatediI.e. the first distance. The first distance may be expressed as:<i,|r,pi|>。
103, determining an index structure of each data object according to the first distance;
after determining the first distance of each data object, an index structure may be constructed with the first distance as an attribute value of the data object, and a specific index structure may employ a B + tree index.
104, receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object;
the query method for measuring spatial data similarity based on SQL provided by this embodiment provides a user-defined function as a query interface for a user, that is, the user only needs to submit an SQL statement when sending a query request. The SQL statement needs to include a data set name, a query object, and a preset distance threshold between the query object and a target data object, so as to find all data objects in the data set whose distance from the query object is less than or equal to the preset distance threshold, that is, the data objects are considered to be similar to the query object.
Step 105, determining a second distance between the query object and a reference point in each partition, and determining a query range of the query object in each partition according to the second distance and a preset distance threshold;
since each data object is converted into data represented by a first distance, if a similarity query needs to be implemented by using an SQL statement, the query object needs to be correspondingly converted into data represented by the distance between the query object and the reference point in each partition.
Specifically, since the query object is given by the user, the reference point within each partition is also determined, and thus the distance between the query object and the reference point within each partition, i.e., the second distance, can be determined. According to different partitions, the distances from the query object to the reference points in different partitions are different.
After the query object is converted into the data represented by the second distance, the query range of the query object in each partition needs to be determined according to the second distance and a preset distance threshold.
Specifically, the necessary conditions for determining that the data object is similar to the query object are: the distance from the data object to the query object is less than or equal to a preset distance threshold, and the requirement can be expressed by formula one:
the formula I is as follows: | pi,q|-θ≤|pi,r|≤|pi,q|+θ;
Wherein p isiIs a reference point; i is a partition sequence number; q is a query object; theta is a preset distance threshold; r is a data object.
Reference point p in determining query object q to different partitions iiSecond distance | p therebetweeniQ | after, p | in different partitions i can be determined according to equation oneiR | range interval that should be satisfied, and | piR | means that the data object in the partition i is to the reference point piFrom which the query range of the query object within each partition can be determined.
For example, if it is determined to be in partition 1|p1And q | is 10, then the query range of the query object in partition 1 can be determined to be [ 10-theta, 10+ theta](ii) a If in partition 2 | p2And q | is 15, then it can be determined that the query object has a query range of [ 15-theta, 15+ theta ] in partition 2]In this way, multiple query ranges may be obtained.
And 106, in the query range of each partition, determining a target data object corresponding to the query object according to the index structure of the data object in the query range.
Because the index structure is constructed according to the first distance, the first distance is used for representing the distance from the data object in the partition to the reference point, and the query range is also an interval range representing the distance from the data object in the partition to the reference point, after the query range of each partition is determined, the data object with the distance from the reference point in each partition meeting the query range can be queried in the index structure by taking the query range as a query condition, and the data object queried in each partition is merged to obtain the target data object.
In the query method for measuring spatial data similarity based on SQL provided in this embodiment, a data set is partitioned to obtain a plurality of partitions, where each partition includes: data object, reference point; determining a first distance between each data object in the partition and the reference point according to the reference point; determining an index structure of each data object according to the first distance; receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object; determining a second distance between the query object and a reference point in each partition, and determining a query range of the query object in each partition according to the second distance and a preset distance threshold; and in the query range of each partition, determining a target data object corresponding to the query object according to the index structure of the data object in the query range. According to the method and the device, measurement space data similarity query can be realized based on the database of the SQL technology, so that the applicability and the performance of the RDBMS database similarity query are improved.
Example two
The embodiment provides a query method for measuring spatial data similarity based on SQL, as shown in fig. 3, the method may include:
step 201, determining a plurality of reference points in a data set;
the reference points may be determined in various ways in the data set, for example, the data objects may be divided into a plurality of reference points according to the number of the data objects remaining in the data set except the reference points; it is also possible to divide data objects whose distances to the reference points are within a preset range into each of the reference points according to their distances to the reference points.
Since the distribution of the reference points and the data objects is irregular, the selection of the reference points affects the number of data objects in each partition and the distance of each data object from the reference point. Therefore, the quality of the reference point selection directly affects the performance of the similarity query.
Therefore, determining a plurality of reference points in the data set in the present embodiment may include:
given a set of queries, each query object in the set of queries is known, and thus, the distance of the query object from each data object in the data set is known. Meanwhile, an ideal reference point selection condition may include the following two aspects: (1) the number of query objects closest to the point should be as large as possible; (2) the sum of the distances of each query object to that point should be as small as possible.
According to the two selection conditions, in a data set formed by the query set and the data set, the reference value score (r) of each data object is obtained according to a formula II, and a plurality of data objects with the maximum reference values score (r) are selected as a plurality of reference points according to the size of the reference values.
The formula II is as follows:
Figure BDA0001395070920000091
wherein score (r) is a reference value for the data object;
Figure BDA0001395070920000092
for query objects closest to the data objectThe number of (2);
Figure BDA0001395070920000093
is the maximum of the number of query objects closest to the data object, and
Figure BDA0001395070920000094
wherein R is a data object, R is a data set, and Q is a query set;
Figure BDA0001395070920000095
is the minimum of the number of query objects that are closest to the data object, an
Figure BDA0001395070920000096
Wherein R is a data object, R is a data set, and Q is a query set; AVG (r) is the average of the sum of the distances from each query object to the data object, and
Figure BDA0001395070920000097
wherein, q is the query object,
Figure BDA0001395070920000098
a set of query objects that are closest to the data object, r is the data object,
Figure BDA0001395070920000099
the number of query objects closest to the data object; AVGmaxMaximum of the average of the sum of the distances of each query object to the data object, and AVGmax=maxr∈(R∪Q)AVG (R), wherein R is a data object, R is a data set, and Q is a query set; AVGminMinimum of the average of the sum of the distances of each query object to the data object, and AVGmin=minr∈(R∪Q)AVG (R), wherein R is a data object, R is a data set, and Q is a query set.
Step 202, determining the distance between each data object and each reference point;
in order to divide each data, after the reference points are determined, the distance between each data object in the data set and each reference point needs to be determined.
Step 203, dividing the data object corresponding to each reference point and having the minimum distance to the reference point into a partition, and obtaining a plurality of partitions matched with the number of the reference points, wherein each partition comprises: data object, reference point;
after determining the distance between each data object in the data set and each reference point, dividing the data object corresponding to each reference point and having the smallest distance with each reference point into a partition. Namely, the magnitude relation of the distance between each data object and different reference points is compared, the reference point closest to each data object is determined, and each data object is divided into the partition where the reference point with the smallest distance is located.
For example, there are M reference points p in the known dataset1、p2……pMData object x to reference point p1Is 1, and the data object x is to other reference points p2……pMAre all greater than 1; while data object y is to reference point p2Is 1, and the data object x is to other reference points p1、p3……pMAre all greater than 1, the data object x is compared to the reference point p1Dividing into the same partition, and associating data object y with reference point p2Dividing into the same partition. According to the method, a plurality of partitions matched with the number of reference points are obtained, wherein each partition comprises at least one data object and at least one reference point.
By partitioning the data objects in the data set and setting a reference point for each partition, the partition processing of the data objects is realized, and the efficiency of similarity query is improved.
Step 204, determining a first distance between each data object in the partition and the reference point according to the reference point;
since the types of the data objects in the data set have diversity, each data object has its original attribute, a new attribute value can be added to the data object for identifying the data object in the similarity query. Specifically, the distance between each data object to the reference point, i.e., the first distance, may be determined in each partition according to the reference point. The first distance may be used to identify each data object, and specifically, the first distance includes partition sequence number information of a partition where the data object is located, and distance information between the data object and a reference point of the partition where the data object is located.
For example, a data object r is obtained in the data set, no matter the original attribute of the data object r is character string data, image data or any other type, a new attribute value is added to the data object r, and the attribute value is the reference point p from the data object r to the partition i where the data object r is locatediI.e. the first distance. The first distance may be expressed as:<i,|r,pi|>。
by adding a new attribute value to the data object: the first distance enables formats of various types of data objects in a measurement space to be unified, efficiency and convenience of updating and maintaining the data objects are improved, and meanwhile the first distance is suitable for a database based on the SQL technology.
Step 205, determining a sorting rule of the data objects according to the magnitude relation of the first distance;
since the first distance can be expressed as:<i,|r,pi|>where i is the partition number, | r, piI is the reference point p from the data object r to the partition i where it is locatediThe distance of (c). The first distance is represented in simplified form as:<ii,di>thus arbitrarily giving a first distance of two data objects<i1,d1>And<i2,d2>the ordering rule of the data objects may be determined as:
when i is1>i2Or i1=i2And d is1>d2When the temperature of the water is higher than the set temperature,<i1,d1>><i2,d2>;
when i is1=i2And d is1=d2When the temperature of the water is higher than the set temperature,<i1,d1>=<i2,d2>;
if not, then,<i1,d1><<i2,d2>。
step 206, determining an index structure of each data object according to the first distance and the sorting rule;
after the first distances of all the data objects are determined, the first distances of all the data objects can be sorted according to the size of the partition sequence number and the distance from the data object to the reference point in the partition where the data object is located according to the sorting rule, and then the index structure of each data object can be determined according to the first distances and the sorting rule. Specifically, the index structure may be sorted in ascending order or descending order according to the size order of the partition sequence numbers and the distance between the data object and the reference point in the partition where the data object is located. At the same time, the range of first distances for data objects in each partition in the index structure is continuous and forms a range interval. Therefore, by adopting the index structure in the embodiment, when the first distance of the corresponding data object is queried in one partition, the index page corresponding to the specified partition can be accessed in a targeted manner, and the first distance values of other partitions in the rest of index pages can not be accessed, so that unnecessary data transmission and processor redundancy calculation amount are avoided, and the efficiency and accuracy of similarity query are improved.
By determining the index structure of each data object according to the first distance of the data object, the representation mode of the index is independent of the original attribute of the data object, and therefore the storage compactness of the data object in the database is improved. Meanwhile, the index representation mode is a specific numerical value, so that the index is updated and maintained more conveniently, the similarity query processing speed is increased when index scanning is used, and the similarity query performance is integrally improved.
Step 207, receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object;
the query method for measuring spatial data similarity based on SQL provided by this embodiment provides a user-defined function as a query interface for a user, that is, the user only needs to submit an SQL statement when sending a query request. The SQL statement needs to include a data set name, a query object, and a preset distance threshold between the query object and a target data object, so as to find all data objects in the data set whose distance from the query object is less than or equal to the preset distance threshold, that is, the data objects are considered to be similar to the query object.
Step 208, determining a second distance between the query object and the reference point in each partition, and determining a query range of the query object in each partition according to the second distance and a preset distance threshold;
since each data object is converted into data represented by a first distance, if a similarity query needs to be implemented by using an SQL statement, the query object needs to be correspondingly converted into data represented by the distance between the query object and the reference point in each partition.
Specifically, since the query object is given by the user, the reference point within each partition is also determined, and thus the distance between the query object and the reference point within each partition, i.e., the second distance, can be determined. According to different partitions, the distances from the query object to the reference points in different partitions are different.
After the query object is converted into the data represented by the second distance, the query range of the query object in each partition needs to be determined according to the second distance and a preset distance threshold.
Optionally, determining a query range of the query object in each partition according to the second distance and a preset distance threshold, specifically including:
determining an upper limit value of the query range according to the sum of the second distance and a preset distance threshold; and meanwhile, determining the lower limit value of the query range according to the difference between the second distance and the preset distance threshold.
Specifically, the necessary conditions for determining that the data object is similar to the query object are: the distance from the data object to the query object is less than or equal to a preset distance threshold, and the requirement can be expressed by formula one:
the formula I is as follows: | pi,q|-θ≤|pi,r|≤|pi,q|+θ;
Wherein p isiIs a reference point; i is a partition sequence number; q is a query object; theta is a preset distance threshold; r is a data object.
Reference point p in determining query object q to different partitions iiSecond distance | p therebetweeniQ | after, p | in different partitions i can be determined according to equation oneiR | range interval that should be satisfied, and | piR | means that the data object in the partition i is to the reference point piFrom which the query range of the query object within each partition can be determined.
For example, if it is determined that | p is in partition 11And q | is 10, then the query range of the query object in partition 1 can be determined to be [ 10-theta, 10+ theta](ii) a If in partition 2 | p2And q | is 15, then it can be determined that the query object has a query range of [ 15-theta, 15+ theta ] in partition 2]In this way, multiple query ranges may be obtained.
Step 209, in the query range of each partition, determining the target data object corresponding to the query object according to the index structure of the data object in the query range.
Because the index structure is constructed according to the first distance, the first distance is used for representing the distance from the data object in the partition to the reference point, and the query range is also an interval range representing the distance from the data object in the partition to the reference point, after the query range of each partition is determined, the data object with the distance from the reference point in each partition meeting the query range can be queried in the index structure by taking the query range as a query condition, and the data object queried in each partition is merged to obtain the target data object.
In the query method for measuring spatial data similarity based on SQL provided in this embodiment, a data set is partitioned to obtain a plurality of partitions, where each partition includes: data object, reference point; determining a first distance between each data object in the partition and the reference point according to the reference point; determining an index structure of each data object according to the first distance; receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object; determining a second distance between the query object and a reference point in each partition, and determining a query range of the query object in each partition according to the second distance and a preset distance threshold; and in the query range of each partition, determining a target data object corresponding to the query object according to the index structure of the data object in the query range. According to the method and the device, measurement space data similarity query can be realized based on the database of the SQL technology, so that the applicability and the performance of the RDBMS database similarity query are improved.
EXAMPLE III
The embodiment provides a query method for measuring spatial data similarity based on SQL, as shown in fig. 4, the method may include:
step 301, performing partition processing on a data set to obtain a plurality of partitions; wherein each partition comprises: data object, reference point;
due to the fact that data types under the measurement space are diversified and the data size is large, in order to improve query efficiency, a preprocessing method for dividing all data in a database can be adopted, and the data are divided into a plurality of partitions. All data under the measurement space form a data set, each partition after the data set is divided comprises a partition sequence number used for identifying the partition, and each partition also comprises at least one reference point and at least one data object.
Specifically, as shown in fig. 2, a plurality of objects may be arbitrarily selected from the data set as reference points, and the data space is divided into a plurality of disjoint partitions with the same number as the reference points, where the reference points may be piWhere i is a partition number for identifying each partition, and i is 1, 2, 3, 4, 5. The determination of the data objects in each partition may be performed in various ways, for example, the data objects may be divided into a plurality of reference points according to the number of the data objects remaining in the data set except for the reference points; it is also possible to divide data objects whose distances to the reference points are within a preset range into each of the reference points according to their distances to the reference points.
Step 302, determining a first distance between each data object in the partition and a reference point according to the reference point;
since the types of the data objects in the data set have diversity, each data object has its original attribute, a new attribute value can be added to the data object for identifying the data object in the similarity query. Specifically, the distance between each data object to the reference point, i.e., the first distance, may be determined in each partition according to the reference point. The first distance may be used to identify each data object, and specifically, the first distance includes partition sequence number information of a partition where the data object is located, and distance information between the data object and a reference point of the partition where the data object is located.
For example, a data object r is obtained in the data set, no matter the original attribute of the data object r is character string data, image data or any other type, a new attribute value is added to the data object r, and the attribute value is the reference point p from the data object r to the partition i where the data object r is locatediI.e. the first distance. The first distance may be expressed as:<i,|r,pi|>。
step 303, determining an index structure of each data object according to the first distance;
after determining the first distance of each data object, an index structure may be constructed with the first distance as an attribute value of the data object, and a specific index structure may employ a B + tree index.
Step 304, receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object;
the query method for measuring spatial data similarity based on SQL provided by this embodiment provides a user-defined function as a query interface for a user, that is, the user only needs to submit an SQL statement when sending a query request. The SQL statement needs to include a data set name, a query object, and a preset distance threshold between the query object and a target data object, so as to find all data objects in the data set whose distance from the query object is less than or equal to the preset distance threshold, that is, the data objects are considered to be similar to the query object.
Furthermore, the query request sent by the user may further include a similarity function, and the similarity function is used for processing and filtering the target data object and returning a final result set.
Step 305, determining a second distance between the query object and the reference point in each partition;
since each data object is converted into data represented by a first distance, if a similarity query needs to be implemented by using an SQL statement, the query object needs to be correspondingly converted into data represented by the distance between the query object and the reference point in each partition.
Specifically, since the query object is given by the user, the reference point within each partition is also determined, and thus the distance between the query object and the reference point within each partition, i.e., the second distance, can be determined. According to different partitions, the distances from the query object to the reference points in different partitions are different.
Step 306, acquiring the minimum value of the second distance according to the second distance between the query object and the reference point in each partition;
and determining the minimum value of the plurality of second distances according to the second distances from the query object to the reference points in different partitions, namely the distance between the query object and the reference point closest to the query object.
In particular, if piIs a reference point; i is a partition sequence number; q is a query object; r is a data object; can determine the query object q to the reference point p in different partitions iiSecond distance | p therebetweeniQ, at a plurality of second distances piObtaining the minimum value | p from q |qQ |, wherein pqFor a plurality of reference points piThe reference point closest to the query object q.
Step 307, determining an upper limit value of the query range according to the sum of the minimum value of the second distance and a preset distance threshold, and determining a lower limit value of the query range according to the difference between the second distance and the preset distance threshold;
the requirements for determining that a data object is similar to a query object are: the distance from the data object to the query object is less than or equal to a preset distance threshold, and the requirement can be expressed by formula one:
the formula I is as follows: | pi,q|-θ≤|pi,r|≤|pi,q|+θ;
Wherein p isiIs a reference point; i is a partition sequence number; q is a query object; theta is a preset distance threshold; r is a data object.
For example, if it is determined that | p is in partition 11And q | is 10, then the query range of the query object in partition 1 can be determined to be [ 10-theta, 10+ theta](ii) a If in partition 2 | p2And q | is 15, then it can be determined that the query object has a query range of [ 15-theta, 15+ theta ] in partition 2]。
Reference point p in determining query object q to different partitions iiSecond distance | p therebetweeniMinimum value of q | pqQ | then, equation one can be optimized to equation two:
the formula II is as follows: | pi,q|-θ≤|pi,r|≤|pq,q|+θ;
Through the formula two, i p in different partitions i can be determinediR | range interval that should be satisfied, where | pqQ | is the query object q and the reference point p closest theretoqThe distance between them. For example, if | p is determinedqQ is 5, in partition 1 | p1And q | is 10, then the query range of the query object in partition 1 can be determined to be [ 10-theta, 5+ theta [ ]](ii) a If in partition 2 | p2And q | is 15, then it can be determined that the query object has a query range of [ 15-theta, 5+ theta ] in partition 2]In this way, multiple query ranges may be obtained.
Due to | pq,q|≤|piAnd q |, it can be seen that the second formula can obtain a more accurate upper limit value of the query range relative to the first formula, while keeping the lower limit value of the query range unchanged. Therefore, the query range of the query object in each partition is determined to be more accurate, the filtering effect on the data object is improved, the efficiency and the performance of similarity query are improved, and meanwhile, the expenses of a CPU and an I/O are obviously reduced.
Since the query range of the query object in each partition can be obtained according to the query request, the number of the query ranges is the same as the number of the partitions. When the number of data sets is large or the number of dimensions is large, it may happen that thousands of reference points are selected, which may result in a dramatic increase in the number of query ranges to be obtained. Each query condition triggers index scanning on the index, then each scanning is performed step by step, and target data objects generated by each index scanning are merged to obtain a candidate set. This causes a high overhead in similarity query of the database, and meanwhile, when the number of query ranges exceeds a threshold, the query engine may use sequential scanning instead of index scanning, the query processing time of sequential scanning and the record number of the data set are in a linear increasing relationship, and when the number of data sets is large or the number of dimensions is large, the query processing time is long, and the timeliness is poor.
Therefore, in order to solve the above problem, a mother table R may be first generated in the data set, and the mother table R stores therein the partition number and the range section of the first distance of all the data objects in each partition;
correspondingly, after determining the query range of the query object in each partition, a temporary table S can be constructed in the database, and the table stores the partition number and the query range of the query object corresponding to the partition number;
and connecting the temporary table S with the mother table R for query, and further reducing the query range of the query object in the temporary table S in each partition so as to improve the efficiency of similarity query.
And 308, in the query range of each partition, determining a target data object corresponding to the query object according to the index structure of the data object in the query range.
Because the index structure is constructed according to the first distance, the first distance is used for representing the distance from the data object in the partition to the reference point, and the query range is also an interval range representing the distance from the data object in the partition to the reference point, after the query range of each partition is determined, the data object with the distance from the reference point in each partition meeting the query range can be queried in the index structure by taking the query range as a query condition, and the data object queried in each partition is merged to obtain the target data object.
Further, after the target data object is obtained, the target data object may be processed and filtered according to the similarity function in the query request, so as to obtain a final result set. The target data object is processed by utilizing the similarity function, so that the accuracy of similarity query is further improved, and the performance of similarity query is improved.
In the query method for measuring spatial data similarity based on SQL provided in this embodiment, a data set is partitioned to obtain a plurality of partitions, where each partition includes: data object, reference point; determining a first distance between each data object in the partition and the reference point according to the reference point; determining an index structure of each data object according to the first distance; receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object; determining a second distance between the query object and a reference point in each partition, and determining a query range of the query object in each partition according to the second distance and a preset distance threshold; and in the query range of each partition, determining a target data object corresponding to the query object according to the index structure of the data object in the query range. According to the method and the device, measurement space data similarity query can be realized based on the database of the SQL technology, so that the applicability and the performance of the RDBMS database similarity query are improved.
Example four
The present embodiment provides an SQL-based query apparatus for measuring spatial data similarity, as shown in fig. 5, the apparatus may include: a partitioning module 41, a first determining module 42, a constructing module 43, a receiving module 44, a second determining module 45, and a querying module 46;
the partitioning module 41 is configured to perform partitioning processing on the data set to obtain a plurality of partitions; wherein each partition comprises: data object, reference point;
a first determining module 42, configured to determine, according to the reference point, a first distance between each data object in the partition and the reference point;
a building module 43, configured to determine an index structure of each data object according to the first distance;
a receiving module 44, configured to receive a query request of a user, where the query request includes a query object and a preset distance threshold between the query object and a target data object;
a second determining module 45, configured to determine a second distance between the query object and the reference point in each partition, and determine a query range of the query object in each partition according to the second distance and a preset distance threshold;
and the query module 46 is configured to determine, within the query range of each partition, a target data object corresponding to the query object according to the index structure of the data object within the query range.
The specific manner in which the respective modules perform operations has been described in detail in relation to the apparatus in this embodiment, and will not be elaborated upon here.
The SQL-based query device for measuring spatial data similarity according to this embodiment obtains a plurality of partitions by partitioning a data set, where each partition includes: data object, reference point; determining a first distance between each data object in the partition and the reference point according to the reference point; determining an index structure of each data object according to the first distance; receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object; determining a second distance between the query object and a reference point in each partition, and determining a query range of the query object in each partition according to the second distance and a preset distance threshold; and in the query range of each partition, determining a target data object corresponding to the query object according to the index structure of the data object in the query range. According to the method and the device, measurement space data similarity query can be realized based on the database of the SQL technology, so that the applicability and the performance of the RDBMS database similarity query are improved.
EXAMPLE five
The present embodiment provides an SQL-based query apparatus for measuring spatial data similarity, as shown in fig. 6, the apparatus may include: a partitioning module 51, a first determining module 52, a constructing module 53, a receiving module 54, a second determining module 55 and a querying module 56;
the partitioning module 51 is configured to perform partitioning processing on the data set to obtain a plurality of partitions; wherein each partition comprises: data object, reference point;
a first determining module 52, configured to determine, according to the reference point, a first distance between each data object in the partition and the reference point;
a building module 53, configured to determine an index structure of each data object according to the first distance;
a receiving module 54, configured to receive a query request of a user, where the query request includes a query object and a preset distance threshold between the query object and a target data object;
a second determining module 55, configured to determine a second distance between the query object and the reference point in each partition, and determine a query range of the query object in each partition according to the second distance and a preset distance threshold;
and the query module 56 is configured to determine, within the query range of each partition, a target data object corresponding to the query object according to the index structure of the data object within the query range.
Optionally, the partition module 51 may include: a first determination unit 511 and a division unit 512;
wherein the first determining unit 511 is configured to determine, in the data set, a distance between each data object and each reference point;
the dividing unit 512 is configured to divide the data object corresponding to each reference point and having the smallest distance to the reference point into a partition, so as to obtain a plurality of partitions matching the number of the reference points.
Further, the building module 53 may include: a second determination unit 531 and a third determination unit 532;
the second determining unit 531 is configured to determine an ordering rule of the data object according to a magnitude relationship of the first distance;
a third determining unit 532, configured to determine an index structure of each data object according to the first distance and the sorting rule.
Further, the second determining module 55 includes: a fourth determination unit 551, a fifth determination unit 552;
the fourth determining unit 551 is configured to determine an upper limit value of the query range according to a sum of the second distance and a preset distance threshold;
a fifth determining unit 552, configured to determine a lower limit value of the query range according to a difference between the second distance and a preset distance threshold.
Preferably, the second determining module 55 may further include: an obtaining unit 553, configured to obtain a minimum value of the second distance according to the second distance between the query object and the reference point in each partition;
correspondingly, the fourth determining unit 551 is further configured to determine the upper limit value of the query range according to a sum of the minimum value of the second distance and a preset distance threshold.
The specific manner in which the respective modules perform operations has been described in detail in relation to the apparatus in this embodiment, and will not be elaborated upon here.
The SQL-based query device for measuring spatial data similarity according to this embodiment obtains a plurality of partitions by partitioning a data set, where each partition includes: data object, reference point; determining a first distance between each data object in the partition and the reference point according to the reference point; determining an index structure of each data object according to the first distance; receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object; determining a second distance between the query object and a reference point in each partition, and determining a query range of the query object in each partition according to the second distance and a preset distance threshold; and in the query range of each partition, determining a target data object corresponding to the query object according to the index structure of the data object in the query range. According to the method and the device, measurement space data similarity query can be realized based on the database of the SQL technology, so that the applicability and the performance of the RDBMS database similarity query are improved.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A query method for measuring spatial data similarity based on SQL is characterized by comprising the following steps:
carrying out partition processing on the data set to obtain a plurality of partitions; wherein each partition comprises: data object, reference point;
determining a first distance between each data object in the partition and the reference point according to the reference point;
determining an index structure of each data object according to the first distance, wherein the index structure is a B + tree index;
receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object;
determining a second distance between the query object and a reference point in each partition, and determining a query range of the query object in each partition according to the second distance and the preset distance threshold;
and in the query range of each partition, determining a target data object corresponding to the query object according to the index structure of the data object in the query range.
2. The method of claim 1, wherein partitioning the data set into a plurality of partitions comprises:
determining, in the data set, a distance between each data object and each reference point;
and dividing the data object corresponding to each reference point and having the minimum distance with the reference point into a partition, and obtaining a plurality of partitions matched with the number of the reference points.
3. The method of claim 1, wherein determining the index structure for each data object based on the first distance comprises:
determining a sorting rule of the data objects according to the magnitude relation of the first distance;
and determining the index structure of each data object according to the first distance and the sorting rule.
4. The method according to claim 1, wherein the determining a query range of the query object in each of the partitions according to the second distance and the preset distance threshold comprises:
determining an upper limit value of the query range according to the sum of the second distance and the preset distance threshold;
and determining the lower limit value of the query range according to the difference between the second distance and the preset distance threshold.
5. The method of claim 4, further comprising, after said determining a second distance between the query object and a reference point within each partition: acquiring the minimum value of a second distance according to the second distance between the query object and the reference point in each partition;
correspondingly, the determining the upper limit value of the query range according to the sum of the second distance and the preset distance threshold includes:
and determining the upper limit value of the query range according to the sum of the minimum value of the second distance and the preset distance threshold.
6. An SQL-based query device for measuring spatial data similarity is characterized by comprising: the system comprises a partitioning module, a first determining module, a constructing module, a receiving module, a second determining module and a query module;
the partitioning module is used for partitioning the data set to obtain a plurality of partitions; wherein each partition comprises: data object, reference point;
the first determining module is used for determining a first distance between each data object in the partition and the reference point according to the reference point;
the building module is configured to determine an index structure of each data object according to the first distance, where the index structure is a B + tree index;
the receiving module is used for receiving a query request of a user, wherein the query request comprises a query object and a preset distance threshold value between the query object and a target data object;
the second determining module is configured to determine a second distance between the query object and a reference point in each partition, and determine a query range of the query object in each partition according to the second distance and the preset distance threshold;
and the query module is used for determining a target data object corresponding to the query object in the query range of each partition according to the index structure of the data object in the query range.
7. The apparatus of claim 6, wherein the partition module comprises: a first determination unit and a dividing unit;
wherein the first determining unit is configured to determine, in the data set, a distance between each data object and each reference point;
the dividing unit is used for dividing the data object corresponding to each reference point and having the minimum distance with the reference point into a partition, and obtaining a plurality of partitions matched with the number of the reference points.
8. The apparatus of claim 6, wherein the building module comprises: a second determination unit and a third determination unit;
the second determining unit is configured to determine an ordering rule of the data object according to a magnitude relationship of the first distance;
and the third determining unit is configured to determine an index structure of each data object according to the first distance and the sorting rule.
9. The apparatus of claim 6, wherein the second determining module comprises: a fourth determining unit and a fifth determining unit;
the fourth determining unit is configured to determine an upper limit value of the query range according to a sum of the second distance and the preset distance threshold;
the fifth determining unit is configured to determine a lower limit of the query range according to a difference between the second distance and the preset distance threshold.
10. The apparatus of claim 9, wherein the second determining module further comprises: an acquisition unit;
the acquiring unit is used for acquiring the minimum value of a second distance between the query object and the reference point in each partition;
correspondingly, the fourth determining unit is further configured to determine the upper limit value of the query range according to a sum of the minimum value of the second distance and the preset distance threshold.
CN201710771206.8A 2017-08-31 2017-08-31 SQL-based query method and device for measuring spatial data similarity Active CN107562872B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710771206.8A CN107562872B (en) 2017-08-31 2017-08-31 SQL-based query method and device for measuring spatial data similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710771206.8A CN107562872B (en) 2017-08-31 2017-08-31 SQL-based query method and device for measuring spatial data similarity

Publications (2)

Publication Number Publication Date
CN107562872A CN107562872A (en) 2018-01-09
CN107562872B true CN107562872B (en) 2020-03-24

Family

ID=60978424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710771206.8A Active CN107562872B (en) 2017-08-31 2017-08-31 SQL-based query method and device for measuring spatial data similarity

Country Status (1)

Country Link
CN (1) CN107562872B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815241B (en) * 2019-01-31 2021-05-11 上海达梦数据库有限公司 Data query method, device, equipment and storage medium
CN113792709B (en) * 2021-11-15 2022-01-11 湖南视觉伟业智能科技有限公司 Rapid large-scale face recognition method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6636849B1 (en) * 1999-11-23 2003-10-21 Genmetrics, Inc. Data search employing metric spaces, multigrid indexes, and B-grid trees
CN102163232A (en) * 2011-04-18 2011-08-24 国电南瑞科技股份有限公司 SQL (Structured Query Language) interface implementing method supporting IEC61850 object query
CN105138607A (en) * 2015-08-03 2015-12-09 山东省科学院情报研究所 Hybrid granularity distributional memory grid index-based KNN query method
CN106095920A (en) * 2016-06-07 2016-11-09 四川大学 Distributed index method towards extensive High dimensional space data
CN106528773A (en) * 2016-11-07 2017-03-22 山东首讯信息技术有限公司 Spark platform supported spatial data management-based diagram calculation system and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6636849B1 (en) * 1999-11-23 2003-10-21 Genmetrics, Inc. Data search employing metric spaces, multigrid indexes, and B-grid trees
CN102163232A (en) * 2011-04-18 2011-08-24 国电南瑞科技股份有限公司 SQL (Structured Query Language) interface implementing method supporting IEC61850 object query
CN105138607A (en) * 2015-08-03 2015-12-09 山东省科学院情报研究所 Hybrid granularity distributional memory grid index-based KNN query method
CN106095920A (en) * 2016-06-07 2016-11-09 四川大学 Distributed index method towards extensive High dimensional space data
CN106528773A (en) * 2016-11-07 2017-03-22 山东首讯信息技术有限公司 Spark platform supported spatial data management-based diagram calculation system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CM-tree:A dynamic clustered index for similarity search in metric database;Lior Aronovich 等;《Data Knowledge Engineering》;20071231;第63卷(第3期);第919-946页 *

Also Published As

Publication number Publication date
CN107562872A (en) 2018-01-09

Similar Documents

Publication Publication Date Title
US11977541B2 (en) Systems and methods for rapid data analysis
US9916350B2 (en) Automated creation of join graphs for unrelated data sets among relational databases
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
US11132388B2 (en) Efficient spatial queries in large data tables
Bouros et al. Spatio-textual similarity joins
CN109947904B (en) Preference space Skyline query processing method based on Spark environment
TW201931169A (en) Sample set processing method and apparatus, and sample querying method and apparatus
US20150242407A1 (en) Discovery of Data Relationships Between Disparate Data Sets
WO2016029230A1 (en) Automated creation of join graphs for unrelated data sets among relational databases
CN106844481B (en) Font similarity and font replacement method
Chen et al. Metric similarity joins using MapReduce
CN112445889A (en) Method for storing data and retrieving data and related equipment
CN109086376B (en) SPARQL query language-based multi-query method and device
CN115374129B (en) Database joint index coding method and system
CN107562872B (en) SQL-based query method and device for measuring spatial data similarity
EP3067804A1 (en) Data arrangement program, data arrangement method, and data arrangement apparatus
CN112162986B (en) Parallel top-k range skyline query method and system
CN106959960B (en) Data acquisition method and device
CN115905373B (en) Data query and analysis method, device, equipment and storage medium
Qi et al. Efficient top-k spatial distance joins
Kesidis et al. Efficient cut-off threshold estimation for word spotting applications
US10185742B2 (en) Flexible text searching for data objects of object notation
CN111460325B (en) POI searching method, device and equipment
KR20100084266A (en) Apparatus and method for processing skyline queries including keyword
CN106709058B (en) Data retrieval recommendation method based on use probability

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant