CN114896249A - Unbalanced area tree index structure and n-dimensional space inverse nearest neighbor query algorithm - Google Patents

Unbalanced area tree index structure and n-dimensional space inverse nearest neighbor query algorithm Download PDF

Info

Publication number
CN114896249A
CN114896249A CN202210539416.5A CN202210539416A CN114896249A CN 114896249 A CN114896249 A CN 114896249A CN 202210539416 A CN202210539416 A CN 202210539416A CN 114896249 A CN114896249 A CN 114896249A
Authority
CN
China
Prior art keywords
data
region
point
distance
nearest neighbor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210539416.5A
Other languages
Chinese (zh)
Inventor
朱亮
张士澜
宋鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University
Original Assignee
Hebei University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University filed Critical Hebei University
Priority to CN202210539416.5A priority Critical patent/CN114896249A/en
Publication of CN114896249A publication Critical patent/CN114896249A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an unbalanced region Tree index structure and an n-dimensional space inverse nearest neighbor query algorithm, which are based on an unbalanced region Tree (UR-Tree) index structure and are suitable for an n-dimensional data space inverse nearest neighbor algorithm. Which comprises the following steps: step 1, dividing a data space and establishing an unbalanced area tree index structure; step 2, filtering all nodes of the index tree according to a filtering rule, finding out a region containing candidate tuples, and thinning the region according to a thinning rule to form a final candidate set; and 3, verifying the data tuples contained in the candidate set to obtain a final result set. The invention provides an index structure applying a brand-new space division method, and a new pruning algorithm is used on the basis, so that the efficiency of an inverse nearest neighbor algorithm can be improved. The method is suitable for the inverse nearest neighbor query of the n-dimensional space, and the performance of the algorithm cannot be rapidly deteriorated along with the increase of the dimension.

Description

Unbalanced area tree index structure and n-dimensional space inverse nearest neighbor query algorithm
Technical Field
The invention relates to a data set query method, in particular to an unbalanced area tree index structure and an n-dimensional space inverse neighbor query algorithm based on the unbalanced area tree index structure.
Background
One part of the existing reverse neighbor query strategy is to further obtain the reverse neighbors of query points by calculating accurate or approximate nearest neighbor distances for all data objects; and the other part is to use a filtering-refining framework, apply a distance-based pruning technology and an index structure to obtain the inverse neighbor of the query point. The basic inverse neighbor algorithm based on pre-calculation is to calculate the nearest neighbor and the nearest neighbor circle of each data point in the data set and then to query. At present, a common inverse nearest neighbor algorithm is a method based on a filtering-thinning framework, that is, according to a certain rule, pruning is performed on the whole data set to form a candidate set of query results, then thinning is performed on data points in the candidate set to eliminate false hits, and finally a result set is output.
At present, inverse nearest neighbor query has been widely applied in a plurality of fields such as artificial intelligence, machine learning, databases, data mining, decision support, geographic information systems and the like. The reverse neighbor query is an important query operation in a spatial database, and has important applications in the fields of geographic information systems, positioning services and the like, such as resource allocation, facility site selection, route selection and the like. In addition, the inverse nearest neighbor query is a basic operation in data mining and is very important in a data mining model, especially in the field of clustering and outlier detection.
In many applications, the feature space of data such as video, images, shapes (other examples) is often high dimensional. For many inverse neighbor query algorithms that rely on pruning techniques for computation, the spatial partitioning and pruning become complex with the increase in dimensionality, the overall performance is severely limited by the catastrophic limitations of the dimensionality, and even deteriorates sharply at 10 dimensions. For high-dimensional data sets, the inverse nearest neighbor algorithm is a key point of research and is also a difficulty.
The inverse nearest neighbor query algorithm is widely applied to multiple fields, is an important research subject which is widely concerned by academia, and the dimensionality disaster is a challenging problem which is widely concerned for years. However, most algorithms do not efficiently solve the n-dimensional spatial inverse neighbor query.
Disclosure of Invention
The invention aims to provide an unbalanced area tree index structure and a construction method thereof, so as to solve the problem that the complexity of the inverse nearest neighbor algorithm is greatly increased along with the increase of the dimension in the medium-high dimension data space division.
The second purpose of the invention is to provide an n-dimensional space inverse neighbor query algorithm based on the non-equilibrium region tree index, so as to improve the query performance of the inverse neighbor algorithm in the medium-high dimensional data set and simplify the inverse neighbor query in the medium-high dimensional data set.
One of the objects of the invention is achieved by:
an unbalanced area tree index structure uses PPS space division algorithm to divide the space into different areas, and the index structure contains the characteristic information of the areas besides the information of the contained data tuples.
Dividing a data space by using a PPS space division algorithm, so that the number of data tuples contained in each region is not more than a set threshold value M aiming at the number of data points of the region; the nodes of the index structure comprise the information of the data tuples and the regional characteristic information of the region; the region feature information includes a region center point position, a maximum distance from the center point to a region boundary, and a maximum value of nearest neighbor distances of all data tuples included in the region.
The PPS space division algorithm is a division algorithm of the midpoint of an n-dimensional space. The invention provides an index structure applying a brand-new space division method, and a new pruning algorithm is used on the basis, so that the efficiency of an inverse nearest neighbor algorithm can be improved.
The construction method of the unbalanced area tree index structure comprises the following steps of dividing the whole data space into a plurality of areas by using a PPS space division algorithm, requiring that the number of data points in each area does not exceed a set threshold value M, wherein each area comprises a null area, so as to establish the unbalanced area tree index structure, each node represents an area S, and the index structure comprises the following information:
the intermediate node contains information: the center point c of the region S, the maximum value b of the boundary of the region S max And minimum value b min Maximum distance r from center point c to zone boundary max Data tuples containedThe number tuple _ num, and the maximum value d of the nearest neighbor distance between the tuple in all the data tuples in the region max
Structure of the data tuple list: ordering all data tuples according to a certain attribute, requiring the attribute to have maximum discrimination, and the attribute value (A) of the data tuples 1 ,···,A n ) Tid is the ID of the data tuple, tid' is the ID of the nearest neighbor of the data tuple, d NN Is the distance between the tuple of data and its nearest neighbor.
The second purpose of the invention is realized by the following steps:
an n-dimensional space inverse neighbor query algorithm based on non-equilibrium region tree index, a filtering rule and an algorithm thereof are disclosed, namely, a query point Q is given, if the distance between the point and the center c of a region is greater than the sum of the maximum distance from the center point of the region to a boundary and the maximum neighbor distance of the region, the region can be pruned, otherwise, the region can be used as a candidate region; the rule and its algorithm are refined, i.e. the distance between a data tuple t and a query point Q in a dimension is larger than the nearest neighbor distance of the data tuple, which can be pruned. And when the distance between the data tuple t and the query point Q in each dimension is smaller than the nearest neighbor distance of the data tuple, taking the data tuple t and the query point Q as one of candidate points.
The invention relates to an n-dimensional space inverse nearest neighbor query algorithm based on non-equilibrium region tree index, which specifically comprises the following steps:
step 1, dividing a data space and establishing an unbalanced area tree index structure;
step 2, filtering all nodes of the index tree according to a filtering rule, finding out a region containing candidate tuples, and thinning the region according to a thinning rule to form a final candidate set;
and 3, verifying the data tuples contained in the candidate set to obtain a final result set.
Aiming at a multi-dimensional numerical data set, the invention uses a dividing algorithm (also called PPS space dividing algorithm) of a midpoint in an n-dimensional space, can solve the difficult problem of dividing an n-dimensional space region, creates an unbalanced region tree index structure according to the difficult problem, establishes a pruning rule by using the structure, provides a secondary pruning method, reduces the scale of a candidate set, and reduces the cost consumed in the verification process.
The method is suitable for the inverse nearest neighbor query of the n-dimensional space, and the performance of the algorithm cannot be rapidly deteriorated along with the increase of the dimension.
Drawings
Fig. 1 is a schematic diagram of nearest neighbor points and inverted neighbor points.
Fig. 2 is a schematic diagram of two-dimensional space division.
FIG. 3 is a schematic diagram of an unbalanced area tree index structure according to the present invention.
FIG. 4 is a schematic diagram of a node of an unbalanced area tree index structure.
Figure 5 is a schematic illustration of the proof of lemma 1.
FIG. 6 is a flow chart of the PPS spatial partitioning algorithm of the present invention.
FIG. 7 is a flow chart of construction of an unbalanced area tree index structure.
Fig. 8 is a flow chart of the inverse nearest neighbor algorithm.
FIG. 9 is a time cost bar graph of queries made at a CA data set.
FIG. 10 is a candidate set-scale bar graph formed from a query performed on a CA data set.
FIG. 11 is a time cost diagram of an inverse nearest neighbor algorithm querying an FCT data set.
FIG. 12 is a candidate set-scale bar graph formed by an inverse nearest neighbor algorithm querying an FCT data set.
Detailed Description
The invention is described in further detail below with reference to the figures and the detailed description.
Is provided with
Figure BDA0003649992940000031
Is the set of all real numbers, and
Figure BDA0003649992940000032
d (·,)) is an n-dimensional real vector space with a distance function d (·). Suppose that
Figure BDA0003649992940000033
Is a finite data set/relationship whose schema is R (tid, A) 1 ,···,A n ) Having n attributes (A) 1 ,···,A n ) Correspond to
Figure BDA0003649992940000034
Figure BDA0003649992940000035
Wherein the ith dimension
Figure BDA0003649992940000036
Each t ═ t (tid, t) 1 ,···,t n ) e.R is associated with tid (tuple identifier or primary key). R is typically stored as a base table in a relational database system, and R represents the size of R, i.e., the number of tuples in R.
The distance function d (-) given by the invention is composed of p Norm or p-norm | · | | non-calculation p And (p is more than or equal to 1 and less than or equal to infinity) derivation. For the
Figure BDA00036499929400000310
Wherein x is (x) 1 ,x 2 ,···,x n ) And y ═ y 1 ,y 2 ,···,y n ) Then d (x, y) is:
Figure BDA0003649992940000037
d (x,y)=||x-y|| =max 1≤i≤n (|x i -y i |),p=∞
in addition, | x | luminance p →||x|| When p → ∞. When p is 1,2, and ∞, d p (x, y) will be the Manhattan distance d, respectively 1 (x, y), Euclidean distance d 2 (x, y) and a maximum distance d (x, y), which is useful in many applications, the Euclidean distance d used in the present invention 2 (x, y) and d (x, y) are given below.
Considering query points
Figure BDA0003649992940000038
An inverse nearest neighbor query rnn (Q) finds the set of data tuples in R with Q as nearest neighbor according to a given distance function d (·,), i.e.:
Figure BDA0003649992940000039
rnn (Q) may be an empty set or may be a set of one or more elements, the returned result should be a set of all eligible points, and the nearest neighbor of query point Q is much different from the set of all points having point Q as its nearest neighbor.
As shown in fig. 1, the nearest neighbor of the point p is the point q, but the point p is not the nearest neighbor of the point q, and the point q and the point r are nearest neighbors, so that it is known that the inverse neighbor of the point p is an empty set, the inverse neighbor of the point q is the point p and the point r, and the inverse neighbor of the point r is the point q, that is: nn (p) { q }, nn (q) { r }, nn (r) { q }, and
Figure BDA0003649992940000041
RNN(q)={p,r},RNN(r)={q}。
the invention provides an index structure applying a brand-new space division method, and a new pruning algorithm is used on the basis, so that the efficiency of the inverse nearest neighbor algorithm is improved.
The invention provides a non-equilibrium region tree index structure and a construction method thereof in a first aspect. As shown in fig. 2, the PPS space segmentation algorithm is used to divide the entire data space into a plurality of regions, and the number of data points in each region is required to be not more than M, which includes empty regions, so as to establish the unbalanced region tree index structure shown in fig. 3, where each node represents a region S, and the index structure includes the following information (fig. 4):
the intermediate node contains information: the center point c of the region S, the maximum value b of the boundary of the region S max And minimum value b min Maximum distance r from center point c to zone boundary max The number of data tuples comprising tuple _ num, and the tuples of all data tuples in the region andmaximum value d of its nearest neighbor distance max . The leaf node is added with a pointer to the list of data tuples it contains compared to the intermediate nodes.
Structure of the data tuple list: ordering all data tuples according to a certain attribute, requiring the attribute to have maximum discrimination, and the attribute value (A) of the data tuples 1 ,···,A n ) Tid is the identifier of the data tuple, tid' is the identifier of the nearest neighbor of the data tuple, d NN Is the distance between the tuple of data and its nearest neighbor.
In the second aspect of the invention, based on the unbalanced area tree index structure, a new inverse neighbor query algorithm is provided, the established unbalanced area tree index structure is used according to a certain filtering rule, the data space is pruned for the first time, areas which do not contain query results are filtered, data tuples in the areas are pruned for the second time in the rest areas according to a thinning rule to form a final candidate set, and finally the candidate set is verified.
Lemma 1, a query point Q and a query area S are given, and when the distance between the query point Q and the center point c of the area S is larger than the maximum boundary distance r of the area max Distance d from maximum neighbor max When the sum is obtained, the region does not contain the inverse neighbor of query point Q, that is, any data point t contained in the region is not the inverse neighbor of query point Q.
I.e. for region S, if d (Q, c)>r max +d max Then, then
Figure BDA0003649992940000042
And (3) proving that: as shown in fig. 5, it is known that: d (Q, c)>r max +d max According to the triangle trilateral relationship: d (Q, t) ≧ d (Q, c) -d (c, t) holds, i.e.: d (Q, t)>r max +d max -d (c, t), that is: d (Q, t) -d max >r max -d (c, t), wherein: r is not more than d (c, t) max The method comprises the following steps: d (Q, t) -d max >0, namely: d (Q, t)>d max ≥d(t,t NN ) Query point Q is not the Nearest Neighbor (NN) of data point t. After the syndrome is confirmed.
According to the lemma 1, the following filtering rule, namely the first pruning rule, can be obtained: given a query point Q, the distance d (Q, c) between the point and the center c of the region is calculated, if d (Q, c)>r max +d max Then the region may be pruned away and vice versa as a candidate region. If the area only contains one point, its nearest neighbor must be located in other area, d of the area max I.e. the distance between the point and its nearest neighbors, complies with the rules of pruning and does not need to be discussed separately.
Lemma 2. there are two points t ═ in the n-dimensional space (t) 1 ,···,t n ) And t ═ t' 1 ,···,t′ n ) If the distance between two points in a certain dimension is greater than a constant D, the distance D (t, t') is greater than D, that is:
Figure BDA0003649992940000054
let | t i -t′ i |>D, then there are: d (t, t')>D。
And (3) proving that: for two points in n-dimensional space, t ═ t 1 ,···,t n ) And t ═ t' 1 ,···,t′ n ) Distance between two points in any dimension
Figure BDA0003649992940000051
Figure BDA0003649992940000052
Where i ═ 1, …, n holds, and if i ∈ {1,2, ·, n } exists, then | t is made i -t′ i |>When D is greater than
Figure BDA0003649992940000053
This is true. After the syndrome is confirmed.
From the lemma 2, a refinement rule, i.e. a second pruning rule, can be obtained, where the distance between the data tuple t and the query point Q in a certain dimension is greater than the nearest neighbor distance d (t, t) of the data tuple t NN ) Then the point can be pruned, only if the distance between the data tuple t and the query point Q in each dimension is less than d (t, t) NN ) When it is used, it can be used as waiting timeOne of the points is selected.
Lemma 3. given a query point Q, if a point t satisfies that the distance d (t, Q) from the query point Q is not greater than the distance d (t, t) from the point t to its nearest neighbor NN ) Then point t is one of the inverse neighbors of query point Q. Conversely, if a point t 'satisfies a distance d (t', Q) from query point Q that is greater than the distance d (t ', t') from point t 'to its nearest neighbor' NN ) Then point t' is not the inverse neighbor of query point Q.
In short, for a point t, if the query point Q is located in its nearest neighbor region, the point t must belong to rnn (Q), otherwise, the point t must not belong to rnn (Q).
And (3) proving that: given a query point Q and a data point t, and a nearest neighbor distance d (t, t) for the data point t NN ) When d (t, Q) is less than or equal to d (t, t) NN ) When the query point Q is located in the nearest neighbor area of the data point t, the query point Q becomes the nearest neighbor of the data point t, and the data point t is the inverse neighbor of the query point Q; when d (t, Q)>d(t,t NN ) In the meantime, the query point Q is located outside the nearest neighbor region of the data point t, and the query point Q cannot be the nearest neighbor of the data point t, so the data point t is not the inverse neighbor of the query point Q.
Obtaining a verification rule for the candidate set by lemma 3, calculating a distance d (t, Q) between the candidate tuple t and the query point Q, and when d (t, Q) is not greater than the nearest neighbor distance d (t, t) of the candidate tuple t NN ) Then, the tuple is the inverse neighbor of query point Q; otherwise, the tuple is not the inverse neighbor of query point Q.
As shown in fig. 6-8, the inverse neighbor algorithm based on the unbalanced area tree index structure provided by the present invention mainly includes an algorithm of establishing an index and two modules of filtering-refining-verifying.
S1, establishing an unbalanced tree region index structure, including the following two aspects:
s1-1, using PPS space division algorithm to carry out space division and merging to build tree (figure 6):
s1-1-1: the coordinate of the center point c of the calculation region S in the ith dimension is [ (b) max (i) -b min (i) )/2]Calculating the maximum distance r from the center point c to the boundary of the region S max Judging whether the number of data points contained in the area is more than M;
s1-1-2: if the number of data points contained in the region S is not more than the set threshold value M, forming a list of the contained data points and sorting the data points according to the primary key; if the number of data points contained in the region S is larger than the set threshold value M, finding out the k longest edges e of the region S 1 ,e 2 ,···,e k Each edge is divided into n 1 ,n 2 ,···,n k In equal parts, the region S is divided into h ═ n 1 n 2 ···n k Sub-region { S i I 1,2, h, inserting the sub-regions as the sub-nodes of the region S into the index tree, and inserting all the sub-regions S into the index tree i (i ═ 1, ·, h) as the region to be divided, recursively calling the algorithm;
when the data space is two-dimensional, the k value is taken as 2, and 2 edges are equally divided into 3 parts; when the dimension of the data space exceeds two dimensions, the k value can be 3, and 3 longest edges are equally divided into 2, 2 and 3 parts respectively.
S1-2, performing information padding to form an unbalanced area Tree (UR-Tree) index (fig. 7):
s1-2-1: calculating the distance between the data point t and other data points in the region, finding the point t 'nearest to the data point t, and recording the identification tid and the distance d (t, t') of the data tuple as the identification tid 'nearest to the data point t and the nearest distance d (t, t') nearest to the data point t respectively NN
S1-2-2: centered on data point t, 2d NN If the rectangle does not exceed the boundary of the area where the data point t is located, the point t' is the nearest neighbor of the data point t; otherwise, step S1-2-3 is executed to check the authenticity of the card;
s1-2-3: calculating the area intersected with the rectangle S ', finding the data point in the area and positioned in the rectangle S ', checking whether the data point is the nearest neighbor of the data point t, and updating the tid ' of the nearest neighbor of the data point t and the nearest neighbor distance d NN
S1-2-4: calculating the nearest neighbor distance d of all data points in each area NN Maximum value of (d) max
S2, filtering-refining-verifying, comprising the following steps (fig. 8):
s2-1: sequentially accessing the child nodes of the root node from the root node;
s2-2: calculating the distance d (Q, c) between the center c of the accessed node and the query point Q and the maximum boundary distance r of the region max Distance d from maximum neighbor max The sum (r) of max +d max );
S2-3: comparing d (Q, c) with r max +d max When d (Q, c)>r max +d max Then, the node and all the sub-nodes are pruned, and the sequential access is continued; when d (Q, c) is less than or equal to r max +d max If the node is a leaf node, adding the node into the candidate region set, and continuing sequential access, otherwise, adding a child node into the access list until the sequential access is finished;
s2-4: accessing a data point list contained in all the regions in the candidate region set, starting from the first attribute, and comparing the distance d' of the data point t and the query point Q on the attribute with the nearest neighbor distance d (t, t) NN ) For comparison, when d'>d(t,t NN ) When the point is pruned, if d' is less than or equal to d (t, t) on any attribute NN ) If both are true, adding the point into the candidate set;
s2-5: sequentially visiting all data points in the candidate set, calculating the distance d (t, Q) between the data point t and the query point Q, and comparing the distance d (t, t) with the nearest neighbor distance d (t, t) NN ) When d (t, Q) is less than or equal to d (t, t) NN ) If so, the data point t is the inverse neighbor of the query point Q and is added to the result set, otherwise the data point t is discarded.
The PPS space segmentation algorithm program is written by Microsoft Visual Studio, and runs on a computer with a central processing unit of Intel (R) core (TM) i5-9400 CPU @2.90Hz, a memory of 16GB and an operating system of Windows 10.
The experimental data set is the U.S. state of california census data (CA): in 2020, the census data in california in the united states contains two attributes of longitude and latitude, which are two-dimensional data sets, and the longitude and latitude data in the original data set are converted into floating point type data through calculation, and contain 519723 data tuples.
Forest coverage type dataset (FCT): the forest coverage type data set describes the topographical features of 581012 forest elements, each element having an area of 900m 2 Including observations of trees in four regions of the rossford national forest, colorado, usa. There are 55 columns of data in the dataset, 53 columns of attributes, containing information on elevation, slope, and vegetation type.
The invention relates to a comparison test of a PPS space segmentation algorithm and a CSD algorithm in a two-dimensional data set CA:
in the experiment, data sets with data sizes of 10k, 50k, 100k and 200k are respectively segmented from the CA data sets, and the five data sets required by the experiment are formed together with a complete CA data set (the data size is about 500 k). The operating efficiency of UR and CSD algorithms on five data sets with different scales is tested through experiments, 100 groups of two-dimensional data are generated by using a random function in each experiment and are used as a set of query points, the experiment is carried out for multiple times, the query time and the scale of a formed candidate set are respectively recorded, and finally the average time required by query of each data point and the scale of the candidate set are calculated. In the experiment, the threshold set for establishing the UR-Tree index structure is M100, the data set is two-dimensional, and the region boundary in two dimensions is divided into 3 equal parts for each division, so as to form 9 small regions.
Figure 9 shows the time cost of running two RNN algorithms on different sized data sets, comparing pruning time and validation time of the algorithms respectively. As shown in FIG. 9, the impact of the size of the data set on the time cost of RNN query is very limited, and both algorithms are insensitive to changes in the size of the data set and do not change greatly with changes in the size of the data set. In addition, the UR algorithm operates more efficiently and at a lower time cost than the CSD algorithm. In the time consumed by the CSD and VR algorithms, the verification time accounts for about 8% -10% of the total time of the algorithms, in the UR algorithm, the time occupied by the verification time is about 2% -4% of the total time of the algorithms, the verification time required by the UR algorithm is greatly reduced, the total time consumed by the algorithms is very small, and because complex nearest neighbor calculation is not required in the verification process, only the complexity of distance calculation is low. Fig. 10 reflects the relationship between the sizes of the candidate sets formed by the two algorithms and the size of the data set in the experiment, and it can be known that neither algorithm increases with the increase of the size of the data set and is maintained at a stable size, and the size of the candidate set obtained by the UR algorithm is smaller than that of the other two algorithms, so that the number of data tuples required to be verified in the verification process is reduced, and the time required to be consumed in the verification process is reduced in another aspect.
Experiments of the invention in the data set FCT:
the reverse nearest neighbor query is carried out by using a UR algorithm according to different dimensions in an FCT data set in the experiment, 100 groups of data are generated by using a random function in each experiment, the query time and the candidate set scale formed through filtering are respectively recorded, and the average time and the candidate set scale required by each query are calculated after multiple experiments. In the experiment, the threshold set by establishing the UR-Tree index structure is M100, and when the data set is two-dimensional, the region boundary in two dimensions is divided into 3 equal parts each time to form 9 small regions; when the dimension of the data set is larger than two dimensions, three edges with the longest region boundary are found in each division, and the three edges are equally divided into 3 parts, 2 parts and 2 parts respectively to form 12 small regions.
Fig. 11 is a relationship between time cost of the UR algorithm and the dimension of the data set when the UR algorithm performs an inverse neighbor query on the n-dimensional data set. As can be seen from fig. 11, the time cost of the UR algorithm increases with the increase of the dimension of the data set, and since the main body in the pruning process is the calculation of the distance, the time cost shows a nearly linear increasing trend with the increase of the dimension of the data set, and similarly, the verification process also shows a nearly linear increasing trend with the increase of the dimension of the data set. In addition, the time total cost of the pruning process in the reverse neighbor query algorithm is large, more than 99 percent of the time total cost is the time cost occupied by the pruning process, and the verification time ratio is within 1 percent. Fig. 12 shows the relationship between the size of the candidate set and the dimension of the data set formed after pruning by the UR algorithm, and it can be seen from fig. 12 that as the dimension of the data set increases, the size of the candidate set also gradually increases, a nearly linear increase is shown in the low-dimensional data set, and the increase gradually flattens in the medium-high-dimensional data set.

Claims (6)

1. An unbalanced area tree index structure is characterized in that a PPS space segmentation algorithm is used to divide a space into different areas, and the index structure also comprises characteristic information of the areas except information of data tuples contained in the index structure.
2. The unbalanced region tree index structure of claim 1, wherein the PPS space partition algorithm is used to partition the data space such that the number of tuples contained in each region is not greater than a predetermined threshold for the number of tuples in the regionM(ii) a The nodes of the index structure comprise the information of the data tuples and the regional characteristic information of the region; the region feature information includes a region center point position, a maximum distance from the center point to a region boundary, and a maximum value of nearest neighbor distances of all data tuples included in the region.
3. A method for constructing an unbalanced region tree index structure is characterized in that a PPS space segmentation algorithm is used for dividing the whole data space into a plurality of regions, and the number of data points in each region is required not to exceed a set threshold valueMIncluding empty regions, to create an unbalanced region tree index structure, each node representing a regionSThe index structure contains the following information:
the intermediate node contains information: the regionSCentral point of (2)cRegion of interestSMaximum value of boundaryb max And minimum valueb min Center point ofcMaximum distance to zone boundaryr max Number of data tuples involvedtuple_numAnd the maximum value of the distance between the tuple and the nearest neighbor thereof in all the data tuples in the regiond max
Structure of the data tuple list: ordering all data tuples according to a certain attribute, which is requiredThe attributes have the maximum discrimination, the attribute value of the data tuple: (A 1 , …, A n ),tidFor the identification of the present data tuple,tid′is the identification of the nearest neighbor of the data tuple,d NN is the distance between the tuple of data and its nearest neighbor.
4. An n-dimensional space inverse nearest neighbor query algorithm based on an unbalanced area tree index structure is characterized by a filtering rule and an algorithm thereof, namely a given query pointQIf the point is centered on the areacThe distance between the two regions is greater than the sum of the maximum distance from the central point of the region to the boundary and the maximum adjacent distance of the region, then the region can be pruned, otherwise, the region can be used as a candidate region; refining rules and their algorithms, i.e. tuples of data in a certain dimensiontAnd a query pointQThe distance between them is greater than the nearest neighbor distance of the data tuples, the point can be pruned; data tupletAnd a query pointQWhen the distance in each dimension is smaller than the nearest neighbor distance of the data tuple, the data tuple is taken as one of the candidate points.
5. An n-dimensional space inverse nearest neighbor query algorithm based on an unbalanced area tree index structure is characterized by comprising the following steps:
step 1, dividing a data space and establishing an unbalanced area tree index structure;
step 2, filtering all nodes of the index tree according to a filtering rule, finding out a region containing candidate tuples, and thinning the region according to a thinning rule to form a final candidate set;
and 3, verifying the data tuples contained in the candidate set to obtain a final result set.
6. An n-dimensional space inverse nearest neighbor query algorithm based on an unbalanced area tree index structure is characterized by comprising the following steps:
s1, establishing an unbalanced tree area index structure:
s1-1, using PPS space division algorithm to carry out space division and merging to build tree:
s1-1-1: computing regionsSCentral point of (2)cIn the first placeiThe coordinate of the dimension is [ (), (b max i()b min i() ) / 2]Calculating the center pointcTo the areaSMaximum distance of boundaryr max Judging whether the number of data points contained in the region is larger than a set threshold value related to the number of data points in the small regionM
S1-1-2: if regionSIncluding the number of data points not being greater than a set thresholdMForming a list of the data points contained in the data points and sorting the data points according to the primary key; if regionSThe number of data points contained is greater than a set thresholdMFind outSIs/are as followskThe longest edge of the stripe 1 , e 2 , …, e k Each edge is divided inton 1 , n 2 , …, n k Equal parts, then regionSIs divided intoh = n 1 n 2n k Sub-areaS i :i= 1, 2, …, hTaking the sub-region as a regionSInserting all sub-regions into the index treeS i (i = 1, …, h) As the area to be divided, the algorithm is called recursively;
when the data space is two-dimensional,ktaking the value as 2, and equally dividing 2 edges into 3 parts; when the spatial dimension of the data exceeds two dimensions,kthe value can be 3, and the longest edges of the 3 strips are respectively divided into 2 parts, 2 parts and 3 parts;
s1-2, filling information to form an unbalanced area tree index:
s1-2-1: calculating data pointstThe distance between the data point and other data points in the region is foundtNearest pointt′And records the identification of its data tuplestidDistance tod(t, t′) As data points, respectivelytIdentification of nearest neighborstid′And nearest neighbor distanced NN
S1-2-2: by data pointtIs a center,2d NN Making a rectangle parallel to the boundary of the region for the side lengthS′If the rectangle does not exceed the data pointtThe boundary of the region, pointt′I.e. the data pointst(ii) nearest neighbors; otherwise, step S1-2-3 is executed to check the authenticity of the card;
s1-2-3: calculation and rectangleS′Intersecting region, finding rectangle in the regionS′Checking whether the data point is a data pointtAnd update the data pointtNearest neighbortid′And nearest neighbor distanced NN
S1-2-4: calculating the nearest neighbor distance of all data points in each regiond NN Is recorded asd max
S2, filtering, refining and verifying:
s2-1: sequentially accessing the child nodes of the root node from the root node;
s2-2: calculating the center of the accessed nodecAnd query pointQThe distance betweend(Q, c) And maximum boundary distance of regionr max Distance to maximum neighbord max The sum of (a), (b), (c), (d)r max + d max );
S2-3: comparisond(Q, c) Andr max + d max when is coming into contact withd(Q, c) > r max + d max Then, the node and all the sub-nodes are pruned, and the sequential access is continued; when in used(Q, c) ≤ r max + d max If the node is a leaf node, adding the node into the candidate region set, and continuing sequential access, otherwise, adding a child node into the access list until the sequential access is finished;
s2-4: accessing a list of data points contained in all regions of the set of candidate regions, starting with the first attribute, and assigning the data pointstAnd a query pointQDistance over the attributed′Distance to nearest neighbord(t, t NN ) Making a comparison whend′ > d(t, t NN ) When the point is pruned, if on any attributed′ d(t, t NN ) If both are true, adding the point into the candidate set;
s2-5: sequentially visiting all data points in the candidate set to calculate the data pointstAnd query pointQThe distance betweend(t, Q) Comparing it with the nearest neighbor distanced(t, t NN ) The size of (2)d(t, Q) ≤ d(t, t NN ) Time, data pointtIs a query pointQThe inverse neighbors of (2) are added into the result set, otherwise the data pointstQuery pointQThe inverse neighbors of (2) are discarded.
CN202210539416.5A 2022-05-18 2022-05-18 Unbalanced area tree index structure and n-dimensional space inverse nearest neighbor query algorithm Pending CN114896249A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210539416.5A CN114896249A (en) 2022-05-18 2022-05-18 Unbalanced area tree index structure and n-dimensional space inverse nearest neighbor query algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210539416.5A CN114896249A (en) 2022-05-18 2022-05-18 Unbalanced area tree index structure and n-dimensional space inverse nearest neighbor query algorithm

Publications (1)

Publication Number Publication Date
CN114896249A true CN114896249A (en) 2022-08-12

Family

ID=82723649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210539416.5A Pending CN114896249A (en) 2022-05-18 2022-05-18 Unbalanced area tree index structure and n-dimensional space inverse nearest neighbor query algorithm

Country Status (1)

Country Link
CN (1) CN114896249A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541420A (en) * 2023-07-07 2023-08-04 上海爱可生信息技术股份有限公司 Vector data query method
WO2024098763A1 (en) * 2022-11-08 2024-05-16 苏州元脑智能科技有限公司 Text operation diagram mutual-retrieval method and apparatus, text operation diagram mutual-retrieval model training method and apparatus, and device and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024098763A1 (en) * 2022-11-08 2024-05-16 苏州元脑智能科技有限公司 Text operation diagram mutual-retrieval method and apparatus, text operation diagram mutual-retrieval model training method and apparatus, and device and medium
CN116541420A (en) * 2023-07-07 2023-08-04 上海爱可生信息技术股份有限公司 Vector data query method
CN116541420B (en) * 2023-07-07 2023-09-15 上海爱可生信息技术股份有限公司 Vector data query method

Similar Documents

Publication Publication Date Title
Abbasifard et al. A survey on nearest neighbor search methods
Manolopoulos et al. R-Trees: Theory and Applications: Theory and Applications
Kao et al. Clustering uncertain data using voronoi diagrams and r-tree index
Angiulli et al. Outlier mining in large high-dimensional data sets
Traina et al. Fast indexing and visualization of metric data sets using slim-trees
Lee et al. Computational geometry—a survey
Nagpal et al. Review based on data clustering algorithms
Taot et al. Indexing multi-dimensional uncertain data with arbitrary probability density functions
Dai et al. Probabilistic spatial queries on existentially uncertain data
Ester et al. Clustering for mining in large spatial databases
CN114896249A (en) Unbalanced area tree index structure and n-dimensional space inverse nearest neighbor query algorithm
US20030004938A1 (en) Method of storing and retrieving multi-dimensional data using the hilbert curve
Qi et al. Theoretically optimal and empirically efficient r-trees with strong parallelizability
KR102005343B1 (en) Partitioned space based spatial data object query processing apparatus and method, storage media storing the same
Kollios et al. Indexing mobile objects using dual transformations
Pola et al. The NOBH-tree: Improving in-memory metric access methods by using metric hyperplanes with non-overlapping nodes
Lamrous et al. Divisive hierarchical k-means
Ali et al. Probabilistic voronoi diagrams for probabilistic moving nearest neighbor queries
KR101994871B1 (en) Apparatus for generating index to multi dimensional data
Papadias et al. Constraint-based processing of multiway spatial joins
Agarwal et al. Time responsive external data structures for moving points
Lin et al. Finding optimal region for bichromatic reverse nearest neighbor in two-and three-dimensional spaces
Karlsen et al. Qualitatively correct bintrees: an efficient representation of qualitative spatial information
Wang Nearest neighbor query processing using the network voronoi diagram
Mao et al. On data partitioning in tree structure metric-space indexes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination