CN114896249A

CN114896249A - Unbalanced area tree index structure and n-dimensional space inverse nearest neighbor query algorithm

Info

Publication number: CN114896249A
Application number: CN202210539416.5A
Authority: CN
Inventors: 朱亮; 张士澜; 宋鑫
Original assignee: Hebei University
Current assignee: Hebei University
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-08-12

Abstract

The invention relates to an unbalanced region Tree index structure and an n-dimensional space inverse nearest neighbor query algorithm, which are based on an unbalanced region Tree (UR-Tree) index structure and are suitable for an n-dimensional data space inverse nearest neighbor algorithm. Which comprises the following steps: step 1, dividing a data space and establishing an unbalanced area tree index structure; step 2, filtering all nodes of the index tree according to a filtering rule, finding out a region containing candidate tuples, and thinning the region according to a thinning rule to form a final candidate set; and 3, verifying the data tuples contained in the candidate set to obtain a final result set. The invention provides an index structure applying a brand-new space division method, and a new pruning algorithm is used on the basis, so that the efficiency of an inverse nearest neighbor algorithm can be improved. The method is suitable for the inverse nearest neighbor query of the n-dimensional space, and the performance of the algorithm cannot be rapidly deteriorated along with the increase of the dimension.

Description

Unbalanced area tree index structure and n-dimensional space inverse nearest neighbor query algorithm

Technical Field

The invention relates to a data set query method, in particular to an unbalanced area tree index structure and an n-dimensional space inverse neighbor query algorithm based on the unbalanced area tree index structure.

Background

One part of the existing reverse neighbor query strategy is to further obtain the reverse neighbors of query points by calculating accurate or approximate nearest neighbor distances for all data objects; and the other part is to use a filtering-refining framework, apply a distance-based pruning technology and an index structure to obtain the inverse neighbor of the query point. The basic inverse neighbor algorithm based on pre-calculation is to calculate the nearest neighbor and the nearest neighbor circle of each data point in the data set and then to query. At present, a common inverse nearest neighbor algorithm is a method based on a filtering-thinning framework, that is, according to a certain rule, pruning is performed on the whole data set to form a candidate set of query results, then thinning is performed on data points in the candidate set to eliminate false hits, and finally a result set is output.

At present, inverse nearest neighbor query has been widely applied in a plurality of fields such as artificial intelligence, machine learning, databases, data mining, decision support, geographic information systems and the like. The reverse neighbor query is an important query operation in a spatial database, and has important applications in the fields of geographic information systems, positioning services and the like, such as resource allocation, facility site selection, route selection and the like. In addition, the inverse nearest neighbor query is a basic operation in data mining and is very important in a data mining model, especially in the field of clustering and outlier detection.

In many applications, the feature space of data such as video, images, shapes (other examples) is often high dimensional. For many inverse neighbor query algorithms that rely on pruning techniques for computation, the spatial partitioning and pruning become complex with the increase in dimensionality, the overall performance is severely limited by the catastrophic limitations of the dimensionality, and even deteriorates sharply at 10 dimensions. For high-dimensional data sets, the inverse nearest neighbor algorithm is a key point of research and is also a difficulty.

The inverse nearest neighbor query algorithm is widely applied to multiple fields, is an important research subject which is widely concerned by academia, and the dimensionality disaster is a challenging problem which is widely concerned for years. However, most algorithms do not efficiently solve the n-dimensional spatial inverse neighbor query.

Disclosure of Invention

The invention aims to provide an unbalanced area tree index structure and a construction method thereof, so as to solve the problem that the complexity of the inverse nearest neighbor algorithm is greatly increased along with the increase of the dimension in the medium-high dimension data space division.

The second purpose of the invention is to provide an n-dimensional space inverse neighbor query algorithm based on the non-equilibrium region tree index, so as to improve the query performance of the inverse neighbor algorithm in the medium-high dimensional data set and simplify the inverse neighbor query in the medium-high dimensional data set.

One of the objects of the invention is achieved by:

an unbalanced area tree index structure uses PPS space division algorithm to divide the space into different areas, and the index structure contains the characteristic information of the areas besides the information of the contained data tuples.

Dividing a data space by using a PPS space division algorithm, so that the number of data tuples contained in each region is not more than a set threshold value M aiming at the number of data points of the region; the nodes of the index structure comprise the information of the data tuples and the regional characteristic information of the region; the region feature information includes a region center point position, a maximum distance from the center point to a region boundary, and a maximum value of nearest neighbor distances of all data tuples included in the region.

The PPS space division algorithm is a division algorithm of the midpoint of an n-dimensional space. The invention provides an index structure applying a brand-new space division method, and a new pruning algorithm is used on the basis, so that the efficiency of an inverse nearest neighbor algorithm can be improved.

The construction method of the unbalanced area tree index structure comprises the following steps of dividing the whole data space into a plurality of areas by using a PPS space division algorithm, requiring that the number of data points in each area does not exceed a set threshold value M, wherein each area comprises a null area, so as to establish the unbalanced area tree index structure, each node represents an area S, and the index structure comprises the following information:

the intermediate node contains information: the center point c of the region S, the maximum value b of the boundary of the region S _max And minimum value b _min Maximum distance r from center point c to zone boundary _max Data tuples containedThe number tuple _ num, and the maximum value d of the nearest neighbor distance between the tuple in all the data tuples in the region _max ；

Structure of the data tuple list: ordering all data tuples according to a certain attribute, requiring the attribute to have maximum discrimination, and the attribute value (A) of the data tuples ₁ ,···,A _n ) Tid is the ID of the data tuple, tid' is the ID of the nearest neighbor of the data tuple, d _NN Is the distance between the tuple of data and its nearest neighbor.

The second purpose of the invention is realized by the following steps:

an n-dimensional space inverse neighbor query algorithm based on non-equilibrium region tree index, a filtering rule and an algorithm thereof are disclosed, namely, a query point Q is given, if the distance between the point and the center c of a region is greater than the sum of the maximum distance from the center point of the region to a boundary and the maximum neighbor distance of the region, the region can be pruned, otherwise, the region can be used as a candidate region; the rule and its algorithm are refined, i.e. the distance between a data tuple t and a query point Q in a dimension is larger than the nearest neighbor distance of the data tuple, which can be pruned. And when the distance between the data tuple t and the query point Q in each dimension is smaller than the nearest neighbor distance of the data tuple, taking the data tuple t and the query point Q as one of candidate points.

The invention relates to an n-dimensional space inverse nearest neighbor query algorithm based on non-equilibrium region tree index, which specifically comprises the following steps:

step 1, dividing a data space and establishing an unbalanced area tree index structure;

step 2, filtering all nodes of the index tree according to a filtering rule, finding out a region containing candidate tuples, and thinning the region according to a thinning rule to form a final candidate set;

and 3, verifying the data tuples contained in the candidate set to obtain a final result set.

Aiming at a multi-dimensional numerical data set, the invention uses a dividing algorithm (also called PPS space dividing algorithm) of a midpoint in an n-dimensional space, can solve the difficult problem of dividing an n-dimensional space region, creates an unbalanced region tree index structure according to the difficult problem, establishes a pruning rule by using the structure, provides a secondary pruning method, reduces the scale of a candidate set, and reduces the cost consumed in the verification process.

The method is suitable for the inverse nearest neighbor query of the n-dimensional space, and the performance of the algorithm cannot be rapidly deteriorated along with the increase of the dimension.

Drawings

Fig. 1 is a schematic diagram of nearest neighbor points and inverted neighbor points.

Fig. 2 is a schematic diagram of two-dimensional space division.

FIG. 3 is a schematic diagram of an unbalanced area tree index structure according to the present invention.

FIG. 4 is a schematic diagram of a node of an unbalanced area tree index structure.

Figure 5 is a schematic illustration of the proof of lemma 1.

FIG. 6 is a flow chart of the PPS spatial partitioning algorithm of the present invention.

FIG. 7 is a flow chart of construction of an unbalanced area tree index structure.

Fig. 8 is a flow chart of the inverse nearest neighbor algorithm.

FIG. 9 is a time cost bar graph of queries made at a CA data set.

FIG. 10 is a candidate set-scale bar graph formed from a query performed on a CA data set.

FIG. 11 is a time cost diagram of an inverse nearest neighbor algorithm querying an FCT data set.

FIG. 12 is a candidate set-scale bar graph formed by an inverse nearest neighbor algorithm querying an FCT data set.

Detailed Description

The invention is described in further detail below with reference to the figures and the detailed description.

Is provided with

Is the set of all real numbers, and

d (·,)) is an n-dimensional real vector space with a distance function d (·). Suppose that

Is a finite data set/relationship whose schema is R (tid, A) ₁ ,···,A _n ) Having n attributes (A) ₁ ,···,A _n ) Correspond to

Wherein the ith dimension

Each t ═ t (tid, t) ₁ ,···,t _n ) e.R is associated with tid (tuple identifier or primary key). R is typically stored as a base table in a relational database system, and R represents the size of R, i.e., the number of tuples in R.

The distance function d (-) given by the invention is composed of _p Norm or p-norm | · | | non-calculation _p And (p is more than or equal to 1 and less than or equal to infinity) derivation. For the

Wherein x is (x) ₁ ,x ₂ ,···,x _n ) And y ═ y ₁ ,y ₂ ,···,y _n ) Then d (x, y) is:

d _∞ (x,y)＝||x-y|| _∞ ＝max _1≤i≤n (|x _i -y _i |)，p＝∞

in addition, | x | luminance _p →||x|| _∞ When p → ∞. When p is 1,2, and ∞, d _p (x, y) will be the Manhattan distance d, respectively ₁ (x, y), Euclidean distance d ₂ (x, y) and a maximum distance d _∞ (x, y), which is useful in many applications, the Euclidean distance d used in the present invention ₂ (x, y) and d (x, y) are given below.

Considering query points

An inverse nearest neighbor query rnn (Q) finds the set of data tuples in R with Q as nearest neighbor according to a given distance function d (·,), i.e.:

rnn (Q) may be an empty set or may be a set of one or more elements, the returned result should be a set of all eligible points, and the nearest neighbor of query point Q is much different from the set of all points having point Q as its nearest neighbor.

As shown in fig. 1, the nearest neighbor of the point p is the point q, but the point p is not the nearest neighbor of the point q, and the point q and the point r are nearest neighbors, so that it is known that the inverse neighbor of the point p is an empty set, the inverse neighbor of the point q is the point p and the point r, and the inverse neighbor of the point r is the point q, that is: nn (p) { q }, nn (q) { r }, nn (r) { q }, and

RNN(q)＝{p,r}，RNN(r)＝{q}。

the invention provides an index structure applying a brand-new space division method, and a new pruning algorithm is used on the basis, so that the efficiency of the inverse nearest neighbor algorithm is improved.

The invention provides a non-equilibrium region tree index structure and a construction method thereof in a first aspect. As shown in fig. 2, the PPS space segmentation algorithm is used to divide the entire data space into a plurality of regions, and the number of data points in each region is required to be not more than M, which includes empty regions, so as to establish the unbalanced region tree index structure shown in fig. 3, where each node represents a region S, and the index structure includes the following information (fig. 4):

the intermediate node contains information: the center point c of the region S, the maximum value b of the boundary of the region S _max And minimum value b _min Maximum distance r from center point c to zone boundary _max The number of data tuples comprising tuple _ num, and the tuples of all data tuples in the region andmaximum value d of its nearest neighbor distance _max . The leaf node is added with a pointer to the list of data tuples it contains compared to the intermediate nodes.

Structure of the data tuple list: ordering all data tuples according to a certain attribute, requiring the attribute to have maximum discrimination, and the attribute value (A) of the data tuples ₁ ,···,A _n ) Tid is the identifier of the data tuple, tid' is the identifier of the nearest neighbor of the data tuple, d _NN Is the distance between the tuple of data and its nearest neighbor.

In the second aspect of the invention, based on the unbalanced area tree index structure, a new inverse neighbor query algorithm is provided, the established unbalanced area tree index structure is used according to a certain filtering rule, the data space is pruned for the first time, areas which do not contain query results are filtered, data tuples in the areas are pruned for the second time in the rest areas according to a thinning rule to form a final candidate set, and finally the candidate set is verified.

Lemma 1, a query point Q and a query area S are given, and when the distance between the query point Q and the center point c of the area S is larger than the maximum boundary distance r of the area _max Distance d from maximum neighbor _max When the sum is obtained, the region does not contain the inverse neighbor of query point Q, that is, any data point t contained in the region is not the inverse neighbor of query point Q.

I.e. for region S, if d (Q, c)>r _max +d _max Then, then

And (3) proving that: as shown in fig. 5, it is known that: d (Q, c)>r _max +d _max According to the triangle trilateral relationship: d (Q, t) ≧ d (Q, c) -d (c, t) holds, i.e.: d (Q, t)>r _max +d _max -d (c, t), that is: d (Q, t) -d _max >r _max -d (c, t), wherein: r is not more than d (c, t) _max The method comprises the following steps: d (Q, t) -d _max >0, namely: d (Q, t)>d _max ≥d(t,t _NN ) Query point Q is not the Nearest Neighbor (NN) of data point t. After the syndrome is confirmed.

According to the lemma 1, the following filtering rule, namely the first pruning rule, can be obtained: given a query point Q, the distance d (Q, c) between the point and the center c of the region is calculated, if d (Q, c)>r _max +d _max Then the region may be pruned away and vice versa as a candidate region. If the area only contains one point, its nearest neighbor must be located in other area, d of the area _max I.e. the distance between the point and its nearest neighbors, complies with the rules of pruning and does not need to be discussed separately.

Lemma 2. there are two points t ═ in the n-dimensional space (t) ₁ ,···,t _n ) And t ═ t' ₁ ,···,t′ _n ) If the distance between two points in a certain dimension is greater than a constant D, the distance D (t, t') is greater than D, that is:

let | t _i -t′ _i |>D, then there are: d (t, t')>D。

And (3) proving that: for two points in n-dimensional space, t ═ t ₁ ,···,t _n ) And t ═ t' ₁ ,···,t′ _n ) Distance between two points in any dimension

Where i ═ 1, …, n holds, and if i ∈ {1,2, ·, n } exists, then | t is made _i -t′ _i |>When D is greater than

This is true. After the syndrome is confirmed.

From the lemma 2, a refinement rule, i.e. a second pruning rule, can be obtained, where the distance between the data tuple t and the query point Q in a certain dimension is greater than the nearest neighbor distance d (t, t) of the data tuple t _NN ) Then the point can be pruned, only if the distance between the data tuple t and the query point Q in each dimension is less than d (t, t) _NN ) When it is used, it can be used as waiting timeOne of the points is selected.

Lemma 3. given a query point Q, if a point t satisfies that the distance d (t, Q) from the query point Q is not greater than the distance d (t, t) from the point t to its nearest neighbor _NN ) Then point t is one of the inverse neighbors of query point Q. Conversely, if a point t 'satisfies a distance d (t', Q) from query point Q that is greater than the distance d (t ', t') from point t 'to its nearest neighbor' _NN ) Then point t' is not the inverse neighbor of query point Q.

In short, for a point t, if the query point Q is located in its nearest neighbor region, the point t must belong to rnn (Q), otherwise, the point t must not belong to rnn (Q).

And (3) proving that: given a query point Q and a data point t, and a nearest neighbor distance d (t, t) for the data point t _NN ) When d (t, Q) is less than or equal to d (t, t) _NN ) When the query point Q is located in the nearest neighbor area of the data point t, the query point Q becomes the nearest neighbor of the data point t, and the data point t is the inverse neighbor of the query point Q; when d (t, Q)>d(t,t _NN ) In the meantime, the query point Q is located outside the nearest neighbor region of the data point t, and the query point Q cannot be the nearest neighbor of the data point t, so the data point t is not the inverse neighbor of the query point Q.

Obtaining a verification rule for the candidate set by lemma 3, calculating a distance d (t, Q) between the candidate tuple t and the query point Q, and when d (t, Q) is not greater than the nearest neighbor distance d (t, t) of the candidate tuple t _NN ) Then, the tuple is the inverse neighbor of query point Q; otherwise, the tuple is not the inverse neighbor of query point Q.

As shown in fig. 6-8, the inverse neighbor algorithm based on the unbalanced area tree index structure provided by the present invention mainly includes an algorithm of establishing an index and two modules of filtering-refining-verifying.

S1, establishing an unbalanced tree region index structure, including the following two aspects:

s1-1, using PPS space division algorithm to carry out space division and merging to build tree (figure 6):

s1-1-1: the coordinate of the center point c of the calculation region S in the ith dimension is [ (b) _max ⁽ⁱ⁾ -b _min ⁽ⁱ⁾ )/2]Calculating the maximum distance r from the center point c to the boundary of the region S _max Judging whether the number of data points contained in the area is more than M;

s1-1-2: if the number of data points contained in the region S is not more than the set threshold value M, forming a list of the contained data points and sorting the data points according to the primary key; if the number of data points contained in the region S is larger than the set threshold value M, finding out the k longest edges e of the region S ₁ ,e ₂ ,···,e _k Each edge is divided into n ₁ ,n ₂ ,···,n _k In equal parts, the region S is divided into h ═ n ₁ n ₂ ···n _k Sub-region { S _i I 1,2, h, inserting the sub-regions as the sub-nodes of the region S into the index tree, and inserting all the sub-regions S into the index tree _i (i ═ 1, ·, h) as the region to be divided, recursively calling the algorithm;

when the data space is two-dimensional, the k value is taken as 2, and 2 edges are equally divided into 3 parts; when the dimension of the data space exceeds two dimensions, the k value can be 3, and 3 longest edges are equally divided into 2, 2 and 3 parts respectively.

S1-2, performing information padding to form an unbalanced area Tree (UR-Tree) index (fig. 7):

s1-2-1: calculating the distance between the data point t and other data points in the region, finding the point t 'nearest to the data point t, and recording the identification tid and the distance d (t, t') of the data tuple as the identification tid 'nearest to the data point t and the nearest distance d (t, t') nearest to the data point t respectively _NN ；

S1-2-2: centered on data point t, 2d _NN If the rectangle does not exceed the boundary of the area where the data point t is located, the point t' is the nearest neighbor of the data point t; otherwise, step S1-2-3 is executed to check the authenticity of the card;

s1-2-3: calculating the area intersected with the rectangle S ', finding the data point in the area and positioned in the rectangle S ', checking whether the data point is the nearest neighbor of the data point t, and updating the tid ' of the nearest neighbor of the data point t and the nearest neighbor distance d _NN ；

S1-2-4: calculating the nearest neighbor distance d of all data points in each area _NN Maximum value of (d) _max 。

S2, filtering-refining-verifying, comprising the following steps (fig. 8):

s2-1: sequentially accessing the child nodes of the root node from the root node;

s2-2: calculating the distance d (Q, c) between the center c of the accessed node and the query point Q and the maximum boundary distance r of the region _max Distance d from maximum neighbor _max The sum (r) of _max +d _max )；

S2-3: comparing d (Q, c) with r _max +d _max When d (Q, c)>r _max +d _max Then, the node and all the sub-nodes are pruned, and the sequential access is continued; when d (Q, c) is less than or equal to r _max +d _max If the node is a leaf node, adding the node into the candidate region set, and continuing sequential access, otherwise, adding a child node into the access list until the sequential access is finished;

s2-4: accessing a data point list contained in all the regions in the candidate region set, starting from the first attribute, and comparing the distance d' of the data point t and the query point Q on the attribute with the nearest neighbor distance d (t, t) _NN ) For comparison, when d'>d(t,t _NN ) When the point is pruned, if d' is less than or equal to d (t, t) on any attribute _NN ) If both are true, adding the point into the candidate set;

s2-5: sequentially visiting all data points in the candidate set, calculating the distance d (t, Q) between the data point t and the query point Q, and comparing the distance d (t, t) with the nearest neighbor distance d (t, t) _NN ) When d (t, Q) is less than or equal to d (t, t) _NN ) If so, the data point t is the inverse neighbor of the query point Q and is added to the result set, otherwise the data point t is discarded.

The PPS space segmentation algorithm program is written by Microsoft Visual Studio, and runs on a computer with a central processing unit of Intel (R) core (TM) i5-9400 CPU @2.90Hz, a memory of 16GB and an operating system of Windows 10.

The experimental data set is the U.S. state of california census data (CA): in 2020, the census data in california in the united states contains two attributes of longitude and latitude, which are two-dimensional data sets, and the longitude and latitude data in the original data set are converted into floating point type data through calculation, and contain 519723 data tuples.

Forest coverage type dataset (FCT): the forest coverage type data set describes the topographical features of 581012 forest elements, each element having an area of 900m ² Including observations of trees in four regions of the rossford national forest, colorado, usa. There are 55 columns of data in the dataset, 53 columns of attributes, containing information on elevation, slope, and vegetation type.

The invention relates to a comparison test of a PPS space segmentation algorithm and a CSD algorithm in a two-dimensional data set CA:

in the experiment, data sets with data sizes of 10k, 50k, 100k and 200k are respectively segmented from the CA data sets, and the five data sets required by the experiment are formed together with a complete CA data set (the data size is about 500 k). The operating efficiency of UR and CSD algorithms on five data sets with different scales is tested through experiments, 100 groups of two-dimensional data are generated by using a random function in each experiment and are used as a set of query points, the experiment is carried out for multiple times, the query time and the scale of a formed candidate set are respectively recorded, and finally the average time required by query of each data point and the scale of the candidate set are calculated. In the experiment, the threshold set for establishing the UR-Tree index structure is M100, the data set is two-dimensional, and the region boundary in two dimensions is divided into 3 equal parts for each division, so as to form 9 small regions.

Figure 9 shows the time cost of running two RNN algorithms on different sized data sets, comparing pruning time and validation time of the algorithms respectively. As shown in FIG. 9, the impact of the size of the data set on the time cost of RNN query is very limited, and both algorithms are insensitive to changes in the size of the data set and do not change greatly with changes in the size of the data set. In addition, the UR algorithm operates more efficiently and at a lower time cost than the CSD algorithm. In the time consumed by the CSD and VR algorithms, the verification time accounts for about 8% -10% of the total time of the algorithms, in the UR algorithm, the time occupied by the verification time is about 2% -4% of the total time of the algorithms, the verification time required by the UR algorithm is greatly reduced, the total time consumed by the algorithms is very small, and because complex nearest neighbor calculation is not required in the verification process, only the complexity of distance calculation is low. Fig. 10 reflects the relationship between the sizes of the candidate sets formed by the two algorithms and the size of the data set in the experiment, and it can be known that neither algorithm increases with the increase of the size of the data set and is maintained at a stable size, and the size of the candidate set obtained by the UR algorithm is smaller than that of the other two algorithms, so that the number of data tuples required to be verified in the verification process is reduced, and the time required to be consumed in the verification process is reduced in another aspect.

Experiments of the invention in the data set FCT:

the reverse nearest neighbor query is carried out by using a UR algorithm according to different dimensions in an FCT data set in the experiment, 100 groups of data are generated by using a random function in each experiment, the query time and the candidate set scale formed through filtering are respectively recorded, and the average time and the candidate set scale required by each query are calculated after multiple experiments. In the experiment, the threshold set by establishing the UR-Tree index structure is M100, and when the data set is two-dimensional, the region boundary in two dimensions is divided into 3 equal parts each time to form 9 small regions; when the dimension of the data set is larger than two dimensions, three edges with the longest region boundary are found in each division, and the three edges are equally divided into 3 parts, 2 parts and 2 parts respectively to form 12 small regions.

Fig. 11 is a relationship between time cost of the UR algorithm and the dimension of the data set when the UR algorithm performs an inverse neighbor query on the n-dimensional data set. As can be seen from fig. 11, the time cost of the UR algorithm increases with the increase of the dimension of the data set, and since the main body in the pruning process is the calculation of the distance, the time cost shows a nearly linear increasing trend with the increase of the dimension of the data set, and similarly, the verification process also shows a nearly linear increasing trend with the increase of the dimension of the data set. In addition, the time total cost of the pruning process in the reverse neighbor query algorithm is large, more than 99 percent of the time total cost is the time cost occupied by the pruning process, and the verification time ratio is within 1 percent. Fig. 12 shows the relationship between the size of the candidate set and the dimension of the data set formed after pruning by the UR algorithm, and it can be seen from fig. 12 that as the dimension of the data set increases, the size of the candidate set also gradually increases, a nearly linear increase is shown in the low-dimensional data set, and the increase gradually flattens in the medium-high-dimensional data set.

Claims

1. An unbalanced area tree index structure is characterized in that a PPS space segmentation algorithm is used to divide a space into different areas, and the index structure also comprises characteristic information of the areas except information of data tuples contained in the index structure.

2. The unbalanced region tree index structure of claim 1, wherein the PPS space partition algorithm is used to partition the data space such that the number of tuples contained in each region is not greater than a predetermined threshold for the number of tuples in the regionM(ii) a The nodes of the index structure comprise the information of the data tuples and the regional characteristic information of the region; the region feature information includes a region center point position, a maximum distance from the center point to a region boundary, and a maximum value of nearest neighbor distances of all data tuples included in the region.

3. A method for constructing an unbalanced region tree index structure is characterized in that a PPS space segmentation algorithm is used for dividing the whole data space into a plurality of regions, and the number of data points in each region is required not to exceed a set threshold valueMIncluding empty regions, to create an unbalanced region tree index structure, each node representing a regionSThe index structure contains the following information:

the intermediate node contains information: the regionSCentral point of (2)cRegion of interestSMaximum value of boundaryb _max And minimum valueb _min Center point ofcMaximum distance to zone boundaryr _max Number of data tuples involvedtuple_numAnd the maximum value of the distance between the tuple and the nearest neighbor thereof in all the data tuples in the regiond _max ；

Structure of the data tuple list: ordering all data tuples according to a certain attribute, which is requiredThe attributes have the maximum discrimination, the attribute value of the data tuple: (A ₁ , …, A _n )，tidFor the identification of the present data tuple,tid′is the identification of the nearest neighbor of the data tuple,d _NN is the distance between the tuple of data and its nearest neighbor.

4. An n-dimensional space inverse nearest neighbor query algorithm based on an unbalanced area tree index structure is characterized by a filtering rule and an algorithm thereof, namely a given query pointQIf the point is centered on the areacThe distance between the two regions is greater than the sum of the maximum distance from the central point of the region to the boundary and the maximum adjacent distance of the region, then the region can be pruned, otherwise, the region can be used as a candidate region; refining rules and their algorithms, i.e. tuples of data in a certain dimensiontAnd a query pointQThe distance between them is greater than the nearest neighbor distance of the data tuples, the point can be pruned; data tupletAnd a query pointQWhen the distance in each dimension is smaller than the nearest neighbor distance of the data tuple, the data tuple is taken as one of the candidate points.

5. An n-dimensional space inverse nearest neighbor query algorithm based on an unbalanced area tree index structure is characterized by comprising the following steps:

6. An n-dimensional space inverse nearest neighbor query algorithm based on an unbalanced area tree index structure is characterized by comprising the following steps:

s1, establishing an unbalanced tree area index structure:

s1-1, using PPS space division algorithm to carry out space division and merging to build tree:

s1-1-1: computing regionsSCentral point of (2)cIn the first placeiThe coordinate of the dimension is [ (), (b _max ⁱ⁽⁾ − b _min ⁱ⁽⁾ ) / 2]Calculating the center pointcTo the areaSMaximum distance of boundaryr _max Judging whether the number of data points contained in the region is larger than a set threshold value related to the number of data points in the small regionM；

S1-1-2: if regionSIncluding the number of data points not being greater than a set thresholdMForming a list of the data points contained in the data points and sorting the data points according to the primary key; if regionSThe number of data points contained is greater than a set thresholdMFind outSIs/are as followskThe longest edge of the stripe ₁ , e ₂ , …, e _k Each edge is divided inton ₁ , n ₂ , …, n _k Equal parts, then regionSIs divided intoh = n ₁ n ₂ …n _k Sub-areaS _i :i= 1, 2, …, hTaking the sub-region as a regionSInserting all sub-regions into the index treeS _i (i = 1, …, h) As the area to be divided, the algorithm is called recursively;

when the data space is two-dimensional,ktaking the value as 2, and equally dividing 2 edges into 3 parts; when the spatial dimension of the data exceeds two dimensions,kthe value can be 3, and the longest edges of the 3 strips are respectively divided into 2 parts, 2 parts and 3 parts;

s1-2, filling information to form an unbalanced area tree index:

s1-2-1: calculating data pointstThe distance between the data point and other data points in the region is foundtNearest pointt′And records the identification of its data tuplestidDistance tod(t, t′) As data points, respectivelytIdentification of nearest neighborstid′And nearest neighbor distanced _NN ；

S1-2-2: by data pointtIs a center，2d _NN Making a rectangle parallel to the boundary of the region for the side lengthS′If the rectangle does not exceed the data pointtThe boundary of the region, pointt′I.e. the data pointst(ii) nearest neighbors; otherwise, step S1-2-3 is executed to check the authenticity of the card;

s1-2-3: calculation and rectangleS′Intersecting region, finding rectangle in the regionS′Checking whether the data point is a data pointtAnd update the data pointtNearest neighbortid′And nearest neighbor distanced _NN ；

S1-2-4: calculating the nearest neighbor distance of all data points in each regiond _NN Is recorded asd _max ；

S2, filtering, refining and verifying:

s2-2: calculating the center of the accessed nodecAnd query pointQThe distance betweend(Q, c) And maximum boundary distance of regionr _max Distance to maximum neighbord _max The sum of (a), (b), (c), (d)r _max + d _max ）；

S2-3: comparisond(Q, c) Andr _max + d _max when is coming into contact withd(Q, c) > r _max + d _max Then, the node and all the sub-nodes are pruned, and the sequential access is continued; when in used(Q, c) ≤ r _max + d _max If the node is a leaf node, adding the node into the candidate region set, and continuing sequential access, otherwise, adding a child node into the access list until the sequential access is finished;

s2-4: accessing a list of data points contained in all regions of the set of candidate regions, starting with the first attribute, and assigning the data pointstAnd a query pointQDistance over the attributed′Distance to nearest neighbord(t, t _NN ) Making a comparison whend′ > d(t, t _NN ) When the point is pruned, if on any attributed′ ≤ d(t, t _NN ) If both are true, adding the point into the candidate set;

s2-5: sequentially visiting all data points in the candidate set to calculate the data pointstAnd query pointQThe distance betweend(t, Q) Comparing it with the nearest neighbor distanced(t, t _NN ) The size of (2)d(t, Q) ≤ d(t, t _NN ) Time, data pointtIs a query pointQThe inverse neighbors of (2) are added into the result set, otherwise the data pointstQuery pointQThe inverse neighbors of (2) are discarded.