CN105868414A

CN105868414A - Clustering separation distributive indexing method

Info

Publication number: CN105868414A
Application number: CN201610287204.7A
Authority: CN
Inventors: 袁鑫攀; 汪灿飞; 何频捷; 梁圣; 满君丰; 向平; 向一平
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2016-05-03
Filing date: 2016-05-03
Publication date: 2016-08-17
Anticipated expiration: 2036-05-03
Also published as: CN105868414B

Abstract

The invention discloses a clustering separation distributive indexing method, called CS-Chord (Clustering separation-Chord) for short. In M-Chord distributive indexing, marginal vectors of clusters are generally sparser, the sparser vectors enable the radius of each cluster to become quite large; during range query, the clusters with larger radius intersect with range searching areas more easily, and the to-be-searched areas are increased; however, the marginal vectors of the clusters are usually vectors with high access, and the performance is further reduced. The sparse vectors at margins of the clusters are separated out and stored in a concentrated manner in an independent server by the CS-Chord, dense vectors are stored in a Chord ring, and during searching, on the one hand, the high-frequency query is concentrated on the vectors in the independent server, on the other hand, the searching range on the Chord ring is reduced, so that the retrieval efficiency is improved.

Description

A kind of distributed index method clustering separation

Technical field

The present invention relates to distributed index field, more particularly, to a kind of distributed index side clustering separation Method.

Background technology

P2P peer-to-peer network is independent of special centralized server, and in network, all of node is all equality, Freely interconnect.They carry out resource and the service of Sharing computer by exchange.P2P distributed index structure is filled That divides make use of the performance of each node in network node, has extensibility good, resource utilization advantages of higher. In recent years, distributed index has been increasingly becoming the focus of research.NOVAK.D et al. proposes M-Chord is the distributed index algorithm of a kind of high dimension vector similarity retrieval based on P2P network.This algorithm Being combined with Chord agreement by IDistance algorithm, IDistance algorithm is responsible for high dimension vector dimensionality reduction, Chord Agreement is responsible for the storage and retrieval of distributed vector.

Chord is a kind of structurized distributed lookup agreement, comes in P2P network fast by DHT technology Speed ground locating resource.In order to realize quick resource lookup, each node on Chord ring is required for safeguarding O(log₂ ⁿ) routing table of (n is the node total number mesh in Chord ring) length.In Chord agreement, node and Data all can obtain the m bit identifier in same a room by mapping, makes each by introducing dummy node The data that node storage is roughly equal, i.e. Chord agreement is load balancing.Node route list is scattered, Each node has only to know that in whole system, the routing iinformation of minority node is the most permissible by constantly redirecting inquiry Obtain query path information.The operation of one query has only to produce O (log in ring₂ ⁿ) bar message.

Distributed hash (DHT) is in a certain way for each node one the unique mark of distribution in network Know symbol.In Chord agreement, data resource is also by same rule one unique identifier of distribution.Chord Agreement uses concordance hash algorithm (Consistent Hash) to calculate node and resource, and the result of mapping is led to Cross 2^mDelivery obtains the identifier of a m position, and scope is [0,2^m-1].For node, IP address is only One, concordance Hash is by obtaining node identifier to the IP Address-Hash of node.For data, pass through Key value Hash is obtained with the identifier of data.M=2, N₄Chord ring such as Fig. 1 (a) shown in, Ni is node, and Ki is resource.

Resource position rapidly the routing iinformation that each node relied primarily on is preserved.The data knot of each node Structure has a routing table, saves data and the address information of part of nodes, as shown in Fig. 1 (b).

The lookup of Chord can be divided into following steps:

(1) certain node N receives key value key to be checked, first searches in the local resource of this node N and is No have this key value, if node N has this key value, then searches and terminates and return node resource, otherwise Turn to step (2).

(2) check the pointer gauge of requested node, find the identifier and closest mapped less than key value Node, then search request is sent on this node, repeat step (1).

IDistance is a kind of high dimension vector indexing means based on metric space.The basic think of that its index is set up Thinking: choose several anchor points in whole data space, each anchor point correspond to a cluster subset.Data Each data point in space is divided in the cluster subset of the nearest anchor point of this data point.Then higher-dimension to Measure by being converted into an one-dimensional key value iDist that can measure with the distance of anchor point, utilize B⁺-Tree group Knitting the key value iDist managing all high dimension vectors, the computing formula of key value iDist is: iDist (x)=dist (p_i,x) +i*c.As in figure 2 it is shown, P₀、P₁、P₂For anchor point；C_iFor P_iData subset in from P_iFarthest certain The distance of data point, i.e. P_iThe radius of data subset；C is a constant, more than all of C_i。

If complete or collected works are D, and given similarity dimensions inquiry Range (q, r), i.e. retrieval and data point q distance Set of data points less than radius r: Range (q, r)=x ∈ D, dist (q, x) ＜ r}, wherein, function dist (q, x) Represent the vector q distance to data point x.

The retrieving of IDistance is:

(1) pass through and each anchor point P_iDistance calculate: the search of q circle whether with this anchor point P_iData Subset intersects.

The judgment formula intersected is: dist (q, P_i)<C_i+r

Disjoint judgment formula is: dist (q, P_i)>C_i+r

(2) if non-intersect, without searched targets point in the data subset of this anchor point；If intersecting, it is determined that search Ring body scope.The ring body scope of search is:

{x∈P_i,max(dist(P_i, q)-r, 0) ＜ dist (P_i, x) ＜ min (dist (P_i,q)+r,C_i)}

(3) determine the hunting zone of one-dimensional key value iDist, thus quickly search on B+ tree, The data point found enters Candidate Set.The hunting zone of one-dimensional key value iDist:

{x∈P_i,i*c+max(dist(P_i, q)-r, 0) ＜ iDist (P_i, x) ＜ i*c+min (dist (P_i,q)+r,C_i)}

(4) respectively with q, each data point in Candidate Set is carried out distance to calculate, if distance is less than r, then Enter final retrieval set.

IDistance, by the index problem of high dimension vector, is reduced to one-dimensional by the way of choosing anchor point cleverly On, one-dimensional index is organized by B+ tree, has the advantages that search is fast, saved substantial amounts of distance meter Calculate.

The M-Chord (M represents Metric) that NOVAK.D et al. proposes is distributed under a kind of metric space Index Algorithm, not only energy locating resource (equal lookup) under distributed p2p network, also extends similarity Search (range-based searching).IDistance algorithm is combined by this algorithm with Chord agreement, and IDistance algorithm is born The dimensionality reduction of duty high dimension vector, Chord agreement is responsible for the storage of distributed data, is successfully achieved high dimension vector Similarity search under distributed environment.IDistance with Chord is combined by M-Chord algorithm, logical Cross IDistance and high dimension vector is converted into one-dimensional key value, by hash function, one-dimensional key value is mapped to In the identifier space of Chord, inserted by Chord ring and retrieve data, as shown in Figure 3.

Certain node of M-Chord algorithm receives a range retrieval Range, and (Q, r) process is as follows, wherein Q For vector to be checked, r is query context radius.

(1) (Q, r) with the intersecting area clustered, maps to calculate range retrieval Range by IDistance Interval [xi, yi] for multiple key values.

(2) keep hash function h to xi, yi Hash by position, generate the key value model in Chord ring Enclose [h (xi), h (yi)].By the node at key value h (xi) place, table of query and routing location, if h (yi) is more than joint The key value maximum Key of stored data in point_max, then by scope [Key_max, h (yi)] and it is sent to this node Descendant node.If h (yi) is than the Key of descendant node_maxAlso want big, then continue the descendant node toward it and send Query Information.

(4) each node (including the node in server and Chord ring) receives inquiry request, The B of this node⁺-Tree retrieving in key value scope, whether directed quantity exists, if there is vector, with to be checked to Amount Q carries out distance and calculates, if distance is less than r, returns to be originally sent the node of request.

Marginal vectors in the cluster of M-Chord typically ratio is sparser, and the vector of these rarenesses makes each cluster Radius become the biggest.Range query when, radius is the biggest more the easy region with range-based searching is intersected, So that the region searched is increased.As long as this means that the region of range-based searching is intersected with each cluster, No matter intersecting area data number, be necessary for positioning a secondary data in Chord ring.These minimal amount of numbers It is greatly increased according to the number of times of the locating resource in Chord ring, therefore reduces the performance of M-Chord.

Fig. 4 is that the data that the characteristic of the color histogram of 68040 width images is clustered by Kmeans are distributed Figure.As can be seen from the figure the radius length of this cluster is 0.62, but most data is distributed in Between 0.09-0.35.Owing to the MARG of only a few causes the radius of cluster to add by about one time.

Fig. 5 is the data access frequency figure under 1000 random range-based searching in this Cluster space.Due to The when of range query, in this case it is not apparent that whether have data in query context, so there is no the interval of data in retrieval In also can go access.The contrast of Fig. 4 and Fig. 5 can be seen that, marginal vectors is rare, but these vectors are interviewed The frequency asked is the most at a relatively high, and substantially greater than 80%.

Summary of the invention

The present invention is to overcome at least one defect (not enough) described in above-mentioned prior art, it is provided that a kind of cluster point From distributed index method CS-Chord (Clustering separation-Chord), this indexing means reduce Hunting zone on Chord ring, improves recall precision.

For solving above-mentioned technical problem, technical scheme is as follows:

A kind of distributed index method clustering separation, comprises the following steps:

Step one: separation edge sparse vector, and use independent server centered storage edge sparse vector；

Step 2: set up distributed index, calculating needs the one-dimensional of edge sparse vector S of addition Chord ring Key value Key (s), and this vector is inserted into distributed index, the detailed process that vector inserts is；

(21) if Key (S) >=n*C, wherein n is the number of cluster subspace, and C is a constant, Its value more than all values in the DUAL PROBLEMS OF VECTOR MAPPING in IDistance index structure medium ring body to one-dimensional axle, then will be closed Key value Key (s) and vector S are sent on independent server, then vector S are inserted into this separate server B⁺In-Tree index, then this new vector has inserted；If Key (s) < n*C turns to step (22)；

(22) keep hash function that Key (s) carries out Hash by position, generate and be assigned on Chord ring Key value Key_Chord, utilize Chord location algorithm, search key value Key_ChordThe node IP ground that should store Location, by Key_ChordIt is sent on this node with this vector S, then vector S is inserted into the B of node⁺-Tree In index, index foundation completes；

Step 3: carry out range query based on constructed index, if the distributed index method that cluster separates (Q, r), wherein Q is vector to be checked to the range query Range of CS-Chord, and r is query context radius, step Rapid as follows:

(31) (Q, r) with the intersecting area clustered, maps to calculate range query Range by IDistance Interval [xi, yi] for multiple key values；

(32) if xi >=n*C, then by step (31) computer capacity inquiry Range, (Q, r) with cluster Intersecting area is sent on separate server, goes to step (34), if xi is ＜ n*C, turns to step (33)；

(33) generate the key value scope [h (xi), h (yi)] in Chord ring, closed by table of query and routing location The node at key assignments h (xi) place, if h (yi) is more than the key value maximum Key of stored data in node_max, Then by scope [Key_max, h (yi)] and it is sent to the descendant node of this node, if h (yi) is still than descendant node Key_maxGreatly, then continue the descendant node toward it and send Query Information,

(34) each node receives inquiry request, at the B of this node⁺-Tree(B⁺Storage in-Tree The various S vector meeting condition) in retrieval key value scope whether directed quantity exist, if there is vector Z Then carry out distance with vector Q to be checked to calculate, when distance is less than inquiry radius r, then vector Z is returned to initially Send the node of request, if distance is more than or equal to inquiry radius r, then return null value.If there is not vector, Also return to null value.

Preferably, above-mentioned steps one separation edge sparse vector, and use independent server centered storage edge The detailed process of sparse vector data is:

If the dense vector of cluster is designated as R with the separation of edge sparse vector_b, the radius of cluster is R, then from The distance [0, R of cluster centre point_bRegion between] is dense vector district, [R_b, R] region be sparse vector District；

At n Cluster space, the calculating for sparse vector district key value Key (S) increases on the basis of original The distance of n*C, if vector S is a vector needing to add Chord ring, P is vector S point place cluster Central point, then the computing formula of key value Key (S) of vector S is as follows:

K e y (S) = \{\begin{matrix} i * C + d i s t (S, P) & (d i s t (S, P) \leq R_{b}) \\ (i + n) * C + d i s t (S, P) & (d i s t (S, P) > R_{b}) \end{matrix} - - - (1)

Wherein 0≤i ＜ n；

By formula (1), sparse vector is separated, and use independent server centered storage sparse data； Then as key value Key (the S) >=n*C of query vector S, directly access the server lookup of centralized stores.

Preferably, in above-mentioned steps two, position keeps hash function h to be defined as follows:

For data interval [X_min, X_max], interval [Y to be mapped that to_min, Y_max] and keep the consistent of data Property, it is assumed that X_i∈[X_min, X_max], after being mapped by function h, value is Y_i, Y_i∈[Y_min, Y_max], then this Kazakhstan Uncommon function is defined as:

h (X_{i}) = Y_{m i n} + \frac{(Y_{m a x} - Y_{m i n}) * (X_{i} - X_{\min})}{X_{m a x} - X_{m i n}} - - - (2)

Wherein, because of CS-Chord index key value interval be [0, K_max], the identifier of Chord ring is empty Between scope be [0,2^m-1], thus can be by X_min=0, X_max=K_max,Y_min=0, Y_max=2^m-1 substitutes into formula (2), can :

h (K_{i}) = \frac{(2^{m} - 1) * K_{i}}{K_{m a x}} - - - (3) .

It should be noted that the causa essendi that position keeps hash function is: the most right during common mapping method The key value Hash of data point, the most again to 2^mValue after delivery is just for the key value of Chord ring.But this Plant way by the most adjacent Mapping of data points to different nodes.Need even for IDistance is this It is worthless for the continuous Index Algorithm searched, it is therefore desirable to the hash function that a kind of position keeps so that protect Hold the concordance of data order.

Compared with prior art, technical solution of the present invention provides the benefit that:

The present invention proposes a kind of distributed index method clustering separation, is called for short CS-Chord (Clustering separation-Chord).In M-Chord distributed index, the marginal vectors of cluster typically ratio is sparser, The vector of these rarenesses makes the radius of each cluster become the biggest.Range query when, radius is the biggest Cluster the easiest region with range-based searching to intersect, so that the region that candidate searches is increased.And the limit clustered Edge vector is typically again the vector of high access, and performance reduces further.CS-Chord of the present invention will The sparse vector at cluster edge is separated and is centrally stored on independent server, is stored in by dense vector In Chord ring, during lookup, on the one hand the inquiry of high frequency concentrates on the vector of separate server, on the other hand also subtracts Lack the hunting zone on Chord ring, thus improve recall precision.

Accompanying drawing explanation

Fig. 1 is Chord schematic diagram.

Fig. 2 is the schematic diagram of IDistance.

Fig. 3 is the schematic diagram of M-Chord.

Fig. 4 is certain Cluster space data profile.

Fig. 5 is the access frequency figure of certain stochastic clustering.

Fig. 6 is that two-dimensional space clusters edge separation schematic diagram.

Fig. 7 is that CS-Chord indexes schematic diagram.

Fig. 8 is CS-Chord range-based searching schematic diagram.

Detailed description of the invention

Accompanying drawing being merely cited for property explanation, it is impossible to be interpreted as the restriction to this patent；In order to this enforcement is more preferably described Example, some parts of accompanying drawing have omission, zoom in or out, do not represent the size of actual product；

To those skilled in the art, in accompanying drawing, some known features and explanation thereof may be omitted is to manage Solve.With embodiment, technical scheme is described further below in conjunction with the accompanying drawings.

A kind of distributed index method (CS-Chord) of separation that clusters of the present invention is by cluster edge Sparse vector is separated and is centrally stored on independent server, and dense vector is stored in Chord ring In, during lookup, on the one hand the inquiry of high frequency concentrates on the dense vector of separate server, on the other hand decreases Hunting zone on Chord ring, thus improve recall precision.It specifically comprises the following steps that

Step one: edge sparse vector separates

MARG is sparse and access frequency is high, it should centralised storage, eliminates Resource orientation in Chord ring Time.In order to reach this purpose, first have to.

If the dense vector of cluster and the separation of sparse vector, it is designated as R_b.If the radius of cluster is R.Then from The distance [0, R of cluster centre point_bRegion between] is the dense districts of data.[R_b, R] region be Sparse District.As shown in Figure 6, the data of two-dimensional space are divided into three Cluster spaces, the Dark grey part data of cluster Dense district, bright gray parts is Sparse district.The data in dense district are distributed in [0,3C].The number of rarefaction According to being distributed between [3C, 6C].

Assume total n Cluster space, Sparse district key value (Key) is calculated at original base The distance of n*C is increased on plinth.Assuming that vector S is a vector needing to add Chord ring, P is vector S The central point (anchor point) of some place cluster.Then the computing formula of the key value Key (S) of vector S is as follows:

K e y (S) = \{\begin{matrix} i * C + d i s t (S, P) & (d i s t (S, P) \leq R_{b}) \\ (i + n) * C + d i s t (S, P) & (d i s t (S, P) > R_{b}) \end{matrix} - - - (1)

Wherein 0≤i ＜ n.

By formula (1), it is possible to sparse data is separated, and puts in one section of continuous print region.Cause Even if these data are all put into Chord ring by this, also it is not result in that substantial amounts of Resource orientation operates.But, by It is high in the access frequency of sparse data, so this patent uses independent server centered storage sparse data.This During Key (the S) >=n*C of sample query context, directly access the server lookup of centralized stores.

Step 2: set up distributed index

Certain node passes through formula (1), calculates the one-dimensional key value Key of vector S, is inserted into by this vector The process of distributed index is:

(21) if Key >=n*C, then the information of key value Key and vector S is sent to independent service On device, it is then inserted into the B of server⁺In-Tree index, then this new vector has inserted.If Key ＜ n*C turns to step (22).

(22) keep hash function that Key carries out Hash by position, generate the pass being assigned on Chord ring Key value Key_Chord.Utilize Chord location algorithm, search key value Key_ChordThe node IP that should store. Then the information of data point is sent on this node, is then inserted into the B of this node data⁺In-Tree index, Index foundation completes.

Wherein, position keeps hash function h to be defined as follows:

For data interval [X_min, X_max], interval [Y to be mapped that to_min, Y_max] and keep the consistent of data Property.Assume X_i∈[X_min, X_max], after being mapped by function h, value is Y_i, Y_i∈[Y_min, Y_max].Then this Kazakhstan Uncommon function can be defined as:

h (X_{i}) = Y_{m i n} + \frac{(Y_{m a x} - Y_{m i n}) * (X_{i} - X_{m i n})}{X_{m a x} - X_{m i n}} - - - (2)

h (K_{i}) = \frac{(2^{m} - 1) * K_{i}}{K_{m a x}} - - - (3)

Fig. 7 is the schematic diagram that in two-dimensional space, CS-Chord distributed index sets up process.

Step 3: range query

(wherein Q is to be checked for Q, schematic diagram r) to be illustrated in figure 8 the range query Range of CS-Chord Vector, r is query context radius, and step is as follows:

(31) (Q, r) with the intersecting area clustered, maps to calculate range query Range by IDistance Interval [xi, yi] for multiple key values.

(32) if xi >=n*C, then the information that step (31) calculates is sent on separate server, turns Step (34).If xi ＜ n*C turns to step (33).

(33) the key value scope [h (xi), h (yi)] in Chord ring is generated.Closed by table of query and routing location The node at key assignments h (xi) place, if h (yi) is more than the key value maximum Key of stored data in node_max, Then by scope [Key_max, h (yi)] and it is sent to the descendant node of this node.If h (yi) is than the Key of descendant node_max Also want big, then continue the descendant node toward it and send Query Information.

(34) each node (including the node in server and Chord ring) receives inquiry request, The B of this node⁺-Tree retrieving in key value scope, whether directed quantity exists, if there is vector, with to be checked to Amount Q carries out distance and calculates, if distance is less than r, returns to be originally sent the node of request.

As shown in Figure 8, this inquiry Q and the cluster dense district of P0, rarefaction is the most crossing, dilute with cluster P1 Dredge district to intersect.The one-dimensional key value scope mapped is [x1, y1], [x2, y2] [x3, y3].[x1, y1] is interval Being sent in Chord ring retrieval, retrieval is sent in server in [x2, y2] [x3, y3] interval.

Obviously, the above embodiment of the present invention is only for clearly demonstrating example of the present invention, and not It it is the restriction to embodiments of the present invention.For those of ordinary skill in the field, in described above On the basis of can also make other changes in different forms.Here without also cannot be to all of enforcement Mode gives exhaustive.All any amendment, equivalent and improvement made within the spirit and principles in the present invention Deng, within should be included in the protection domain of the claims in the present invention.

Claims

1. the distributed index method clustering separation, it is characterised in that comprise the following steps:

(34) each node receives inquiry request, at the B of this node⁺-Tree retrieves key value scope In whether directed quantity exist, if there is vector Z, carrying out distance with vector Q to be checked and calculating, when distance is less than Inquiry radius r, then return to be originally sent the node of request by vector Z, if distance is more than or equal to inquiry half During the r of footpath, then return null value；If there is not vector, also return to null value.

The distributed index method that cluster the most according to claim 1 separates, it is characterised in that above-mentioned Step one separation edge sparse vector, and use the tool of independent server centered storage edge sparse vector data Body process is:

K e y (S) = \{\begin{matrix} i * C + d i s t (S, P) & (d i s t (S, P) \leq R_{b}) \\ (i + n) * C + d i s t (S, P) & (d i s t (S, P) > R_{b}) \end{matrix} - - - (1)

Wherein 0≤i ＜ n；

The distributed index method that cluster the most according to claim 1 separates, it is characterised in that above-mentioned In step 2, position keeps hash function h to be defined as follows:

h (X_{i}) = Y_{m i n} + \frac{(Y_{m a x} - Y_{m i n}) * (X_{i} - X_{m i n})}{X_{m a x} - X_{m i n}} - - - (2)

h (K_{i}) = \frac{(2^{m} - 1) * K_{i}}{K_{m a x}} - - - (3) .