CN105868414B

CN105868414B - A kind of distributed index method that cluster is isolated

Info

Publication number: CN105868414B
Application number: CN201610287204.7A
Authority: CN
Inventors: 袁鑫攀; 汪灿飞; 何频捷; 梁圣; 满君丰; 向平; 向一平
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2016-05-03
Filing date: 2016-05-03
Publication date: 2019-03-26
Anticipated expiration: 2036-05-03
Also published as: CN105868414A

Abstract

The invention proposes a kind of distributed index method that cluster is isolated, abbreviation CS-Chord(Clustering separation-Chord).In M-Chord distributed index, for the marginal vectors of cluster generally than sparser, these rare vectors make the radius of each cluster become very big.When range query, the bigger cluster of radius is easier to intersect with the region of range-based searching, so that the candidate region searched is increased.And the vector of the marginal vectors usually high access clustered, performance further decrease.The sparse vector for clustering edge is separated and is centrally stored on independent server by CS-Chord of the present invention, dense vector is stored in Chord ring, the inquiry of one side high frequency concentrates on the vector of separate server when lookup, on the other hand the search range on Chord ring is decreased, to improve recall precision.

Description

A kind of distributed index method that cluster is isolated

Technical field

The present invention relates to distributed index fields, more particularly, to a kind of distributed index method that cluster is isolated.

Background technique

P2P peer-to-peer network does not depend on dedicated centralized server, and all nodes are all equality in network, freely mutually Connection.They are by exchanging resource and service come Sharing computer.Network section is adequately utilized in P2P distributed index structure The performance of each node in point has many advantages, such as that scalability is good, and resource utilization is high.In recent years, distributed to index It has been increasingly becoming the hot spot of research.The M-Chord that NOVAK.D et al. is proposed is that a kind of high dimension vector based on P2P network is similar Property retrieval distributed index algorithm.The algorithm by IDistance algorithm in conjunction with Chord agreement, be responsible for by IDistance algorithm High dimension vector dimensionality reduction, Chord agreement are responsible for distributed vector storage and retrieval.

Chord is a kind of distributed lookup agreement of structuring, is promptly positioned in P2P network by DHT technology Resource.In order to realize quick resource lookup, each the node on Chord ring requires maintenance O (log₂ ⁿ) (n is Chord ring In node total number mesh) length routing table.In Chord agreement, node and data can all be obtained by mapping with a piece of sky Between in m bit identifier, by introduce dummy node so that each node is stored roughly equal data, i.e. Chord agreement is negative Carry equilibrium.Node route list is dispersion, and each node is it is only necessary to know that the routing iinformation of a small number of nodes is logical in whole system It crosses and constantly jumps inquiry and be obtained with query path information.The operation of one query only needs to generate O (log in ring₂ ⁿ) item Message.

Distributed hash (DHT) is to distribute a unique identifier in a certain way for each node in network. In Chord agreement, data resource also presses one unique identifier of same rule distribution.Chord agreement uses consistency Hash algorithm (Consistent Hash) comes calculate node and resource, and the result of mapping passes through to 2^mModulus obtains one m Identifier, range is [0,2^m-1].For node, IP address is uniquely that consistency Hash passes through the IP address to node Hash obtains node identifier.For data, the identifier of data is obtained with by key value Hash.M=2, N₄'s Shown in Chord ring such as Fig. 1 (a), Ni is node, and Ki is resource.

The routing iinformation for positioning each node relied primarily on rapidly and being saved of resource.In the data structure of each node There is a routing table, the data and address information of part of nodes is saved, as shown in Fig. 1 (b).

The lookup of Chord can be divided into following steps:

(1) some node N receives key value key to be checked, searches for whether have the pass in the local resource of node N first Key assignments searches end and return node resource if node N has the key value, otherwise turns to step (2).

(2) pointer gauge for checking requested node, finds the identifier less than key value mapping and apart from nearest node, Then search request is sent on the node, is repeated step (1).

IDistance is a kind of high dimension vector indexing means based on metric space.The basic thought that its index is established It is: chooses several anchor points in entire data space, each anchor point corresponds to a cluster subset.Every number of data space Strong point is all divided into the cluster subset of the anchor point nearest from the data point.Then high dimension vector by turning at a distance from anchor point The one-dimensional key value iDist that can be measured is turned to, B is utilized⁺- Tree organizes the key value of all high dimension vectors IDist, the calculation formula of key value iDist are as follows: iDist (x)=dist (p_i,x)+i*c.As shown in Fig. 2, P₀、P₁、P₂For anchor Point；C_iFor P_iData subset in from P_iThe distance of some farthest data point, i.e. P_iData subset radius；C is one normal Amount, greater than all C_i。

If complete or collected works are D, similarity dimensions inquiry Range (q, r) is given, i.e. retrieval is less than half with data point q distance The set of data points of diameter r: Range (q, r)={ x ∈ D, dist (q, x) < r }, wherein function dist (q, x) indicates that vector q is arrived The distance of data point x.

The retrieving of IDistance are as follows:

(1) pass through and each anchor point P_iDistance calculate: the search circle of q whether with anchor point P_iData subset intersection.

The judgment formula of intersection are as follows: dist (q, P_i)<C_i+r

Disjoint judgment formula are as follows: dist (q, P_i)>C_i+r

(2) if non-intersecting in the data subset of the anchor point without searched targets point；If intersection, it is determined that the ring body model of search It encloses.The ring body range of search are as follows:

{x∈P_i,max(dist(P_i, q) and-r, 0) < dist (P_i, x) and < min (dist (P_i,q)+r,C_i)}

(3) search range of one-dimensional key value iDist is determined, to quickly be searched on B+ tree, the number found Strong point enters Candidate Set.The search range of one-dimensional key value iDist:

{x∈P_i,i*c+max(dist(P_i, q) and-r, 0) < iDist (P_i, x) and < i*c+min (dist (P_i,q)+r,C_i)}

(4) each data point in Candidate Set is carried out with q apart from calculating respectively, if distance is less than r, is entered finally Retrieval set.

The index problem of high dimension vector is cleverly reduced on one-dimensional by IDistance by way of choosing anchor point, will One-dimensional index carries out tissue by B+ tree, has the characteristics that search is fast, has saved a large amount of distance and has calculated.

The M-Chord (M indicate Metric) that NOVAK.D et al. is proposed is distributed index algorithm under a kind of metric space, It not only can locating resource (equal lookup), also extension similarity searching (range-based searching) under distributed p2p network.The algorithm By IDistance algorithm in conjunction with Chord agreement, IDistance algorithm is responsible for the dimensionality reduction of high dimension vector, and Chord agreement is responsible for Distributed data storage, successfully realizes similarity search of the high dimension vector under distributed environment.M-Chord algorithm will IDistance is combined with Chord, converts one-dimensional key value for high dimension vector by IDistance, will by hash function One-dimensional key value is mapped in the identifier space of Chord, data is inserted into and retrieved by Chord ring, as shown in Figure 3.

It is as follows that some node of M-Chord algorithm receives range retrieval Range (Q, a r) process, wherein Q be it is to be checked to Amount, r are query context radius.

(1) intersecting area that range retrieval Range (Q, r) and cluster are calculated by IDistance, is mapped as multiple passes Key assignments section [xi, yi].

(2) keep hash function h to xi by position, yi Hash generates key value range [h (xi), h in Chord ring (yi)].The node where key value h (xi) is positioned by table of query and routing, if h (yi) is greater than the pass of stored data in node Key assignments maximum value Key_max, then by range [Key_max, h (yi)] and it is sent to the descendant node of the node.If h (yi) is than subsequent section The Key of point_maxAlso then to continue to send query information toward its descendant node greatly.

(4) each node (including the node in server and Chord ring) receives inquiry request, in the B of this node⁺Retrieve in key value range whether directed quantity exists in-Tree, vector is then carried out with vector Q to be checked apart from calculating if it exists, if Distance is less than r, back to the node for being originally sent request.

For marginal vectors in the cluster of M-Chord generally than sparser, these rare vectors make the half of each cluster Diameter becomes very big.When range query, the bigger radius the easier to intersect with the region of range-based searching, so that search Increase in region.As long as this means that the region of range-based searching is intersected with each cluster, regardless of intersecting area data number, A data must be just positioned in Chord ring.The number of these locating resources of the minimal amount of data in Chord ring is significantly Increase, therefore reduces the performance of M-Chord.

Fig. 4 is the data profile that the characteristic of the color histogram of 68040 width images is clustered by Kmeans.From figure In it can be seen that radius length of this cluster is 0.62, but most data is distributed between 0.09-0.35.Due to pole A small number of edge datas causes the radius of cluster to increase will by about one time.

Fig. 5 is the data access frequency figure under 1000 random range-based searchings in the Cluster space.Since range is looked into When inquiry, in this case it is not apparent that whether have data in query context, so the section of data can not gone to access in retrieval yet.Figure It can be seen that marginal vectors are rare in the comparison of 4 and Fig. 5, but the frequency that these vectors are accessed is quite high, substantially greatly In 80%.

Summary of the invention

The present invention in order to overcome at least one of the drawbacks of the prior art described above (deficiency), provides a kind of point of cluster separation Cloth indexing means CS-Chord (Clustering separation-Chord), which reduces on Chord ring Recall precision is improved in search range.

In order to solve the above technical problems, technical scheme is as follows:

A kind of distributed index method that cluster is isolated, comprising the following steps:

Step 1: separation edge sparse vector, and edge sparse vector is stored using independent server centered；

Step 2: establishing distributed index, calculates the one-dimensional key value for needing to be added the edge sparse vector S of Chord ring Key (s), and the vector is inserted into distributed index, the detailed process of vector insertion is；

(21) if Key (S) >=n*C, wherein n is the number for clustering subspace, and C is a constant, and value is greater than All values in IDistance index structure middle ring intracorporal DUAL PROBLEMS OF VECTOR MAPPING to one-dimensional axis, then by key value Key (s) and vector S It is sent on independent server, then vector S is inserted into the B of the separate server⁺In-Tree index, then the new vector Insertion is completed；If Key (s) < n*C turns to step (22)；

(22) it keeps hash function to carry out Hash to Key (s) by position, generates the key value being assigned on Chord ring Key_Chord, using Chord location algorithm, search key value Key_ChordThe node IP address that should be stored, by Key_ChordWith the vector S is sent on the node, then vector S is inserted into the B of node⁺In-Tree index, index, which is established, to be completed；

Step 3: carrying out range query based on constructed index, if the distributed index method CS- that cluster is isolated The range query Range (Q, r) of Chord, wherein Q is vector to be checked, and r is query context radius, and steps are as follows:

(31) intersecting area that range query Range (Q, r) and cluster are calculated by IDistance, is mapped as multiple Key value section [xi, yi]；

(32) if xi >=n*C, the intersecting area of step (31) computer capacity inquiry Range (Q, r) and cluster is sent out It is sent on separate server, goes to step (34), step (33) are turned to if xi < n*C；

(33) the key value range [h (xi), h (yi)] in Chord ring is generated, key value h is positioned by table of query and routing (xi) node where, if h (yi) is greater than the key value maximum value Key of stored data in node_max, then by range [Key_max, H (yi)] it is sent to the descendant node of the node, if h (yi) is still than the Key of descendant node_maxGreatly, then continue toward the subsequent of it Node sends query information,

(34) each node receives inquiry request, in the B of this node⁺-Tree(B⁺What is stored in-Tree is to meet item The various S vectors of part) in retrieval key value range in whether directed quantity exist, if it exists vector Z then with vector Q to be checked carry out away from From calculating, when distance is less than inquiry radius r, then by vector Z back to the node for being originally sent request, if distance is greater than or equal to When inquiring radius r, then null value is returned.Vector if it does not exist also returns to null value.

Preferably, one separation edge sparse vector of above-mentioned steps, and it is sparse using independent server centered storage edge The detailed process of vector data are as follows:

If the dense vector of cluster and the separation of edge sparse vector are denoted as R_b, the radius of cluster is R, then from cluster The distance [0, R of heart point_b] between region be dense vector area, [R_b, R] region be sparse vector area；

In n Cluster space, n*C's is increased on the basis of the original for the calculating of sparse vector area key value Key (S) Distance, if vector S is the vector for needing to be added Chord ring, P is the central point clustered where vector S point, then vector S The calculation formula of key value Key (S) is as follows:

Wherein 0≤i < n；

By formula (1), sparse vector is separated, and stores sparse data using independent server centered；Then when looking into When asking key value Key (S) >=n*C of vector S, centrally stored server inquiry is directly accessed.

Preferably, in above-mentioned steps two, position keeps hash function h to be defined as follows:

For data interval [X_min, X_max], to map that section [Y_min, Y_max] and keep the consistency of data, it is false If X_i∈[X_min, X_max], value is Y after being mapped by function h_i, Y_i∈[Y_min, Y_max], then the hash function is defined as:

Wherein, because the section of the index key value of CS-Chord is [0, K_max], the identifier space range of Chord ring is [0,2^mIt -1], therefore can be by X_min=0, X_max=K_max,Y_min=0, Y_max=2^m- 1 substitutes into formula (2), can obtain:

It should be noted that position keeps the causa essendi of hash function to be: first to data when common mapping method The key value Hash of point, then again to 2^mValue after modulus is just the key value of Chord ring.But this way will be originally adjacent Mapping of data points to different nodes on.It is not for the Index Algorithm for needing continuously to search this for IDistance Desirable, it is therefore desirable to a kind of hash function that position is kept, so that keeping the consistency of data sequence.

Compared with prior art, the beneficial effect of technical solution of the present invention is:

The invention proposes a kind of distributed index method that cluster is isolated, abbreviation CS-Chord (Clustering separation-Chord).In M-Chord distributed index, for the marginal vectors of cluster generally than sparser, these are rare Vector makes the radius of each cluster become very big.When range query, the bigger cluster of radius is easier to be looked into range The region intersection looked for, so that the candidate region searched is increased.And cluster marginal vectors usually high access to Amount, performance further decrease.CS-Chord of the present invention separates the sparse vector for clustering edge and centrally stored On independent server, dense vector is stored in Chord ring, the inquiry of one side high frequency concentrates on independent clothes when lookup The vector of business device, on the other hand decreases the search range on Chord ring, to improve recall precision.

Detailed description of the invention

Fig. 1 is Chord schematic diagram.

Fig. 2 is the schematic diagram of IDistance.

Fig. 3 is the schematic diagram of M-Chord.

Fig. 4 is certain Cluster space data profile.

Fig. 5 is the access frequency figure of some stochastic clustering.

Fig. 6 is two-dimensional space cluster edge separation schematic diagram.

Fig. 7 is CS-Chord index schematic diagram.

Fig. 8 is CS-Chord range-based searching schematic diagram.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；In order to better illustrate this embodiment, attached Scheme certain components to have omission, zoom in or out, does not represent the size of actual product；

To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

A kind of isolated distributed index method (CS-Chord) of cluster of the present invention by cluster edge it is sparse to Amount is separated and is centrally stored on independent server, and dense vector is stored in Chord ring, on the one hand high when lookup The inquiry of frequency concentrates on the dense vector of separate server, on the other hand decreases the search range on Chord ring, to mention High recall precision.The specific steps of which are as follows:

Step 1: edge sparse vector separation

Edge data is sparse and access frequency is high, it should centralised storage, eliminate Resource orientation in Chord ring when Between.In order to reach this purpose, it first has to.

If the dense vector of cluster and the separation of sparse vector, are denoted as R_b.If the radius of cluster is R.Then from cluster The distance [0, R of heart point_b] between region be the dense area of data.[R_b, R] region be Sparse area.As shown in fig. 6, two-dimentional The data in space are divided into three Cluster spaces, and the dense area of dark gray section data of cluster, bright gray parts are Sparse Area.The data distribution in dense area is in [0,3C].The data distribution of rarefaction is between [3C, 6C].

Assuming that sharing n Cluster space, the calculating of Sparse area key value (Key) is increased on the basis of the original The distance of n*C.Assuming that vector S is the vector for needing to be added Chord ring, P is the central point (anchor of cluster where vector S point Point).Then the calculation formula of the key value Key (S) of vector S is as follows:

Wherein 0≤i < n.

Pass through formula (1), so that it may sparse data be separated, and be put into one section of continuous region.Even if therefore These data are all put into Chord ring, a large amount of Resource orientation will not be caused to operate.But due to the access of sparse data Frequency is high, so this patent stores sparse data using independent server centered.Key (S) >=n*C of query context in this way When, directly access centrally stored server inquiry.

Step 2: distributed index is established

Some node calculates the one-dimensional key value Key of vector S, which is inserted into distributed rope by formula (1) The process drawn are as follows:

(21) it if Key >=n*C, sends the information of key value Key and vector S on independent server, then Insert it into the B of server⁺In-Tree index, then the new vector insertion is completed.If Key < n*C turns to step (22).

(22) it keeps hash function to carry out Hash to Key by position, generates the key value being assigned on Chord ring Key_Chord.Using Chord location algorithm, key value Key is searched_ChordThe node IP that should be stored.Then the information of data point is sent out It is sent on the node, is then inserted into the B of the node data⁺In-Tree index, index, which is established, to be completed.

Wherein, position keeps hash function h to be defined as follows:

For data interval [X_min, X_max], to map that section [Y_min, Y_max] and keep data consistency.It is false If X_i∈[X_min, X_max], value is Y after being mapped by function h_i, Y_i∈[Y_min, Y_max].Then the hash function can be with is defined as:

Fig. 7 is the schematic diagram of CS-Chord distributed index establishment process in two-dimensional space.

Step 3: range query

It is illustrated in figure 8 the schematic diagram of the range query Range (Q, r) of CS-Chord, wherein Q is vector to be checked, and r is to look into Range radius is ask, steps are as follows:

(31) intersecting area that range query Range (Q, r) and cluster are calculated by IDistance, is mapped as multiple Key value section [xi, yi].

(32) it if xi >=n*C, sends the information that step (31) calculate on separate server, goes to step (34). If xi < n*C turns to step (33).

(33) the key value range [h (xi), h (yi)] in Chord ring is generated.Key value h is positioned by table of query and routing (xi) node where, if h (yi) is greater than the key value maximum value Key of stored data in node_max, then by range [Key_max, H (yi)] it is sent to the descendant node of the node.If h (yi) is than the Key of descendant node_maxAlso then to continue toward after it greatly Query information is sent after node.

(34) each node (including the node in server and Chord ring) receives inquiry request, in the B of this node⁺Retrieve in key value range whether directed quantity exists in-Tree, vector is then carried out with vector Q to be checked apart from calculating if it exists, if Distance is less than r, back to the node for being originally sent request.

As shown in Figure 8, inquiry Q and the cluster dense area P0, rarefaction are all intersected, and are intersected with the cluster rarefaction P1.It reflects The one-dimensional key value range penetrated is [x1, y1], [x2, y2] [x3, y3].The section [x1, y1], which is sent in Chord ring, to be retrieved, [x2, Y2] section [x3, y3] is sent in server and retrieves.

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of distributed index method that cluster is isolated, which comprises the following steps:

Step 2: establishing distributed index, calculates the one-dimensional key value Key for needing to be added the edge sparse vector S of Chord ring (S), it and by the vector is inserted into distributed index, the detailed process of vector insertion is；

(21) if Key (S) >=n*C, wherein n is the number for clustering subspace, and C is a constant, and value is greater than IDistance All values in index structure middle ring intracorporal DUAL PROBLEMS OF VECTOR MAPPING to one-dimensional axis then send key value Key (S) and vector S to solely On vertical server, then vector S is inserted into the B of the separate server⁺In-Tree index, then the new vector insertion is completed； If Key (S) < n*C turns to step (22)；

Position keeps hash function h to be defined as follows:

For data interval [X_min, X_max], to map that section [Y_min, Y_max] and keep data consistency, it is assumed that X_i∈ [X_min, X_max], value is Y after being mapped by function h_i, Y_i∈[Y_min, Y_max], then the hash function is defined as:

Wherein, because the section of the index key value of CS-Chord is [0, K_max], the identifier space range of Chord ring is [0,2^m- It 1], therefore can be by X_min=0, X_max=K_max,Y_min=0, Y_max=2^m- 1 substitutes into formula (2), can obtain:

Step 3: carrying out range query based on constructed index, if clustering isolated distributed index method CS-Chord's Range query Range (Q, r), wherein Q is vector to be checked, and r is query context radius, and steps are as follows:

(31) intersecting area that range query Range (Q, r) and cluster are calculated by IDistance, is mapped as multiple keys It is worth section [xi, yi]；

(32) it if xi >=n*C, sends step (31) computer capacity inquiry Range (Q, r) and the intersecting area of cluster to (34) on separate server, are gone to step, step (33) are turned to if xi < n*C；

(33) the key value range [h (xi), h (yi)] in Chord ring is generated, key value h (xi) is positioned by table of query and routing The node at place, if h (yi) is greater than the key value maximum value Key of stored data in node_max, then by range [Key_max, h (yi)] it is sent to the descendant node of the node, if h (yi) is still than the Key of descendant node_maxGreatly, then continue the subsequent section toward it Point sends query information,

(34) each node receives inquiry request, in the B of this node⁺In-Tree retrieve key value range in whether directed quantity In the presence of vector Z then carries out inquiring radius r apart from calculating when distance is less than, then returning to vector Z with vector Q to be checked if it exists It is originally sent the node of request, if distance is greater than or equal to inquiry radius r, null value is returned, if it does not exist vector, also returns Make the return trip empty value.

2. the isolated distributed index method of cluster according to claim 1, which is characterized in that above-mentioned steps one separate side Edge sparse vector, and use the detailed process of independent server centered storage edge sparse vector data are as follows:

If the dense vector of cluster and the separation of edge sparse vector are denoted as R_b, the radius of cluster is R, then from cluster centre point Distance [0, R_b] between region be dense vector area, [R_b, R] region be sparse vector area；

In n Cluster space, for the calculating of sparse vector area key value Key (S) increase on the basis of the original n*C away from From if vector S is the vector for needing to be added Chord ring, P is the central point of cluster where vector S point, then the pass of vector S The calculation formula of key value Key (S) is as follows:

Wherein 0≤i < n；

By formula (1), sparse vector is separated, and stores sparse data using independent server centered；Then when inquiry to When measuring key value Key (S) >=n*C of S, centrally stored server inquiry is directly accessed.