CN106777133A

CN106777133A - A kind of similar connection processing method of metric space based on MapReduce

Info

Publication number: CN106777133A
Application number: CN201611173516.1A
Authority: CN
Inventors: 高云君; 杨克宇; 陈璐; 陈刚; 陈纯
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-12-16
Filing date: 2016-12-16
Publication date: 2017-05-31

Abstract

The invention discloses a kind of similar connection processing method of metric space based on MapReduce.The present invention is based on MapReduce distributed computing frameworks, designs the similar connection processing method of efficient metric space, and the data of repeated and redundant are detected and deleted.It is of the invention that the data set for giving is divided in the Map stages first, Similarity measures then are carried out to obtain repeated data result in the Reduce stages, and then are deleted.In the Map stages, the present invention samples to data set；High-quality pivot(ing) point is obtained according to sample data；Data set in whole metric space is mapped to vector space by pivot(ing) point；Finally using the partitioning technology based on KD trees, division as uniform as possible is carried out to data set.In the Reduce stages, the present invention utilizes area filter and flat scanning law technology, and realization carries out the Similarity Measure with effective beta pruning to data, obtains the result of similar connection treatment.Present invention greatly enhances similar connection treatment effeciency, there is provided optimal performance.

Description

A kind of similar connection processing method of metric space based on MapReduce

Technical field

The present invention relates to the connection treatment technology under the moderate quantity space of Computer Database field, more particularly to a kind of base In the similar connection processing method of the metric space of MapReduce.

Background technology

Metric space it is similar connection refer to：Found in cartesian product between two datasets in given metric space Data pair of all similitudes higher than (or distance is less than) given threshold value.The similar connection treatment of metric space is widely used In the every field of society, detected including repeated data and deleted.

With continuing to bring out for the novel information published method with social networks, ecommerce as representative, and cloud computing, The rise of Internet of Things computer technology, data just at an unprecedented rate constantly increase and accumulate, be thereupon with MapReduce is that all kinds of big data distributed systems of representative flourish, and the epoch of big data have arrived.Such a Big data epoch, the similar join algorithm of tradition centralization can not meet current rapidly being carried out to mass data again gradually The requirement that complex data is detected and deleted.Therefore, designing one has enhanced scalability, efficient distributed similar connection treatment Method becomes the active demand of academia and industrial quarters.

For the similar connection processing method of metric space based on MapReduce, current domestic and foreign scholars have been made A few thing.Wherein, most representational algorithm is to be based on the MAPSS methods of spherical partitioning technology and drawn based on two points of hyperplane The ClusterJoin methods of the technology of dividing.However, these methods mainly have two defects：(1) these methods are randomly chosen division Central point, this may cause data divide it is unbalanced, it is necessary to carry out further dividing to data again；(2) these methods Focused data splitting scheme, and have ignored data divide after the completion of, to it is each division internal data between carry out Similarity Measure when, Pruning strategy is designed to put forward efficient mode.Our method compensate for above-mentioned two defect well, improve similar company The efficiency for the treatment of is connect, efficiently repeated data is detected and is deleted.

The content of the invention

In view of the shortcomings of the prior art, the present invention provides a kind of metric space based on MapReduce similar connection treatment Method, the method is based on MapReduce distributed computing frameworks, the data set for giving is divided in the Map stages first, then Similarity measures are carried out to obtain repeated data result in the Reduce stages, and then are deleted.

In order to achieve the above object, the present invention uses technical scheme as follows：A kind of metric space based on MapReduce Similar connection processing method, specifically includes following steps：A kind of similar connection treatment side of metric space based on MapReduce The step of method, the method, is as follows：

(1) stochastical sampling is carried out to the metric space data set given in application, obtains sample data；

(2) pivot point selection is carried out to the sample data for obtaining；

(3) the whole data set (including sample data) that will be given in application maps to vector space from metric space；

(4) KD trees are built using the sample data for being mapped to vector space obtained in step (3), obtains corresponding space Divide；

(5) in the Map stages, divided according to the space obtained in step (4), to the whole data set obtained in step (3) Divided；

(6) Similarity Measure is carried out to the data after division in the Reduce stages, obtains the result of similar connection.

Further, the step (2) is specially：

(2.1) alternative set of the outlier as pivot(ing) point is found out in sample data；

(2.2) according to the selection target of pivot(ing) point, the greed selection of increment type is carried out to the point in alternative set.

Further, the step (3) is specially：Data for each in metric space, calculate and step (2) In the distance between the pivot(ing) point that obtains, and in the hope of distance as each dimension in vector space coordinate value, with degree of obtaining Coordinate of the quantity space data in vector space.

Further, described step (4) is specially：To the sample data obtained in step (3), KD trees are built, obtained KD trees in comprising the equal leaf node of data point number, the corresponding area of space of each leaf node is the knot of space division Really.

Further, described step (5) is in the Map stages, and mapping to of being obtained in step (3) is whole after vector space Individual data set is divided in the additional space division obtained in step (4).

Further, the step (6) is specially：

(6.1) in the Reduce stages, for each division, by the data inside each division in a random selected dimension On, it is ranked up arrangement using quick sorting algorithm；

(6.2) flat plane scanning method is utilized, metric space distance is carried out to the data set after sequence and is calculated with the result, and The calculating of adjusting the distance of calmodulin binding domain CaM filtering technique carries out beta pruning.

Further, the area filter technology refers to：If difference of two data objects in any dimension of vector space Value can not possibly turn into final result more than given distance threshold, then they, such that it is able to be calculated without metric space distance Just it is cut up.

The invention has the advantages that：The present invention takes full advantage of meter under MapReduce distributed computing frameworks The related technology of connection treatment similar to metric space in calculation machine database field, before guarantee result is correct in the Map stages Put, carried out as far as possible uniform division to data set, and the effective Pruning strategy in Reduce stage designs, carry out phase Calculated like degree；Greatly reduce CPU time, cost on network communication and I/O expenses, there is provided efficient similar connection treatability Can, to realize rapidly carrying out duplicate detection and deletion to mass data.

Brief description of the drawings

Fig. 1 is implementation steps flow chart of the invention；

Fig. 2 is that the space based on KD trees divides schematic diagram；

Fig. 3 is that the data based on KD trees divide schematic diagram；

Fig. 4 is Reduce stages similar connection treatment schematic diagram.

Specific embodiment

Technical scheme is described further in conjunction with accompanying drawing and specific implementation：

As shown in figure 1, specific implementation process of the present invention and operation principle are as follows：

Step (1)：Metric space data set to being given in application carries out stochastical sampling, obtains sample data.

Step (2)：Sample data to obtaining carries out pivot point selection；The pivot(ing) point requirement selected ensures data in vector Distance in space each other is with its distance in former metric space as close as the specific steps bag of its selection Include：

1) alternative set of the outlier as pivot(ing) point is found out in sample data；

2) according to the selection target of pivot point, the greed selection of increment type is carried out to the point in alternative set.

Step (3)：The given whole data set (including sample data) of application is mapped into vector space from metric space； The mode of vector space mapping is the data for each in metric space, is calculated and the pivot(ing) point obtained in step (2) The distance between, and in the hope of distance as the coordinate value of each dimension in vector space, obtain metric space data in vector Coordinate in space.

Step (4)：KD trees are built using the sample data obtained in step (3), corresponding space is obtained and is divided；Specifically such as Under：Sample data to being obtained in step (3) sets up KD trees, comprising the leaf section that data point number is equal in the KD trees for obtaining Point, the corresponding area of space of each leaf node is the result of space division；The structure to KD trees by taking Fig. 2 as an example is said below Bright, wherein sample data is { q₂,o₃,q₄,o₄,o₅,q₅,o₇,q₈}：

1) in a randomly selected dimension, dimension y is chosen in Fig. 2 (a), all of sampled data is ranked up, And then sample data is divided into two nodes of A, B, i.e. A={ q₂,o₃,q₄,o₄And B={ o₅,q₅,o₇,q₈}；

2) division is iterated to two nodes of A, B respectively, finally gives four nodes shown in Fig. 2 (b), i.e. P₁= {q₂,o₄},P₂={ o₃,q₄},P₃={ o₅,q₅And P₄={ o₇,q₈}；

3) finally give the corresponding space of each leaf node to divide, as Fig. 2 (b) interior joints P₁、P₂、P₃And P₄It is corresponding Bounding box BB (P₁)、BB(P₂)、BB(P₃) and BB (P₄)。

Step (5)：In the Map stages, divided according to the space obtained in step (4), to the whole number obtained in step (3) Divided according to collection；It is specific as follows：Described step (5) maps to vector space in the Map stages, by what is obtained in step (3) Whole data set afterwards, is divided in the space division obtained in corresponding step (4), by taking Fig. 3 as an example, it is assumed that application is given Data set be Q={ q₁,q₂,…,q₈, O={ o₁,o₂,…,o₈, specific partiting step is as follows：

1) as shown in Fig. 3 (a), data set Q is divided in corresponding division, obtains four division P of data set Q₁ ^Q= {q₁,q₂},P₂ ^Q={ q₃,q₄},P₃ ^Q={ q₅And P₄ ^Q={ q₆,q₇,q₈}；

2) as shown in Fig. 3 (a), in the division P for obtaining_i ^QAfterwards, calculating can be by P_i ^QIn the minimum that surrounds of all data objects Bounding box MBB (P₁ ^Q)、MBB(P₂ ^Q)、MBB(P₃ ^Q) and MBB (P₄ ^Q)；

3) each division P is calculated_i ^QHunting zone, divide P_i ^QHunting zone for its correspondence bounding box scope extend out distance The corresponding region of threshold size, shown in such as Fig. 3 (b), region shown in dotted line is MBB (P₂ ^Q) hunting zone SR (P₂ ^Q)；

4) according to each hunting zone for dividing for obtaining, data set O is divided in the hunting zone of correspondence division, such as Shown in Fig. 3 (b), the result that data set O is divided is P₁ ^O={ o₂},P₂ ^O={ o₂,o₃,o₅,o₆},P₃ ^O={ o₃,o₅And P₄ ^O={ o₃, o₆,o₇}。

Step (6)：Similarity Measure is carried out to the data after division in the Reduce stages, the treatment knot of similar connection is obtained Really；Specific steps include：

1) in the Reduce stages, for each division, the data inside each division make in a random selected dimension Arrangement is ranked up with quick sorting algorithm, as shown in figure 4, to dividing P₂When being processed, have selected dimension x is carried out to data Sequence；

2) flat plane scanning method is utilized, metric space distance is carried out to the data set after sequence and is calculated with the result, and tied Closing the calculating of adjusting the distance of area filter technology carries out beta pruning；As shown in Fig. 4 (a), there is one scan plane from left scanning to the right side, sweep now Retouch to data object q₂, need to be the q for being now arranged in the plane of scanning motion₂Data of the checking within the distance threshold of plane of scanning motion right Object o₅,o₂And o₃；In addition, according to area filter technology, shown in such as Fig. 4 (b), o₅And o₃Because being in q₂Hunting zone SR (q₂) outside can be cut off, it is q finally to only need to₂Verify itself and o₂Distance.

Claims

1. the similar connection processing method of a kind of metric space based on MapReduce, it is characterised in that the step of the method such as Under：

(1) stochastical sampling is carried out to the metric space data set given in application, obtains sample data.

(2) pivot point selection is carried out to the sample data for obtaining.

(3) the whole data set (including sample data) that will be given in application maps to vector space from metric space.

(4) KD trees are built using the sample data for being mapped to vector space obtained in step (3), obtains corresponding space and draw Point.

(5) in the Map stages, divided according to the space obtained in step (4), the whole data set to being obtained in step (3) is carried out Divide.

2. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that： The step (2) is specially：

3. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that： The step (3) is specially：Data for each in metric space, calculate with the pivot(ing) point that obtains in step (2) it Between distance, and in the hope of distance as each dimension in vector space coordinate value, to obtain metric space data in vector Coordinate in space.

4. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that： Described step (4) is specially：To the sample data obtained in step (3), KD trees are built, comprising data point in the KD trees for obtaining The equal leaf node of number, the corresponding area of space of each leaf node is the result of space division.

5. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that： Described step (5) is divided to step in Map stages, the whole data set mapped to after vector space that will be obtained in step (3) Suddenly during the additional space for being obtained in (4) is divided.

6. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that： The step (6) is specially：

(6.1) in the Reduce stages, for each division, by the data inside each division in a random selected dimension, Arrangement is ranked up using quick sorting algorithm；

(6.2) flat plane scanning method is utilized, metric space distance is carried out to the data set after sequence and is calculated with the result, and combined The calculating of adjusting the distance of area filter technology carries out beta pruning.

7. the similar connection processing method of the metric space based on MapReduce according to claim 6, it is characterised in that： The area filter technology refers to：If difference of two data objects in any dimension of vector space is more than given apart from threshold Value, then they can not possibly be as final result, such that it is able to be just cut up without metric space distance calculating.