CN106777133A - A kind of similar connection processing method of metric space based on MapReduce - Google Patents

A kind of similar connection processing method of metric space based on MapReduce Download PDF

Info

Publication number
CN106777133A
CN106777133A CN201611173516.1A CN201611173516A CN106777133A CN 106777133 A CN106777133 A CN 106777133A CN 201611173516 A CN201611173516 A CN 201611173516A CN 106777133 A CN106777133 A CN 106777133A
Authority
CN
China
Prior art keywords
data
space
metric space
similar connection
processing method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611173516.1A
Other languages
Chinese (zh)
Inventor
高云君
杨克宇
陈璐
陈刚
陈纯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201611173516.1A priority Critical patent/CN106777133A/en
Publication of CN106777133A publication Critical patent/CN106777133A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of similar connection processing method of metric space based on MapReduce.The present invention is based on MapReduce distributed computing frameworks, designs the similar connection processing method of efficient metric space, and the data of repeated and redundant are detected and deleted.It is of the invention that the data set for giving is divided in the Map stages first, Similarity measures then are carried out to obtain repeated data result in the Reduce stages, and then are deleted.In the Map stages, the present invention samples to data set;High-quality pivot(ing) point is obtained according to sample data;Data set in whole metric space is mapped to vector space by pivot(ing) point;Finally using the partitioning technology based on KD trees, division as uniform as possible is carried out to data set.In the Reduce stages, the present invention utilizes area filter and flat scanning law technology, and realization carries out the Similarity Measure with effective beta pruning to data, obtains the result of similar connection treatment.Present invention greatly enhances similar connection treatment effeciency, there is provided optimal performance.

Description

A kind of similar connection processing method of metric space based on MapReduce
Technical field
The present invention relates to the connection treatment technology under the moderate quantity space of Computer Database field, more particularly to a kind of base In the similar connection processing method of the metric space of MapReduce.
Background technology
Metric space it is similar connection refer to:Found in cartesian product between two datasets in given metric space Data pair of all similitudes higher than (or distance is less than) given threshold value.The similar connection treatment of metric space is widely used In the every field of society, detected including repeated data and deleted.
With continuing to bring out for the novel information published method with social networks, ecommerce as representative, and cloud computing, The rise of Internet of Things computer technology, data just at an unprecedented rate constantly increase and accumulate, be thereupon with MapReduce is that all kinds of big data distributed systems of representative flourish, and the epoch of big data have arrived.Such a Big data epoch, the similar join algorithm of tradition centralization can not meet current rapidly being carried out to mass data again gradually The requirement that complex data is detected and deleted.Therefore, designing one has enhanced scalability, efficient distributed similar connection treatment Method becomes the active demand of academia and industrial quarters.
For the similar connection processing method of metric space based on MapReduce, current domestic and foreign scholars have been made A few thing.Wherein, most representational algorithm is to be based on the MAPSS methods of spherical partitioning technology and drawn based on two points of hyperplane The ClusterJoin methods of the technology of dividing.However, these methods mainly have two defects:(1) these methods are randomly chosen division Central point, this may cause data divide it is unbalanced, it is necessary to carry out further dividing to data again;(2) these methods Focused data splitting scheme, and have ignored data divide after the completion of, to it is each division internal data between carry out Similarity Measure when, Pruning strategy is designed to put forward efficient mode.Our method compensate for above-mentioned two defect well, improve similar company The efficiency for the treatment of is connect, efficiently repeated data is detected and is deleted.
The content of the invention
In view of the shortcomings of the prior art, the present invention provides a kind of metric space based on MapReduce similar connection treatment Method, the method is based on MapReduce distributed computing frameworks, the data set for giving is divided in the Map stages first, then Similarity measures are carried out to obtain repeated data result in the Reduce stages, and then are deleted.
In order to achieve the above object, the present invention uses technical scheme as follows:A kind of metric space based on MapReduce Similar connection processing method, specifically includes following steps:A kind of similar connection treatment side of metric space based on MapReduce The step of method, the method, is as follows:
(1) stochastical sampling is carried out to the metric space data set given in application, obtains sample data;
(2) pivot point selection is carried out to the sample data for obtaining;
(3) the whole data set (including sample data) that will be given in application maps to vector space from metric space;
(4) KD trees are built using the sample data for being mapped to vector space obtained in step (3), obtains corresponding space Divide;
(5) in the Map stages, divided according to the space obtained in step (4), to the whole data set obtained in step (3) Divided;
(6) Similarity Measure is carried out to the data after division in the Reduce stages, obtains the result of similar connection.
Further, the step (2) is specially:
(2.1) alternative set of the outlier as pivot(ing) point is found out in sample data;
(2.2) according to the selection target of pivot(ing) point, the greed selection of increment type is carried out to the point in alternative set.
Further, the step (3) is specially:Data for each in metric space, calculate and step (2) In the distance between the pivot(ing) point that obtains, and in the hope of distance as each dimension in vector space coordinate value, with degree of obtaining Coordinate of the quantity space data in vector space.
Further, described step (4) is specially:To the sample data obtained in step (3), KD trees are built, obtained KD trees in comprising the equal leaf node of data point number, the corresponding area of space of each leaf node is the knot of space division Really.
Further, described step (5) is in the Map stages, and mapping to of being obtained in step (3) is whole after vector space Individual data set is divided in the additional space division obtained in step (4).
Further, the step (6) is specially:
(6.1) in the Reduce stages, for each division, by the data inside each division in a random selected dimension On, it is ranked up arrangement using quick sorting algorithm;
(6.2) flat plane scanning method is utilized, metric space distance is carried out to the data set after sequence and is calculated with the result, and The calculating of adjusting the distance of calmodulin binding domain CaM filtering technique carries out beta pruning.
Further, the area filter technology refers to:If difference of two data objects in any dimension of vector space Value can not possibly turn into final result more than given distance threshold, then they, such that it is able to be calculated without metric space distance Just it is cut up.
The invention has the advantages that:The present invention takes full advantage of meter under MapReduce distributed computing frameworks The related technology of connection treatment similar to metric space in calculation machine database field, before guarantee result is correct in the Map stages Put, carried out as far as possible uniform division to data set, and the effective Pruning strategy in Reduce stage designs, carry out phase Calculated like degree;Greatly reduce CPU time, cost on network communication and I/O expenses, there is provided efficient similar connection treatability Can, to realize rapidly carrying out duplicate detection and deletion to mass data.
Brief description of the drawings
Fig. 1 is implementation steps flow chart of the invention;
Fig. 2 is that the space based on KD trees divides schematic diagram;
Fig. 3 is that the data based on KD trees divide schematic diagram;
Fig. 4 is Reduce stages similar connection treatment schematic diagram.
Specific embodiment
Technical scheme is described further in conjunction with accompanying drawing and specific implementation:
As shown in figure 1, specific implementation process of the present invention and operation principle are as follows:
Step (1):Metric space data set to being given in application carries out stochastical sampling, obtains sample data.
Step (2):Sample data to obtaining carries out pivot point selection;The pivot(ing) point requirement selected ensures data in vector Distance in space each other is with its distance in former metric space as close as the specific steps bag of its selection Include:
1) alternative set of the outlier as pivot(ing) point is found out in sample data;
2) according to the selection target of pivot point, the greed selection of increment type is carried out to the point in alternative set.
Step (3):The given whole data set (including sample data) of application is mapped into vector space from metric space; The mode of vector space mapping is the data for each in metric space, is calculated and the pivot(ing) point obtained in step (2) The distance between, and in the hope of distance as the coordinate value of each dimension in vector space, obtain metric space data in vector Coordinate in space.
Step (4):KD trees are built using the sample data obtained in step (3), corresponding space is obtained and is divided;Specifically such as Under:Sample data to being obtained in step (3) sets up KD trees, comprising the leaf section that data point number is equal in the KD trees for obtaining Point, the corresponding area of space of each leaf node is the result of space division;The structure to KD trees by taking Fig. 2 as an example is said below Bright, wherein sample data is { q2,o3,q4,o4,o5,q5,o7,q8}:
1) in a randomly selected dimension, dimension y is chosen in Fig. 2 (a), all of sampled data is ranked up, And then sample data is divided into two nodes of A, B, i.e. A={ q2,o3,q4,o4And B={ o5,q5,o7,q8};
2) division is iterated to two nodes of A, B respectively, finally gives four nodes shown in Fig. 2 (b), i.e. P1= {q2,o4},P2={ o3,q4},P3={ o5,q5And P4={ o7,q8};
3) finally give the corresponding space of each leaf node to divide, as Fig. 2 (b) interior joints P1、P2、P3And P4It is corresponding Bounding box BB (P1)、BB(P2)、BB(P3) and BB (P4)。
Step (5):In the Map stages, divided according to the space obtained in step (4), to the whole number obtained in step (3) Divided according to collection;It is specific as follows:Described step (5) maps to vector space in the Map stages, by what is obtained in step (3) Whole data set afterwards, is divided in the space division obtained in corresponding step (4), by taking Fig. 3 as an example, it is assumed that application is given Data set be Q={ q1,q2,…,q8, O={ o1,o2,…,o8, specific partiting step is as follows:
1) as shown in Fig. 3 (a), data set Q is divided in corresponding division, obtains four division P of data set Q1 Q= {q1,q2},P2 Q={ q3,q4},P3 Q={ q5And P4 Q={ q6,q7,q8};
2) as shown in Fig. 3 (a), in the division P for obtainingi QAfterwards, calculating can be by Pi QIn the minimum that surrounds of all data objects Bounding box MBB (P1 Q)、MBB(P2 Q)、MBB(P3 Q) and MBB (P4 Q);
3) each division P is calculatedi QHunting zone, divide Pi QHunting zone for its correspondence bounding box scope extend out distance The corresponding region of threshold size, shown in such as Fig. 3 (b), region shown in dotted line is MBB (P2 Q) hunting zone SR (P2 Q);
4) according to each hunting zone for dividing for obtaining, data set O is divided in the hunting zone of correspondence division, such as Shown in Fig. 3 (b), the result that data set O is divided is P1 O={ o2},P2 O={ o2,o3,o5,o6},P3 O={ o3,o5And P4 O={ o3, o6,o7}。
Step (6):Similarity Measure is carried out to the data after division in the Reduce stages, the treatment knot of similar connection is obtained Really;Specific steps include:
1) in the Reduce stages, for each division, the data inside each division make in a random selected dimension Arrangement is ranked up with quick sorting algorithm, as shown in figure 4, to dividing P2When being processed, have selected dimension x is carried out to data Sequence;
2) flat plane scanning method is utilized, metric space distance is carried out to the data set after sequence and is calculated with the result, and tied Closing the calculating of adjusting the distance of area filter technology carries out beta pruning;As shown in Fig. 4 (a), there is one scan plane from left scanning to the right side, sweep now Retouch to data object q2, need to be the q for being now arranged in the plane of scanning motion2Data of the checking within the distance threshold of plane of scanning motion right Object o5,o2And o3;In addition, according to area filter technology, shown in such as Fig. 4 (b), o5And o3Because being in q2Hunting zone SR (q2) outside can be cut off, it is q finally to only need to2Verify itself and o2Distance.

Claims (7)

1. the similar connection processing method of a kind of metric space based on MapReduce, it is characterised in that the step of the method such as Under:
(1) stochastical sampling is carried out to the metric space data set given in application, obtains sample data.
(2) pivot point selection is carried out to the sample data for obtaining.
(3) the whole data set (including sample data) that will be given in application maps to vector space from metric space.
(4) KD trees are built using the sample data for being mapped to vector space obtained in step (3), obtains corresponding space and draw Point.
(5) in the Map stages, divided according to the space obtained in step (4), the whole data set to being obtained in step (3) is carried out Divide.
(6) Similarity Measure is carried out to the data after division in the Reduce stages, obtains the result of similar connection.
2. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that: The step (2) is specially:
(2.1) alternative set of the outlier as pivot(ing) point is found out in sample data;
(2.2) according to the selection target of pivot(ing) point, the greed selection of increment type is carried out to the point in alternative set.
3. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that: The step (3) is specially:Data for each in metric space, calculate with the pivot(ing) point that obtains in step (2) it Between distance, and in the hope of distance as each dimension in vector space coordinate value, to obtain metric space data in vector Coordinate in space.
4. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that: Described step (4) is specially:To the sample data obtained in step (3), KD trees are built, comprising data point in the KD trees for obtaining The equal leaf node of number, the corresponding area of space of each leaf node is the result of space division.
5. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that: Described step (5) is divided to step in Map stages, the whole data set mapped to after vector space that will be obtained in step (3) Suddenly during the additional space for being obtained in (4) is divided.
6. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that: The step (6) is specially:
(6.1) in the Reduce stages, for each division, by the data inside each division in a random selected dimension, Arrangement is ranked up using quick sorting algorithm;
(6.2) flat plane scanning method is utilized, metric space distance is carried out to the data set after sequence and is calculated with the result, and combined The calculating of adjusting the distance of area filter technology carries out beta pruning.
7. the similar connection processing method of the metric space based on MapReduce according to claim 6, it is characterised in that: The area filter technology refers to:If difference of two data objects in any dimension of vector space is more than given apart from threshold Value, then they can not possibly be as final result, such that it is able to be just cut up without metric space distance calculating.
CN201611173516.1A 2016-12-16 2016-12-16 A kind of similar connection processing method of metric space based on MapReduce Pending CN106777133A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611173516.1A CN106777133A (en) 2016-12-16 2016-12-16 A kind of similar connection processing method of metric space based on MapReduce

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611173516.1A CN106777133A (en) 2016-12-16 2016-12-16 A kind of similar connection processing method of metric space based on MapReduce

Publications (1)

Publication Number Publication Date
CN106777133A true CN106777133A (en) 2017-05-31

Family

ID=58891188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611173516.1A Pending CN106777133A (en) 2016-12-16 2016-12-16 A kind of similar connection processing method of metric space based on MapReduce

Country Status (1)

Country Link
CN (1) CN106777133A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273464A (en) * 2017-06-02 2017-10-20 浙江大学 A kind of similar inquiry processing method of non-distributive measure based on publish/subscribe pattern
CN107506394A (en) * 2017-07-31 2017-12-22 武汉工程大学 Optimization method for eliminating big data standard relation connection redundancy
CN107659560A (en) * 2017-08-28 2018-02-02 国家计算机网络与信息安全管理中心 A kind of abnormal auditing method for mass network data flow log processing
CN108759902A (en) * 2018-03-30 2018-11-06 深圳大图科创技术开发有限公司 A kind of gas ductwork intelligent monitor system based on big data
CN112666451A (en) * 2021-03-15 2021-04-16 南京邮电大学 Integrated circuit scanning test vector generation method
CN113435501A (en) * 2021-06-25 2021-09-24 深圳大学 Clustering-based measurement space data partitioning and performance measuring method and related components
WO2022267096A1 (en) * 2021-06-22 2022-12-29 深圳计算科学研究院 Performance measurement method and apparatus for metric space partitioning boundaries, and related device

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273464A (en) * 2017-06-02 2017-10-20 浙江大学 A kind of similar inquiry processing method of non-distributive measure based on publish/subscribe pattern
CN107273464B (en) * 2017-06-02 2020-05-12 浙江大学 Distributed measurement similarity query processing method based on publish/subscribe mode
CN107506394A (en) * 2017-07-31 2017-12-22 武汉工程大学 Optimization method for eliminating big data standard relation connection redundancy
CN107659560A (en) * 2017-08-28 2018-02-02 国家计算机网络与信息安全管理中心 A kind of abnormal auditing method for mass network data flow log processing
CN108759902A (en) * 2018-03-30 2018-11-06 深圳大图科创技术开发有限公司 A kind of gas ductwork intelligent monitor system based on big data
CN112666451A (en) * 2021-03-15 2021-04-16 南京邮电大学 Integrated circuit scanning test vector generation method
CN112666451B (en) * 2021-03-15 2021-06-29 南京邮电大学 Integrated circuit scanning test vector generation method
WO2022267096A1 (en) * 2021-06-22 2022-12-29 深圳计算科学研究院 Performance measurement method and apparatus for metric space partitioning boundaries, and related device
CN113435501A (en) * 2021-06-25 2021-09-24 深圳大学 Clustering-based measurement space data partitioning and performance measuring method and related components
CN113435501B (en) * 2021-06-25 2023-07-07 深圳大学 Clustering-based metric space data partitioning and performance measuring method and related components

Similar Documents

Publication Publication Date Title
CN106777133A (en) A kind of similar connection processing method of metric space based on MapReduce
CN104376053B (en) A kind of storage and retrieval method based on magnanimity meteorological data
CN104462260B (en) A kind of community search method in social networks based on k- cores
CN106777093B (en) Skyline inquiry system based on space time sequence data flow application
CN111222418B (en) Crowdsourcing data rapid fusion optimization method for multiple road segments of lane line
CN104408055B (en) The storage method and device of a kind of laser radar point cloud data
CN106951526B (en) Entity set extension method and device
CN105654483A (en) Three-dimensional point cloud full-automatic registration method
CN104850712B (en) Surface sampled data topology Region Queries method in kind
CN109241355A (en) Accessibility querying method, system and the readable storage medium storing program for executing of directed acyclic graph
CN106780721A (en) Three-dimensional laser spiral scanning point cloud three-dimensional reconstruction method
CN106909539A (en) Image indexing system, server, database and related methods
CN105550332A (en) Dual-layer index structure based origin graph query method
CN103926879A (en) Aviation engine crankcase feature recognition method
CN105740521B (en) Small grid elimination method and device during reservoir numerical simulation system solution
CN110275929A (en) A kind of candidate road section screening technique and mesh segmentation method based on mesh segmentation
CN114783068A (en) Gesture recognition method, gesture recognition device, electronic device and storage medium
CN110505322A (en) A kind of IP address section lookup method and device
CN102043857B (en) All-nearest-neighbor query method and system
CN108564116A (en) A kind of ingredient intelligent analysis method of camera scene image
CN107888494B (en) Community discovery-based packet classification method and system
CN114358252A (en) Operation execution method and device in target neural network model and storage medium
CN104573036A (en) Distance-based algorithm for solving representative node set in two-dimensional space
CN106909552A (en) Image retrieval server, system, coordinate indexing and misarrangement method
CN103761298A (en) Distributed-architecture-based entity matching method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170531

RJ01 Rejection of invention patent application after publication