CN106777133A - A kind of similar connection processing method of metric space based on MapReduce - Google Patents
A kind of similar connection processing method of metric space based on MapReduce Download PDFInfo
- Publication number
- CN106777133A CN106777133A CN201611173516.1A CN201611173516A CN106777133A CN 106777133 A CN106777133 A CN 106777133A CN 201611173516 A CN201611173516 A CN 201611173516A CN 106777133 A CN106777133 A CN 106777133A
- Authority
- CN
- China
- Prior art keywords
- data
- space
- metric space
- similar connection
- processing method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2264—Multidimensional index structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of similar connection processing method of metric space based on MapReduce.The present invention is based on MapReduce distributed computing frameworks, designs the similar connection processing method of efficient metric space, and the data of repeated and redundant are detected and deleted.It is of the invention that the data set for giving is divided in the Map stages first, Similarity measures then are carried out to obtain repeated data result in the Reduce stages, and then are deleted.In the Map stages, the present invention samples to data set;High-quality pivot(ing) point is obtained according to sample data;Data set in whole metric space is mapped to vector space by pivot(ing) point;Finally using the partitioning technology based on KD trees, division as uniform as possible is carried out to data set.In the Reduce stages, the present invention utilizes area filter and flat scanning law technology, and realization carries out the Similarity Measure with effective beta pruning to data, obtains the result of similar connection treatment.Present invention greatly enhances similar connection treatment effeciency, there is provided optimal performance.
Description
Technical field
The present invention relates to the connection treatment technology under the moderate quantity space of Computer Database field, more particularly to a kind of base
In the similar connection processing method of the metric space of MapReduce.
Background technology
Metric space it is similar connection refer to:Found in cartesian product between two datasets in given metric space
Data pair of all similitudes higher than (or distance is less than) given threshold value.The similar connection treatment of metric space is widely used
In the every field of society, detected including repeated data and deleted.
With continuing to bring out for the novel information published method with social networks, ecommerce as representative, and cloud computing,
The rise of Internet of Things computer technology, data just at an unprecedented rate constantly increase and accumulate, be thereupon with
MapReduce is that all kinds of big data distributed systems of representative flourish, and the epoch of big data have arrived.Such a
Big data epoch, the similar join algorithm of tradition centralization can not meet current rapidly being carried out to mass data again gradually
The requirement that complex data is detected and deleted.Therefore, designing one has enhanced scalability, efficient distributed similar connection treatment
Method becomes the active demand of academia and industrial quarters.
For the similar connection processing method of metric space based on MapReduce, current domestic and foreign scholars have been made
A few thing.Wherein, most representational algorithm is to be based on the MAPSS methods of spherical partitioning technology and drawn based on two points of hyperplane
The ClusterJoin methods of the technology of dividing.However, these methods mainly have two defects:(1) these methods are randomly chosen division
Central point, this may cause data divide it is unbalanced, it is necessary to carry out further dividing to data again;(2) these methods
Focused data splitting scheme, and have ignored data divide after the completion of, to it is each division internal data between carry out Similarity Measure when,
Pruning strategy is designed to put forward efficient mode.Our method compensate for above-mentioned two defect well, improve similar company
The efficiency for the treatment of is connect, efficiently repeated data is detected and is deleted.
The content of the invention
In view of the shortcomings of the prior art, the present invention provides a kind of metric space based on MapReduce similar connection treatment
Method, the method is based on MapReduce distributed computing frameworks, the data set for giving is divided in the Map stages first, then
Similarity measures are carried out to obtain repeated data result in the Reduce stages, and then are deleted.
In order to achieve the above object, the present invention uses technical scheme as follows:A kind of metric space based on MapReduce
Similar connection processing method, specifically includes following steps:A kind of similar connection treatment side of metric space based on MapReduce
The step of method, the method, is as follows:
(1) stochastical sampling is carried out to the metric space data set given in application, obtains sample data;
(2) pivot point selection is carried out to the sample data for obtaining;
(3) the whole data set (including sample data) that will be given in application maps to vector space from metric space;
(4) KD trees are built using the sample data for being mapped to vector space obtained in step (3), obtains corresponding space
Divide;
(5) in the Map stages, divided according to the space obtained in step (4), to the whole data set obtained in step (3)
Divided;
(6) Similarity Measure is carried out to the data after division in the Reduce stages, obtains the result of similar connection.
Further, the step (2) is specially:
(2.1) alternative set of the outlier as pivot(ing) point is found out in sample data;
(2.2) according to the selection target of pivot(ing) point, the greed selection of increment type is carried out to the point in alternative set.
Further, the step (3) is specially:Data for each in metric space, calculate and step (2)
In the distance between the pivot(ing) point that obtains, and in the hope of distance as each dimension in vector space coordinate value, with degree of obtaining
Coordinate of the quantity space data in vector space.
Further, described step (4) is specially:To the sample data obtained in step (3), KD trees are built, obtained
KD trees in comprising the equal leaf node of data point number, the corresponding area of space of each leaf node is the knot of space division
Really.
Further, described step (5) is in the Map stages, and mapping to of being obtained in step (3) is whole after vector space
Individual data set is divided in the additional space division obtained in step (4).
Further, the step (6) is specially:
(6.1) in the Reduce stages, for each division, by the data inside each division in a random selected dimension
On, it is ranked up arrangement using quick sorting algorithm;
(6.2) flat plane scanning method is utilized, metric space distance is carried out to the data set after sequence and is calculated with the result, and
The calculating of adjusting the distance of calmodulin binding domain CaM filtering technique carries out beta pruning.
Further, the area filter technology refers to:If difference of two data objects in any dimension of vector space
Value can not possibly turn into final result more than given distance threshold, then they, such that it is able to be calculated without metric space distance
Just it is cut up.
The invention has the advantages that:The present invention takes full advantage of meter under MapReduce distributed computing frameworks
The related technology of connection treatment similar to metric space in calculation machine database field, before guarantee result is correct in the Map stages
Put, carried out as far as possible uniform division to data set, and the effective Pruning strategy in Reduce stage designs, carry out phase
Calculated like degree;Greatly reduce CPU time, cost on network communication and I/O expenses, there is provided efficient similar connection treatability
Can, to realize rapidly carrying out duplicate detection and deletion to mass data.
Brief description of the drawings
Fig. 1 is implementation steps flow chart of the invention;
Fig. 2 is that the space based on KD trees divides schematic diagram;
Fig. 3 is that the data based on KD trees divide schematic diagram;
Fig. 4 is Reduce stages similar connection treatment schematic diagram.
Specific embodiment
Technical scheme is described further in conjunction with accompanying drawing and specific implementation:
As shown in figure 1, specific implementation process of the present invention and operation principle are as follows:
Step (1):Metric space data set to being given in application carries out stochastical sampling, obtains sample data.
Step (2):Sample data to obtaining carries out pivot point selection;The pivot(ing) point requirement selected ensures data in vector
Distance in space each other is with its distance in former metric space as close as the specific steps bag of its selection
Include:
1) alternative set of the outlier as pivot(ing) point is found out in sample data;
2) according to the selection target of pivot point, the greed selection of increment type is carried out to the point in alternative set.
Step (3):The given whole data set (including sample data) of application is mapped into vector space from metric space;
The mode of vector space mapping is the data for each in metric space, is calculated and the pivot(ing) point obtained in step (2)
The distance between, and in the hope of distance as the coordinate value of each dimension in vector space, obtain metric space data in vector
Coordinate in space.
Step (4):KD trees are built using the sample data obtained in step (3), corresponding space is obtained and is divided;Specifically such as
Under:Sample data to being obtained in step (3) sets up KD trees, comprising the leaf section that data point number is equal in the KD trees for obtaining
Point, the corresponding area of space of each leaf node is the result of space division;The structure to KD trees by taking Fig. 2 as an example is said below
Bright, wherein sample data is { q2,o3,q4,o4,o5,q5,o7,q8}:
1) in a randomly selected dimension, dimension y is chosen in Fig. 2 (a), all of sampled data is ranked up,
And then sample data is divided into two nodes of A, B, i.e. A={ q2,o3,q4,o4And B={ o5,q5,o7,q8};
2) division is iterated to two nodes of A, B respectively, finally gives four nodes shown in Fig. 2 (b), i.e. P1=
{q2,o4},P2={ o3,q4},P3={ o5,q5And P4={ o7,q8};
3) finally give the corresponding space of each leaf node to divide, as Fig. 2 (b) interior joints P1、P2、P3And P4It is corresponding
Bounding box BB (P1)、BB(P2)、BB(P3) and BB (P4)。
Step (5):In the Map stages, divided according to the space obtained in step (4), to the whole number obtained in step (3)
Divided according to collection;It is specific as follows:Described step (5) maps to vector space in the Map stages, by what is obtained in step (3)
Whole data set afterwards, is divided in the space division obtained in corresponding step (4), by taking Fig. 3 as an example, it is assumed that application is given
Data set be Q={ q1,q2,…,q8, O={ o1,o2,…,o8, specific partiting step is as follows:
1) as shown in Fig. 3 (a), data set Q is divided in corresponding division, obtains four division P of data set Q1 Q=
{q1,q2},P2 Q={ q3,q4},P3 Q={ q5And P4 Q={ q6,q7,q8};
2) as shown in Fig. 3 (a), in the division P for obtainingi QAfterwards, calculating can be by Pi QIn the minimum that surrounds of all data objects
Bounding box MBB (P1 Q)、MBB(P2 Q)、MBB(P3 Q) and MBB (P4 Q);
3) each division P is calculatedi QHunting zone, divide Pi QHunting zone for its correspondence bounding box scope extend out distance
The corresponding region of threshold size, shown in such as Fig. 3 (b), region shown in dotted line is MBB (P2 Q) hunting zone SR (P2 Q);
4) according to each hunting zone for dividing for obtaining, data set O is divided in the hunting zone of correspondence division, such as
Shown in Fig. 3 (b), the result that data set O is divided is P1 O={ o2},P2 O={ o2,o3,o5,o6},P3 O={ o3,o5And P4 O={ o3,
o6,o7}。
Step (6):Similarity Measure is carried out to the data after division in the Reduce stages, the treatment knot of similar connection is obtained
Really;Specific steps include:
1) in the Reduce stages, for each division, the data inside each division make in a random selected dimension
Arrangement is ranked up with quick sorting algorithm, as shown in figure 4, to dividing P2When being processed, have selected dimension x is carried out to data
Sequence;
2) flat plane scanning method is utilized, metric space distance is carried out to the data set after sequence and is calculated with the result, and tied
Closing the calculating of adjusting the distance of area filter technology carries out beta pruning;As shown in Fig. 4 (a), there is one scan plane from left scanning to the right side, sweep now
Retouch to data object q2, need to be the q for being now arranged in the plane of scanning motion2Data of the checking within the distance threshold of plane of scanning motion right
Object o5,o2And o3;In addition, according to area filter technology, shown in such as Fig. 4 (b), o5And o3Because being in q2Hunting zone SR
(q2) outside can be cut off, it is q finally to only need to2Verify itself and o2Distance.
Claims (7)
1. the similar connection processing method of a kind of metric space based on MapReduce, it is characterised in that the step of the method such as
Under:
(1) stochastical sampling is carried out to the metric space data set given in application, obtains sample data.
(2) pivot point selection is carried out to the sample data for obtaining.
(3) the whole data set (including sample data) that will be given in application maps to vector space from metric space.
(4) KD trees are built using the sample data for being mapped to vector space obtained in step (3), obtains corresponding space and draw
Point.
(5) in the Map stages, divided according to the space obtained in step (4), the whole data set to being obtained in step (3) is carried out
Divide.
(6) Similarity Measure is carried out to the data after division in the Reduce stages, obtains the result of similar connection.
2. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that:
The step (2) is specially:
(2.1) alternative set of the outlier as pivot(ing) point is found out in sample data;
(2.2) according to the selection target of pivot(ing) point, the greed selection of increment type is carried out to the point in alternative set.
3. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that:
The step (3) is specially:Data for each in metric space, calculate with the pivot(ing) point that obtains in step (2) it
Between distance, and in the hope of distance as each dimension in vector space coordinate value, to obtain metric space data in vector
Coordinate in space.
4. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that:
Described step (4) is specially:To the sample data obtained in step (3), KD trees are built, comprising data point in the KD trees for obtaining
The equal leaf node of number, the corresponding area of space of each leaf node is the result of space division.
5. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that:
Described step (5) is divided to step in Map stages, the whole data set mapped to after vector space that will be obtained in step (3)
Suddenly during the additional space for being obtained in (4) is divided.
6. the similar connection processing method of the metric space based on MapReduce according to claim 1, it is characterised in that:
The step (6) is specially:
(6.1) in the Reduce stages, for each division, by the data inside each division in a random selected dimension,
Arrangement is ranked up using quick sorting algorithm;
(6.2) flat plane scanning method is utilized, metric space distance is carried out to the data set after sequence and is calculated with the result, and combined
The calculating of adjusting the distance of area filter technology carries out beta pruning.
7. the similar connection processing method of the metric space based on MapReduce according to claim 6, it is characterised in that:
The area filter technology refers to:If difference of two data objects in any dimension of vector space is more than given apart from threshold
Value, then they can not possibly be as final result, such that it is able to be just cut up without metric space distance calculating.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611173516.1A CN106777133A (en) | 2016-12-16 | 2016-12-16 | A kind of similar connection processing method of metric space based on MapReduce |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611173516.1A CN106777133A (en) | 2016-12-16 | 2016-12-16 | A kind of similar connection processing method of metric space based on MapReduce |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106777133A true CN106777133A (en) | 2017-05-31 |
Family
ID=58891188
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611173516.1A Pending CN106777133A (en) | 2016-12-16 | 2016-12-16 | A kind of similar connection processing method of metric space based on MapReduce |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106777133A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273464A (en) * | 2017-06-02 | 2017-10-20 | 浙江大学 | A kind of similar inquiry processing method of non-distributive measure based on publish/subscribe pattern |
CN107506394A (en) * | 2017-07-31 | 2017-12-22 | 武汉工程大学 | Optimization method for eliminating big data standard relation connection redundancy |
CN107659560A (en) * | 2017-08-28 | 2018-02-02 | 国家计算机网络与信息安全管理中心 | A kind of abnormal auditing method for mass network data flow log processing |
CN108759902A (en) * | 2018-03-30 | 2018-11-06 | 深圳大图科创技术开发有限公司 | A kind of gas ductwork intelligent monitor system based on big data |
CN112666451A (en) * | 2021-03-15 | 2021-04-16 | 南京邮电大学 | Integrated circuit scanning test vector generation method |
CN113435501A (en) * | 2021-06-25 | 2021-09-24 | 深圳大学 | Clustering-based measurement space data partitioning and performance measuring method and related components |
WO2022267096A1 (en) * | 2021-06-22 | 2022-12-29 | 深圳计算科学研究院 | Performance measurement method and apparatus for metric space partitioning boundaries, and related device |
-
2016
- 2016-12-16 CN CN201611173516.1A patent/CN106777133A/en active Pending
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273464A (en) * | 2017-06-02 | 2017-10-20 | 浙江大学 | A kind of similar inquiry processing method of non-distributive measure based on publish/subscribe pattern |
CN107273464B (en) * | 2017-06-02 | 2020-05-12 | 浙江大学 | Distributed measurement similarity query processing method based on publish/subscribe mode |
CN107506394A (en) * | 2017-07-31 | 2017-12-22 | 武汉工程大学 | Optimization method for eliminating big data standard relation connection redundancy |
CN107659560A (en) * | 2017-08-28 | 2018-02-02 | 国家计算机网络与信息安全管理中心 | A kind of abnormal auditing method for mass network data flow log processing |
CN108759902A (en) * | 2018-03-30 | 2018-11-06 | 深圳大图科创技术开发有限公司 | A kind of gas ductwork intelligent monitor system based on big data |
CN112666451A (en) * | 2021-03-15 | 2021-04-16 | 南京邮电大学 | Integrated circuit scanning test vector generation method |
CN112666451B (en) * | 2021-03-15 | 2021-06-29 | 南京邮电大学 | Integrated circuit scanning test vector generation method |
WO2022267096A1 (en) * | 2021-06-22 | 2022-12-29 | 深圳计算科学研究院 | Performance measurement method and apparatus for metric space partitioning boundaries, and related device |
CN113435501A (en) * | 2021-06-25 | 2021-09-24 | 深圳大学 | Clustering-based measurement space data partitioning and performance measuring method and related components |
CN113435501B (en) * | 2021-06-25 | 2023-07-07 | 深圳大学 | Clustering-based metric space data partitioning and performance measuring method and related components |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106777133A (en) | A kind of similar connection processing method of metric space based on MapReduce | |
CN104376053B (en) | A kind of storage and retrieval method based on magnanimity meteorological data | |
CN104462260B (en) | A kind of community search method in social networks based on k- cores | |
CN106777093B (en) | Skyline inquiry system based on space time sequence data flow application | |
CN111222418B (en) | Crowdsourcing data rapid fusion optimization method for multiple road segments of lane line | |
CN104408055B (en) | The storage method and device of a kind of laser radar point cloud data | |
CN106951526B (en) | Entity set extension method and device | |
CN105654483A (en) | Three-dimensional point cloud full-automatic registration method | |
CN104850712B (en) | Surface sampled data topology Region Queries method in kind | |
CN109241355A (en) | Accessibility querying method, system and the readable storage medium storing program for executing of directed acyclic graph | |
CN106780721A (en) | Three-dimensional laser spiral scanning point cloud three-dimensional reconstruction method | |
CN106909539A (en) | Image indexing system, server, database and related methods | |
CN105550332A (en) | Dual-layer index structure based origin graph query method | |
CN103926879A (en) | Aviation engine crankcase feature recognition method | |
CN105740521B (en) | Small grid elimination method and device during reservoir numerical simulation system solution | |
CN110275929A (en) | A kind of candidate road section screening technique and mesh segmentation method based on mesh segmentation | |
CN114783068A (en) | Gesture recognition method, gesture recognition device, electronic device and storage medium | |
CN110505322A (en) | A kind of IP address section lookup method and device | |
CN102043857B (en) | All-nearest-neighbor query method and system | |
CN108564116A (en) | A kind of ingredient intelligent analysis method of camera scene image | |
CN107888494B (en) | Community discovery-based packet classification method and system | |
CN114358252A (en) | Operation execution method and device in target neural network model and storage medium | |
CN104573036A (en) | Distance-based algorithm for solving representative node set in two-dimensional space | |
CN106909552A (en) | Image retrieval server, system, coordinate indexing and misarrangement method | |
CN103761298A (en) | Distributed-architecture-based entity matching method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170531 |
|
RJ01 | Rejection of invention patent application after publication |