CN107657050A

CN107657050A - One kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method

Info

Publication number: CN107657050A
Application number: CN201710950911.4A
Authority: CN
Inventors: 蒋步星
Original assignee: BEIJING RUNQIAN INFORMATION SYSTEM TECHNOLOGY Co Ltd
Current assignee: BEIJING RUNQIAN INFORMATION SYSTEM TECHNOLOGY Co Ltd
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2018-02-02

Abstract

The invention provides a kind of based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method.Data set A is divided into N sections, a section later major key key assignments every section of first record has been divided to read out, data set B corresponding with data set A must be also segmented, and find the associated key key assignments of corresponding every section of first record in data set B with dichotomy according to the major key key assignments of every section of first record in data set A.After finding each section of starting point in data set B, data set B is segmented by starting point；Each thread individually calculates one-to-one join with dimension data collection A and B points each corresponding section associated with conflation algorithm, the section that A and B points of data set only needs each to travel through once when calculating, later merging data has been calculated to integrate as C, merging data collection C after finally these segmentations are calculated again forms final data set D, and then can quotes have recorded for needs in data set D.

Description

One kind is based on " with conflation algorithm calculating one-to-one join, one-to-many join " contraposition It is segmented parallel method

Technical field

The present invention relates to the one-to-one join of parallel computation, one-to-many join, more specifically for, more particularly to one kind is based on " calculate one-to-one join with conflation algorithm, one-to-many join " contraposition is segmented parallel method.

Background technology

With attention of the whole world to intellectual property, the development trend of wherein patent of invention is become better and better, each field of every profession and trade Apply for that the quantity of the patent of invention of mandate is increasingly huge, particularly machinery, biology, chemical industry, medicine, internet industry are even more such as This.In face of these substantial amounts of authorized patents of invention, the most important thing rationally shown using these existing known technologies.In It is many patents of invention derived using genetic resources occurred, technology is carried out using existing authorized known technology Transformation and upgrade of aspect etc..It is " a kind of to calculate one-to-one join, one-to-many join with conflation algorithm that the present invention is based on invention What method " was derived.

According to the requirement of Patent Law, it is the invention of Application No. 201710931999.5 to state patent direct sources of the present invention Patent, the patent of invention of Application No. 201710931999.5 Publication ahead in application process, with present patent application people It is all " Beijing Runqian Information System Technology Co., Ltd. " that inventor Jiang walks star, inventor passes through to Application No. 201710931999.5 patent of invention in-depth study, broken through in original patented technology aspect.Patent of the present invention Applied to the one-to-one join of parallel computation, the one-to-many join being difficult to.

Due to the limitation of monokaryon CPU speed lifting, the main path for lifting computer speed is to use multi-core CPU.It is more The development that core CPU popularization logarithm value calculates generates historic change, due to multinuclear and monokaryon CPU design theory not Together, the software of operational excellence or former algorithm can not make multi-core CPU play its maximum efficiency on monokaryon CPU, want efficiently Using multi-core CPU, original legacy code must be just improved, circulates the thought of parallel algorithm.Original serial calculating is made into simultaneously Capable calculating, if wanting parallel computation, it is necessary to which each thread handles a part of data respectively, so data sectional to each Individual thread.

Sometimes for obtaining complete result, it would be desirable to obtain result from two or more data sets.We are just Need to perform join calculating.

One-to-one join relation is a kind of relation between two datasets, the list that first data is concentrated in the relation Individual row is related to the single row that second data is concentrated.If the key of the two data sets association is all major key, claim the two Data set is with dimension data collection.

One-to-many join relation is a kind of relation between two datasets, the list that first data is concentrated in the relation Individual row is related to one or more rows that second data is concentrated, but the row that second data is concentrated may only be with first A row correlation in data set.If in the application, some data set contains a piece of or more panel regions, each panel region Same object is all directed to, but each region each shows different themes again, such a one-to-many join is related to that we claim Be main Sub Data Set.Each of which small region is referred to as Sub Data Set, and the data set that Sub Data Set is formed is collectively referred to as main number According to collection.The characteristics of boss's data set is that the associated key of Sub Data Set is a part for master data set association major key.

Data set can be got up communication with one another by key.Major key (Primary Key) is a row, every in this row The key assignments of a line is all unique.In data set, the key assignments of each major key is unique.The purpose for the arrangement is that do not weighing In the case of all data in multiple each data set, the data cross between data set is bundled.

The one-to-one join of big data quantity, one-to-many join external memory parallel computation when, why not can two datasets point It is not segmented not individuallyBecause two datasets are the join relations of association, if be individually segmented, number may be occurred by having divided According to dislocation, cause two datasets on not, so can not individually be segmented.

The one-to-one join of existing big data quantity, one-to-many join external memory parallel computing have following methods at present： First the data partition for needing to calculate one-to-one join, one-to-many join, then each area is one-to-one with hash calculating respectively Join, one-to-many join, finally merge result of calculation.Shortcoming is that the quantity of subregion is secured before calculating, then parallel meter The Thread Count of calculation differs surely equal with the number of partitions, and the loss of computing capability can be caused in the case of unequal.If Thread Count During more than the number of partitions, the corresponding thread in each area, unnecessary thread can leave unused；If Thread Count is less than the number of partitions, it is impossible to Ensure the corresponding thread in each area, cause some subregions can only serial computing, computer needs to balance each thread computing power The very broken of subregion point, the complexity of management is so added.Also individual shortcoming is that traditional hash join algorithms are located parallel The action concurrently write to external memory occurs when managing big data, the efficiency for concurrently writing external memory is very low.

In view of the above-mentioned problems, invent a kind of based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " pair Position segmentation parallel method, solve the one-to-one join of insoluble big data quantity, one-to-many join external memory at present and count parallel The problem of calculation technology.

The content of the invention

In order to overcome foregoing problems, it is an object of the invention to provide one kind based on " one-to-one with conflation algorithm calculating Join, one-to-many join " contraposition segmentation parallel method.

One kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method, implementation Condition is：

The computer for implementing this invention is that multi-core CPU configures, the interrelated data that the very big internal memory of data volume can not load Collect A and B, feature is that associated key is to know in advance and determine (necessary condition of this invention).If the major key in data set A Major key in corresponding data collection B, then A and B is one-to-one join with dimension data collection；If the major key in data set A corresponds to subnumber According to the associated key (associated key in Sub Data Set B is the major key part in master data set A) in collection B, then A and B is one-to-many Boss's join data set.

It is using step：

1. the data set A and B couple in external memory carry out following preparation, with dimension data collection relation according to their major key Sequence, boss's data set relation master data set A according to it major key sort, Sub Data Set B by association major key key row sequence or All with the related key row sequence of major key (key is put into front corresponding to association), the history data set that these are prepared to calculate join All keep；(this is disposable preparation, later data set A and B again parallel computation join when to avoid the need for doing this step accurate It is standby.)

2. a data set A is divided into N sections, (N is the Thread Count of this invention of parallel computation, and it is how many not know dividing for Thread Count Section can), first major key key assignments recorded after divide section every section is read out, data set B corresponding with data set A Also it must be segmented, be misplaced if not corresponding to appearance not segment data；

3. because data set B associated key and data set A major key are corresponding relations, and data set B is according to associated key It is ranked up, it is possible to found according to the major key key assignments of every section of first record in data set A with dichotomy in data set B The associated key key assignments of corresponding every section of first record；

It is 4. right after finding each section of the starting point (the associated key key assignments of every section of first record) in data set B Data set B is segmented by starting point；

5. data set an A and B are divided into after N sections, it is same that each thread individually calculates one-to-one join with conflation algorithm Dimension data collection A corresponds to the section associated with each of B points, and the section that A and B points of data set only needs each traversal one when calculating It is secondary, calculate later merging data and integrated as C, the merging data collection C after finally these segmentations are calculated again forms final number According to D is collected, then can quotes have recorded for needs in data set D.

When parallel computation one-to-many boss's join data set A and B, principle and method are same as above step 3,4,5.Be exactly Sub Data Set major key changes the associated key that Sub Data Set corresponds to master data set into.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that data set A data are orderly, and data set A is divided into N sections.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that data set B data is orderly, and the pass of corresponding every section of first record in data set B is found with dichotomy Join key key assignments.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that the parallel computation ratio of this invention is " a kind of to calculate one-to-one join, one-to-many join with conflation algorithm Method " serial computing is fireballing more.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that the computer hardware for implementing this invention supports parallel computation.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that data set A and data set B segmentation must be carried out synchronously, because data set A is to associate with data set B Corresponding relation, the phenomenon of data dislocation otherwise occurs.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that each section in data set B is that each section in data set A calculates, and can thus be used Low-down cost finds every section in data set B of beginning point.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that align this spy that the method for segmentation make use of the data in data set A and data set B to be ordered Sign.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that implementation condition is that one-to-one join is with dimension data collection, one-to-many boss's join data set, associated key Know and determine in advance.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that data set A is divided into N sections, N is the Thread Count of this invention of parallel computation, does not know Thread Count Points how many sections can.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that the minor sort of data set one can be calculated repeatedly after preserving with conflation algorithm.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that when boss's join data set one-to-many with conflation algorithm parallel computation, method and principle are the same as parallel It is the same with dimension data collection to calculate one-to-one join, is exactly only that bundle data set major key changes Sub Data Set into and corresponds to master data set Associated key.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that during with conflation algorithm parallel computation one-to-one join or one-to-many join, advantage is only to need very little Internal memory can complete big data calculating.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that same to use hash methods in the past during with conflation algorithm parallel computation one-to-one join or one-to-many join Solve the problems, such as same ratio, the loss of computing capability will not be caused, it is also more much higher than concurrently writing the efficiency of external memory.

It is as described above a kind of based on " the contraposition segmentation with the one-to-one join of conflation algorithm calculating, one-to-many join " is parallel Method, it is characterised in that methods described is suitable for all systems, platform, software, language.

It is a theoretic description above, the possibility of various optimizations is also had in actual implementation process, but it is substantially former Reason will not change.Those skilled in the art can carry out the protection of various changes and modification without departing from the present invention to the present invention Scope.

Beneficial effects of the present invention

With a kind of former technology " method that one-to-one join, one-to-many join are calculated with conflation algorithm " ratio, advantage is Solve the problems, such as it is same parallel than it is serial fast more, also more meeting the demand of present user, (computer user of multi-core CPU is non- It is often universal).With solving the problems, such as same ratio with hash methods in the past, the loss of computing capability will not be caused, also than concurrently writing The efficiency of external memory is much higher.Relational database is unordered set system, the order of data set can not be preserved in theory, it is difficult to sharp With data, this feature, the present invention can but utilize data set this feature in order in order.

The present invention is further described with reference to the accompanying drawings and examples.

Brief description of the drawings

Fig. 1 is the flow chart of the present invention；

Embodiment

One information systems technology Co., Ltd, business are special production and sales report softwares.Through being commonly encountered situations below, Computer counts to the situation of order, uses order data collection (sales order number, sales date, client, handler, shape State, remarks, area) and order detailed data collection (sales order number, commodity, quantity, unit price, cost price, quantity performed, gold Volume), order data collection and order detailed data collection are the relations of one-to-many boss's join data set, and associated key is sales order number And determine, the data inside the two data sets are unordered and the very big internal memory of data volume can not load, and software shortcoming is every (result fragmentation, concurrently the efficiency of write-in external memory is very low, very difficult for all handy hashjoin calculating of the secondary one-to-many join of calculating With), the computer of user is all the high configuration of multi-core CPU, with it is original " a kind of conflation algorithm calculate one-to-one join, a pair More join method " can not meet demand, because the very big resource of computer is all wasting, it is necessary to original serial approach Make parallel method into, could reasonably utilize computer resource.Programmer enters according to the characteristics of such case and Products The following design of row complies fully with the implement scene of patent of the present invention to improve calculating speed.

Present computer calculates the amount of money of the Beijing area order detail item amount of money more than 100 yuan and added up to, and area is needed from ordering Got in forms data collection A, the amount of money needs to get, it is necessary to enter data set A and data set B from order detailed data collection B The one-to-many join of row gets up, and at this time needs the one-to-many join with conflation algorithm parallel computation data set A and B.

The ascending sequence of major key sales order number is pressed to order data collection A and order detailed data collection B in external memory, arranged Sequence is preserved later.

Data set A is divided into four sections (uncertain Thread Counts point how many sections can), this first of four sections The sales order number key assignments of record is read out, and is 2017061203,2017071201,2017082531,2017092501 respectively.

According to the sales order number key assignments of first of every section record in data set A (2017061203,2017071201, 2017082531st, 2017092501) (looked for since data set is divided into two sections with dichotomy, judge the inside either with or without key assignments 2017061203, it is divided into two sections of the insides in the section having and continues to look for, continues to judge either with or without key assignments 2017061203, according to this class Circulation is pushed away until finding.) find in data set B it is corresponding every section first record sales order number key assignments (2017061203, 2017071201、2017082531、2017092501)；

By sales order number it is (2017061203,2017071201,2017082531,2017092501) to data set B Section starting point be segmented；

Then first thread individually calculates data set A first paragraphs (2017061203 sales order number institutes with conflation algorithm Be recorded as starting point, upper one of the record where 2017071201 sales order numbers is recorded as end point) and data set B First paragraph (is recorded as starting point, the record where 2017071201 sales order numbers where 2017061203 sales order numbers Upper one be recorded as end point) one-to-many join, data set A first paragraph and data set B first paragraph when calculating only Need each that once, end product is merging data collection C (sales order number, area, the amount of money), other section of one-to-many join to traversal The rest may be inferred for calculating；

It is last this four sections are calculated respectively again after merging data collection C form final data set D (sales order number, Area, the amount of money), carry out traversal calculating to data set D, traversal filter condition is that area is Pekinese and the amount of money more than 100 yuan, Qualified record is found out, the statistics amount of money is then carried out and adds up to.

This example is the one-to-many boss's join data set of parallel computation above, sides of the one-to-one join with dimension data collection Method and principle are same as above.

Claims

1. one kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method, step is such as Under：

A. following preparation is carried out to the data set A and B in external memory, arranged with dimension data collection relation according to their major key Sequence, boss's data set relation master data set A sort according to its major key, and Sub Data Set B is by the key row sequence for associating major key or complete Portion and the related key row sequence of major key, the history data set that these preparations calculate join is all kept；

B. data set A is divided into N sections, divides a section later major key key assignments every section of first record to read out, with number It must also be segmented according to data set B corresponding to collection A, be misplaced if not corresponding to appearance not segment data；

C. because data set B associated key and data set A major key are corresponding relations, and data set B is carried out according to associated key Sequence, it is possible to found in data set B and corresponded to dichotomy according to the major key key assignments of every section of first record in data set A Every section first record associated key key assignments；

D. after finding each section of starting point in data set B, data set B is segmented by starting point；

E. after data set A and B being divided into N sections, each thread individually calculates the one-to-one same dimensions of join with conflation algorithm Each the corresponding section associated divided according to collection A and B, the section that A and B points of data set only need each to travel through once when calculating, meter Later merging data has been calculated to integrate as C, the merging data collection C after finally these segmentations are calculated again forms final data set D, Then can quotes have recorded for needs in data set D.

When parallel computation one-to-many boss's join data set A and B, principle and method are same as above step 3,4,5.It is exactly subnumber The associated key for changing Sub Data Set into according to major key is collected and corresponding to master data set.

It is 2. as claimed in claim 1 a kind of based on " with conflation algorithm calculating one-to-one join, one-to-many join " contraposition point Section parallel method, it is characterised in that the computer for implementing this invention is multi-core CPU configuration, and the very big internal memory of data volume can not load Interrelated data set A and B, feature is that associated key is to know and determine the necessary condition of invention (this) in advance.If The major key in major key corresponding data collection B in data set A, then A and B is one-to-one join with dimension data collection；If in data set A Major key correspond in Sub Data Set B associated key (associated key in Sub Data Set B be in master data set A major key a part), Then A and B is one-to-many boss's join data set.

It is 3. as claimed in claim 1 a kind of based on " with conflation algorithm calculating one-to-one join, one-to-many join " contraposition point Section parallel method, it is characterised in that data set A data are orderly, and data set A is divided into N sections.

4. one kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method, its feature It is, data set B data is orderly, and the associated key key assignments of corresponding every section of first record in data set B is found with dichotomy.

5. one kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method, its feature It is, a kind of parallel computation ratio " method that one-to-one join, one-to-many join are calculated with conflation algorithm " serial meter of this invention Calculate fireballing more.

6. one kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method, its feature It is, the computer hardware for implementing this invention supports parallel computation.

7. one kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method, its feature It is, data set A and data set B segmentation must be carried out synchronously, no because data set A is to associate corresponding relation with data set B The phenomenon of data dislocation then occurs.

8. one kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method, its feature Be, each section in data set B, which is each section in data set A, to be calculated, thus can with it is low-down into Originally every section in data set B of beginning point is found.

9. one kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method, its feature It is, aligns this feature that the method for segmentation make use of the data in data set A and data set B to be ordered.

10. one kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method, its feature It is, methods described is suitable for all systems, platform, software, language.