CN109033295A - The merging method and device of super large data set - Google Patents
The merging method and device of super large data set Download PDFInfo
- Publication number
- CN109033295A CN109033295A CN201810772324.5A CN201810772324A CN109033295A CN 109033295 A CN109033295 A CN 109033295A CN 201810772324 A CN201810772324 A CN 201810772324A CN 109033295 A CN109033295 A CN 109033295A
- Authority
- CN
- China
- Prior art keywords
- data
- fragmentation
- data set
- major key
- data fragmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides the merging methods and device of a kind of super large data set, are related to technical field of data processing, and the combined efficiency for solving the problems, such as super large data set in the prior art is relatively low.Method and device provided by the invention is realized based on distributed computing, include: that fragment is carried out to the first data set according to the first association major key after the data that the first of the first data set the association major key is converted to preset field type, obtains the first data fragmentation of preset quantity and cached to preset cache system;The second of second data set association major key is converted to, fragment is carried out to the second data set according to the second association major key after the data of preset field type, obtains the second data fragmentation of preset quantity;Read the first data fragmentation from preset cache system, the first data fragmentation and the second data fragmentation matched, and by after matching the first data fragmentation and the second data fragmentation merge.The present invention can be used in merging super large data set.
Description
Technical field
The present invention relates to technical field of data processing more particularly to a kind of merging methods and device of super large data set.
Background technique
Super large data set is the set of the data set as composed by data magnitude very big data set.Currently, due to super
Large data sets have very big data magnitude, while newly-increased data magnitude is also very big, by Installed System Memory and disk etc.
The limitation of system resource condition, super large data set are difficult to store in relevant database, therefore to two associated super larges
Data set is associated analysis, it will usually merge two associated super large data sets, will be directed to two associated super larges
The process that data set is associated analysis is converted to the retrieval, classification and statistics for single data set, to reduce from associated
The difficulty of mined information in super large data set.
However, when being merged to two associated super large data sets, being usually all by some data in actual conditions
The lesser data set of magnitude first stores in database, and the biggish data set of data magnitude is split as after multi-block data fragment and deposits
It stores up the data set in database and executes merging parallel, the above process is primarily present following problems: firstly, logarithm in the prior art
The case where will appear data skew in the fragment result of fragment is carried out according to collection, that is, mass data occurs by concentration point at one or more
Data joint account is carried out on a a small amount of fragment, and remaining low volume data is on remaining a large amount of fragments and carries out data
Joint account, since the whole efficiency of joint account is determined according to the computational efficiency of whole fragments, and above-mentioned a small amount of fragment
Joint account efficiency far be lower than the average joint account efficiency of whole fragments, thus greatly reduce the whole of data set and close
And efficiency;Secondly, needing to traverse the data of every piece of data fragmentation one by one, i.e. one number of every reading during executing merging
According to then needing to send an access request to database whether there is in database and can merge with above-mentioned data to search
Corresponding data, thus cause the request number of times to database excessive, network pressure increase;Meanwhile it being limited to database performance,
The request number of times of most databases is limited, strongly limits the promotion of the whole combined efficiency of data set.
Summary of the invention
The embodiment of the present invention provides the merging method and device of a kind of super large data set, for solving in the prior art
The relatively low problem of the combined efficiency of super large data set.
In order to achieve the above object, the present invention adopts the following technical scheme:
In a first aspect, the present invention provides a kind of merging method of super large data set, this method is realized based on distributed computing,
Include:
The first of first data set association major key is converted to after the data of preset field type according to the first association major key
Fragment is carried out to the first data set, the first data fragmentation of preset quantity is obtained and is cached to preset cache system;
The second of second data set association major key is converted to after the data of preset field type according to the second association major key
Fragment is carried out to the second data set, obtains the second data fragmentation of preset quantity;
The first data fragmentation is read from preset cache system, to the first data fragmentation and the progress of the second data fragmentation
Match, and by after matching the first data fragmentation and the second data fragmentation merge.
It, can be by the association master of two data sets to be combined in the merging method of super large data set provided by the invention
Key is converted to the data of preset field type, so that carrying out fragment to two data sets to be combined based on distributed computing
When, the data in data set can be distributed as homogeneously as possible according to the association major key for the data for being converted to preset field type
It onto each fragment of data set, effectively avoids occurring the case where data skew in fragment result, promotes data set and integrally merge
Efficiency;Meanwhile the present invention to preset cache system by replacing data set to be combined data set cache to be combined
It stores to database, so that the merging process of data set is no longer influenced by access database pressure and access the limit of database number
System effectively promotes the whole combined efficiency of data set.
Optionally, preset field type is byte type, and the data of preset field type are the long value of byte type.
Optionally, it is closed after the first association major key of the first data set to be converted to the data of preset field type according to first
Join major key and fragment carried out to the first data set specifically:
The first data in the first data set are read in advance, extract the first field for needing to merge in the first data;
Judge whether the first field is effective;
If the determination result is YES, the first association major key of the first data set is converted to byte type and obtains the first association and led
The long value of key;
The first cryptographic Hash for calculating the first association major key long value of the first data set, according to the first cryptographic Hash to the first number
Fragment is carried out according to collection, the first data fragmentation of preset quantity is obtained and is cached to preset cache system.
Optionally, it is closed after the second association major key of the second data set to be converted to the data of preset field type according to second
Join major key and fragment carried out to the second data set specifically:
The second data in the second data set are read in advance, extract the second field for needing to merge in the second data;
Judge whether the second field is effective;
If the determination result is YES, the second association major key of the second data set is converted to byte type and obtains the second association and led
The long value of key;
The second cryptographic Hash for calculating the second association major key long value, divides the second data set according to the second cryptographic Hash
Piece.
Optionally, corresponding fragment number is distributed for each first data fragmentation and be each second data fragmentation distribution
Corresponding fragment number;
The first data fragmentation is then read from preset cache system, to the first data fragmentation and the progress of the second data fragmentation
Match specifically:
The first data fragmentation is read from preset cache system, for each first data fragmentation distribute corresponding fragment number,
And corresponding fragment number is distributed for each second data fragmentation.
Optionally, matched first data fragmentation and the second data fragmentation are merged specifically:
Second data fragmentation data of the second data fragmentation of recombination in advance;
From the first data fragmentation data read one by one in preset cache system in the first data fragmentation, search and the first number
According to the first association major key that whether there is the first data fragmentation data in matched second data fragmentation of fragment;
If so, searching the second number according to the first of the first data fragmentation data the association major key and preset Correlation Criteria
According to the second data fragmentation data that whether there is in fragment with the first data fragmentation Data Matching;
If so, the first data fragmentation data and the second data fragmentation data are merged.
Optionally, preset quantity is determined according to the size of data of the first data set and the second data set.
Second aspect, the present invention provide a kind of merging device of super large data set, which is realized based on distributed computing,
Include:
First fragment module, after being associated with the data that major key is converted to preset field type for the first of the first data set
Fragment is carried out to the first data set according to the first association major key, the first data fragmentation of preset quantity is obtained and is cached to pre-
If caching system;
Second fragment module, after being associated with the data that major key is converted to preset field type for the second of the second data set
Fragment is carried out to the second data set according to the second association major key, obtains the second data fragmentation of preset quantity;
Matching module, for reading the first data fragmentation from preset cache system, to the first data fragmentation and the second number
It is matched according to fragment;
Merging module, for after matching the first data fragmentation and the second data fragmentation merge.
Optionally, preset field type is byte type, and the data of preset field type are the long value of byte type.
Optionally, the first fragment module is specifically used for:
The first data in the first data set are read in advance, extract the first field for needing to merge in the first data;
Judge whether the first field is effective;
If the determination result is YES, the first association major key of the first data set is converted to byte type and obtains the first association and led
The long value of key;
The first cryptographic Hash for calculating the first association major key long value of the first data set, according to the first cryptographic Hash to the first number
Fragment is carried out according to collection, the first data fragmentation of preset quantity is obtained and is cached to preset cache system.
Optionally, the second fragment module is specifically used for:
The second data in the second data set are read in advance, extract the second field for needing to merge in the second data;
Judge whether the second field is effective;
If the determination result is YES, the second association major key of the second data set is converted to byte type and obtains the second association and led
The long value of key;
The second cryptographic Hash for calculating the second association major key long value, divides the second data set according to the second cryptographic Hash
Piece.
Optionally, corresponding fragment number is distributed for each first data fragmentation and be each second data fragmentation distribution
Corresponding fragment number;
Then matching module is specifically used for:
The first data fragmentation is read from preset cache system, according to the fragment number of the first data fragmentation and the second data
The fragment number of fragment matches the first data fragmentation and the second data fragmentation.
Optionally, merging module is specifically used for:
Second data fragmentation data of the second data fragmentation of recombination in advance;
From the first data fragmentation data read one by one in preset cache system in the first data fragmentation, with the first data
Search whether that there are the first of the first data fragmentation data to be associated with major key in matched second data fragmentation of fragment;
If so, searching the second number according to the first of the first data fragmentation data the association major key and preset Correlation Criteria
According to fragment with the presence or absence of the second data fragmentation data with the first data fragmentation Data Matching;
If so, the first data fragmentation data and the second data fragmentation data are merged.
Optionally, preset quantity is determined according to the size of data of the first data set and the second data set.
The third aspect, provides a kind of merging device of super large data set, including communication interface, processor, memory, total
Line;For storing computer executed instructions, processor is connect with memory by bus memory, when intelligent probe device is run
When, processor executes the computer executed instructions of memory storage, so that the merging device of super large data set executes above-mentioned first
The method of aspect.
Fourth aspect provides a kind of combination system of super large data set, merging device including above-mentioned super large data set, with
And preset cache system.
5th aspect provides a kind of storage medium, and storage medium is stored with instruction code, and instruction code is above-mentioned for executing
The method of first aspect.
6th aspect provides a kind of computer program product, and computer program product includes instruction code, and instruction code is used
In the method for executing above-mentioned first aspect.
It is to be appreciated that any super large data set of above-mentioned offer merges device, super large data set combination system, storage
Medium or computer program product are used to execute the corresponding method of first aspect presented above, therefore, can reach
Beneficial effect can refer to the beneficial effect of corresponding scheme in the method and following detailed description of first aspect above,
Details are not described herein again.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of step flow chart of the merging method for super large data set that one embodiment of the invention provides;
Fig. 2 is the step flow chart of the merging method for another super large data set that one embodiment of the invention provides;
Fig. 3 is the step flow chart of the merging method for another super large data set that one embodiment of the invention provides;
Fig. 4 is the step flow chart of the merging method for another super large data set that one embodiment of the invention provides;
Fig. 5 is a kind of structural schematic diagram of the merging device for super large data set that one embodiment of the invention provides;
Fig. 6 is the structural schematic diagram of the merging device for another super large data set that one embodiment of the invention provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.The use of term " first " and " second " etc. does not indicate any sequence, can be by above-mentioned art
Language is construed to the title of described object.In the embodiment of the present application, " illustrative " or " such as " etc. words for indicate make
Example, illustration or explanation.Be described as in the embodiment of the present application " illustrative " or " such as " any embodiment or design
Scheme is not necessarily to be construed as than other embodiments or design scheme more preferably or more advantage.Specifically, it uses " exemplary
" or " such as " etc. words be intended to that related notion is presented in specific ways.In addition, in the description of the embodiment of the present application, unless
It is otherwise noted, the meaning of " plurality " is refer to two or more.
Fig. 1 is a kind of step flow chart of the merging method for super large data set that one embodiment of the invention provides.This implementation
Method in example is realized based on distributed computing, as shown in Figure 1, the method in the present embodiment includes the following steps:
Step S101: the first of the first data set the association major key is converted to after the data of preset field type according to first
It is associated with major key and fragment is carried out to the first data set, obtain the first data fragmentation of preset quantity and cached to preset cache system
System.
Wherein, the data set that the first data set needs to merge, first is associated with each number in major key i.e. the first data set
According to the major key of table.It may include whole field informations of the first association major key in the data of preset field type.Wherein, predetermined word
Segment type can be byte (Byte) type, the data of preset field type can according to preset field type and preset algorithm
To the long value, etc. of resulting first association major key after the first association major key respective operations.Herein, it is to be noted that, it is existing
Then data set can generally be divided using hash algorithm according to IP address using IP address as association major key by having in technology
Data generally can be at most divided into 256 parts, and with taking IP to keep the data skew degree in data fragmentation result minimum by piece
The final stage field of location carries out Hash operation remainder to carry out fragment to data set.However, due to the final stage of IP address
Field can generally have a large amount of reserved fields such as xxx.xxx.xxx.0, such as in 10 IP address, and there are 5 IP address
Final stage field there are reserved field such as xxx.xxx.xxx.0, IP address suffix (i.e. final stage field) data not enough from
It dissipates, the data for thus causing to be carried out according to IP address final stage field in the result of fragment are also not discrete enough, not can avoid out
The case where existing data skew;When fragment quantity is less than 256, for example, fragment quantity be 255, according to aforesaid way to data into
The case where row fragment can further destroy the discreteness of data in fragment result, lead to data skew is even more serious;Meanwhile it breathing out
Uncommon algorithm is also only applicable to the case where association major key is IP address, if setting other classes except an ip address for association major key
Type data then can not carry out data fragmentation to association major key using hash algorithm.To solve the above-mentioned problems, in the present invention,
First association major key of the first data set can be converted to the data of preset field type, it can in the data of preset field type
With whole field informations comprising the first association major key and Hash operation can be carried out, therefore the first of the first data set is closed
The data that connection major key is converted to preset field type can guarantee that each data in the first association major key are carrying out data point
It can be by carry out operation, to guarantee the data discrete journey with higher in the first data set fragment result to greatest extent when piece
Degree, effectively avoids occurring the case where data skew in fragment result, while also effectively expanding the scope of application of hash algorithm.
The mode that the long value of the first association major key is obtained in specific implementation, in the above process can be by those skilled in the art
Member is set according to the actual situation, and this is not limited by the present invention.In a kind of preferred mode, the acquisition of above-mentioned long value
Mode can be with are as follows: after the field type of the first association major key is converted to byte type, is converted into the first association master of byte type
The algorithm seed of key and specific length be repeated with or wait relevant operations, then take the absolute value of aforesaid operations result as the
The long value of one association major key.Wherein, the algorithm seed of above-mentioned specific length and to the first association major key and specific length
Algorithm seed be repeated with or wait the process of relevant operations that can be set according to the actual situation by those skilled in the art
It sets, this is not limited by the present invention.It is, of course, understood that preset field type can also be other in specific implementation
The data of type, preset field type can also be the other types of data in addition to above-mentioned long value, as long as effectively avoiding
There is the case where data skew in fragment result.
In specific implementation, above-mentioned preset quantity can be according to the first data set and the second data set (i.e. subsequent step
The second data set in S102) size of data determine.When the data volume of the first data set and the second data set is larger, then
It is corresponding to increase above-mentioned preset quantity;The above-mentioned preset quantity of on the contrary then corresponding reduction.
Preset caching system can be HDFS (the distributed text of Hadoop Distributed Filesystem, Hadoop
Part system) or other similar distributed file storage systems.In specific implementation, this step can use big data calculation block
Frame Spark technology executes, or can also use MR, hive enforcement engine and the real-time computing engines of history of hadoop
Storm etc. similar big data processing technique executes.
Step S102: the second of the second data set the association major key is converted to after the data of preset field type according to second
It is associated with major key and fragment is carried out to the second data set, obtain the second data fragmentation of preset quantity.
Specifically, in this step the data of preset field type and according to second association major key to the second data set carry out
The process of fragment is similar with step S101, specifically may refer to the corresponding description in step S101, details are not described herein again.
It is identical to obtain the technology used in technology used by the second data fragmentation of preset quantity and step S101, specifically
It can refer to the corresponding description in step S101, details are not described herein again.
Step S103: reading the first data fragmentation from preset cache system, to the first data fragmentation and the second data point
Piece is matched, and by after matching the first data fragmentation and the second data fragmentation merge.
Specifically, to the first data fragmentation and the second data fragmentation carry out matched matching way can there are many, for example,
According to Spark technology execute step S101 and step S102, then Spark can automatically for the first data fragmentation establish index with
And establish index for the second data fragmentation, then it can be according to the index of the first data fragmentation and the index pair of the second data fragmentation
First data fragmentation and the second data fragmentation are matched.For example, the first data fragmentation include data fragmentation 1, data fragmentation 2,
And totally 3 data fragmentations, Spark are the index that the first data fragmentation is established to data fragmentation 3 are as follows: 1- data fragmentation 1,2- data
Fragment 2,3- data fragmentation 3;Second data fragmentation includes data fragmentation a, data fragmentation b and data fragmentation c totally 3 data
Fragment, Spark are the index that the second data fragmentation is established are as follows: 1- data fragmentation a, 2- data fragmentation b, 3- data fragmentation c.Then may be used
It is matched so that identical data fragmentation will be indexed in the first data fragmentation and the second data fragmentation, i.e. data fragmentation 1 and data
Fragment a matching (the two index all be 1), data fragmentation 2 matched with data fragmentation b (it is all 2 that the two, which indexes), data fragmentation 3 and
Data fragmentation c matching (the two index is all 3).
It is, of course, understood that specific implementation in can also using except it is above-mentioned enumerate mode in addition to other way pair
First data fragmentation and the second data fragmentation are matched, such as are believed according to IP address or according to the time for generating data fragmentation
Breath match etc., and the present invention carries out matched matching way to the first data fragmentation and the second data fragmentation and is not construed as limiting.
When being merged to the first data fragmentation and the second data fragmentation, can be closed according to the first of the first data fragmentation
Connection major key, the second association major key of the second data fragmentation and preset Correlation Criteria determine the first data fragmentation and the second data
With the presence or absence of the data that can merge between fragment, the data that can be merged if it exists, then to can merge
Data merge.Wherein, above-mentioned preset Correlation Criteria can be set according to the actual situation by those skilled in the art
It sets, this is not limited by the present invention.
It can be seen that in the merging method of super large data set provided by the invention, it can be by two data to be combined
The association major key of collection is converted to the data of preset field type so as to two data sets to be combined by distribution based on
It is according to the association major key for the data for being converted to preset field type that the data in data set are as uniform as possible when calculating progress fragment
Ground is assigned on each fragment of data set, effectively avoids the appearance of data skew situation in fragment result, and it is whole to promote data set
Body combined efficiency;Meanwhile the present invention is by the way that replace data set cache to be combined to preset cache system will be to be combined
Data set is stored to database, is made the merging process of data set no longer need largely to access database, is further improved data
The whole combined efficiency of collection.
Wherein, step S101 can specifically be implemented using following scheme, using Spark technology to the first data in the program
Collection carries out fragment, and preset cache system is included the following steps: using HDFS referring in particular to shown in Fig. 2
Step S201: the first data in the first data set are read in advance, extract need to merge in the first data first
Field.
Wherein, super large data set is made of multiple files, when reading the first data, due to the limitation of Installed System Memory, and one
Secondary property completely reads and handles super large data set can be beyond Memory Load, therefore can read in batches and handle entire super large number
According to collection.For example, can each batch read the first data of multiple files to Installed System Memory, then in system memory to reading
The first data of multiple files execute the alignment processing in this step and subsequent step (corresponding step S202-S205), and
Output is merged to processing result using predetermined manner (such as using union () function) after processing is completed, to reach more
Good readwrite performance effect.
Wherein it is possible to handle every first data read from the first data set one by one using map () function and extract
The first field for needing to merge out.The first field for needing to merge i.e. final merging data when required field, the first field
It can specifically be set according to the actual situation by those skilled in the art, this is not limited by the present invention.
Step S202: judge whether the first field is effective;If judging result be it is no, then follow the steps S203;If judgement knot
Fruit be it is yes, then follow the steps S204.
The validity of the first field is verified according to preset verification rule, to judge whether the first field is effective.
Preset verification rule can be set according to the actual situation by those skilled in the art, and this is not limited by the present invention.
Step S203: the first data corresponding with first field are abandoned.
Step S204: the first association major key of the first data set is converted into byte type and obtains the first association major key
Long value.
The specific acquisition modes for obtaining the long value of the first association major key, which can refer in step S101, corresponds to description, herein
It repeats no more.
Step S205: the first cryptographic Hash of the first association major key long value of the first data set is calculated, according to the first Hash
Value carries out fragment to the first data set, obtains the first data fragmentation of preset quantity and is cached to preset cache system.
Specifically, the first cryptographic Hash using gained remainder as first by being associated with to fragment number (i.e. preset quantity) remainder
First association major key corresponding data collection is stored on the fragment position, with realization pair by the fragment position of major key corresponding data collection
First data set carries out the purpose of fragment, then by the data buffer storage after fragment to preset cache system.
In addition, after carrying out fragment to the first data set corresponding fragment can be distributed for each first data fragmentation
Number, and the fragment number of each first data fragmentation is passed through into first Hash of index () index functions to first data fragmentation
Value, to establish corresponding relationship between fragment number and the first data fragmentation, for the first data fragmentation and the second data fragmentation
It is matched.
On the basis of using scheme shown in Fig. 2, step S102 can specifically be implemented using following scheme, referring to Fig. 3 institute
Show, includes the following steps:
Step S301: the second data in the second data set are read in advance, extract need to merge in the second data second
Field.
Step S302: judge whether the second field is effective;If judging result be it is no, then follow the steps S303;If judgement knot
Fruit be it is yes, then follow the steps S304.
Step S303: the second data corresponding with second field are abandoned.
Step S304: the second association major key of the second data set is converted into byte type and obtains the second association major key
Long value.
Wherein, the process of step S301- step S304 is referred to the corresponding description in step S201- step S204, this
Place repeats no more.
Step S305: the second cryptographic Hash of the second association major key long value is calculated, according to the second cryptographic Hash to the second data
Collection carries out fragment.
Wherein, the second cryptographic Hash using gained remainder as the second association by leading to fragment number (i.e. preset quantity) remainder
Second association major key corresponding data collection is stored on the fragment position, to realize to the by the fragment position of key corresponding data collection
The purpose of two data sets progress fragment.
In addition, after carrying out fragment to the second data set corresponding fragment can be distributed for each second data fragmentation
Number, and the fragment number of each second data fragmentation is passed through into second Hash of index () index functions to second data fragmentation
Value, to establish corresponding relationship between fragment number and the second data fragmentation, for the first data fragmentation and the second data fragmentation
It is matched.
According to above-mentioned Fig. 2 and implementation process shown in Fig. 3, step S103 can be implemented using following scheme, referring in particular to
Shown in Fig. 4, the program includes the following steps:
Step S401: reading the first data fragmentation from preset cache system, to the first data fragmentation and the second data point
Piece is matched.
Specifically, according to the fragment number of the fragment number of the first data fragmentation and the second data fragmentation to the first data fragmentation
It is matched with second data fragmentation.It specially can be by identical first data fragmentation of fragment number and the second data fragmentation
As matched data fragmentation.
Step S402: the second data fragmentation data of the second data fragmentation of recombination in advance.
Wherein, the second data fragmentation data are converted into Map () function by iterator mode, to reduce traversal number,
Improve combined efficiency.Specifically list can will be formed with the second data fragmentation data of identical second association major key, and with the
Two association major keys are key, are that value defines Map () with above-mentioned list.Such as with map.put (" A ", " B "), A is at this time
Two association major keys, B are the second data fragmentation data with identical second association major key, then can be obtained with map.get (" A ")
B。
Step S403: it from the first data fragmentation data read one by one in preset cache system in the first data fragmentation, looks into
It looks for and is associated with major key with the presence or absence of the first of the first data fragmentation data with matched second data fragmentation of the first data fragmentation;If
It is no, then follow the steps S404;If so, thening follow the steps S405.
Specifically, in this programme in, preferably from every read one by one in preset cache system in the first data fragmentation
First data fragmentation data, for being counted in subsequent step (step S404- step S407) to first in the first data fragmentation
It can be executed one by one according to the merging treatment of fragment data, to reduce request memory.It is, of course, understood that above-mentioned reading
Mode and processing mode are only a kind of preferred embodiments, in specific implementation, can also once be read from preset cache system a plurality of
First data fragmentation data, and merging treatment is executed to a plurality of first data fragmentation data simultaneously.
Wherein, from preset cache system read a first data fragmentation data after, can be by first data
Fragment data is decomposed into field, then according to the cryptographic Hash of the first of the first data fragmentation data the association major key, in step
It searches whether to exist in Map () function that S402 is defined and is associated with major key with first cryptographic Hash identical second for being associated with major key
Cryptographic Hash, if it is not, then explanation with matched second data fragmentation of first data fragmentation in be not present and the first data of this
The identical second association major key of the first association major key of fragment data, thens follow the steps S404;If so, thening follow the steps S405.
Step S404: the first data fragmentation data are abandoned.
Step S405: second is searched according to the first of the first data fragmentation data the association major key and preset Correlation Criteria
With the presence or absence of the second data fragmentation data with the first data fragmentation Data Matching in data fragmentation;If it is not, thening follow the steps
S406;If so, thening follow the steps S407.
According to the first of the first data fragmentation data association major key extracted from Map () function that step S402 is defined with
The corresponding list of the first association major key, and searched in the list according to preset Correlation Criteria and divided with the presence or absence of with the first data
The matched second data fragmentation data of sheet data, if it is not, then illustrating that there is no divide with above-mentioned first data in the second data fragmentation
Associated the second data fragmentation data for being able to carry out data merging of sheet data, then follow the steps S406;If so, thening follow the steps
S407。
Wherein, preset Correlation Criteria can be configured according to the actual situation by those skilled in the art, the present invention couple
This is not construed as limiting.
Step S406: the first data fragmentation data are abandoned.
Step S407: the first data fragmentation data and the second data fragmentation data are merged.
It is preferred that can be using map Partitions WithIndex () function to the first data fragmentation data and the second number
Merging is executed parallel according to fragment data.
Using method provided by the invention, it is assumed that it is parallel to execute combined degree of parallelism as n, i.e., the first data set is divided into n
Second data set is divided into n the second data fragmentations, then to the first data fragmentation and the second data point by a first data fragmentation
Available n is to small data set after being matched for piece, the merging for being then n to small data set execution degree of parallelism to above-mentioned n.Its
In, it is assumed that the scale of construction of associated two super large data sets is respectively that x and y traverse the complexity of merging in the prior art one by one
For x*y, by the way that n is executed merging to small data set parallel in the present invention, combined complexity can be reduced to (x/n) * (y/
N), if ignoring the time-consuming of Hash operation, time-consuming of the present invention answers approximate 1/n/n time-consuming needed for can merging for traversal one by one, by
This is as it can be seen that the present invention can effectively promote the efficiency that data set integrally merges.In addition, the present invention is also without consideration system resource
It limits (such as limitation of database access pressure and access times), therefore data will not be lost;Merging in the present invention
Journey fully complies with preset Correlation Criteria, if above-mentioned Correlation Criteria is errorless, data set merges in the present invention success rate and accurate
Rate can achieve 100%.
Fig. 5 is a kind of structural schematic diagram of the merging device for super large data set that one embodiment of the invention provides.Such as Fig. 5 institute
Show, which includes:
First fragment module 51 is converted to the data of preset field type for the first association major key by the first data set
Afterwards according to first association major key to the first data set carry out fragment, obtain the first data fragmentation of preset quantity and cached to
Preset cache system.
Wherein, preset field type is byte type, and the data of preset field type are the long value of byte type.
First fragment module 51 is specifically used for:
The first data in the first data set are read in advance, extract the first field for needing to merge in the first data;
Judge whether the first field is effective;
If the determination result is YES, the first association major key of the first data set is converted to byte type and obtains the first association and led
The long value of key;
The first cryptographic Hash for calculating the first association major key long value of the first data set, according to the first cryptographic Hash to the first number
Fragment is carried out according to collection, the first data fragmentation of preset quantity is obtained and is cached to preset cache system.
Second fragment module 52 is converted to the data of preset field type for the second association major key by the second data set
Fragment is carried out to the second data set according to the second association major key afterwards, obtains the second data fragmentation of preset quantity.
Wherein, preset field type is byte type, and the data of preset field type are the long value of byte type.
Second fragment module 52 is specifically used for:
The second data in the second data set are read in advance, extract the second field for needing to merge in the second data;
Judge whether the second field is effective;
If the determination result is YES, the second association major key of the second data set is converted to byte type and obtains the second association and led
The long value of key;
The second cryptographic Hash for calculating the second association major key long value, divides the second data set according to the second cryptographic Hash
Piece.
Matching module 53, for reading the first data fragmentation from preset cache system, to the first data fragmentation and second
Data fragmentation is matched.Wherein, the fragment number of each first data fragmentation determined according to the first cryptographic Hash and according to second
Cryptographic Hash determines the fragment number of each second data fragmentation;
Then matching module 53 is specifically used for:
The first data fragmentation is read from preset cache system, according to the fragment number of the first data fragmentation and the second data
The fragment number of fragment matches the first data fragmentation and the second data fragmentation.
Merging module 54, for after matching the first data fragmentation and the second data fragmentation merge.
Merging module 54 is specifically used for:
Second data fragmentation data of the second data fragmentation of recombination in advance;
From the first data fragmentation data read one by one in preset cache system in the first data fragmentation, with the first data
Search whether that there are the first of the first data fragmentation data to be associated with major key in matched second data fragmentation of fragment;
If so, searching the second number according to the first of the first data fragmentation data the association major key and preset Correlation Criteria
According to fragment with the presence or absence of the second data fragmentation data with the first data fragmentation Data Matching;
If so, the first data fragmentation data and the second data fragmentation data are merged.
All related contents for each step that above method embodiment is related to can quote the function of corresponding function module
It can describe, details are not described herein for effect.
Using integrated module, the merging device of super large data set include: storage unit, processing unit with
And interface unit.Processing unit is used to carry out control management to the movement of the merging device of super large data set, for example, processing unit
For supporting the merging device of super large data set to execute each step in Fig. 1-4.Interface unit is for supporting super large data set
Merge the interaction of device and other devices;Storage unit, for storing the merging program of device code and data of super large data set.
Wherein, using processing unit as processor, storage unit is memory, and interface unit is for communication interface.Wherein,
The merging device of super large data set referring to fig. 6, including communication interface 601, processor 602, memory 603 and bus
604, communication interface 601, processor 602 are connected by bus 604 with memory 603.
Processor 602 can be a general central processor (Central Processing Unit, CPU), micro process
Device, application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC) or one or more
A integrated circuit executed for controlling application scheme program.
Memory 603 can be read-only memory (Read-Only Memory, ROM) or can store static information and instruction
Other kinds of static storage device, random access memory (Random Access Memory, RAM) or letter can be stored
The other kinds of dynamic memory of breath and instruction, is also possible to Electrically Erasable Programmable Read-Only Memory (Electrically
Erasable Programmable Read-only Memory, EEPROM), CD-ROM (Compact Disc Read-
Only Memory, CD-ROM) or other optical disc storages, optical disc storage (including compression optical disc, laser disc, optical disc, digital universal
Optical disc, Blu-ray Disc etc.), magnetic disk storage medium or other magnetic storage apparatus or can be used in carrying or store to have referring to
Enable or data structure form desired program code and can by any other medium of computer access, but not limited to this.
Memory, which can be, to be individually present, and is connected by bus with processor.Memory can also be integrated with processor.
Wherein, memory 603 is used to store the application code for executing application scheme, and is controlled by processor 602
System executes.Communication interface 601 is used to support the interaction of the merging device and other devices of super large data set.Processor 602 is used for
The application code stored in memory 603 is executed, to realize the merging side of the super large data set in the embodiment of the present application
Method.
The present invention also provides a kind of combination systems of super large data set, the merging dress including any of the above-described super large data set
It sets and preset cache system.Preset cache system specifically may refer to the corresponding introduction in step S101, no longer superfluous herein
It states.
The present invention also provides a kind of calculating to store media (or medium), including carrying out in above-described embodiment when executed
The instruction of the operation of method, when instruction is run on computers, so that computer executes above-mentioned embodiment of the method.
In addition, the present invention also provides a kind of computer program product, including above-mentioned calculating storage media (or medium).
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row
His property includes, so that the process, method, article or the device that include a series of elements not only include those elements, and
And further include other elements that are not explicitly listed, or further include for this process, method, article or device institute it is intrinsic
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do
There is also other identical elements in the process, method of element, article or device.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, the technical solution of the application substantially in other words does the prior art
The part contributed out can be embodied in the form of software products, which is stored in a storage medium
In (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal (can be mobile phone, computer, service
Device, air conditioner or network equipment etc.) execute method described in each embodiment of the application.
Embodiments herein is described above in conjunction with attached drawing, but the application be not limited to it is above-mentioned specific
Embodiment, the above mentioned embodiment is only schematical, rather than restrictive, those skilled in the art
Under the enlightenment of the application, when not departing from the application objective and scope of the claimed protection, can also it make very much
Form belongs within the protection of the application.
Claims (14)
1. a kind of merging method of super large data set, which is characterized in that the method is realized based on distributed computing, comprising:
The first of first data set association major key is converted to after the data of preset field type according to the first association major key
Fragment is carried out to first data set, the first data fragmentation of preset quantity is obtained and is cached to preset cache system;
The second of second data set association major key is converted to after the data of preset field type according to the second association major key
Fragment is carried out to second data set, obtains the second data fragmentation of preset quantity;
First data fragmentation is read from the preset cache system, to first data fragmentation and second data
Fragment is matched, and by after matching the first data fragmentation and the second data fragmentation merge.
2. the merging method of super large data set according to claim 1, which is characterized in that the preset field type is word
Nodal pattern, the data of the preset field type are long value.
3. the merging method of super large data set according to claim 2, which is characterized in that described by the of the first data set
One association major key carries out first data set according to the first association major key after being converted to the data of preset field type
Fragment specifically:
The first data in first data set are read in advance, extract the first field for needing to merge in first data;
Judge whether first field is effective;
If the determination result is YES, the first association major key of first data set is converted to byte type and obtains described first and closed
Join the long value of major key;
The first cryptographic Hash for calculating the first association major key long value of first data set, according to first cryptographic Hash to institute
It states the first data set and carries out fragment, obtain the first data fragmentation of preset quantity and cached to preset cache system.
4. the merging method of super large data set according to claim 3, which is characterized in that described by the of the second data set
Two association major keys carry out second data set according to the second association major key after being converted to the data of preset field type
Fragment specifically:
The second data in second data set are read in advance, extract the second field for needing to merge in second data;
Judge whether second field is effective;
If the determination result is YES, the second association major key of second data set is converted to byte type and obtains described second and closed
Join the long value of major key;
The second cryptographic Hash for calculating the second association major key long value, according to second cryptographic Hash to second data set
Carry out fragment.
5. the merging method of super large data set according to claim 4, which is characterized in that for each first data fragmentation point
Corresponding fragment number is distributed with corresponding fragment number and for each second data fragmentation;
It is then described that first data fragmentation is read from the preset cache system, to first data fragmentation and described the
Two data fragmentations are matched specifically:
First data fragmentation is read from the preset cache system, according to the fragment number of first data fragmentation and
The fragment number of second data fragmentation matches first data fragmentation and second data fragmentation.
6. the merging method of super large data set according to claim 1, which is characterized in that described by matched first data
Fragment and the second data fragmentation merge specifically:
The second data fragmentation data of second data fragmentation are recombinated in advance;
It reads the first data fragmentation data in the first data fragmentation one by one from the preset cache system, searches and described the
With the presence or absence of the first association major key of the first data fragmentation data in matched second data fragmentation of one data fragmentation;
If so, searching described the according to the first of the first data fragmentation data the association major key and preset Correlation Criteria
With the presence or absence of the second data fragmentation data with the first data fragmentation Data Matching in two data fragmentations;
If so, the first data fragmentation data and the second data fragmentation data are merged.
7. the merging method of super large data set according to claim 1, which is characterized in that according to first data set with
And the size of data of second data set determines the preset quantity.
8. a kind of merging device of super large data set, which is characterized in that described device is realized based on distributed computing, comprising:
First fragment module, basis after the data for the first of the first data set the association major key to be converted to preset field type
The first association major key carries out fragment to first data set, obtains the first data fragmentation of preset quantity and is cached
To preset cache system;
Second fragment module, basis after the data for the second of the second data set the association major key to be converted to preset field type
The second association major key carries out fragment to second data set, obtains the second data fragmentation of preset quantity;
Matching module, for reading first data fragmentation from the preset cache system, to first data fragmentation
It is matched with second data fragmentation;
Merging module, for after matching the first data fragmentation and the second data fragmentation merge.
9. the merging device of super large data set according to claim 8, which is characterized in that the preset field type is word
Nodal pattern, the data of the preset field type are the long value of byte type.
10. the merging device of super large data set according to claim 9, which is characterized in that the first fragment module tool
Body is used for:
The first data in first data set are read in advance, extract the first field for needing to merge in first data;
Judge whether first field is effective;
If the determination result is YES, the first association major key of first data set is converted to byte type and obtains described first and closed
Join the long value of major key;
The first cryptographic Hash for calculating the first association major key long value of first data set, according to first cryptographic Hash to institute
It states the first data set and carries out fragment, obtain the first data fragmentation of preset quantity and cached to preset cache system.
11. the merging device of super large data set according to claim 10, which is characterized in that the second fragment module tool
Body is used for:
The second data in second data set are read in advance, extract the second field for needing to merge in second data;
Judge whether second field is effective;
If the determination result is YES, the second association major key of second data set is converted to byte type and obtains described second and closed
Join the long value of major key;
The second cryptographic Hash for calculating the second association major key long value, according to second cryptographic Hash to second data set
Carry out fragment.
12. the merging device of super large data set according to claim 11, which is characterized in that be each first data fragmentation
It distributes corresponding fragment number and distributes corresponding fragment number for each second data fragmentation;
Then the matching module is specifically used for:
First data fragmentation is read from the preset cache system, according to the fragment number of first data fragmentation and
The fragment number of second data fragmentation matches first data fragmentation and second data fragmentation.
13. the merging device of super large data set according to claim 8, which is characterized in that the merging module is specifically used
In:
The second data fragmentation data of second data fragmentation are recombinated in advance;
The first data fragmentation data in the first data fragmentation are read one by one from the preset cache system, with described first
Search whether that there are the first of the first data fragmentation data to be associated with major key in matched second data fragmentation of data fragmentation;
If so, searching described the according to the first of the first data fragmentation data the association major key and preset Correlation Criteria
Two data fragmentations are with the presence or absence of the second data fragmentation data with the first data fragmentation Data Matching;
If so, the first data fragmentation data and the second data fragmentation data are merged.
14. the merging device of super large data set according to claim 8, which is characterized in that according to first data set
And the size of data of second data set determines the preset quantity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810772324.5A CN109033295B (en) | 2018-07-13 | 2018-07-13 | Method and device for merging super-large data sets |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810772324.5A CN109033295B (en) | 2018-07-13 | 2018-07-13 | Method and device for merging super-large data sets |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109033295A true CN109033295A (en) | 2018-12-18 |
CN109033295B CN109033295B (en) | 2021-07-02 |
Family
ID=64642826
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810772324.5A Active CN109033295B (en) | 2018-07-13 | 2018-07-13 | Method and device for merging super-large data sets |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033295B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110505276A (en) * | 2019-07-17 | 2019-11-26 | 北京三快在线科技有限公司 | Object matching method, apparatus and system, electronic equipment and storage medium |
CN111198847A (en) * | 2019-12-30 | 2020-05-26 | 广东奡风科技股份有限公司 | Data parallel processing method, device and system suitable for large data set |
CN111611243A (en) * | 2020-05-13 | 2020-09-01 | 第四范式(北京)技术有限公司 | Data processing method and device |
CN112732650A (en) * | 2020-12-31 | 2021-04-30 | 中国工商银行股份有限公司 | File fragmentation method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100146003A1 (en) * | 2008-12-10 | 2010-06-10 | Unisys Corporation | Method and system for building a B-tree |
CN107657050A (en) * | 2017-10-13 | 2018-02-02 | 北京润乾信息系统技术有限公司 | One kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method |
CN107704587A (en) * | 2017-10-10 | 2018-02-16 | 北京润乾信息系统技术有限公司 | A kind of method that one-to-one join, one-to-many join are calculated with conflation algorithm |
-
2018
- 2018-07-13 CN CN201810772324.5A patent/CN109033295B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100146003A1 (en) * | 2008-12-10 | 2010-06-10 | Unisys Corporation | Method and system for building a B-tree |
CN107704587A (en) * | 2017-10-10 | 2018-02-16 | 北京润乾信息系统技术有限公司 | A kind of method that one-to-one join, one-to-many join are calculated with conflation algorithm |
CN107657050A (en) * | 2017-10-13 | 2018-02-02 | 北京润乾信息系统技术有限公司 | One kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110505276A (en) * | 2019-07-17 | 2019-11-26 | 北京三快在线科技有限公司 | Object matching method, apparatus and system, electronic equipment and storage medium |
CN111198847A (en) * | 2019-12-30 | 2020-05-26 | 广东奡风科技股份有限公司 | Data parallel processing method, device and system suitable for large data set |
CN111611243A (en) * | 2020-05-13 | 2020-09-01 | 第四范式(北京)技术有限公司 | Data processing method and device |
CN111611243B (en) * | 2020-05-13 | 2023-06-13 | 第四范式(北京)技术有限公司 | Data processing method and device |
CN112732650A (en) * | 2020-12-31 | 2021-04-30 | 中国工商银行股份有限公司 | File fragmentation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN109033295B (en) | 2021-07-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033295A (en) | The merging method and device of super large data set | |
US10452691B2 (en) | Method and apparatus for generating search results using inverted index | |
US8719237B2 (en) | Method and apparatus for deleting duplicate data | |
CN112800095B (en) | Data processing method, device, equipment and storage medium | |
US9529849B2 (en) | Online hash based optimizer statistics gathering in a database | |
US11074242B2 (en) | Bulk data insertion in analytical databases | |
US10915534B2 (en) | Extreme value computation | |
WO2018036549A1 (en) | Distributed database query method and device, and management system | |
US20140089258A1 (en) | Mail indexing and searching using hierarchical caches | |
CN111797096A (en) | Data indexing method and device based on ElasticSearch, computer equipment and storage medium | |
CN113297250A (en) | Method and system for multi-table association query of distributed database | |
CN109117426A (en) | Distributed networks database query method, apparatus, equipment and storage medium | |
CN117033424A (en) | Query optimization method and device for slow SQL (structured query language) statement and computer equipment | |
CN112445776B (en) | Presto-based dynamic barrel dividing method, system, equipment and readable storage medium | |
CN109101621A (en) | A kind of batch processing method and system of data | |
Beedkar et al. | Closing the gap: Sequence mining at scale | |
CN104794237A (en) | Web page information processing method and device | |
CN104750846A (en) | Method and device for finding substring | |
CN107169313A (en) | The read method and computer-readable recording medium of DNA data files | |
CN103891244B (en) | A kind of method and device carrying out data storage and search | |
KR101299555B1 (en) | Apparatus and method for text search using index based on hash function | |
CN110489601A (en) | A kind of quick dynamic updating method of real time data index based on caching mechanism | |
Albers et al. | Quantifying competitiveness in paging with locality of reference | |
US20240095246A1 (en) | Data query method and apparatus based on doris, storage medium and device | |
Huston et al. | Sketch-based indexing of n-words |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |