CN109033295A - The merging method and device of super large data set - Google Patents

The merging method and device of super large data set Download PDF

Info

Publication number
CN109033295A
CN109033295A CN201810772324.5A CN201810772324A CN109033295A CN 109033295 A CN109033295 A CN 109033295A CN 201810772324 A CN201810772324 A CN 201810772324A CN 109033295 A CN109033295 A CN 109033295A
Authority
CN
China
Prior art keywords
data
fragmentation
data set
major key
data fragmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810772324.5A
Other languages
Chinese (zh)
Other versions
CN109033295B (en
Inventor
史贵振
高福海
张莹莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Yaxin Network Security Industry Technology Research Institute Co Ltd
Original Assignee
Chengdu Yaxin Network Security Industry Technology Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Yaxin Network Security Industry Technology Research Institute Co Ltd filed Critical Chengdu Yaxin Network Security Industry Technology Research Institute Co Ltd
Priority to CN201810772324.5A priority Critical patent/CN109033295B/en
Publication of CN109033295A publication Critical patent/CN109033295A/en
Application granted granted Critical
Publication of CN109033295B publication Critical patent/CN109033295B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides the merging methods and device of a kind of super large data set, are related to technical field of data processing, and the combined efficiency for solving the problems, such as super large data set in the prior art is relatively low.Method and device provided by the invention is realized based on distributed computing, include: that fragment is carried out to the first data set according to the first association major key after the data that the first of the first data set the association major key is converted to preset field type, obtains the first data fragmentation of preset quantity and cached to preset cache system;The second of second data set association major key is converted to, fragment is carried out to the second data set according to the second association major key after the data of preset field type, obtains the second data fragmentation of preset quantity;Read the first data fragmentation from preset cache system, the first data fragmentation and the second data fragmentation matched, and by after matching the first data fragmentation and the second data fragmentation merge.The present invention can be used in merging super large data set.

Description

The merging method and device of super large data set
Technical field
The present invention relates to technical field of data processing more particularly to a kind of merging methods and device of super large data set.
Background technique
Super large data set is the set of the data set as composed by data magnitude very big data set.Currently, due to super Large data sets have very big data magnitude, while newly-increased data magnitude is also very big, by Installed System Memory and disk etc. The limitation of system resource condition, super large data set are difficult to store in relevant database, therefore to two associated super larges Data set is associated analysis, it will usually merge two associated super large data sets, will be directed to two associated super larges The process that data set is associated analysis is converted to the retrieval, classification and statistics for single data set, to reduce from associated The difficulty of mined information in super large data set.
However, when being merged to two associated super large data sets, being usually all by some data in actual conditions The lesser data set of magnitude first stores in database, and the biggish data set of data magnitude is split as after multi-block data fragment and deposits It stores up the data set in database and executes merging parallel, the above process is primarily present following problems: firstly, logarithm in the prior art The case where will appear data skew in the fragment result of fragment is carried out according to collection, that is, mass data occurs by concentration point at one or more Data joint account is carried out on a a small amount of fragment, and remaining low volume data is on remaining a large amount of fragments and carries out data Joint account, since the whole efficiency of joint account is determined according to the computational efficiency of whole fragments, and above-mentioned a small amount of fragment Joint account efficiency far be lower than the average joint account efficiency of whole fragments, thus greatly reduce the whole of data set and close And efficiency;Secondly, needing to traverse the data of every piece of data fragmentation one by one, i.e. one number of every reading during executing merging According to then needing to send an access request to database whether there is in database and can merge with above-mentioned data to search Corresponding data, thus cause the request number of times to database excessive, network pressure increase;Meanwhile it being limited to database performance, The request number of times of most databases is limited, strongly limits the promotion of the whole combined efficiency of data set.
Summary of the invention
The embodiment of the present invention provides the merging method and device of a kind of super large data set, for solving in the prior art The relatively low problem of the combined efficiency of super large data set.
In order to achieve the above object, the present invention adopts the following technical scheme:
In a first aspect, the present invention provides a kind of merging method of super large data set, this method is realized based on distributed computing, Include:
The first of first data set association major key is converted to after the data of preset field type according to the first association major key Fragment is carried out to the first data set, the first data fragmentation of preset quantity is obtained and is cached to preset cache system;
The second of second data set association major key is converted to after the data of preset field type according to the second association major key Fragment is carried out to the second data set, obtains the second data fragmentation of preset quantity;
The first data fragmentation is read from preset cache system, to the first data fragmentation and the progress of the second data fragmentation Match, and by after matching the first data fragmentation and the second data fragmentation merge.
It, can be by the association master of two data sets to be combined in the merging method of super large data set provided by the invention Key is converted to the data of preset field type, so that carrying out fragment to two data sets to be combined based on distributed computing When, the data in data set can be distributed as homogeneously as possible according to the association major key for the data for being converted to preset field type It onto each fragment of data set, effectively avoids occurring the case where data skew in fragment result, promotes data set and integrally merge Efficiency;Meanwhile the present invention to preset cache system by replacing data set to be combined data set cache to be combined It stores to database, so that the merging process of data set is no longer influenced by access database pressure and access the limit of database number System effectively promotes the whole combined efficiency of data set.
Optionally, preset field type is byte type, and the data of preset field type are the long value of byte type.
Optionally, it is closed after the first association major key of the first data set to be converted to the data of preset field type according to first Join major key and fragment carried out to the first data set specifically:
The first data in the first data set are read in advance, extract the first field for needing to merge in the first data;
Judge whether the first field is effective;
If the determination result is YES, the first association major key of the first data set is converted to byte type and obtains the first association and led The long value of key;
The first cryptographic Hash for calculating the first association major key long value of the first data set, according to the first cryptographic Hash to the first number Fragment is carried out according to collection, the first data fragmentation of preset quantity is obtained and is cached to preset cache system.
Optionally, it is closed after the second association major key of the second data set to be converted to the data of preset field type according to second Join major key and fragment carried out to the second data set specifically:
The second data in the second data set are read in advance, extract the second field for needing to merge in the second data;
Judge whether the second field is effective;
If the determination result is YES, the second association major key of the second data set is converted to byte type and obtains the second association and led The long value of key;
The second cryptographic Hash for calculating the second association major key long value, divides the second data set according to the second cryptographic Hash Piece.
Optionally, corresponding fragment number is distributed for each first data fragmentation and be each second data fragmentation distribution Corresponding fragment number;
The first data fragmentation is then read from preset cache system, to the first data fragmentation and the progress of the second data fragmentation Match specifically:
The first data fragmentation is read from preset cache system, for each first data fragmentation distribute corresponding fragment number, And corresponding fragment number is distributed for each second data fragmentation.
Optionally, matched first data fragmentation and the second data fragmentation are merged specifically:
Second data fragmentation data of the second data fragmentation of recombination in advance;
From the first data fragmentation data read one by one in preset cache system in the first data fragmentation, search and the first number According to the first association major key that whether there is the first data fragmentation data in matched second data fragmentation of fragment;
If so, searching the second number according to the first of the first data fragmentation data the association major key and preset Correlation Criteria According to the second data fragmentation data that whether there is in fragment with the first data fragmentation Data Matching;
If so, the first data fragmentation data and the second data fragmentation data are merged.
Optionally, preset quantity is determined according to the size of data of the first data set and the second data set.
Second aspect, the present invention provide a kind of merging device of super large data set, which is realized based on distributed computing, Include:
First fragment module, after being associated with the data that major key is converted to preset field type for the first of the first data set Fragment is carried out to the first data set according to the first association major key, the first data fragmentation of preset quantity is obtained and is cached to pre- If caching system;
Second fragment module, after being associated with the data that major key is converted to preset field type for the second of the second data set Fragment is carried out to the second data set according to the second association major key, obtains the second data fragmentation of preset quantity;
Matching module, for reading the first data fragmentation from preset cache system, to the first data fragmentation and the second number It is matched according to fragment;
Merging module, for after matching the first data fragmentation and the second data fragmentation merge.
Optionally, preset field type is byte type, and the data of preset field type are the long value of byte type.
Optionally, the first fragment module is specifically used for:
The first data in the first data set are read in advance, extract the first field for needing to merge in the first data;
Judge whether the first field is effective;
If the determination result is YES, the first association major key of the first data set is converted to byte type and obtains the first association and led The long value of key;
The first cryptographic Hash for calculating the first association major key long value of the first data set, according to the first cryptographic Hash to the first number Fragment is carried out according to collection, the first data fragmentation of preset quantity is obtained and is cached to preset cache system.
Optionally, the second fragment module is specifically used for:
The second data in the second data set are read in advance, extract the second field for needing to merge in the second data;
Judge whether the second field is effective;
If the determination result is YES, the second association major key of the second data set is converted to byte type and obtains the second association and led The long value of key;
The second cryptographic Hash for calculating the second association major key long value, divides the second data set according to the second cryptographic Hash Piece.
Optionally, corresponding fragment number is distributed for each first data fragmentation and be each second data fragmentation distribution Corresponding fragment number;
Then matching module is specifically used for:
The first data fragmentation is read from preset cache system, according to the fragment number of the first data fragmentation and the second data The fragment number of fragment matches the first data fragmentation and the second data fragmentation.
Optionally, merging module is specifically used for:
Second data fragmentation data of the second data fragmentation of recombination in advance;
From the first data fragmentation data read one by one in preset cache system in the first data fragmentation, with the first data Search whether that there are the first of the first data fragmentation data to be associated with major key in matched second data fragmentation of fragment;
If so, searching the second number according to the first of the first data fragmentation data the association major key and preset Correlation Criteria According to fragment with the presence or absence of the second data fragmentation data with the first data fragmentation Data Matching;
If so, the first data fragmentation data and the second data fragmentation data are merged.
Optionally, preset quantity is determined according to the size of data of the first data set and the second data set.
The third aspect, provides a kind of merging device of super large data set, including communication interface, processor, memory, total Line;For storing computer executed instructions, processor is connect with memory by bus memory, when intelligent probe device is run When, processor executes the computer executed instructions of memory storage, so that the merging device of super large data set executes above-mentioned first The method of aspect.
Fourth aspect provides a kind of combination system of super large data set, merging device including above-mentioned super large data set, with And preset cache system.
5th aspect provides a kind of storage medium, and storage medium is stored with instruction code, and instruction code is above-mentioned for executing The method of first aspect.
6th aspect provides a kind of computer program product, and computer program product includes instruction code, and instruction code is used In the method for executing above-mentioned first aspect.
It is to be appreciated that any super large data set of above-mentioned offer merges device, super large data set combination system, storage Medium or computer program product are used to execute the corresponding method of first aspect presented above, therefore, can reach Beneficial effect can refer to the beneficial effect of corresponding scheme in the method and following detailed description of first aspect above, Details are not described herein again.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is a kind of step flow chart of the merging method for super large data set that one embodiment of the invention provides;
Fig. 2 is the step flow chart of the merging method for another super large data set that one embodiment of the invention provides;
Fig. 3 is the step flow chart of the merging method for another super large data set that one embodiment of the invention provides;
Fig. 4 is the step flow chart of the merging method for another super large data set that one embodiment of the invention provides;
Fig. 5 is a kind of structural schematic diagram of the merging device for super large data set that one embodiment of the invention provides;
Fig. 6 is the structural schematic diagram of the merging device for another super large data set that one embodiment of the invention provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.The use of term " first " and " second " etc. does not indicate any sequence, can be by above-mentioned art Language is construed to the title of described object.In the embodiment of the present application, " illustrative " or " such as " etc. words for indicate make Example, illustration or explanation.Be described as in the embodiment of the present application " illustrative " or " such as " any embodiment or design Scheme is not necessarily to be construed as than other embodiments or design scheme more preferably or more advantage.Specifically, it uses " exemplary " or " such as " etc. words be intended to that related notion is presented in specific ways.In addition, in the description of the embodiment of the present application, unless It is otherwise noted, the meaning of " plurality " is refer to two or more.
Fig. 1 is a kind of step flow chart of the merging method for super large data set that one embodiment of the invention provides.This implementation Method in example is realized based on distributed computing, as shown in Figure 1, the method in the present embodiment includes the following steps:
Step S101: the first of the first data set the association major key is converted to after the data of preset field type according to first It is associated with major key and fragment is carried out to the first data set, obtain the first data fragmentation of preset quantity and cached to preset cache system System.
Wherein, the data set that the first data set needs to merge, first is associated with each number in major key i.e. the first data set According to the major key of table.It may include whole field informations of the first association major key in the data of preset field type.Wherein, predetermined word Segment type can be byte (Byte) type, the data of preset field type can according to preset field type and preset algorithm To the long value, etc. of resulting first association major key after the first association major key respective operations.Herein, it is to be noted that, it is existing Then data set can generally be divided using hash algorithm according to IP address using IP address as association major key by having in technology Data generally can be at most divided into 256 parts, and with taking IP to keep the data skew degree in data fragmentation result minimum by piece The final stage field of location carries out Hash operation remainder to carry out fragment to data set.However, due to the final stage of IP address Field can generally have a large amount of reserved fields such as xxx.xxx.xxx.0, such as in 10 IP address, and there are 5 IP address Final stage field there are reserved field such as xxx.xxx.xxx.0, IP address suffix (i.e. final stage field) data not enough from It dissipates, the data for thus causing to be carried out according to IP address final stage field in the result of fragment are also not discrete enough, not can avoid out The case where existing data skew;When fragment quantity is less than 256, for example, fragment quantity be 255, according to aforesaid way to data into The case where row fragment can further destroy the discreteness of data in fragment result, lead to data skew is even more serious;Meanwhile it breathing out Uncommon algorithm is also only applicable to the case where association major key is IP address, if setting other classes except an ip address for association major key Type data then can not carry out data fragmentation to association major key using hash algorithm.To solve the above-mentioned problems, in the present invention, First association major key of the first data set can be converted to the data of preset field type, it can in the data of preset field type With whole field informations comprising the first association major key and Hash operation can be carried out, therefore the first of the first data set is closed The data that connection major key is converted to preset field type can guarantee that each data in the first association major key are carrying out data point It can be by carry out operation, to guarantee the data discrete journey with higher in the first data set fragment result to greatest extent when piece Degree, effectively avoids occurring the case where data skew in fragment result, while also effectively expanding the scope of application of hash algorithm.
The mode that the long value of the first association major key is obtained in specific implementation, in the above process can be by those skilled in the art Member is set according to the actual situation, and this is not limited by the present invention.In a kind of preferred mode, the acquisition of above-mentioned long value Mode can be with are as follows: after the field type of the first association major key is converted to byte type, is converted into the first association master of byte type The algorithm seed of key and specific length be repeated with or wait relevant operations, then take the absolute value of aforesaid operations result as the The long value of one association major key.Wherein, the algorithm seed of above-mentioned specific length and to the first association major key and specific length Algorithm seed be repeated with or wait the process of relevant operations that can be set according to the actual situation by those skilled in the art It sets, this is not limited by the present invention.It is, of course, understood that preset field type can also be other in specific implementation The data of type, preset field type can also be the other types of data in addition to above-mentioned long value, as long as effectively avoiding There is the case where data skew in fragment result.
In specific implementation, above-mentioned preset quantity can be according to the first data set and the second data set (i.e. subsequent step The second data set in S102) size of data determine.When the data volume of the first data set and the second data set is larger, then It is corresponding to increase above-mentioned preset quantity;The above-mentioned preset quantity of on the contrary then corresponding reduction.
Preset caching system can be HDFS (the distributed text of Hadoop Distributed Filesystem, Hadoop Part system) or other similar distributed file storage systems.In specific implementation, this step can use big data calculation block Frame Spark technology executes, or can also use MR, hive enforcement engine and the real-time computing engines of history of hadoop Storm etc. similar big data processing technique executes.
Step S102: the second of the second data set the association major key is converted to after the data of preset field type according to second It is associated with major key and fragment is carried out to the second data set, obtain the second data fragmentation of preset quantity.
Specifically, in this step the data of preset field type and according to second association major key to the second data set carry out The process of fragment is similar with step S101, specifically may refer to the corresponding description in step S101, details are not described herein again.
It is identical to obtain the technology used in technology used by the second data fragmentation of preset quantity and step S101, specifically It can refer to the corresponding description in step S101, details are not described herein again.
Step S103: reading the first data fragmentation from preset cache system, to the first data fragmentation and the second data point Piece is matched, and by after matching the first data fragmentation and the second data fragmentation merge.
Specifically, to the first data fragmentation and the second data fragmentation carry out matched matching way can there are many, for example, According to Spark technology execute step S101 and step S102, then Spark can automatically for the first data fragmentation establish index with And establish index for the second data fragmentation, then it can be according to the index of the first data fragmentation and the index pair of the second data fragmentation First data fragmentation and the second data fragmentation are matched.For example, the first data fragmentation include data fragmentation 1, data fragmentation 2, And totally 3 data fragmentations, Spark are the index that the first data fragmentation is established to data fragmentation 3 are as follows: 1- data fragmentation 1,2- data Fragment 2,3- data fragmentation 3;Second data fragmentation includes data fragmentation a, data fragmentation b and data fragmentation c totally 3 data Fragment, Spark are the index that the second data fragmentation is established are as follows: 1- data fragmentation a, 2- data fragmentation b, 3- data fragmentation c.Then may be used It is matched so that identical data fragmentation will be indexed in the first data fragmentation and the second data fragmentation, i.e. data fragmentation 1 and data Fragment a matching (the two index all be 1), data fragmentation 2 matched with data fragmentation b (it is all 2 that the two, which indexes), data fragmentation 3 and Data fragmentation c matching (the two index is all 3).
It is, of course, understood that specific implementation in can also using except it is above-mentioned enumerate mode in addition to other way pair First data fragmentation and the second data fragmentation are matched, such as are believed according to IP address or according to the time for generating data fragmentation Breath match etc., and the present invention carries out matched matching way to the first data fragmentation and the second data fragmentation and is not construed as limiting.
When being merged to the first data fragmentation and the second data fragmentation, can be closed according to the first of the first data fragmentation Connection major key, the second association major key of the second data fragmentation and preset Correlation Criteria determine the first data fragmentation and the second data With the presence or absence of the data that can merge between fragment, the data that can be merged if it exists, then to can merge Data merge.Wherein, above-mentioned preset Correlation Criteria can be set according to the actual situation by those skilled in the art It sets, this is not limited by the present invention.
It can be seen that in the merging method of super large data set provided by the invention, it can be by two data to be combined The association major key of collection is converted to the data of preset field type so as to two data sets to be combined by distribution based on It is according to the association major key for the data for being converted to preset field type that the data in data set are as uniform as possible when calculating progress fragment Ground is assigned on each fragment of data set, effectively avoids the appearance of data skew situation in fragment result, and it is whole to promote data set Body combined efficiency;Meanwhile the present invention is by the way that replace data set cache to be combined to preset cache system will be to be combined Data set is stored to database, is made the merging process of data set no longer need largely to access database, is further improved data The whole combined efficiency of collection.
Wherein, step S101 can specifically be implemented using following scheme, using Spark technology to the first data in the program Collection carries out fragment, and preset cache system is included the following steps: using HDFS referring in particular to shown in Fig. 2
Step S201: the first data in the first data set are read in advance, extract need to merge in the first data first Field.
Wherein, super large data set is made of multiple files, when reading the first data, due to the limitation of Installed System Memory, and one Secondary property completely reads and handles super large data set can be beyond Memory Load, therefore can read in batches and handle entire super large number According to collection.For example, can each batch read the first data of multiple files to Installed System Memory, then in system memory to reading The first data of multiple files execute the alignment processing in this step and subsequent step (corresponding step S202-S205), and Output is merged to processing result using predetermined manner (such as using union () function) after processing is completed, to reach more Good readwrite performance effect.
Wherein it is possible to handle every first data read from the first data set one by one using map () function and extract The first field for needing to merge out.The first field for needing to merge i.e. final merging data when required field, the first field It can specifically be set according to the actual situation by those skilled in the art, this is not limited by the present invention.
Step S202: judge whether the first field is effective;If judging result be it is no, then follow the steps S203;If judgement knot Fruit be it is yes, then follow the steps S204.
The validity of the first field is verified according to preset verification rule, to judge whether the first field is effective. Preset verification rule can be set according to the actual situation by those skilled in the art, and this is not limited by the present invention.
Step S203: the first data corresponding with first field are abandoned.
Step S204: the first association major key of the first data set is converted into byte type and obtains the first association major key Long value.
The specific acquisition modes for obtaining the long value of the first association major key, which can refer in step S101, corresponds to description, herein It repeats no more.
Step S205: the first cryptographic Hash of the first association major key long value of the first data set is calculated, according to the first Hash Value carries out fragment to the first data set, obtains the first data fragmentation of preset quantity and is cached to preset cache system.
Specifically, the first cryptographic Hash using gained remainder as first by being associated with to fragment number (i.e. preset quantity) remainder First association major key corresponding data collection is stored on the fragment position, with realization pair by the fragment position of major key corresponding data collection First data set carries out the purpose of fragment, then by the data buffer storage after fragment to preset cache system.
In addition, after carrying out fragment to the first data set corresponding fragment can be distributed for each first data fragmentation Number, and the fragment number of each first data fragmentation is passed through into first Hash of index () index functions to first data fragmentation Value, to establish corresponding relationship between fragment number and the first data fragmentation, for the first data fragmentation and the second data fragmentation It is matched.
On the basis of using scheme shown in Fig. 2, step S102 can specifically be implemented using following scheme, referring to Fig. 3 institute Show, includes the following steps:
Step S301: the second data in the second data set are read in advance, extract need to merge in the second data second Field.
Step S302: judge whether the second field is effective;If judging result be it is no, then follow the steps S303;If judgement knot Fruit be it is yes, then follow the steps S304.
Step S303: the second data corresponding with second field are abandoned.
Step S304: the second association major key of the second data set is converted into byte type and obtains the second association major key Long value.
Wherein, the process of step S301- step S304 is referred to the corresponding description in step S201- step S204, this Place repeats no more.
Step S305: the second cryptographic Hash of the second association major key long value is calculated, according to the second cryptographic Hash to the second data Collection carries out fragment.
Wherein, the second cryptographic Hash using gained remainder as the second association by leading to fragment number (i.e. preset quantity) remainder Second association major key corresponding data collection is stored on the fragment position, to realize to the by the fragment position of key corresponding data collection The purpose of two data sets progress fragment.
In addition, after carrying out fragment to the second data set corresponding fragment can be distributed for each second data fragmentation Number, and the fragment number of each second data fragmentation is passed through into second Hash of index () index functions to second data fragmentation Value, to establish corresponding relationship between fragment number and the second data fragmentation, for the first data fragmentation and the second data fragmentation It is matched.
According to above-mentioned Fig. 2 and implementation process shown in Fig. 3, step S103 can be implemented using following scheme, referring in particular to Shown in Fig. 4, the program includes the following steps:
Step S401: reading the first data fragmentation from preset cache system, to the first data fragmentation and the second data point Piece is matched.
Specifically, according to the fragment number of the fragment number of the first data fragmentation and the second data fragmentation to the first data fragmentation It is matched with second data fragmentation.It specially can be by identical first data fragmentation of fragment number and the second data fragmentation As matched data fragmentation.
Step S402: the second data fragmentation data of the second data fragmentation of recombination in advance.
Wherein, the second data fragmentation data are converted into Map () function by iterator mode, to reduce traversal number, Improve combined efficiency.Specifically list can will be formed with the second data fragmentation data of identical second association major key, and with the Two association major keys are key, are that value defines Map () with above-mentioned list.Such as with map.put (" A ", " B "), A is at this time Two association major keys, B are the second data fragmentation data with identical second association major key, then can be obtained with map.get (" A ") B。
Step S403: it from the first data fragmentation data read one by one in preset cache system in the first data fragmentation, looks into It looks for and is associated with major key with the presence or absence of the first of the first data fragmentation data with matched second data fragmentation of the first data fragmentation;If It is no, then follow the steps S404;If so, thening follow the steps S405.
Specifically, in this programme in, preferably from every read one by one in preset cache system in the first data fragmentation First data fragmentation data, for being counted in subsequent step (step S404- step S407) to first in the first data fragmentation It can be executed one by one according to the merging treatment of fragment data, to reduce request memory.It is, of course, understood that above-mentioned reading Mode and processing mode are only a kind of preferred embodiments, in specific implementation, can also once be read from preset cache system a plurality of First data fragmentation data, and merging treatment is executed to a plurality of first data fragmentation data simultaneously.
Wherein, from preset cache system read a first data fragmentation data after, can be by first data Fragment data is decomposed into field, then according to the cryptographic Hash of the first of the first data fragmentation data the association major key, in step It searches whether to exist in Map () function that S402 is defined and is associated with major key with first cryptographic Hash identical second for being associated with major key Cryptographic Hash, if it is not, then explanation with matched second data fragmentation of first data fragmentation in be not present and the first data of this The identical second association major key of the first association major key of fragment data, thens follow the steps S404;If so, thening follow the steps S405.
Step S404: the first data fragmentation data are abandoned.
Step S405: second is searched according to the first of the first data fragmentation data the association major key and preset Correlation Criteria With the presence or absence of the second data fragmentation data with the first data fragmentation Data Matching in data fragmentation;If it is not, thening follow the steps S406;If so, thening follow the steps S407.
According to the first of the first data fragmentation data association major key extracted from Map () function that step S402 is defined with The corresponding list of the first association major key, and searched in the list according to preset Correlation Criteria and divided with the presence or absence of with the first data The matched second data fragmentation data of sheet data, if it is not, then illustrating that there is no divide with above-mentioned first data in the second data fragmentation Associated the second data fragmentation data for being able to carry out data merging of sheet data, then follow the steps S406;If so, thening follow the steps S407。
Wherein, preset Correlation Criteria can be configured according to the actual situation by those skilled in the art, the present invention couple This is not construed as limiting.
Step S406: the first data fragmentation data are abandoned.
Step S407: the first data fragmentation data and the second data fragmentation data are merged.
It is preferred that can be using map Partitions WithIndex () function to the first data fragmentation data and the second number Merging is executed parallel according to fragment data.
Using method provided by the invention, it is assumed that it is parallel to execute combined degree of parallelism as n, i.e., the first data set is divided into n Second data set is divided into n the second data fragmentations, then to the first data fragmentation and the second data point by a first data fragmentation Available n is to small data set after being matched for piece, the merging for being then n to small data set execution degree of parallelism to above-mentioned n.Its In, it is assumed that the scale of construction of associated two super large data sets is respectively that x and y traverse the complexity of merging in the prior art one by one For x*y, by the way that n is executed merging to small data set parallel in the present invention, combined complexity can be reduced to (x/n) * (y/ N), if ignoring the time-consuming of Hash operation, time-consuming of the present invention answers approximate 1/n/n time-consuming needed for can merging for traversal one by one, by This is as it can be seen that the present invention can effectively promote the efficiency that data set integrally merges.In addition, the present invention is also without consideration system resource It limits (such as limitation of database access pressure and access times), therefore data will not be lost;Merging in the present invention Journey fully complies with preset Correlation Criteria, if above-mentioned Correlation Criteria is errorless, data set merges in the present invention success rate and accurate Rate can achieve 100%.
Fig. 5 is a kind of structural schematic diagram of the merging device for super large data set that one embodiment of the invention provides.Such as Fig. 5 institute Show, which includes:
First fragment module 51 is converted to the data of preset field type for the first association major key by the first data set Afterwards according to first association major key to the first data set carry out fragment, obtain the first data fragmentation of preset quantity and cached to Preset cache system.
Wherein, preset field type is byte type, and the data of preset field type are the long value of byte type.
First fragment module 51 is specifically used for:
The first data in the first data set are read in advance, extract the first field for needing to merge in the first data;
Judge whether the first field is effective;
If the determination result is YES, the first association major key of the first data set is converted to byte type and obtains the first association and led The long value of key;
The first cryptographic Hash for calculating the first association major key long value of the first data set, according to the first cryptographic Hash to the first number Fragment is carried out according to collection, the first data fragmentation of preset quantity is obtained and is cached to preset cache system.
Second fragment module 52 is converted to the data of preset field type for the second association major key by the second data set Fragment is carried out to the second data set according to the second association major key afterwards, obtains the second data fragmentation of preset quantity.
Wherein, preset field type is byte type, and the data of preset field type are the long value of byte type.
Second fragment module 52 is specifically used for:
The second data in the second data set are read in advance, extract the second field for needing to merge in the second data;
Judge whether the second field is effective;
If the determination result is YES, the second association major key of the second data set is converted to byte type and obtains the second association and led The long value of key;
The second cryptographic Hash for calculating the second association major key long value, divides the second data set according to the second cryptographic Hash Piece.
Matching module 53, for reading the first data fragmentation from preset cache system, to the first data fragmentation and second Data fragmentation is matched.Wherein, the fragment number of each first data fragmentation determined according to the first cryptographic Hash and according to second Cryptographic Hash determines the fragment number of each second data fragmentation;
Then matching module 53 is specifically used for:
The first data fragmentation is read from preset cache system, according to the fragment number of the first data fragmentation and the second data The fragment number of fragment matches the first data fragmentation and the second data fragmentation.
Merging module 54, for after matching the first data fragmentation and the second data fragmentation merge.
Merging module 54 is specifically used for:
Second data fragmentation data of the second data fragmentation of recombination in advance;
From the first data fragmentation data read one by one in preset cache system in the first data fragmentation, with the first data Search whether that there are the first of the first data fragmentation data to be associated with major key in matched second data fragmentation of fragment;
If so, searching the second number according to the first of the first data fragmentation data the association major key and preset Correlation Criteria According to fragment with the presence or absence of the second data fragmentation data with the first data fragmentation Data Matching;
If so, the first data fragmentation data and the second data fragmentation data are merged.
All related contents for each step that above method embodiment is related to can quote the function of corresponding function module It can describe, details are not described herein for effect.
Using integrated module, the merging device of super large data set include: storage unit, processing unit with And interface unit.Processing unit is used to carry out control management to the movement of the merging device of super large data set, for example, processing unit For supporting the merging device of super large data set to execute each step in Fig. 1-4.Interface unit is for supporting super large data set Merge the interaction of device and other devices;Storage unit, for storing the merging program of device code and data of super large data set.
Wherein, using processing unit as processor, storage unit is memory, and interface unit is for communication interface.Wherein, The merging device of super large data set referring to fig. 6, including communication interface 601, processor 602, memory 603 and bus 604, communication interface 601, processor 602 are connected by bus 604 with memory 603.
Processor 602 can be a general central processor (Central Processing Unit, CPU), micro process Device, application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC) or one or more A integrated circuit executed for controlling application scheme program.
Memory 603 can be read-only memory (Read-Only Memory, ROM) or can store static information and instruction Other kinds of static storage device, random access memory (Random Access Memory, RAM) or letter can be stored The other kinds of dynamic memory of breath and instruction, is also possible to Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-only Memory, EEPROM), CD-ROM (Compact Disc Read- Only Memory, CD-ROM) or other optical disc storages, optical disc storage (including compression optical disc, laser disc, optical disc, digital universal Optical disc, Blu-ray Disc etc.), magnetic disk storage medium or other magnetic storage apparatus or can be used in carrying or store to have referring to Enable or data structure form desired program code and can by any other medium of computer access, but not limited to this. Memory, which can be, to be individually present, and is connected by bus with processor.Memory can also be integrated with processor.
Wherein, memory 603 is used to store the application code for executing application scheme, and is controlled by processor 602 System executes.Communication interface 601 is used to support the interaction of the merging device and other devices of super large data set.Processor 602 is used for The application code stored in memory 603 is executed, to realize the merging side of the super large data set in the embodiment of the present application Method.
The present invention also provides a kind of combination systems of super large data set, the merging dress including any of the above-described super large data set It sets and preset cache system.Preset cache system specifically may refer to the corresponding introduction in step S101, no longer superfluous herein It states.
The present invention also provides a kind of calculating to store media (or medium), including carrying out in above-described embodiment when executed The instruction of the operation of method, when instruction is run on computers, so that computer executes above-mentioned embodiment of the method.
In addition, the present invention also provides a kind of computer program product, including above-mentioned calculating storage media (or medium).
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, method, article or the device that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or device institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, method of element, article or device.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, the technical solution of the application substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal (can be mobile phone, computer, service Device, air conditioner or network equipment etc.) execute method described in each embodiment of the application.
Embodiments herein is described above in conjunction with attached drawing, but the application be not limited to it is above-mentioned specific Embodiment, the above mentioned embodiment is only schematical, rather than restrictive, those skilled in the art Under the enlightenment of the application, when not departing from the application objective and scope of the claimed protection, can also it make very much Form belongs within the protection of the application.

Claims (14)

1. a kind of merging method of super large data set, which is characterized in that the method is realized based on distributed computing, comprising:
The first of first data set association major key is converted to after the data of preset field type according to the first association major key Fragment is carried out to first data set, the first data fragmentation of preset quantity is obtained and is cached to preset cache system;
The second of second data set association major key is converted to after the data of preset field type according to the second association major key Fragment is carried out to second data set, obtains the second data fragmentation of preset quantity;
First data fragmentation is read from the preset cache system, to first data fragmentation and second data Fragment is matched, and by after matching the first data fragmentation and the second data fragmentation merge.
2. the merging method of super large data set according to claim 1, which is characterized in that the preset field type is word Nodal pattern, the data of the preset field type are long value.
3. the merging method of super large data set according to claim 2, which is characterized in that described by the of the first data set One association major key carries out first data set according to the first association major key after being converted to the data of preset field type Fragment specifically:
The first data in first data set are read in advance, extract the first field for needing to merge in first data;
Judge whether first field is effective;
If the determination result is YES, the first association major key of first data set is converted to byte type and obtains described first and closed Join the long value of major key;
The first cryptographic Hash for calculating the first association major key long value of first data set, according to first cryptographic Hash to institute It states the first data set and carries out fragment, obtain the first data fragmentation of preset quantity and cached to preset cache system.
4. the merging method of super large data set according to claim 3, which is characterized in that described by the of the second data set Two association major keys carry out second data set according to the second association major key after being converted to the data of preset field type Fragment specifically:
The second data in second data set are read in advance, extract the second field for needing to merge in second data;
Judge whether second field is effective;
If the determination result is YES, the second association major key of second data set is converted to byte type and obtains described second and closed Join the long value of major key;
The second cryptographic Hash for calculating the second association major key long value, according to second cryptographic Hash to second data set Carry out fragment.
5. the merging method of super large data set according to claim 4, which is characterized in that for each first data fragmentation point Corresponding fragment number is distributed with corresponding fragment number and for each second data fragmentation;
It is then described that first data fragmentation is read from the preset cache system, to first data fragmentation and described the Two data fragmentations are matched specifically:
First data fragmentation is read from the preset cache system, according to the fragment number of first data fragmentation and The fragment number of second data fragmentation matches first data fragmentation and second data fragmentation.
6. the merging method of super large data set according to claim 1, which is characterized in that described by matched first data Fragment and the second data fragmentation merge specifically:
The second data fragmentation data of second data fragmentation are recombinated in advance;
It reads the first data fragmentation data in the first data fragmentation one by one from the preset cache system, searches and described the With the presence or absence of the first association major key of the first data fragmentation data in matched second data fragmentation of one data fragmentation;
If so, searching described the according to the first of the first data fragmentation data the association major key and preset Correlation Criteria With the presence or absence of the second data fragmentation data with the first data fragmentation Data Matching in two data fragmentations;
If so, the first data fragmentation data and the second data fragmentation data are merged.
7. the merging method of super large data set according to claim 1, which is characterized in that according to first data set with And the size of data of second data set determines the preset quantity.
8. a kind of merging device of super large data set, which is characterized in that described device is realized based on distributed computing, comprising:
First fragment module, basis after the data for the first of the first data set the association major key to be converted to preset field type The first association major key carries out fragment to first data set, obtains the first data fragmentation of preset quantity and is cached To preset cache system;
Second fragment module, basis after the data for the second of the second data set the association major key to be converted to preset field type The second association major key carries out fragment to second data set, obtains the second data fragmentation of preset quantity;
Matching module, for reading first data fragmentation from the preset cache system, to first data fragmentation It is matched with second data fragmentation;
Merging module, for after matching the first data fragmentation and the second data fragmentation merge.
9. the merging device of super large data set according to claim 8, which is characterized in that the preset field type is word Nodal pattern, the data of the preset field type are the long value of byte type.
10. the merging device of super large data set according to claim 9, which is characterized in that the first fragment module tool Body is used for:
The first data in first data set are read in advance, extract the first field for needing to merge in first data;
Judge whether first field is effective;
If the determination result is YES, the first association major key of first data set is converted to byte type and obtains described first and closed Join the long value of major key;
The first cryptographic Hash for calculating the first association major key long value of first data set, according to first cryptographic Hash to institute It states the first data set and carries out fragment, obtain the first data fragmentation of preset quantity and cached to preset cache system.
11. the merging device of super large data set according to claim 10, which is characterized in that the second fragment module tool Body is used for:
The second data in second data set are read in advance, extract the second field for needing to merge in second data;
Judge whether second field is effective;
If the determination result is YES, the second association major key of second data set is converted to byte type and obtains described second and closed Join the long value of major key;
The second cryptographic Hash for calculating the second association major key long value, according to second cryptographic Hash to second data set Carry out fragment.
12. the merging device of super large data set according to claim 11, which is characterized in that be each first data fragmentation It distributes corresponding fragment number and distributes corresponding fragment number for each second data fragmentation;
Then the matching module is specifically used for:
First data fragmentation is read from the preset cache system, according to the fragment number of first data fragmentation and The fragment number of second data fragmentation matches first data fragmentation and second data fragmentation.
13. the merging device of super large data set according to claim 8, which is characterized in that the merging module is specifically used In:
The second data fragmentation data of second data fragmentation are recombinated in advance;
The first data fragmentation data in the first data fragmentation are read one by one from the preset cache system, with described first Search whether that there are the first of the first data fragmentation data to be associated with major key in matched second data fragmentation of data fragmentation;
If so, searching described the according to the first of the first data fragmentation data the association major key and preset Correlation Criteria Two data fragmentations are with the presence or absence of the second data fragmentation data with the first data fragmentation Data Matching;
If so, the first data fragmentation data and the second data fragmentation data are merged.
14. the merging device of super large data set according to claim 8, which is characterized in that according to first data set And the size of data of second data set determines the preset quantity.
CN201810772324.5A 2018-07-13 2018-07-13 Method and device for merging super-large data sets Active CN109033295B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810772324.5A CN109033295B (en) 2018-07-13 2018-07-13 Method and device for merging super-large data sets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810772324.5A CN109033295B (en) 2018-07-13 2018-07-13 Method and device for merging super-large data sets

Publications (2)

Publication Number Publication Date
CN109033295A true CN109033295A (en) 2018-12-18
CN109033295B CN109033295B (en) 2021-07-02

Family

ID=64642826

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810772324.5A Active CN109033295B (en) 2018-07-13 2018-07-13 Method and device for merging super-large data sets

Country Status (1)

Country Link
CN (1) CN109033295B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110505276A (en) * 2019-07-17 2019-11-26 北京三快在线科技有限公司 Object matching method, apparatus and system, electronic equipment and storage medium
CN111198847A (en) * 2019-12-30 2020-05-26 广东奡风科技股份有限公司 Data parallel processing method, device and system suitable for large data set
CN111611243A (en) * 2020-05-13 2020-09-01 第四范式(北京)技术有限公司 Data processing method and device
CN112732650A (en) * 2020-12-31 2021-04-30 中国工商银行股份有限公司 File fragmentation method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100146003A1 (en) * 2008-12-10 2010-06-10 Unisys Corporation Method and system for building a B-tree
CN107657050A (en) * 2017-10-13 2018-02-02 北京润乾信息系统技术有限公司 One kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method
CN107704587A (en) * 2017-10-10 2018-02-16 北京润乾信息系统技术有限公司 A kind of method that one-to-one join, one-to-many join are calculated with conflation algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100146003A1 (en) * 2008-12-10 2010-06-10 Unisys Corporation Method and system for building a B-tree
CN107704587A (en) * 2017-10-10 2018-02-16 北京润乾信息系统技术有限公司 A kind of method that one-to-one join, one-to-many join are calculated with conflation algorithm
CN107657050A (en) * 2017-10-13 2018-02-02 北京润乾信息系统技术有限公司 One kind is based on " with the one-to-one join of conflation algorithm calculating, one-to-many join " contraposition segmentation parallel method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110505276A (en) * 2019-07-17 2019-11-26 北京三快在线科技有限公司 Object matching method, apparatus and system, electronic equipment and storage medium
CN111198847A (en) * 2019-12-30 2020-05-26 广东奡风科技股份有限公司 Data parallel processing method, device and system suitable for large data set
CN111611243A (en) * 2020-05-13 2020-09-01 第四范式(北京)技术有限公司 Data processing method and device
CN111611243B (en) * 2020-05-13 2023-06-13 第四范式(北京)技术有限公司 Data processing method and device
CN112732650A (en) * 2020-12-31 2021-04-30 中国工商银行股份有限公司 File fragmentation method and device

Also Published As

Publication number Publication date
CN109033295B (en) 2021-07-02

Similar Documents

Publication Publication Date Title
CN109033295A (en) The merging method and device of super large data set
US10452691B2 (en) Method and apparatus for generating search results using inverted index
US8719237B2 (en) Method and apparatus for deleting duplicate data
CN112800095B (en) Data processing method, device, equipment and storage medium
US9529849B2 (en) Online hash based optimizer statistics gathering in a database
US11074242B2 (en) Bulk data insertion in analytical databases
US10915534B2 (en) Extreme value computation
WO2018036549A1 (en) Distributed database query method and device, and management system
US20140089258A1 (en) Mail indexing and searching using hierarchical caches
CN111797096A (en) Data indexing method and device based on ElasticSearch, computer equipment and storage medium
CN113297250A (en) Method and system for multi-table association query of distributed database
CN109117426A (en) Distributed networks database query method, apparatus, equipment and storage medium
CN117033424A (en) Query optimization method and device for slow SQL (structured query language) statement and computer equipment
CN112445776B (en) Presto-based dynamic barrel dividing method, system, equipment and readable storage medium
CN109101621A (en) A kind of batch processing method and system of data
Beedkar et al. Closing the gap: Sequence mining at scale
CN104794237A (en) Web page information processing method and device
CN104750846A (en) Method and device for finding substring
CN107169313A (en) The read method and computer-readable recording medium of DNA data files
CN103891244B (en) A kind of method and device carrying out data storage and search
KR101299555B1 (en) Apparatus and method for text search using index based on hash function
CN110489601A (en) A kind of quick dynamic updating method of real time data index based on caching mechanism
Albers et al. Quantifying competitiveness in paging with locality of reference
US20240095246A1 (en) Data query method and apparatus based on doris, storage medium and device
Huston et al. Sketch-based indexing of n-words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant